Psychometric Tests#
To obtain results for the psychometric tests, the recorded data has to be processed. This page explains the steps involved in the processing. Furthermore, some background information about the psychometric tests is provided.
Table of Contents#
Data Collection#
Data for the psychometric tests have been collected using the
MultiplEYE-psychometric-tests
repository. Find details about it in the project’s README.md.
Theoretical Background#
The psychometric tests in the MultiplEYE battery are carefully selected to provide a comprehensive assessment of cognitive abilities relevant to language processing and eye-tracking research. Each test targets specific cognitive domains while maintaining cross-linguistic validity and experimental efficiency. (confirm)
Lewandowsky WMC Battery#
Working Memory Capacity (WMC) is a fundamental cognitive construct that represents the amount of information that can be held in mind and simultaneously processed. WMC is correlated with general intelligence (g-factor), language comprehension, and reading ability. (confirm) The Lewandowsky WMC battery provides a comprehensive assessment through four complementary tasks:
Memory Update (MU): Participants continuously update a running memory store with new information while discarding old items. This task measures the ability to dynamically manipulate information in working memory.
Operation Span (OS): Participants must remember items while performing simple mathematical operations, capturing the dual-processing demands of working memory.
Sentence Span (SS): Similar to OS but with sentence verification tasks instead of mathematical operations, tapping verbal working memory specifically relevant to language processing.
Spatial Short-Term Memory (SSTM): Assesses visuospatial memory through recall of spatial sequences, providing a non-verbal complement to the verbal tasks.
Our implementation follows the methodology described in Lewandowsky et al. [LOYE10], specifically the data format and scoring details outlined in the appendix. The scoring approach calculates the proportion of items recalled correctly for each trial, then computes the mean of these trial-level proportions to obtain the final task score. This method ensures that each task contributes equally to the overall WMC construct, regardless of differences in the number of trials or set sizes across tasks.
For each trial, the score is calculated as the proportion of correctly recalled items:
The final task score is the mean of all trial scores:
For the spatial task is scored on pattern similarity, computed as the sum of the dot-to-dot similarities. This dot-to-dot similarity is understood by “awarding 2 points for no distance between a recalled dot and a presented dot, 1 point for a deviation of one cell in any direction, and 0 points if the deviation exceeded one cell”. Finally, this score is normalized:
The total WMC score is the average across all tasks:
This implementation processes data from CSV files rather than the original .dat format described in the paper, but maintains the same scoring logic and produces equivalent results. For the spatial task, we extract the raw score from the .dat file and normalize it by dividing by 240.0, consistent with the maximum attainable score in the current version.
Returned Results:
LWMC_MU_score: Memory Update task scoreLWMC_OS_score: Operation Span task scoreLWMC_SS_score: Sentence Span task scoreLWMC_SSTM_score: Spatial Short-Term Memory task scoreLWMC_Total_score_mean: Average score across all four tasks
Detailed outputs:
LWMC_MU_time_sec: Mean response time for Memory UpdateLWMC_OS_time_sec: Mean response time for Operation SpanLWMC_SS_time_sec: Mean response time for Sentence Span
Rapid Automatized Naming (RAN)#
Rapid Automatized Naming is a classic measure of processing speed and automaticity in cognitive retrieval. Originally developed to predict reading ability, RAN assesses the efficiency with which participants can access and articulate well-learned information. The task requires participants to name a series of familiar items (typically digits or letters) as quickly as possible. RAN performance is strongly predictive of reading fluency across languages and is considered a marker of the automaticity of cognitive processes (citation?).
For our data, there are two trials per session. Both trials consist of (…) (confirm). The results are the times each of the two trials took to complete.
Returned Results:
RAN_practice_rt_sec: Reaction time for practice trialRAN_experimental_rt_sec: Reaction time for experimental trial
Stroop Test#
The Stroop test is one of the most widely used measures of cognitive control and inhibitory functioning. It demonstrates the phenomenon of interference—when automatic processing conflicts with task demands. In the color-word Stroop, participants must inhibit the automatic tendency to read words while naming the ink color. The difference in performance between congruent (word matches color) and incongruent (word conflicts with color) trials provides a sensitive measure of inhibitory control [Str35].
Mathematical Formulas:
For each condition (incongruent, congruent, neutral), basic metrics are calculated as:
Interference effects (focused on congruent vs incongruent comparison):
Returned Results:
StroopAccuracyEffect: Accuracy interference effectStroopRTEffect_sec: Reaction time interference effect
Detailed outputs:
Stroop_incongruent_rt_mean_sec: Mean RT for incongruent trialsStroop_incongruent_accuracy: Accuracy for incongruent trialsStroop_incongruent_num_items: Number of incongruent trialsStroop_congruent_rt_mean_sec: Mean RT for congruent trialsStroop_congruent_accuracy: Accuracy for congruent trialsStroop_congruent_num_items: Number of congruent trialsStroop_neutral_rt_mean_sec: Mean RT for neutral trialsStroop_neutral_accuracy: Accuracy for neutral trialsStroop_neutral_num_items: Number of neutral trials
Flanker Task#
The Flanker task complements the Stroop by measuring inhibitory control in a spatial attention domain. Participants must respond to a central target while ignoring surrounding “flanker” distractors. The task creates conflict between automatic attentional capture by the flankers and the goal-directed focus on the target. Like the Stroop, the difference between incongruent and congruent conditions provides a measure of cognitive control, but through spatial rather than semantic interference [EE74].
Mathematical Formulas:
For each condition (incongruent, congruent), basic metrics are calculated as:
Spatial interference effects:
Returned Results:
FlankerAccuracyEffect: Accuracy interference effectFlankerRTEffect_sec: Reaction time interference effect
Detailed outputs:
Flanker_incongruent_rt_mean_sec: Mean RT for incongruent trialsFlanker_incongruent_accuracy: Accuracy for incongruent trialsFlanker_incongruent_num_items: Number of incongruent trialsFlanker_congruent_rt_mean_sec: Mean RT for congruent trialsFlanker_congruent_accuracy: Accuracy for congruent trialsFlanker_congruent_num_items: Number of congruent trials
PLAB (Pimsleur Language Aptitude Battery)#
The PLAB test assesses language learning aptitude through a battery of tasks that measure different components of language ability. Unlike the other cognitive tests, PLAB specifically targets abilities relevant to second language acquisition, including auditory discrimination, memory for foreign language sounds, and grammatical pattern recognition. This makes it particularly valuable for research on multilingualism and language learning [PRS04].
Returned Results:
PLAB_rt_mean_sec: Mean reaction time across all PLAB trialsPLAB_accuracy: Overall accuracy across all PLAB trials
Detailed outputs:
PLAB_num_items: Total number of PLAB trials
WikiVocab#
WikiVocab is a modern vocabulary assessment that uses items drawn from Wikipedia corpora across multiple languages. It provides a cross-linguistically valid measure of vocabulary breadth by including both real words and carefully constructed pseudo-words. The balanced scoring approach (averaging performance on real and pseudo-words) makes it comparable across languages with different writing systems and vocabulary structures [vRSL+23].
Large, representative item pools from Wikipedia
Balanced scoring controls for response biases
Cross-linguistic comparability through pseudo-word calibration
For some languages, equivalent to LexTALE (Lextale.com scoring)
Mathematical Formulas:
Accuracy calculations for real and pseudo words:
Balanced LexTALE-style scoring:
Returned Results:
WikiVocab_rt_mean_sec: Mean reaction time across all trialsWikiVocab_accuracy: Overall accuracy across all trialsWikiVocab_incorrect_correct_score: Balanced LexTALE-style score
Detailed outputs:
WikiVocab_num_items: Total number of itemsWikiVocab_num_pseudo_words: Number of pseudo-word itemsWikiVocab_num_real_words: Number of real word itemsWikiVocab_pseudo_correct: Fraction correct for pseudo wordsWikiVocab_real_correct: Fraction correct for real wordsWikiVocab_overall_correct: Overall fraction correct
Calculating the Psychometric Tests#
As hinted in the Running the Pipelines section, there are project scripts for the psychometric tests. But before calculation, the input data must be structured correctly.
Preparing the Data#
For each session, there should be a folder containing one subfolder for each test with its data.
The data should be structured as follows, i.e. for the session identifier (sid)
008_SQ_CH_1_PT2 and data collection identifier MultiplEYE_SQ_CH_Zurich_1_2025:
data/MultiplEYE_SQ_CH_Zurich_1_2025/psychometric-tests-sessions/
├── ...
├── 008_SQ_CH_1_PT2
│ ├── 008_SQ_CH_1_PT2.yaml
│ ├── PLAB
│ │ ├── SQCH1_008_PT2_2025-09-13_15-18-01.csv
│ │ ├── SQCH1_008_PT2_2025-09-13_15-18-01.log
│ │ └── SQCH1_008_PT2_2025-09-13_15-18-01.psydat
│ ├── RAN
│ │ ├── SQCH1_008_PT2_2025-09-13_15-12-11.log
│ │ ├── SQCH1_8_S2_2025-09-13_15-12-11.csv
│ │ └── audio_SQCH1_008_PT2_2025-09-13_15-12-11
│ │ ├── SQCH1_8_S2_trial1.wav
│ │ └── SQCH1_8_S2_trial2.wav
│ ├── Stroop_Flanker
│ │ ├── SQCH1_008_PT2_2025-09-13_15-13-23.csv
│ │ ├── SQCH1_008_PT2_2025-09-13_15-13-23.log
│ │ └── SQCH1_008_PT2_2025-09-13_15-13-23.psydat
│ ├── WMC
│ │ ├── MU-8.dat
│ │ ├── OS-8.dat
│ │ ├── SQCH1_008_PT2_2025-09-13_14-31-31.csv
│ │ ├── SQCH1_008_PT2_2025-09-13_14-31-31.log
│ │ ├── SQCH1_008_PT2_2025-09-13_14-31-31.psydat
│ │ ├── SS-8.dat
│ │ ├── SSTM-8.dat
│ │ └── sstm_detailed
│ │ └── SSTM-8.dat
│ ├── WikiVocab
│ │ ├── SQCH1_008_PT2_2025-09-13_15-26-53.csv
│ │ └── images
│ └── psychometric_details_008_SQ_CH_1_PT2.csv (detail output for session)
The psychometric_details_008_SQ_CH_1_PT2.csv is written as an output of the psychometric tests
with all detailled measures for the session.
Furthermore, an overview of all sessions will be written to
data/{data_collection_id}/{data_collection_id}_overview.yaml
If it happens that the data is structured first by the test folder and then by session folder,
the data can be restructured using the preprocessing.scripts.restructure_psycho_tests:main
script.
This can be invoked like this:
restructure_psycho_tests
In short, if your data structure looks like below, restructure_psycho_tests can format it into
the desired structure as shown above.
data/MultiplEYE_SQ_CH_Zurich_1_2025/psychometric-tests-sessions/core_data/
├── PLAB
│ ├── 001_SQ_CH_1_PT2
│ ├── ...
│ └── 026_SQ_CH_1_PT1
├── RAN
│ ├── 001_SQ_CH_1_PT2
│ ├── ...
│ └── 026_SQ_CH_1_PT1
├── Stroop_Flanker
│ ├── 001_SQ_CH_1_PT2
│ ├── ...
│ └── 026_SQ_CH_1_PT1
├── WMC
│ ├── 001_SQ_CH_1_PT2
│ ├── ...
│ └── 026_SQ_CH_1_PT1
├── WikiVocab
│ ├── 001_SQ_CH_1_PT2
│ ├── ...
│ └── 026_SQ_CH_1_PT1
└── participant_configs_SQ_CH_1
├── 001_SQ_CH_1_PT2.yaml
├── ...
└── 026_SQ_CH_1_PT1.yaml
Usually, the command needs no arguments, if the Configuration was correctly set up.
Otherwise, you can find out how to overwrite the global settings
by invoking restructure_psycho_tests --help.
If one session did not finish all tests, this is no problem.
For the sessions where this is the case, the tests that do not have data will be logged to the
console.
The following calculation has been constructed to handle this case,
only calculating the results of tests with existing data.
Running the Calculations#
The psychometric test calculations are performed by the preprocess_psychometric_tests() function,
which processes each test separately and combines the results into overview and detailed output
files.
Overview vs. Detailed Output#
The pipeline generates two types of output for each participant:
Overview file (
{data_collection_id}_overview.yaml): Contains primary metrics as per the original paper outputs.Detailed file (
psychometric_details_{session_id}.csv): Contains more detailed values.
Command Line Usage#
The tests can be calculated with the following command:
preprocess_psychometric_tests
For details on how to overwrite the global settings,
run preprocess_psychometric_tests --help.
Data Handling#
The calculations automatically handle missing data. If a participant didn’t complete certain tests, those values will be omitted from the output files and logged to the console. This ensures robust processing even with incomplete datasets.