Test-retest reliability for common tasks in vision science

Research in perception and attention has typically sought to evaluate cognitive mechanisms according to the average response to a manipulation. Recently, there has been a shift toward appreciating the value of individual differences and the insight gained by exploring the impacts of between-participant variation on human cognition. However, a recent study suggests that many robust, well-established cognitive control tasks suffer from surprisingly low levels of test-retest reliability (Hedge et al., 2018b). We tested a large sample of undergraduate students (n = 160) in two sessions (separated by 1–3 weeks) on four commonly used tasks in vision science. We implemented measures that spanned the range of visual processing, including motion coherence (MoCo), useful field of view (UFOV), multiple-object tracking (MOT), and visual working memory (VWM). Intraclass correlations ranged from excellent to poor suggesting that some task measures are more suitable for assessing individual differences than others. VWM capacity (ICC = 0.89), MoCo threshold (ICC = 0.60), UFOV middle accuracy (ICC = 0.60) and UFOV outer accuracy (ICC = 0.74) showed good-to-excellent reliability. Other measures, namely the maximum number of items tracked in MOT (ICC = 0.41) and UFOV number accuracy (ICC = 0.48), showed moderate reliability; the MOT threshold (ICC = 0.36) and UFOV inner accuracy (ICC = 0.30) showed poor reliability. In this paper, we present these results alongside a summary of reliabilities estimated previously for other vision science tasks. We then offer useful recommendations for evaluating test-retest reliability when considering a task for use in evaluating individual differences.

Download Full-text

Assessing Individual Differences in Genome-Wide Gene Expression in Human Whole Blood: Reliability Over Four Hours and Stability Over 10 Months

Twin Research and Human Genetics ◽

10.1375/twin.12.4.372 ◽

2009 ◽

Vol 12 (4) ◽

pp. 372-380 ◽

Cited By ~ 4

Author(s):

Emma L. Meaburn ◽

Cathy Fernandes ◽

Ian W. Craig ◽

Robert Plomin ◽

Leonard C. Schalkwyk

Keyword(s):

Gene Expression ◽

Individual Differences ◽

Blood Collection ◽

Retest Reliability ◽

Long Term Stability ◽

Genome Wide ◽

Intraclass Correlations ◽

Average Expression Level ◽

Test Retest Reliability ◽

Probe Set

AbstractStudying the causes and correlates of natural variation in gene expression in healthy populations assumes that individual differences in gene expression can be reliably and stably assessed across time. However, this is yet to be established. We examined 4-hour test–retest reliability and 10 month test–retest stability of individual differences in gene expression in ten 12-year-old children. Blood was collected on four occasions: 10 a.m. and 2 p.m. on Day 1 and 10 months later at 10 a.m. and 2 p.m. Total RNA was hybridized to Affymetrix-U133 plus 2.0 arrays. For each probeset, the correlation across individuals between 10 a.m. and 2 p.m. on Day 1 estimates test–retest reliability. We identified 3,414 variable and abundantly expressed probesets whose 4-hour test–retest reliability exceeded .70, a conventionally accepted level of reliability, which we had 80% power to detect. Of the 3,414 reliable probesets, 1,752 were also significantly reliable 10 months later. We assessed the long-term stability of individual differences in gene expression by correlating the average expression level for each probe-set across the two 4-hour assessments on Day 1 with the average level of each probe-set across the two 4-hour assessments 10 months later. 1,291 (73.7%) of the 1,752 probe-sets that reliably detected individual differences across 4 hours on two occasions, 10 months apart, also stably detected individual differences across 10 months. Heritability, as estimated from the MZ twin intraclass correlations, is twice as high for the 1,752 reliable probesets versus all present probesets on the array (0.68 vs 0.34), and is even higher (0.76) for the 1,291 reliable probesets that are also stable across 10 months. The 1,291 probesets that reliably detect individual differences from a single peripheral blood collection and stably detect individual differences over 10 months are promising targets for research on the causes (e.g., eQTLs) and correlates (e.g., psychopathology) of individual differences in gene expression.

Download Full-text

Test-Retest Reliability of the Eurofit Test Battery Administered to University Students

Perceptual and Motor Skills ◽

10.2466/pms.2002.95.3f.1295 ◽

2002 ◽

Vol 95 (3_suppl) ◽

pp. 1295-1300 ◽

Cited By ~ 34

Author(s):

Nikolaos Tsigilis ◽

Helen Douda ◽

Savvas P. Tokmakidis

Keyword(s):

University Students ◽

Undergraduate Students ◽

Intraclass Correlation ◽

Test Battery ◽

Retest Reliability ◽

Test Items ◽

Motor Fitness ◽

Fitness Tests ◽

Test Retest Reliability ◽

Motor Fitness Tests

The purpose of this study was to examine the rest-retest reliability of the Eurofit motor fitness tests performed by university students. A total of 98 undergraduate students who were enrolled in physical education departments in Greece participated (29 men aged 19.5 ± 2.7 hr. and 66 women aged 19, 4 ± 2.7 yr.). ALL Eurofit motor fitness tests and anthropometric measurements were obtained twice with one week between the two measurements, Intraclass correlation coefficient indicated satisfactory coefficients above .70 for most tests. The only exception was the plate-tapping test which yielded a low value ( R = .57). Further, the majority of the Eurofit test battery fitted well within the 95% confidence interval, and only three Eurofit motor fitness test items (flamingo balance, plate tapping, and sit-ups) presented a confidence limit below the value of .70, These findings indicated that the Eurofit test battery yielded reliable data for undergraduate students. However, modifications should be considered to improve the reliability of certain test items, for application to undergraduates.

Download Full-text

Effect sizes and test-retest reliability of the fMRI-based Neurologic Pain Signature

10.1101/2021.05.29.445964 ◽

2021 ◽

Author(s):

Xiaochun Han ◽

Yoni K. Ashar ◽

Philip Kragel ◽

Bogdan Petre ◽

Victoria Schelkun ◽

...

Keyword(s):

Individual Differences ◽

Effect Size ◽

Mental States ◽

Effect Sizes ◽

Medium Effect ◽

Retest Reliability ◽

Medium Effect Size ◽

Pain Reports ◽

Nociceptive Input ◽

Test Retest Reliability

Identifying biomarkers that predict mental states with large effect sizes and high test-retest reliability is a growing priority for fMRI research. We examined a well-established multivariate brain measure that tracks pain induced by nociceptive input, the Neurologic Pain Signature (NPS). In N = 295 participants across eight studies, NPS responses showed a very large effect size in predicting within-person single-trial pain reports (d = 1.45) and medium effect size in predicting individual differences in pain reports (d = 0.49, average r = 0.20). The NPS showed excellent short-term (within-day) test-retest reliability (ICC = 0.84, with average 69.5 trials/person). Reliability scaled with the number of trials within-person, with ≥60 trials required for excellent test-retest reliability. Reliability was comparable in two additional studies across 5-day (N = 29, ICC = 0.74, 30 trials/person) and 1-month (N = 40, ICC = 0.46, 5 trials/person) test-retest intervals. The combination of strong within-person correlations and only modest between-person correlations between the NPS and pain reports indicates that the two measures have different sources of between-person variance. The NPS is not a surrogate for individual differences in pain reports, but can serve as a reliable measure of pain-related physiology and mechanistic target for interventions.

Download Full-text

Learning from the Reliability Paradox: How Theoretically Informed Generative Models Can Advance the Social, Behavioral, and Brain Sciences

10.31234/osf.io/xr7y3 ◽

2020 ◽

Cited By ~ 4

Author(s):

Nathaniel Haines ◽

Peter D. Kvam ◽

Louis H. Irving ◽

Colin Smith ◽

Theodore P. Beauchaine ◽

...

Keyword(s):

Individual Differences ◽

Theory Development ◽

Generative Models ◽

Parameter Estimates ◽

Group Level ◽

Retest Reliability ◽

Individual Level ◽

Generative Modeling ◽

Behavioral Tasks ◽

Test Retest Reliability

Behavioral tasks (e.g., Stroop task) that produce replicable group-level effects (e.g., Stroop effect) often fail to reliably capture individual differences between participants (e.g., low test-retest reliability). This “reliability paradox” has led many researchers to conclude that most behavioral tasks cannot be used to develop and advance theories of individual differences. However, these conclusions are derived from statistical models that provide only superficial summary descriptions of behavioral data, thereby ignoring theoretically-relevant data-generating mechanisms that underly individual-level behavior. More generally, such descriptive methods lack the flexibility to test and develop increasingly complex theories of individual differences. To resolve this theory-description gap, we present generative modeling approaches, which involve using background knowledge to specify how behavior is generated at the individual level, and in turn how the distributions of individual-level mechanisms are characterized at the group level—all in a single joint model. Generative modeling shifts our focus away from estimating descriptive statistical “effects” toward estimating psychologically meaningful parameters, while simultaneously accounting for measurement error that would otherwise attenuate individual difference correlations. Using simulations and empirical data from the Implicit Association Test and Stroop, Flanker, Posner Cueing, and Delay Discounting tasks, we demonstrate how generative models yield (1) higher test-retest reliability estimates, and (2) more theoretically informative parameter estimates relative to traditional statistical approaches. Our results reclaim optimism regarding the utility of behavioral paradigms for testing and advancing theories of individual differences, and emphasize the importance of formally specifying and checking model assumptions to reduce theory-description gaps and facilitate principled theory development.

Download Full-text

Individual differences and test-retest reliability in neural and mood effects of tACS

Brain Stimulation ◽

10.1016/j.brs.2018.12.761 ◽

2019 ◽

Vol 12 (2) ◽

pp. 534

Author(s):

K. Clancy ◽

N. Kartvelishvili ◽

W. Li

Keyword(s):

Individual Differences ◽

Retest Reliability ◽

Mood Effects ◽

Test Retest Reliability

Download Full-text

Test–Retest Reliability and the Effects of Walking Speed on Stride Time Variability During Continuous, Overground Walking in Healthy Young Adults

Journal of Applied Biomechanics ◽

10.1123/jab.2020-0138 ◽

2020 ◽

pp. 1-7

Author(s):

Nicholas S. Ryan ◽

Paul A. Bruno ◽

John M. Barden

Keyword(s):

Young Adults ◽

Walking Speed ◽

Time Variability ◽

Clinical Settings ◽

Stride Time ◽

Real Difference ◽

Overground Walking ◽

Retest Reliability ◽

Intraclass Correlations ◽

Test Retest Reliability

Studies have investigated the reliability and effect of walking speed on stride time variability during walking trials performed on a treadmill. The objective of this study was to investigate the reliability of stride time variability and the effect of walking speed on stride time variability, during continuous, overground walking in healthy young adults. Participants completed: (1) 2 walking trials at their preferred walking speed on 1 day and another trial 2 to 4 days later and (2) 1 trial at their preferred walking speed, 1 trial approximately 20% to 25% faster than their preferred walking speed, and 1 trial approximately 20% to 25% slower than their preferred walking speed on a separate day. Data from a waist-mounted accelerometer were used to determine the consecutive stride times for each trial. The reliability of stride time variability outcomes was generally poor (intraclass correlations: .167–.487). Although some significant differences in stride time variability were found between the preferred walking speed, fast, and slow trials, individual between-trial differences were generally below the estimated minimum difference considered to be a real difference. The development of a protocol to improve the reliability of stride time variability outcomes during continuous, overground walking would be beneficial to improve their application in research and clinical settings.

Download Full-text

Stable individual differences in strategies within, but not between, visual search tasks

Quarterly Journal of Experimental Psychology ◽

10.1177/1747021820929190 ◽

2020 ◽

pp. 174702182092919 ◽

Cited By ~ 1

Author(s):

Alasdair DF Clarke ◽

Jessica L Irons ◽

Warren James ◽

Andrew B Leber ◽

Amelia R Hunt

Keyword(s):

Individual Differences ◽

Visual Search ◽

Search Task ◽

The Other ◽

Retest Reliability ◽

Search Tasks ◽

And Performance ◽

Context Specific ◽

Test Retest Reliability ◽

Over Time

A striking range of individual differences has recently been reported in three different visual search tasks. These differences in performance can be attributed to strategy, that is, the efficiency with which participants control their search to complete the task quickly and accurately. Here, we ask whether an individual’s strategy and performance in one search task is correlated with how they perform in the other two. We tested 64 observers and found that even though the test–retest reliability of the tasks was high, an observer’s performance and strategy in one task was not predictive of their behaviour in the other two. These results suggest search strategies are stable over time, but context-specific. To understand visual search, we therefore need to account not only for differences between individuals but also how individuals interact with the search task and context.

Download Full-text

Large-scale analysis of test–retest reliabilities of self-regulation measures

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1818430116 ◽

2019 ◽

Vol 116 (12) ◽

pp. 5472-5477 ◽

Cited By ~ 88

Author(s):

A. Zeynep Enkavi ◽

Ian W. Eisenberg ◽

Patrick G. Bissett ◽

Gina L. Mazza ◽

David P. MacKinnon ◽

...

Keyword(s):

Individual Differences ◽

Large Scale ◽

Response Times ◽

Self Regulation ◽

Self Report ◽

Model Parameters ◽

Retest Reliability ◽

Large Scale Analysis ◽

Behavioral Tasks ◽

Test Retest Reliability

The ability to regulate behavior in service of long-term goals is a widely studied psychological construct known as self-regulation. This wide interest is in part due to the putative relations between self-regulation and a range of real-world behaviors. Self-regulation is generally viewed as a trait, and individual differences are quantified using a diverse set of measures, including self-report surveys and behavioral tasks. Accurate characterization of individual differences requires measurement reliability, a property frequently characterized in self-report surveys, but rarely assessed in behavioral tasks. We remedy this gap by (i) providing a comprehensive literature review on an extensive set of self-regulation measures and (ii) empirically evaluating test–retest reliability of this battery in a new sample. We find that dependent variables (DVs) from self-report surveys of self-regulation have high test–retest reliability, while DVs derived from behavioral tasks do not. This holds both in the literature and in our sample, although the test–retest reliability estimates in the literature are highly variable. We confirm that this is due to differences in between-subject variability. We also compare different types of task DVs (e.g., model parameters vs. raw response times) in their suitability as individual difference DVs, finding that certain model parameters are as stable as raw DVs. Our results provide greater psychometric footing for the study of self-regulation and provide guidance for future studies of individual differences in this domain.

Download Full-text

Validation of a general nutrition knowledge questionnaire in a Turkish student sample

Public Health Nutrition ◽

10.1017/s1368980011003594 ◽

2012 ◽

Vol 15 (11) ◽

pp. 2074-2085 ◽

Cited By ~ 24

Author(s):

A Aylin Alsaffar

Keyword(s):

Construct Validity ◽

Undergraduate Students ◽

Eating Habits ◽

Nutrition Knowledge ◽

Internal Reliability ◽

Student Sample ◽

Retest Reliability ◽

Knowledge Questionnaire ◽

Turkish People ◽

Test Retest Reliability

AbstractObjectiveTo validate the general nutrition knowledge questionnaire developed by Parmenter and Wardle (1999) in a Turkish student sample.DesignThe original questionnaire of Parmenter and Wardle (1999) was modified and translated into Turkish. The modified questionnaire was administered to second year undergraduate students. Some students completed the questionnaire twice for the measurement of test–retest reliability. Statistical analysis was performed on the responses to measure the internal reliability, test–retest reliability and construct validity.SettingStudents completed the questionnaire under supervision. The questionnaire was completed at the end of lectures. Retest was carried out two weeks after first administration of the test.SubjectsA total of 195 undergraduate students studying either nutrition and dietetics (n 90) or engineering (n 105) participated in the study. Of these, 125 students completed the questionnaire on two occasions.ResultsOverall internal reliability (Cronbach's α = 0·89) and test–retest reliability (0·86) were high. Significant differences between the scores of the two groups of students indicated that the questionnaire had satisfactory construct validity.ConclusionsThe modified version of the general nutrition knowledge questionnaire can be used as a tool to examine the nutrition knowledge of adults in Turkey. In the next stage of the study, some adjustments need to be made to the items that led to low reliability values so that these items will be more applicable to the eating habits and patterns of Turkish people.

Download Full-text

Reliability of the Discounting Inventory: An extension into substance-use population

Polish Psychological Bulletin ◽

10.1515/ppb-2017-0033 ◽

2017 ◽

Vol 48 (2) ◽

pp. 293-300 ◽

Cited By ~ 5

Author(s):

Marta Malesza ◽

Maria Maczuga

Keyword(s):

Substance Use ◽

Individual Differences ◽

Psychometric Properties ◽

Internal Consistency ◽

Reliability Measures ◽

Retest Reliability ◽

Social Discounting ◽

Scale Scores ◽

Test Retest Reliability

Abstract Recent research introduced the Discounting Inventory that allows the measurement of individual differences in the delay, probabilistic, effort, and social discounting rates. The goal of this investigation was to determine several aspects of the reliability of the Discounting Inventory using the responses of 385 participants (200 non-smokers and 185 current-smokers). Two types of reliability are of interest. Internal consistency and test-retest stability. A secondary aim was to extend such reliability measures beyond the non-clinical participant. The current study aimed to measure the reliability of the DI in a nicotine-dependent individuals and non-nicotine-dependent individuals. It is concluded that the internal consistency of the DI is excellent, and that the test-retest reliability results suggest that items intended to measure three types of discounting were likely testing trait, rather than state, factors, regardless of whether “non-smokers” were included in, or excluded from, the analyses (probabilistic discounting scale scores being the exception). With these cautions in mind, however, the psychometric properties of the DI appear to be very good.

Download Full-text