Large-scale analysis of test–retest reliabilities of self-regulation measures

The ability to regulate behavior in service of long-term goals is a widely studied psychological construct known as self-regulation. This wide interest is in part due to the putative relations between self-regulation and a range of real-world behaviors. Self-regulation is generally viewed as a trait, and individual differences are quantified using a diverse set of measures, including self-report surveys and behavioral tasks. Accurate characterization of individual differences requires measurement reliability, a property frequently characterized in self-report surveys, but rarely assessed in behavioral tasks. We remedy this gap by (i) providing a comprehensive literature review on an extensive set of self-regulation measures and (ii) empirically evaluating test–retest reliability of this battery in a new sample. We find that dependent variables (DVs) from self-report surveys of self-regulation have high test–retest reliability, while DVs derived from behavioral tasks do not. This holds both in the literature and in our sample, although the test–retest reliability estimates in the literature are highly variable. We confirm that this is due to differences in between-subject variability. We also compare different types of task DVs (e.g., model parameters vs. raw response times) in their suitability as individual difference DVs, finding that certain model parameters are as stable as raw DVs. Our results provide greater psychometric footing for the study of self-regulation and provide guidance for future studies of individual differences in this domain.

Download Full-text

A large-scale analysis of test-retest reliabilities of self-regulation measures

10.31234/osf.io/x5pm4 ◽

2018 ◽

Cited By ~ 1

Author(s):

Ayse Zeynep Enkavi ◽

Ian W Eisenberg ◽

Patrick Bissett ◽

Gina L. Mazza ◽

David MacKinnon ◽

...

Keyword(s):

Individual Differences ◽

Large Scale ◽

Response Times ◽

Self Regulation ◽

Self Report ◽

Model Parameters ◽

Retest Reliability ◽

Large Scale Analysis ◽

Behavioral Tasks ◽

Test Retest Reliability

The ability to regulate behavior in service of long-term goals is a widely studied psychological construct known as self-regulation. This wide interest is in part due to the putative relations between self-regulation and a range of real-world behaviors. Self-regulation is generally viewed as a trait, and individual differences are quantified using a diverse set of measures including self-report surveys and behavioral tasks. Accurate characterization of individual differences requires measurement reliability, a property frequently characterized in self-report surveys, but rarely assessed in behavioral tasks. We remedy this gap by (1) providing a comprehensive literature review on an extensive set of self-regulation measures, and (2) empirically evaluating retest reliability in this battery of measures in a new sample. We find that self-report survey measures of self-regulation have high test-retest reliability while measures derived from behavioral tasks do not. This holds both in the literature and in our sample. We confirm that this is due to differences in between-subjects variability. We also compare different types of task measures (e.g., model parameters vs. raw response times) in their suitability as individual difference measures, finding that certain model parameters are as stable as raw measures. Our results provide greater psychometric footing for the study of self-regulation and provide guidance for future studies of individual differences in this domain.

Download Full-text

Learning from the Reliability Paradox: How Theoretically Informed Generative Models Can Advance the Social, Behavioral, and Brain Sciences

10.31234/osf.io/xr7y3 ◽

2020 ◽

Cited By ~ 4

Author(s):

Nathaniel Haines ◽

Peter D. Kvam ◽

Louis H. Irving ◽

Colin Smith ◽

Theodore P. Beauchaine ◽

...

Keyword(s):

Individual Differences ◽

Theory Development ◽

Generative Models ◽

Parameter Estimates ◽

Group Level ◽

Retest Reliability ◽

Individual Level ◽

Generative Modeling ◽

Behavioral Tasks ◽

Test Retest Reliability

Behavioral tasks (e.g., Stroop task) that produce replicable group-level effects (e.g., Stroop effect) often fail to reliably capture individual differences between participants (e.g., low test-retest reliability). This “reliability paradox” has led many researchers to conclude that most behavioral tasks cannot be used to develop and advance theories of individual differences. However, these conclusions are derived from statistical models that provide only superficial summary descriptions of behavioral data, thereby ignoring theoretically-relevant data-generating mechanisms that underly individual-level behavior. More generally, such descriptive methods lack the flexibility to test and develop increasingly complex theories of individual differences. To resolve this theory-description gap, we present generative modeling approaches, which involve using background knowledge to specify how behavior is generated at the individual level, and in turn how the distributions of individual-level mechanisms are characterized at the group level—all in a single joint model. Generative modeling shifts our focus away from estimating descriptive statistical “effects” toward estimating psychologically meaningful parameters, while simultaneously accounting for measurement error that would otherwise attenuate individual difference correlations. Using simulations and empirical data from the Implicit Association Test and Stroop, Flanker, Posner Cueing, and Delay Discounting tasks, we demonstrate how generative models yield (1) higher test-retest reliability estimates, and (2) more theoretically informative parameter estimates relative to traditional statistical approaches. Our results reclaim optimism regarding the utility of behavioral paradigms for testing and advancing theories of individual differences, and emphasize the importance of formally specifying and checking model assumptions to reduce theory-description gaps and facilitate principled theory development.

Download Full-text

Cognitive efficiency beats subtraction-based metrics as a reliable individual difference dimension relevant to self-control

10.31234/osf.io/qp2ua ◽

2020 ◽

Author(s):

Alexander Samuel Weigard ◽

D. Angus Clark ◽

Chandra Sripada

Keyword(s):

Individual Difference ◽

Drift Rate ◽

Self Regulation ◽

Self Control ◽

Self Report ◽

Measurement Properties ◽

Top Down ◽

Retest Reliability ◽

Cognitive Efficiency ◽

Test Retest Reliability

Conflict tasks play a central role in the study of self-control. These tasks feature a condition assumed to demand top-down control and a matched condition where control demands are assumed to be absent, and individual differences in control ability are indexed by subtracting measures of performance (e.g., reaction time) across these conditions. Subtraction-based metrics of top-down control have recently been criticized for having low test-retest reliability, weak intercorrelations across conceptually similar tasks, and weak relationships with self-report measures of self-control. Concurrently, there is growing evidence that task-general cognitive efficiency, indexed by the drift rate parameter of the diffusion model (Ratcliff, 1978), constitutes a cohesive, reliable individual difference dimension. However, no previous studies have examined the measurement properties of subtraction metrics of top-down control as compared to drift rate in the same sample, or compared their respective associations with self-report measures. In this re-analysis of open data drawn from a large recent study (Eisenberg et al., 2019; N=522), we find that subtraction metrics fail to form cohesive latent factors, that the resulting factors have poor test-retest reliability, and that they exhibit tenuous connections to questionnaire measures of self-control. In contrast, cognitive efficiency measures from the same tasks form a robust, reliable latent factor that shows moderate associations with self-control. Importantly, this latent cognitive efficiency variable is constructed from conditions that both were, and were not, previously assumed to index control. These findings invite a reconceptualization of subtraction-based tasks, pointing to task-general efficiency as a central individual difference dimension relevant to self-regulation.

Download Full-text

Test-Retest Reliability of a Self-Report Questionnaire for DSM-IV and ICD-10 Personality Disorders

European Journal of Psychological Assessment ◽

10.1027//1015-5759.16.1.53 ◽

2000 ◽

Vol 16 (1) ◽

pp. 53-58 ◽

Cited By ~ 11

Author(s):

Hans Ottosson ◽

Martin Grann ◽

Gunnar Kullgren

Keyword(s):

Personality Disorder ◽

Anxiety Disorders ◽

Personality Disorders ◽

Clinical Sample ◽

Self Report ◽

Anxiety State ◽

Short Term ◽

Retest Reliability ◽

Axis I ◽

Test Retest Reliability

Summary: Short-term stability or test-retest reliability of self-reported personality traits is likely to be biased if the respondent is affected by a depressive or anxiety state. However, in some studies, DSM-oriented self-reported instruments have proved to be reasonably stable in the short term, regardless of co-occurring depressive or anxiety disorders. In the present study, we examined the short-term test-retest reliability of a new self-report questionnaire for personality disorder diagnosis (DIP-Q) on a clinical sample of 30 individuals, having either a depressive, an anxiety, or no axis-I disorder. Test-retest scorings from subjects with depressive disorders were mostly unstable, with a significant change in fulfilled criteria between entry and retest for three out of ten personality disorders: borderline, avoidant and obsessive-compulsive personality disorder. Scorings from subjects with anxiety disorders were unstable only for cluster C and dependent personality disorder items. In the absence of co-morbid depressive or anxiety disorders, mean dimensional scores of DIP-Q showed no significant differences between entry and retest. Overall, the effect from state on trait scorings was moderate, and it is concluded that test-retest reliability for DIP-Q is acceptable.

Download Full-text

Developing a measure to assess clinicians’ ability to reflect on key staff–patient dynamics in forensic settings

Journal of Forensic Practice ◽

10.1108/jfp-07-2021-0041 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Adam Polnay ◽

Helen Walker ◽

Christopher Gallacher

Keyword(s):

Reflective Practice ◽

Factor Structure ◽

Internal Consistency ◽

Self Report ◽

Face Validity ◽

Good Test ◽

Data Set ◽

Retest Reliability ◽

Content Type ◽

Test Retest Reliability

Purpose Relational dynamics between patients and staff in forensic settings can be complicated and demanding for both sides. Reflective practice groups (RPGs) bring clinicians together to reflect on these dynamics. To date, evaluation of RPGs has lacked quantitative focus and a suitable quantitative tool. Therefore, a self-report tool was designed. This paper aims to pilot The Relational Aspects of CarE (TRACE) scale with clinicians in a high-secure hospital and investigate its psychometric properties. Design/methodology/approach A multi-professional sample of 80 clinicians were recruited, completing TRACE and attitudes to personality disorder questionnaire (APDQ). Exploratory factor analysis (EFA) determined factor structure and internal consistency of TRACE. A subset was selected to measure test–retest reliability. TRACE was cross-validated against the APDQ. Findings EFA found five factors underlying the 20 TRACE items: “awareness of common responses,” “discussing and normalising feelings;” “utilising feelings,” “wish to care” and “awareness of complicated affects.” This factor structure is complex, but items clustered logically to key areas originally used to generate items. Internal consistency (α = 0.66, 95% confidence interval (CI) = 0.55–0.76) demonstrated borderline acceptability. TRACE demonstrated good test–retest reliability (intra-class correlation = 0.94, 95% CI = 0.78–0.98) and face validity. TRACE indicated a slight negative correlation with APDQ. A larger data set is needed to substantiate these preliminary findings. Practical implications Early indications suggested TRACE was valid and reliable, suitable to measure the effectiveness of reflective practice. Originality/value The TRACE was a distinctive measure that filled a methodological gap in the literature.

Download Full-text

Validation of the NARCOMS Registry: pain assessment

Multiple Sclerosis Journal ◽

10.1191/1352458505ms1167oa ◽

2005 ◽

Vol 11 (3) ◽

pp. 338-342 ◽

Cited By ~ 28

Author(s):

Ruth Ann Marrie ◽

Gary Cutter ◽

Tuula Tyry ◽

Olympia Hadjimichael ◽

Timothy Vollmer

Keyword(s):

Multiple Sclerosis ◽

Convergent Validity ◽

Disease Duration ◽

Self Report ◽

Research Committee ◽

Retest Reliability ◽

The North ◽

Disability Status ◽

Pain Question ◽

Test Retest Reliability

The North American Research Committee on Multiple Sclerosis (NARCOMS) Registry is a multiple sclerosis (MS) self-report registry with more than 24 000 participants. Participants report disability status upon enrolment, and semi-annually using Performance Scales (PS), Patient Determined Disease Steps (PDDS) and a pain question. In November 2000 and 2001, we also collected the Pain Effects Scale (PES). Our aim was to validate the NARCOMS pain question using the PES as our criterion measure. We measured correlations between the pain question and age, disease duration, various PS subscales and PDDS to assess construct validity. We correlated pain question responses in participants who reported no change in PDSS or the PS subscales between questionnaires to determine test—retest reliability. We measured responsiveness in participants who reported a substantial change in the sensory, spasticity PS subscales. The correlation between the pain question and PES was r=0.61 in November 2000, and r=0.64 in November 2001 (both P<0.0001). Correlations between the pain question and age, and disease duration were low, indicating divergent validity. Correlations between the pain question and spasticity, sensory PS subscales and PDSS were moderate, indicating convergent validity. Test—retest reliability was r=0.84 (P<0.0001). Responsiveness was 70.7%. The pain question is a valid self-report measure of pain in MS.

Download Full-text

Machine learning uncovers the most robust self-report predictors of relationship quality across 43 longitudinal couples studies

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1917036117 ◽

2020 ◽

Vol 117 (32) ◽

pp. 19061-19071 ◽

Cited By ~ 4

Author(s):

Samantha Joel ◽

Paul W. Eastwick ◽

Colleen J. Allison ◽

Ximena B. Arriaga ◽

Zachary G. Baker ◽

...

Keyword(s):

Machine Learning ◽

Individual Differences ◽

Relationship Quality ◽

Individual Difference ◽

Large Scale ◽

Well Being ◽

Attachment Anxiety ◽

Self Report ◽

Individual Difference Variables ◽

Partner Satisfaction

Given the powerful implications of relationship quality for health and well-being, a central mission of relationship science is explaining why some romantic relationships thrive more than others. This large-scale project used machine learning (i.e., Random Forests) to 1) quantify the extent to which relationship quality is predictable and 2) identify which constructs reliably predict relationship quality. Across 43 dyadic longitudinal datasets from 29 laboratories, the top relationship-specific predictors of relationship quality were perceived-partner commitment, appreciation, sexual satisfaction, perceived-partner satisfaction, and conflict. The top individual-difference predictors were life satisfaction, negative affect, depression, attachment avoidance, and attachment anxiety. Overall, relationship-specific variables predicted up to 45% of variance at baseline, and up to 18% of variance at the end of each study. Individual differences also performed well (21% and 12%, respectively). Actor-reported variables (i.e., own relationship-specific and individual-difference variables) predicted two to four times more variance than partner-reported variables (i.e., the partner’s ratings on those variables). Importantly, individual differences and partner reports had no predictive effects beyond actor-reported relationship-specific variables alone. These findings imply that the sum of all individual differences and partner experiences exert their influence on relationship quality via a person’s own relationship-specific experiences, and effects due to moderation by individual differences and moderation by partner-reports may be quite small. Finally, relationship-quality change (i.e., increases or decreases in relationship quality over the course of a study) was largely unpredictable from any combination of self-report variables. This collective effort should guide future models of relationships.

Download Full-text

Effect sizes and test-retest reliability of the fMRI-based Neurologic Pain Signature

10.1101/2021.05.29.445964 ◽

2021 ◽

Author(s):

Xiaochun Han ◽

Yoni K. Ashar ◽

Philip Kragel ◽

Bogdan Petre ◽

Victoria Schelkun ◽

...

Keyword(s):

Individual Differences ◽

Effect Size ◽

Mental States ◽

Effect Sizes ◽

Medium Effect ◽

Retest Reliability ◽

Medium Effect Size ◽

Pain Reports ◽

Nociceptive Input ◽

Test Retest Reliability

Identifying biomarkers that predict mental states with large effect sizes and high test-retest reliability is a growing priority for fMRI research. We examined a well-established multivariate brain measure that tracks pain induced by nociceptive input, the Neurologic Pain Signature (NPS). In N = 295 participants across eight studies, NPS responses showed a very large effect size in predicting within-person single-trial pain reports (d = 1.45) and medium effect size in predicting individual differences in pain reports (d = 0.49, average r = 0.20). The NPS showed excellent short-term (within-day) test-retest reliability (ICC = 0.84, with average 69.5 trials/person). Reliability scaled with the number of trials within-person, with ≥60 trials required for excellent test-retest reliability. Reliability was comparable in two additional studies across 5-day (N = 29, ICC = 0.74, 30 trials/person) and 1-month (N = 40, ICC = 0.46, 5 trials/person) test-retest intervals. The combination of strong within-person correlations and only modest between-person correlations between the NPS and pain reports indicates that the two measures have different sources of between-person variance. The NPS is not a surrogate for individual differences in pain reports, but can serve as a reliable measure of pain-related physiology and mechanistic target for interventions.

Download Full-text

Psychometric properties of a self-report version of the Sexual Interest and Desire Inventory for Women (SIDI-F-SR)

10.31219/osf.io/8ghda ◽

2020 ◽

Cited By ~ 1

Author(s):

Julia Velten ◽

Gerrit Hirschfeld ◽

Milena Meyers ◽

Jürgen Margraf

Keyword(s):

Psychometric Properties ◽

Internal Consistency ◽

Sexual Interest ◽

Intraclass Correlation ◽

Self Report ◽

Clinical Psychologist ◽

Retest Reliability ◽

Absolute Agreement ◽

Test Retest Reliability ◽

Restriction Of Range

Background: The Sexual Interest and Desire Inventory Female (SIDI-F) is a clinician-administered scale that allows for a comprehensive assessment of symptoms related to Hypoactive Sexual Desire Dysfunction (HSDD). As self-report questionnaires may facilitate less socially desirable responding and as time and resources are scarce in many clinical and research settings, a self-report version was developed (SIDI-F-SR). Aim: To investigate the agreement between the SIDI-F and a self-report version (SIDI-F-SR) and assess psychometric properties of the SIDI-F-SR. Methods: A total of 170 women (Mage=36.61, SD=10.61, range=20-69) with HSDD provided data on the SIDI-F, administered by a clinical psychologist via telephone, and the SIDI-F-SR, delivered as an Internet-based questionnaire. A subset of 19 women answered the SIDI-F-SR twice over a period of 14 weeks. Outcomes: Intraclass correlation as well as predictors of absolute agreement between SIDI-F and SIDI-F-SR, as well as internal consistency, test-retest reliability, and criterion-related validity of the SIDI-F-SR were examined. Results: There was high agreement between SIDI-F and SIDI-F-SR (ICC=.86). On average, women scored about one point higher in the self-report vs. the clinician-administered scale. Agreement was higher in young women and those with severe symptoms. Internal consistency of the SIDI-F-SR was acceptable (α=.76) and comparable to the SIDI-F (α=.74). When corrections for the restriction of range were applied, internal consistency of the SIDI-F-SR increased to .91. Test-retest-reliability was good (r=.74). Criterion-related validity was low but comparable between SIDI-F and SIDI-F-SR.

Download Full-text

Test-Retest Reliability of the Trauma and Life Events Self-Report Inventory

Psychological Reports ◽

10.2466/pr0.2000.87.3.750 ◽

2000 ◽

Vol 87 (3) ◽

pp. 750-752 ◽

Cited By ~ 9

Author(s):

J. E. Hovens ◽

I. Bramsen ◽

H. M. van der Ploeg ◽

I. E. W. Reuling

Keyword(s):

Medical Students ◽

Life Events ◽

Self Report ◽

First Year ◽

Total N ◽

Male And Female ◽

Retest Reliability ◽

Time Periods ◽

Test Retest Reliability

Three groups of first-year male and female medical students (total N = 90) completed the Trauma and Life Events Self-report Inventory twice. Test-retest reliability for the three different time periods was .82, .89, and .75, respectively.

Download Full-text