Standard error in the Jacobson and Truax Reliable Change Index: The 
“classical approach” leads to poor estimates

Different authors have used different estimates of variability in the denominator of the Reliable Change Index (RCI). Maassen attempts to clarify some of the differences and the assumptions underlying them. In particular he compares the ‘classical’ approach using an estimate SEd supposedly based on measurement error alone with an estimate SDiff based on the variability of observed differences in a population that should have no true change. Maassen concludes that not only is SEd based on classical theory, but it properly estimates variability due to measurement error and practice effect while SDiff overestimates variability by accounting twice for the variability due to practice. Simulations show Maassen to be wrong on both accounts. With an error rate nominally set to 10%, RCI estimates using SDiff wrongly declare change in 10.4% and 9.4% of simulated cases without true change while estimates using SEd wrongly declare change in 17.5% and 12.3% of the simulated cases (p < .000000001 and p < .008, respectively). In the simulation that separates measurement error and practice effects, SEd estimates the variability of change due to measurement error to be .34, when the true variability due to measurement error was .014. Neuropsychologists should not use SEd in the denominator of the RCI. (JINS, 2004, 10, 899–901.)

Download Full-text

The standard error in the Jacobson and Truax Reliable Change Index: The classical approach to the assessment of reliable change

Journal of the International Neuropsychological Society ◽

10.1017/s1355617704106097 ◽

2004 ◽

Vol 10 (6) ◽

pp. 888-893 ◽

Cited By ~ 73

Author(s):

GERARD H. MAASSEN

Keyword(s):

Standard Error ◽

Real Data ◽

Practice Effects ◽

Reliable Change Index ◽

Classical Approach ◽

Reliable Change ◽

Difference Scores

Researchers and clinicians using Jacobson and Truax's index to assess the reliability of change in patients, or its counterpart by Chelune et al., which takes practice effects into account, are confused by the different ways of calculating the standard error encountered in the literature (see the discussion started in this journal by Hinton-Bayre). This article compares the characteristics of (1) the standard error used by Jacobson and Truax, (2) the standard error of difference scores used by Temkin et al. and (3) an adaptation of Jacobson and Truax's approach that accounts for difference between initial and final variance. It is theoretically demonstrated that the last variant is preferable, which is corroborated by real data. (JINS, 2004, 10, 888–893.)

Download Full-text

How Useful is the Power Law of Practice for Recognizing Practice in Concentration Tests?

European Journal of Psychological Assessment ◽

10.1027/1015-5759.23.3.157 ◽

2007 ◽

Vol 23 (3) ◽

pp. 157-165 ◽

Cited By ~ 4

Author(s):

Carmen Hagemeister

Keyword(s):

Logistic Regression ◽

Reaction Time ◽

Power Law ◽

Error Rate ◽

Reaction Times ◽

Practice Effect ◽

Practice Effects ◽

Rate Decrease ◽

Small Practice

Abstract. When concentration tests are completed repeatedly, reaction time and error rate decrease considerably, but the underlying ability does not improve. In order to overcome this validity problem this study aimed to test if the practice effect between tests and within tests can be useful in determining whether persons have already completed this test. The power law of practice postulates that practice effects are greater in unpracticed than in practiced persons. Two experiments were carried out in which the participants completed the same tests at the beginning and at the end of two test sessions set about 3 days apart. In both experiments, the logistic regression could indeed classify persons according to previous practice through the practice effect between the tests at the beginning and at the end of the session, and, less well but still significantly, through the practice effect within the first test of the session. Further analyses showed that the practice effects correlated more highly with the initial performance than was to be expected for mathematical reasons; typically persons with long reaction times have larger practice effects. Thus, small practice effects alone do not allow one to conclude that a person has worked on the test before.

Download Full-text

The Two Errors of Using the Within-Subject Standard Deviation (WSD) as the Standard Error of a Reliable Change Index

Archives of Clinical Neuropsychology ◽

10.1093/arclin/acq036 ◽

2010 ◽

Vol 25 (5) ◽

pp. 451-456 ◽

Cited By ~ 5

Author(s):

G. H. Maassen

Keyword(s):

Standard Deviation ◽

Standard Error ◽

Reliable Change Index ◽

Reliable Change

Download Full-text

Detecting significant change in neuropsychological test performance: A comparison of four models

Journal of the International Neuropsychological Society ◽

10.1017/s1355617799544068 ◽

1999 ◽

Vol 5 (4) ◽

pp. 357-369 ◽

Cited By ~ 200

Author(s):

NANCY R. TEMKIN ◽

ROBERT K. HEATON ◽

IGOR GRANT ◽

SUREYYA S. DIKMEN

Keyword(s):

Multiple Regression ◽

Confidence Intervals ◽

Test Performance ◽

Neuropsychological Test ◽

Error Rates ◽

Practice Effect ◽

Reliable Change Index ◽

Reliable Change ◽

The Individual

A major use of neuropsychological assessment is to measure changes in functioning over time; that is, to determine whether a difference in test performance indicates a real change in the individual or just chance variation. Using 7 illustrative test measures and retest data from 384 neurologically stable adults, this paper compares different methods of predicting retest scores, and of determining whether observed changes in performance are unusual. The methods include the Reliable Change Index, with and without correction for practice effect, and models based upon simple and multiple regression. For all test variables, the most powerful predictor of follow-up performance was initial performance. Adding demographic variables and overall neuropsychological competence at baseline significantly but slightly improved prediction of all follow-up scores. The simple Reliable Change Index without correction for practice performed least well, with high error rates and large prediction intervals (confidence intervals). Overall prediction accuracy was similar for the other three methods; however, different models produce large differences in predicted scores for some individuals, especially those with extremes of initial test performance, overall competency, or demographics. All 5 measures from the Halstead–Reitan Battery had residual (observed − predicted score) variability that increased with poorer initial performance. Two variables showed significant nonnormality in the distribution of residuals. For accurate prediction with smallest prediction–confidence intervals, we recommend multiple regression models with attention to differential variability and nonnormality of residuals. (JINS, 1999, 5, 357–369.)

Download Full-text

Reliable Change Formula Query: Temkin et al. reply

Journal of the International Neuropsychological Society ◽

10.1017/s135561770063312x ◽

2000 ◽

Vol 6 (3) ◽

pp. 364-364 ◽

Cited By ~ 4

Author(s):

NANCY R. TEMKIN ◽

ROBERT K. HEATON ◽

IGOR GRANT ◽

SUREYYA S. DIKMEN

Keyword(s):

Standard Deviation ◽

Standard Error ◽

Direct Method ◽

Single Point ◽

Reliability Coefficient ◽

Reliable Change Index ◽

Square Root ◽

Reliable Change ◽

The Difference ◽

The One

Hinton-Bayre (2000) raises a point that may occur to many readers who are familiar with the Reliable Change Index (RCI). In our previous paper comparing four models for detecting significant change in neuropsychological performance (Temkin et al., 1999), we used a formula for calculating Sdiff, the measure of variability for the test–retest difference, that differs from the one Hinton-Bayre has seen employed in other studies of the RCI. In fact, there are two ways of calculating Sdiff—a direct method and an approximate method. As stated by Jacobson and Truax (1991, p. 14), the direct method is to compute “the standard error of the difference between the two test scores” or equivalently [begin square root](s12 + s22 − 2s1s2rxx′)[end square root] where si is the standard deviation at time i and rxx′ is the test–retest correlation or reliability coefficient. Jacobson and Truax also provide a formula for the approximation of Sdiff when one does not have access to retest data on the population of interest, but does have a test–retest reliability coefficient and an estimate of the cross-sectional standard deviation, i.e., the standard deviation at a single point in time. This approximation assumes that the standard deviations at Time 1 and Time 2 are equal, which may be close to true in many cases. Since we had the longitudinal data to directly calculate the standard error of the difference between scores at Time 1 and Time 2, we used the direct method. Which method is preferable? When the needed data are available, it is the one we used.

Download Full-text

When (Not) to Rely on the Reliable Change Index

10.31219/osf.io/3kthg ◽

2021 ◽

Author(s):

Andrew Athan McAleavey

Keyword(s):

Measurement Error ◽

Clinical Psychology ◽

False Positives ◽

Reliable Change Index ◽

Applied Psychology ◽

False Negatives ◽

Statistical Tool ◽

Reliable Change ◽

Difference Scores

The reliable change index (RCI) is a widely used statistical tool designed to account for measurement error when evaluating difference scores. Because of its conceptual simplicity and computational ease, it persists in research and applied psychology. However, researchers have repeatedly demonstrated ways that the RCI is insufficient or invalid for various applications. This is a problem in research and clinical psychology since this common tool is potentially problematic. The aims of this manuscript are to non-technically describe the formulation and assumptions of the RCI, to offer guidance as to when the RCI is (and is not) appropriate, and to identify what is needed for proper calculation of the RCI when it is used. Several criteria are identified to help determine whether the RCI is appropriate for a specific use. It is apparent that the RCI is the best available method only in a small number of situations, is frequently miscalculated, and produces incorrect inferences more often than simple alternatives, largely because it is highly insensitive to real changes. Specific alternatives are offered which may better operationalize common inferential tasks, including when more than two observations are available and when false negatives are equally costly to false positives.

Download Full-text

Application of Different Standard Error Estimates in Reliable Change Methods

Archives of Clinical Neuropsychology ◽

10.1093/arclin/acz054 ◽

2019 ◽

Cited By ~ 2

Author(s):

Dustin B Hammers ◽

Kevin Duff

Keyword(s):

Standard Error ◽

Clinical Sample ◽

Cognitive Change ◽

Ease Of Use ◽

Practice Effects ◽

Strongly Correlated ◽

Reliable Change ◽

Z Scores ◽

Mathematical Accuracy ◽

The Impact

Abstract Objective This study attempted to clarify the applicability of standard error (SE) terms in clinical research when examining the impact of short-term practice effects on cognitive performance via reliable change methodology. Method This study compared McSweeney’s SE of the estimate (SEest) to Crawford and Howell’s SE for prediction of the regression (SEpred) using a developmental sample of 167 participants with either normal cognition or mild cognitive impairment (MCI) assessed twice over 1 week. One-week practice effects in older adults: Tools for assessing cognitive change. Using these SEs, previously published standardized regression-based (SRB) reliable change prediction equations were then applied to an independent sample of 143 participants with MCI. Results This clinical developmental sample yielded nearly identical SE values (e.g., 3.697 vs. 3.719 for HVLT-R Total Recall SEest and SEpred, respectively), and the resultant SRB-based discrepancy z scores were comparable and strongly correlated (r = 1.0, p < .001). Consequently, observed follow-up scores for our sample with MCI were consistently below expectation compared to predictions based on Duff’s SRB algorithms. Conclusions These results appear to replicate and extend previous work showing that the calculation of the SEest and SEpred from a clinical sample of cognitively intact and MCI participants yields similar values and can be incorporated into SRB reliable change statistics with comparable results. As a result, neuropsychologists utilizing reliable change methods in research investigation (or clinical practice) should carefully balance mathematical accuracy and ease of use, among other factors, when determining which SE metric to use.

Download Full-text

Estimating Responders to Treatment Using Five Indices of Significant Individual Change

10.21203/rs.3.rs-365851/v1 ◽

2021 ◽

Author(s):

Ron D. Hays ◽

Mary E. Slaughter ◽

Karen L. Spritzer ◽

Patricia M. Herman

Keyword(s):

Low Back Pain ◽

Back Pain ◽

Standard Deviation ◽

Standard Error ◽

Reliable Change Index ◽

Supplementary Information ◽

Low Back ◽

Individual Change ◽

Reliable Change ◽

Deviation Index

Abstract Background: Identifying how many individuals significantly improve (“responders”) provides important supplementary information beyond group mean change about the effects of treatment options. This supplemental information can enhance interpretation of clinical trials and observation studies. This study provides a comparison of five ways of estimating the significance of individual change.Methods: Secondary analyses of the Impact Stratification Score (ISS) for chronic low back pain which was administered at two timepoints in two samples: 1) three months apart in an observational study of 1,680 patients undergoing chiropractic care; and 2) 6 weeks apart in a randomized trial of 720 active-duty military personnel with low back pain. The ISS is the sum of the PROMIS-29 v2.1 physical function, pain interference and pain intensity scores and has a possible range of 8 (least impact) to 50 (greatest impact). The five methods of evaluating individual change compared were: 1) standard deviation index; 2) standard error of measurement (SEM); 3) standard error of estimate; 4) standard error of prediction; and 5) reliable change index.Results: Internal consistency reliability of the ISS at baseline was 0.90 in Sample 1 and 0.92 in Sample 2. Effect size of change on the ISS was -0.16 in Sample 1 and -0.59 in Sample 2. The denominators for the five methods in Sample 1 (Sample 2) were 7.6 (8.4) for the standard deviation index, 2.4 (2.4) for the SEM, 2.3 (2.3) for the standard error of estimation, and 3.3 (3.4) for the standard error of prediction and the reliable change index. The amount of change on the ISS needed for significant individual change in both samples was about 15-16 for the standard deviation index, 5 for the SEM and for the standard error of estimation, and 7 for the standard error of prediction and reliable change index. The percentage of people classified as responders ranged from 1% (standard deviation index in Sample 1) to 57% (SEM and standard error of estimate in Sample 2).Conclusions: The standard error of prediction and reliable change index estimates of significant change are consistent with retrospective ratings of change of at least moderately better in prior research. These two are less likely than other methods to classify people as responders who have not actually gotten better.

Download Full-text

What do Temkin's simulations of reliable change tell us?

Journal of the International Neuropsychological Society ◽

10.1017/s1355617704106127 ◽

2004 ◽

Vol 10 (6) ◽

pp. 902-903

Author(s):

GERARD H. MAASSEN

Keyword(s):

Standard Error ◽

Classical Test Theory ◽

Test Theory ◽

Practice Effects ◽

Reliable Change ◽

Historical Aspect ◽

Classical Test ◽

Classic Approach ◽

Post Hoc

Due to space limitations I have chosen to confine my reply to the comments by Temkin (this issue, pp. 899–901) that touch most directly the concepts of practice effects and reliable change. Temkin seems to portray my adherence to the classic approach as a private affair. However, Temkin herself (Temkin et al., 1999) reported to utilize the most widely applied procedures of Jacobson and Truax and of Chelune et al., which are based on the classic approach. For unexplained reasons they had substituted a different standard error. The unsatisfactory justification later given in their reply to Hinton-Bayre's (2000) letter revealed the presumably actual reason: unfamiliarity with psychometrics including the classical test theory (CTT). Not surprisingly, Temkin ignores this historical aspect in her comment. Nevertheless, the new post-hoc arguments she brings up deserve, of course, a fair evaluation.

Download Full-text

Estimating Responders to Chiropractic Treatment on The Impact Stratification Score Using Five Indices of Significant Individual Change

10.21203/rs.3.rs-154097/v1 ◽

2021 ◽

Author(s):

Ron D. Hays ◽

Mary E. Slaughter ◽

Patricia M. Herman

Keyword(s):

Standard Deviation ◽

Confidence Interval ◽

Standard Error ◽

Chiropractic Care ◽

Reliable Change Index ◽

Supplementary Information ◽

Individual Change ◽

Reliable Change ◽

Deviation Index ◽

The Impact

Abstract Background: Identifying how many individuals significantly improve (“responders”) provides important supplementary information beyond group mean change about the effects of treatment options. This supplemental information can enhance interpretation of clinical trials and observation studies. This study provides a comparison of five ways of estimating the significance of individual change.Methods: Secondary analyses of the Impact Stratification Scale (ISS) selected for chronic low back pain was administered at two timepoints three months apart in an observational study of 1,680 patients undergoing chiropractic care. The (ISS is the sum of the PROMIS-29 v2,1 physical function, pain interference and pain intensity scores and has a possible range of 8 (least impact) to 50 (greatest impact). The five methods of evaluating individual change compared were: 1) standard deviation index; 2) confidence interval around the standard error of measurement (SEM); 3) standard error of estimate; 4) standard error of prediction; and 5) reliable change index.Results: Internal consistency reliability of the ISS at baseline was 0.90. Effect size of change on the ISS was -0.16 using the SD (7.6) at baseline. The denominators for the five methods were 7.6 for the standard deviation index, 2.4 for the confidence interval around the SEM, 2.3 for the standard error of estimation, and 3.3 for the standard error of prediction and the reliable change index. The amount of change on the ISS needed for significant individual change was 15 for the standard deviation index, 5 for the confidence interval around the SEM and for the standard error of estimation, and 7 for the standard error of prediction and reliable change index. The percentage of people classified as responders ranged from 1% (standard deviation index) to 22% (standard error of prediction and reliable change index).Conclusions: The standard error of prediction and reliable change index estimates of significant change are consistent with retrospective ratings of change of at least moderately better in prior research. These two are less likely than other methods to classify people as responders who have not actually gotten better.

Download Full-text