Validation Methods for Aggregate-Level Test Scale Linking: A Case Study Mapping School District Test Score Distributions to a Common Scale

Linking score scales across different tests is considered speculative and fraught, even at the aggregate level. We introduce and illustrate validation methods for aggregate linkages, using the challenge of linking U.S. school district average test scores across states as a motivating example. We show that aggregate linkages can be validated both directly and indirectly under certain conditions such as when the scores for at least some target units (districts) are available on a common test (e.g., the National Assessment of Educational Progress). We introduce precision-adjusted random effects models to estimate linking error, for populations and for subpopulations, for averages and for progress over time. These models allow us to distinguish linking error from sampling variability and illustrate how linking error plays a larger role in aggregates with smaller sample sizes. Assuming that target districts generalize to the full population of districts, we can show that standard errors for district means are generally less than .2 standard deviation units, leading to reliabilities above .7 for roughly 90% of districts. We also show how sources of imprecision and linking error contribute to both within- and between-state district comparisons within versus between states. This approach is applicable whenever the essential counterfactual question—“what would means/variance/progress for the aggregate units be, had students taken the other test?”—can be answered directly for at least some of the units.

Download Full-text

Commentary on “Validation Methods for Aggregate-Level Test Scale Linking: A Case Study Mapping School District Test Score Distributions to a Common Scale”

Journal of Educational and Behavioral Statistics ◽

10.3102/1076998620956668 ◽

2020 ◽

pp. 107699862095666

Author(s):

Alina A. von Davier

Keyword(s):

School District ◽

Test Score ◽

Applied Research ◽

Aggregate Level ◽

Validation Methods ◽

Common Scale ◽

Scale Linking ◽

Score Distributions

In this commentary, I share my perspective on the goals of assessments in general, on linking assessments that were developed according to different specifications and for different purposes, and I propose several considerations for the authors and the readers. This brief commentary is structured around three perspectives (1) the context of this research, (2) the methodology proposed here, and (3) the consequences for applied research.

Download Full-text

Commentary on “Validation Methods for Aggregate-Level Test Scale Linking: A Case Study Mapping School District Test Score Distributions to a Common Scale”

Journal of Educational and Behavioral Statistics ◽

10.3102/1076998620949172 ◽

2020 ◽

pp. 107699862094917

Author(s):

Mark L. Davison

Keyword(s):

School District ◽

Test Score ◽

Full Range ◽

Limited Range ◽

Aggregate Level ◽

Validity Data ◽

Level Data ◽

Common Scale ◽

Scale Linking

This paper begins by setting the linking methods of Reardon, Kalogrides, and Ho in the broader literature on linking. Trends in the validity data suggest that there may be a conditional bias in the estimates of district means, but the data in the article are not conclusive on this point. Further, the data used in their case study might support the validity of the methods only over a limited range of the ability continuum. Applications of the method are then discussed. Contrary to the title, the application of the linking results is not limited to aggregate-level data. Because the potential application is so broad, further research is needed on issues such as the possibility of conditional bias and the validity of estimates over the full range of possible values. Validity is not a dichotomous concept where validity exists or it does not. The evidence reported by Reardon et al. provides substantial, but incomplete, support for the validity of the linked measures in this case study.

Download Full-text

Commentary on Reardon, Kalogrides, and Ho’s “Validation Methods for Aggregate-Level Test Scale Linking: A Case Study Mapping School District Test Score Distributions to a Common Scale”

Journal of Educational and Behavioral Statistics ◽

10.3102/1076998620948267 ◽

2020 ◽

pp. 107699862094826

Author(s):

Daniel Bolt

Keyword(s):

Predictive Accuracy ◽

State Level ◽

State Assessments ◽

Aggregate Level ◽

National Assessment ◽

Common Scale ◽

Scale Linking ◽

Bias Evaluation ◽

Preliminary Support

The studies presented by Reardon, Kalogrides, and Ho provide preliminary support for a National Assessment of Educational Progress–based aggregate linking of state assessments when used for research purposes. In this commentary, I suggest future efforts to explore possible sources of district-level bias, evaluation of predictive accuracy at the state level, and a better understanding of the performance of the linking when applied to the inevitable nonrepresentative district samples that will be encountered in research studies.

Download Full-text

Aggregate-Level Test-Scale Linking: A New Solution for an Old Problem?

Journal of Educational and Behavioral Statistics ◽

10.3102/1076998620960089 ◽

2020 ◽

pp. 107699862096008

Author(s):

Tim Moses ◽

Neil J. Dorans

Keyword(s):

Data Structures ◽

Test Scores ◽

District Level ◽

State Assessments ◽

Aggregate Level ◽

Educational Progress ◽

National Assessment ◽

Validation Methods ◽

Scale Linking ◽

Level Scale

The Reardon, Kalogrides, and Ho article on validation methods for aggregate-level test scale linking is an attempt to validate a district-level scale aligning procedure that appears to be a new solution to an old problem. Their aligning procedure uses the National Assessment of Educational Progress (NAEP) scale to piece together a patchwork of data structures from different tests of different constructs obtained under different administration conditions and used in different ways by different states. In this article, we critique their linking and validation efforts. Our critique has three components. First, we review the recommendations for linking state assessments to NAEP from several studies and commentaries to provide background from which to interpret Reardon et al.’s validation attempts. Second, we provide a replication of the Reardon et al. empirical validations of its proposed linking procedure to demonstrate that correlations between district means on two test scores can be high even when (1) the constructs being measured by the tests are different and (2) the district-level means estimated using the Reardon et al. linking approach can differ substantially from actual district-level means. Then, we suggest additional checks for construct similarity and subpopulation invariance from other concordance studies that could be used to assess whether the inferences made by Reardon et al. are warranted. Finally, until such checks are made, we urge cautious use of the results of the Reardon et al. results.

Download Full-text