Flexibility at the Price of Volatility: Concurrent Calibration in Multistage Tests in Practice Using a 2PL Model

Multistage test (MST) designs promise efficient student ability estimates, an indispensable asset for individual diagnostics in high-stakes educational assessments. In high-stakes testing, annually changing test forms are required because publicly known test items impair accurate student ability estimation, and items of bad model fit must be continually replaced to guarantee test quality. This requires a large and continually refreshed item pool as the basis for high-stakes MST. In practice, the calibration of newly developed items to feed annually changing tests is highly resource intensive. Piloting based on a representative sample of students is often not feasible, given that, for schools, participation in actual high-stakes assessments already requires considerable organizational effort. Hence, under practical constraints, the calibration of newly developed items may take place on the go in the form of a concurrent calibration in MST designs. Based on a simulation approach this paper focuses on the performance of Rasch vs. 2PL modeling in retrieving item parameters when items are for practical reasons non-optimally placed in multistage tests. Overall, the results suggest that the 2PL model performs worse in retrieving item parameters compared to the Rasch model when there is non-optimal item assembly in the MST; especially in retrieving parameters at the margins. The higher flexibility of 2PL modeling, where item discrimination is allowed to vary, seems to come at the cost of increased volatility in parameter estimation. Although the overall bias may be modest, single items can be affected by severe biases when using a 2PL model for item calibration in the context of non-optimal item placement.

Download Full-text

What is at stake in knowing the content and capabilities of children’s minds?

Theory and Research in Education ◽

10.1177/1477878504046524 ◽

2004 ◽

Vol 2 (3) ◽

pp. 283-308 ◽

Cited By ~ 12

Author(s):

Stephen P. Norris ◽

Jacqueline P. Leighton ◽

Linda M. Phillips

Keyword(s):

Construct Validity ◽

High Stakes Testing ◽

Cognitive Models ◽

Test Design ◽

Educational Testing ◽

High Stakes ◽

Test Items ◽

Reasoning Strategies ◽

Interpretation Process ◽

The One

Many significant changes in perspective have to take place before efforts to learn the content and capabilities of children’s minds can hold much sway in educational testing. The language of testing, especially of high stakes testing, remains firmly in the realm of ‘behaviors’, ‘performance’ and ‘competency’ defined in terms of behaviors, test items, or observations. What is on children’s minds is not taken into account as integral to the test design and interpretation process. The point of this article is to argue that behaviorist-based validation models are ill-founded, and to recommend basing tests on cognitive models that theorize the content and capabilities of children’s minds in terms of such features as meta-cognition, reasoning strategies, and principles of sound thinking. This approach is the one most likely to yield the construct validity for tests long endorsed by many testing theorists. The implications of adopting a cognitive basis for testing that might be upsetting to many current practices are explored.

Download Full-text

You Can Play the Game Without Knowing the Rules – But You’re Better Off Knowing Them

European Journal of Psychological Assessment ◽

10.1027/1015-5759/a000637 ◽

2021 ◽

pp. 1-9

Author(s):

Julie Levacher ◽

Marco Koch ◽

Johanna Hissbach ◽

Frank M. Spinath ◽

Nicolas Becker

Keyword(s):

Test Preparation ◽

High Stakes Testing ◽

Control Group ◽

High Stakes ◽

Test Fairness ◽

Test Items ◽

Item Functioning ◽

Confirmatory Factor ◽

Differential Item Functioning Analysis ◽

Item Properties

Abstract. Due to their high item difficulties and excellent psychometric properties, construction-based figural matrices tasks are of particular interest when it comes to high-stakes testing. An important prerequisite is that test preparation – which is likely to occur in this context – does not impair test fairness or item properties. The goal of this study was to provide initial evidence concerning the influence of test preparation. We administered test items to a sample of N = 882 participants divided into two groups, but only one group was given information about the rules employed in the test items. The probability of solving the items was significantly higher in the test preparation group than in the control group ( M = 0.61, SD = 0.19 vs. M = 0.41, SD = 0.25; t(54) = 3.42, p = .001; d = .92). Nevertheless, a multigroup confirmatory factor analysis, as well as a differential item functioning analysis, indicated no differences between the item properties in the two groups. The results suggest that construction-based figural matrices are suitable in the context of high-stakes testing when all participants are provided with test preparation material so that test fairness is ensured.

Download Full-text