Comparing Graphical and Verbal Representations of Measurement Error In Test Score Reports

The aim of this study was to develop, for the benefit of both test takers and test score users, enhanced TOEFL ITP® test score reports that go beyond the simple numerical scores that are currently reported. To do so, we applied traditional scale anchoring (proficiency scaling) to item difficulty data in order to develop performance descriptors for multiple levels of each of the three sections of the TOEFL ITP. A (novel) constraint was that these levels should correspond to those established in an earlier study that mapped (i.e., aligned) TOEFL ITP scores to a widely accepted framework for describing language proficiency – the Common European Framework of Reference (CEFR). The data used in the present study came from administrations of five current operational forms of the recently revised TOEFL ITP test. The outcome of the effort is a set of performance descriptors for each of several levels of TOEFL ITP scores for each of the three sections of the test. The contribution, we believe, constitutes (1) an enhancement of the interpretation of scores for one widely used assessment of English language proficiency and (2) a modest contribution to the literature on developing proficiency descriptors – an approach that combines elements of both scale anchoring and test score mapping.

Download Full-text

Item Context and /s/ Phone Articulation Test Results

Journal of Speech and Hearing Research ◽

10.1044/jshr.1504.852 ◽

1972 ◽

Vol 15 (4) ◽

pp. 852-860 ◽

Cited By ~ 8

Author(s):

Zoe Zehel ◽

Ralph L. Shelton ◽

William B. Arndt ◽

Virginia Wright ◽

Mary Elbert

Keyword(s):

Test Score ◽

Sound Production ◽

Broad Context ◽

Test Results ◽

Production Task ◽

Test Result

Fourteen children who misarticulated some phones of the /s/ phoneme were tape recorded articulating several lists of items involving /s/. The lists included the Mc-Donald Deep Test for /s/, three lists similar to McDonald’s but altered in broad context, and an /s/ sound production task. Scores from lists were correlated, compared for differences in means, or both. Item sets determined by immediate context were also compared for differences between means. All lists were found to be significantly correlated. The comparison of means indicated that both broad and immediate context were related to test result. The estimated “omega square” statistic was used to evaluate the percentage of test score variance attributable to context.

Download Full-text

Benefits from Computerized Adaptive Testing as Seen in Simulation Studies

European Journal of Psychological Assessment ◽

10.1027//1015-5759.15.2.91 ◽

1999 ◽

Vol 15 (2) ◽

pp. 91-98 ◽

Cited By ~ 10

Author(s):

Lutz F. Hornke

Keyword(s):

Measurement Error ◽

Computerized Adaptive Testing ◽

Test Procedure ◽

Adaptive Testing ◽

Parameter Estimates ◽

Simulation Studies ◽

Computerized Adaptive Test ◽

Item Banks ◽

Item Parameters ◽

General Reliability

Summary: Item parameters for several hundreds of items were estimated based on empirical data from several thousands of subjects. The logistic one-parameter (1PL) and two-parameter (2PL) model estimates were evaluated. However, model fit showed that only a subset of items complied sufficiently, so that the remaining ones were assembled in well-fitting item banks. In several simulation studies 5000 simulated responses were generated in accordance with a computerized adaptive test procedure along with person parameters. A general reliability of .80 or a standard error of measurement of .44 was used as a stopping rule to end CAT testing. We also recorded how often each item was used by all simulees. Person-parameter estimates based on CAT correlated higher than .90 with true values simulated. For all 1PL fitting item banks most simulees used more than 20 items but less than 30 items to reach the pre-set level of measurement error. However, testing based on item banks that complied to the 2PL revealed that, on average, only 10 items were sufficient to end testing at the same measurement error level. Both clearly demonstrate the precision and economy of computerized adaptive testing. Empirical evaluations from everyday uses will show whether these trends will hold up in practice. If so, CAT will become possible and reasonable with some 150 well-calibrated 2PL items.

Download Full-text