An analysis on the optimal number of options in multiple-choice items of the National Assessment of Educational Achievement

2014 ◽  
Vol 21 (2) ◽  
pp. 107-128
Author(s):  
Young-Ju Lee ◽  
1993 ◽  
Vol 53 (1) ◽  
pp. 241-247 ◽  
Author(s):  
Kevin D. Crehan ◽  
Thomas M. Haladyna ◽  
Britton W. Brewer

1992 ◽  
Vol 20 (4) ◽  
pp. 251-260 ◽  
Author(s):  
Michael E. Martinez ◽  
John J. Ferris ◽  
William Kraft ◽  
Winton H. Manning

Large-scale testing is dominated by the multiple-choice question format. Widespread use of the format is explained, in part, by the ease with which multiple-choice items can be scored automatically. This article examines automated scoring procedures for an alternative item type: figural response. Figural response items call for the completion or modification of figural material, including illustrations, diagrams, and graphs. Nineteen science items were written in cooperation with the National Assessment of Educational Progress and printed with a special ink, invisible to scanning equipment. The items were answered with pencils; response sheets were then scanned and the resulting data were processed by computer-based scoring algorithms. Implications of this technology for the future of large-scale testing are discussed.


Psychometrika ◽  
2021 ◽  
Author(s):  
Qian Wu ◽  
Monique Vanerum ◽  
Anouk Agten ◽  
Andrés Christiansen ◽  
Frank Vandenabeele ◽  
...  

1987 ◽  
Vol 47 (2) ◽  
pp. 513-522 ◽  
Author(s):  
Steven V. Owen ◽  
Robin D. Froman

1990 ◽  
Vol 1990 (1) ◽  
pp. i-29 ◽  
Author(s):  
Randy Elliot Bennett ◽  
Donald A. Rock ◽  
Minhwei Wang

2021 ◽  
pp. 001316442098810
Author(s):  
Stefanie A. Wind ◽  
Yuan Ge

Practical constraints in rater-mediated assessments limit the availability of complete data. Instead, most scoring procedures include one or two ratings for each performance, with overlapping performances across raters or linking sets of multiple-choice items to facilitate model estimation. These incomplete scoring designs present challenges for detecting rater biases, or differential rater functioning (DRF). The purpose of this study is to illustrate and explore the sensitivity of DRF indices in realistic sparse rating designs that have been documented in the literature that include different types and levels of connectivity among raters and students. The results indicated that it is possible to detect DRF in sparse rating designs, but the sensitivity of DRF indices varies across designs. We consider the implications of our findings for practice related to monitoring raters in performance assessments.


Sign in / Sign up

Export Citation Format

Share Document