A Comparison of the Partial Credit and Graded Response Models in Computerized Adaptive Testing

BACKGROUND There are two philosophical approaches to contemporary psychometrics: Rasch measurement theory (RMT) and item response theory (IRT). Either measurement strategy can be applied to computerized adaptive testing (CAT). There are potential benefits of IRT over RMT with regards to measurement precision, but also potential risks to measurement generalizability. RMT CAT assessments have demonstrated good performance with the CLEFT-Q, a patient-reported outcome measure for use in orofacial clefting. OBJECTIVE To test whether the post-hoc application of IRT (graded response models, GRMs, and multidimensional GRMs) to RMT-validated CLEFT-Q appearance scales could improve CAT accuracy at given assessment lengths. METHODS Partial credit Rasch models, unidimensional GRMs and a multidimensional GRM were calibrated for each of the 7 CLEFT-Q appearance scales (which measure the appearance of the: face, jaw, teeth, nose, nostrils, cleft lip scar and lips) using data from the CLEFT-Q field test. A second, simulated dataset was generated with 1000 plausible response sets to each scale. Rasch and GRM scores were calculated for each simulated response set, scaled to 0-100 scores, and compared by Pearson’s correlation coefficient, root mean square error (RMSE), mean absolute error (MAE) and 95% limits of agreement. For the face, teeth and jaw scales, we repeated this in a an independent, real patient dataset. We then used the simulated data to compare the performance of a range of fixed-length CAT assessments that were generated with partial credit Rasch models, unidimensional GRMs and the multidimensional GRM. Median standard error of measurement (SEM) was recorded for each assessment. CAT scores were scaled to 0-100 and compared to linear assessment Rasch scores with RMSE, MAE and 95% limits of agreement. This was repeated in the independent, real patient dataset with the RMT and unidimensional GRM CAT assessments for the face, teeth and jaw scales to test the generalizability of our simulated data analysis. RESULTS Linear assessment scores generated by Rasch models and unidimensional GRMs showed close agreement, with RMSE ranging from 2.2 to 6.1, and MAE ranging from 1.5 to 4.9 in the simulated dataset. These findings were closely reproduced in the real patient dataset. Unidimensional GRM CAT algorithms achieved lower median SEM than Rasch counterparts, but reproduced linear assessment scores with very similar accuracy (RMSE, MAE and 95% limits of agreement). The multidimensional GRM had poorer accuracy than the unidimensional models at comparable assessment lengths. CONCLUSIONS Partial credit Rasch models and GRMs produce very similar CAT scores. GRM CAT assessments achieve a lower SEM, but this does not translate into better accuracy. Commonly used SEM heuristics for target measurement reliability should not be generalized across CAT assessments built with different psychometric models. In this study, a relatively parsimonious multidimensional GRM CAT algorithm performed more poorly than unidimensional GRM comparators.

Download Full-text

Assessing the Utility of Item Response Models: Computerized Adaptive Testing

Educational Measurement Issues and Practice ◽

10.1111/j.1745-3992.1993.tb00520.x ◽

2005 ◽

Vol 12 (1) ◽

pp. 21-27 ◽

Cited By ~ 7

Author(s):

G. Gage Kingsbury ◽

Ronald L. Houser

Keyword(s):

Item Response ◽

Computerized Adaptive Testing ◽

Adaptive Testing ◽

Response Models ◽

Item Response Models

Download Full-text

Computerized Adaptive Testing of Personality Traits

Zeitschrift für Psychologie / Journal of Psychology ◽

10.1027/0044-3409.216.1.12 ◽

2008 ◽

Vol 216 (1) ◽

pp. 12-21 ◽

Cited By ~ 3

Author(s):

A. Michiel Hol ◽

Harrie C.M. Vorst ◽

Gideon J. Mellenbergh

Keyword(s):

Computerized Adaptive Testing ◽

Latent Trait ◽

Substantial Reduction ◽

Adaptive Testing ◽

Stopping Rule ◽

The Other ◽

Dutch Version ◽

Adjective Check List ◽

Item Parameters ◽

Graded Response

A computerized adaptive testing (CAT) procedure was simulated with ordinal polytomous personality data collected using a conventional paper-and-pencil testing format. An adapted Dutch version of the dominance scale of Gough and Heilbrun’s Adjective Check List (ACL) was used. This version contained Likert response scales with five categories. Item parameters were estimated using Samejima’s graded response model from the responses of 1,925 subjects. The CAT procedure was simulated using the responses of 1,517 other subjects. The value of the required standard error in the stopping rule of the CAT was manipulated. The relationship between CAT latent trait estimates and estimates based on all dominance items was studied. Additionally, the pattern of relationships between the CAT latent trait estimates and the other ACL scales was compared to that between latent trait estimates based on the entire item pool and the other ACL scales. The CAT procedure resulted in latent trait estimates qualitatively equivalent to latent trait estimates based on all items, while a substantial reduction of the number of used items could be realized (at the stopping rule of 0.4 about 33% of the 36 items was used).

Download Full-text

Computerized Adaptive Testing With the Partial Credit Model: Estimation Procedures, Population Distributions, and Item Pool Characteristics

Applied Psychological Measurement ◽

10.1177/0146621605280072 ◽

2005 ◽

Vol 29 (6) ◽

pp. 433-456 ◽

Cited By ~ 11

Author(s):

Joanna S. Gorin ◽

Barbara G. Dodd ◽

Steven J. Fitzpatrick ◽

Yann Yann Shieh

Keyword(s):

Computerized Adaptive Testing ◽

Adaptive Testing ◽

Partial Credit Model ◽

Model Estimation ◽

Item Pool ◽

Partial Credit ◽

Population Distributions ◽

Estimation Procedures

Download Full-text

An Investigation of Procedures for Computerized Adaptive Testing Using Partial Credit Scoring

Applied Measurement in Education ◽

10.1207/s15324818ame0204_5 ◽

1989 ◽

Vol 2 (4) ◽

pp. 335-357 ◽

Cited By ~ 32

Author(s):

William R. Koch ◽

Barbara G. Dodd

Keyword(s):

Computerized Adaptive Testing ◽

Credit Scoring ◽

Adaptive Testing ◽

Partial Credit

Download Full-text

Item Response Models in Computerized Adaptive Testing: A Simulation Study

Computational Science and Its Applications – ICCSA 2014 - Lecture Notes in Computer Science ◽

10.1007/978-3-319-09150-1_40 ◽

2014 ◽

pp. 552-565 ◽

Cited By ~ 1

Author(s):

Maria Eugénia Ferrão ◽

Paula Prata

Keyword(s):

Item Response ◽

Simulation Study ◽

Computerized Adaptive Testing ◽

Adaptive Testing ◽

Response Models ◽

Item Response Models

Download Full-text

Strategies for Controlling Item Exposure in Computerized Adaptive Testing With the Generalized Partial Credit Model

Applied Psychological Measurement ◽

10.1177/0146621604264133 ◽

2004 ◽

Vol 28 (3) ◽

pp. 165-185 ◽

Cited By ~ 10

Author(s):

Laurie Laughlin Davis

Keyword(s):

Computerized Adaptive Testing ◽

Adaptive Testing ◽

Partial Credit Model ◽

Partial Credit ◽

Generalized Partial Credit Model ◽

Item Exposure ◽

Generalized Partial Credit

Download Full-text

On the complementarity of classical test theory and item response models: item difficulty estimates and computerized adaptive testing

Ensaio Avaliação e Políticas Públicas em Educação ◽

10.1590/s0104-40362015000300003 ◽

2015 ◽

Vol 23 (88) ◽

pp. 593-610

Author(s):

Patrícia Costa ◽

Maria Eugénia Ferrão

Keyword(s):

Item Response ◽

Computerized Adaptive Testing ◽

Item Difficulty ◽

Classical Test Theory ◽

Adaptive Testing ◽

Test Theory ◽

Partial Credit Model ◽

Response Models ◽

Item Response Models ◽

Classical Test

This study aims to provide statistical evidence of the complementarity between classical test theory and item response models for certain educational assessment purposes. Such complementarity might support, at a reduced cost, future development of innovative procedures for item calibration in adaptive testing. Classical test theory and the generalized partial credit model are applied to tests comprising multiple choice, short answer, completion, and open response items scored partially. Datasets are derived from the tests administered to the Portuguese population of students enrolled in the 4th and 6th grades. The results show a very strong association between the estimates of difficulty obtained from classical test theory and item response models, corroborating the statistical theory of mental testing.

Download Full-text