scholarly journals Improving Measurement Quality and Efficiency with Adaptive Testing

1982 ◽  
Vol 6 (4) ◽  
pp. 473-492 ◽  
Author(s):  
David J. Weiss

Approaches to adaptive (tailored) testing based on item response theory are described and research results summarized. Through appropriate combinations of item pool design and use of different test termination criteria, adaptive tests can be designed (1) to improve both measurement quality and measurement efficiency, resulting in measurements of equal precision at all trait levels; (2) to improve measurement efficiency for test batteries using item pools designed for conventional test administration; and (3) to improve the accuracy and efficiency of testing for classification (e.g., mastery testing). Research results show that tests based on item response theory (IRT) can achieve measurements of equal precision at all trait levels, given an adequately designed item pool; these results contrast with those of conventional tests which require a tradeoff of bandwidth for fidelity/precision of measurements. Data also show reductions in bias, inaccuracy, and root mean square error of ability estimates. Improvements in test fidelity observed in simulation studies are supported by live-testing data, which showed adaptive tests requiring half the number of items as that of conventional tests to achieve equal levels of reliability, and almost one-third the number to achieve equal levels of validity. When used with item pools from conventional tests, both simulation and live-testing results show reductions in test battery length from conventional tests, with no reductions in the quality of measurements. Adaptive tests designed for dichotomous classification also represent improvements over conventional tests designed for the same purpose. Simulation studies show reductions in test length and improvements in classification accuracy for adaptive vs. conventional tests; live-testing studies in which adaptive tests were compared with "optimal" conventional tests support these findings. Thus, the research data show that IRT-based adaptive testing takes advantage of the capabilities of IRT to improve the quality and/or efficiency of measurement for each examinee.

2019 ◽  
Vol 79 (6) ◽  
pp. 1133-1155
Author(s):  
Emre Gönülateş

This article introduces the Quality of Item Pool (QIP) Index, a novel approach to quantifying the adequacy of an item pool of a computerized adaptive test for a given set of test specifications and examinee population. This index ranges from 0 to 1, with values close to 1 indicating the item pool presents optimum items to examinees throughout the test. This index can be used to compare different item pools or diagnose the deficiencies of a given item pool by quantifying the amount of deviation from a perfect item pool. Simulation studies were conducted to evaluate the capacity of this index for detecting the inadequacies of two simulated item pools. The value of this index was compared with the existing methods of evaluating the quality of computerized adaptive tests (CAT). Results of the study showed that the QIP Index can detect even slight deviations between a proposed item pool and an optimal item pool. It can also uncover shortcomings of an item pool that other outcomes of CAT cannot detect. CAT developers can use the QIP Index to diagnose the weaknesses of the item pool and as a guide for improving item pools.


1996 ◽  
Vol 21 (4) ◽  
pp. 365-389 ◽  
Author(s):  
Martha L. Stocking

Modern applications of computerized adaptive testing are typically grounded in item response theory (IRT; Lord, 1980 ). While the IRT foundations of adaptive testing provide a number of approaches to adaptive test scoring that may seem natural and efficient to psychometricians, these approaches may be more demanding for test takers, test score users, and interested regulatory institutions to comprehend. An alternative method, based on more familiar equated number-correct scores and identical to that used to score and equate many conventional tests, is explored and compared with one that relies more directly on IRT. It is concluded that scoring adaptive tests using the familiar number-correct score, accompanied by the necessary equating to adjust for the intentional differences in adaptive test difficulty, is a statistically viable, although slightly less efficient, method of adaptive test scoring. To enhance the prospects for enlightened public debate about adaptive testing, it may be preferable to use this more familiar approach. Public attention would then likely be focused on issues more central to adaptive testing, namely, the adaptive nature of the test.


Mathematics ◽  
2020 ◽  
Vol 8 (8) ◽  
pp. 1290
Author(s):  
Zheng-Yun Zhuang ◽  
Chi-Kit Ho ◽  
Paul Juinn Bing Tan ◽  
Jia-Ming Ying ◽  
Jin-Hua Chen

The administration of A/B exams usually involves the use of items. Issues arise when the pre-establishment of a question bank is necessary and the inconsistency in the knowledge points to be tested (in the two exams) reduces the exams ‘fairness’. These are critical for a large multi-teacher course wherein the teachers are changed such that the course and examination content are altered every few years. However, a fair test with randomly participating students should still be a guaranteed subject with no item pool. Through data-driven decision-making, this study collected data related to a term test for a compulsory general course for empirical assessments, pre-processed the data and used item response theory to statistically estimate the difficulty, discrimination and lower asymptotic for each item in the two exam papers. Binary goal programing was finally used to analyze and balance the fairness of A/B exams without an item pool. As a result, pairs of associated questions in the two exam papers were optimized in terms of their overall balance in three dimensions (as the goals) through the paired exchanges of items. These exam papers guarantee their consistency (in the tested knowledge points) and also ensure the fairness of the term test (a key psychological factor that motivates continued studies). Such an application is novel as the teacher(s) did not have a pre-set question bank and could formulate the fairest strategy for the A/B exam papers. The model can be employed to address similar teaching practice issues.


Author(s):  
Anju Devianee Keetharuth ◽  
Jakob Bue Bjorner ◽  
Michael Barkham ◽  
John Browne ◽  
Tim Croudace ◽  
...  

Abstract Purpose ReQoL-10 and ReQoL-20 have been developed for use as outcome measures with individuals aged 16 and over, experiencing mental health difficulties. This paper reports modelling results from the item response theory (IRT) analyses that were used for item reduction. Methods From several stages of preparatory work including focus groups and a previous psychometric survey, a pool of items was developed. After confirming that the ReQoL item pool was sufficiently unidimensional for scoring, IRT model parameters were estimated using Samejima’s Graded Response Model (GRM). All 39 mental health items were evaluated with respect to item fit and differential item function regarding age, gender, ethnicity, and diagnosis. Scales were evaluated regarding overall measurement precision and known-groups validity (by care setting type and self-rating of overall mental health). Results The study recruited 4266 participants with a wide range of mental health diagnoses from multiple settings. The IRT parameters demonstrated excellent coverage of the latent construct with the centres of item information functions ranging from − 0.98 to 0.21 and with discrimination slope parameters from 1.4 to 3.6. We identified only two poorly fitting items and no evidence of differential item functioning of concern. Scales showed excellent measurement precision and known-groups validity. Conclusion The results from the IRT analyses confirm the robust structure properties and internal construct validity of the ReQoL instruments. The strong psychometric evidence generated guided item selection for the final versions of the ReQoL measures.


2020 ◽  
Vol 44 (5) ◽  
pp. 362-375
Author(s):  
Tyler Strachan ◽  
Edward Ip ◽  
Yanyan Fu ◽  
Terry Ackerman ◽  
Shyh-Huei Chen ◽  
...  

As a method to derive a “purified” measure along a dimension of interest from response data that are potentially multidimensional in nature, the projective item response theory (PIRT) approach requires first fitting a multidimensional item response theory (MIRT) model to the data before projecting onto a dimension of interest. This study aims to explore how accurate the PIRT results are when the estimated MIRT model is misspecified. Specifically, we focus on using a (potentially misspecified) two-dimensional (2D)-MIRT for projection because of its advantages, including interpretability, identifiability, and computational stability, over higher dimensional models. Two large simulation studies (I and II) were conducted. Both studies examined whether the fitting of a 2D-MIRT is sufficient to recover the PIRT parameters when multiple nuisance dimensions exist in the test items, which were generated, respectively, under compensatory MIRT and bifactor models. Various factors were manipulated, including sample size, test length, latent factor correlation, and number of nuisance dimensions. The results from simulation studies I and II showed that the PIRT was overall robust to a misspecified 2D-MIRT. Smaller third and fourth simulation studies were done to evaluate recovery of the PIRT model parameters when the correctly specified higher dimensional MIRT or bifactor model was fitted with the response data. In addition, a real data set was used to illustrate the robustness of PIRT.


Sign in / Sign up

Export Citation Format

Share Document