An Alternative Method for Scoring Adaptive Tests

1996 ◽  
Vol 21 (4) ◽  
pp. 365-389 ◽  
Author(s):  
Martha L. Stocking

Modern applications of computerized adaptive testing are typically grounded in item response theory (IRT; Lord, 1980 ). While the IRT foundations of adaptive testing provide a number of approaches to adaptive test scoring that may seem natural and efficient to psychometricians, these approaches may be more demanding for test takers, test score users, and interested regulatory institutions to comprehend. An alternative method, based on more familiar equated number-correct scores and identical to that used to score and equate many conventional tests, is explored and compared with one that relies more directly on IRT. It is concluded that scoring adaptive tests using the familiar number-correct score, accompanied by the necessary equating to adjust for the intentional differences in adaptive test difficulty, is a statistically viable, although slightly less efficient, method of adaptive test scoring. To enhance the prospects for enlightened public debate about adaptive testing, it may be preferable to use this more familiar approach. Public attention would then likely be focused on issues more central to adaptive testing, namely, the adaptive nature of the test.

1995 ◽  
Vol 13 (2) ◽  
pp. 151-162 ◽  
Author(s):  
Mary E. Lunz ◽  
Betty Bergstrom

Computerized adaptive testing (CAT) uses a computer algorithm to construct and score the best possible individualized or tailored tests for each candidate. The computer also provides an absolute record of all responses and changes to responses, as well as their effects on candidate performance. The detail of the data from computerized adaptive tests makes it possible to track initial responses and response alterations, and their effect on candidate estimated ability measures, as well as the statistical performance of the examination. The purpose of this study was to track the effect of candidate response patterns on a computerized adaptive test. A ninety-item certification examination was divided into nine units of ten items each to track the pattern of initial responses and response alterations on ability estimates and test precision across the nine test units. The precision of the test was affected most by response alterations during early segments of the test. While generally, candidates benefit from altering responses, individual candidates showed different patterns of response alterations across test segments. Test precision is minimally affected, suggesting that the tailoring of CAT is minimally affected by response alterations.


2021 ◽  
Author(s):  
Bryant A Seamon ◽  
Steven A Kautz ◽  
Craig A Velozo

Abstract Objective Administrative burden often prevents clinical assessment of balance confidence in people with stroke. A computerized adaptive test (CAT) version of the Activities-specific Balance Confidence Scale (ABC CAT) can dramatically reduce this burden. The objective of this study was to test balance confidence measurement precision and efficiency in people with stroke with an ABC CAT. Methods We conducted a retrospective cross-sectional simulation study with data from 406 adults approximately 2-months post-stroke in the Locomotor-Experience Applied Post-Stroke (LEAPS) trial. Item parameters for CAT calibration were estimated with the Rasch model using a random sample of participants (n = 203). Computer simulation was used with response data from remaining 203 participants to evaluate the ABC CAT algorithm under varying stopping criteria. We compared estimated levels of balance confidence from each simulation to actual levels predicted from the Rasch model (Pearson correlations and mean standard error (SE)). Results Results from simulations with number of items as a stopping criterion strongly correlated with actual ABC scores (full item, r = 1, 12-item, r = 0.994; 8-item, r = 0.98; 4-item, r = 0.929). Mean SE increased with decreasing number of items administered (full item, SE = 0.31; 12-item, SE = 0.33; 8-item, SE = 0.38; 4-item, SE = 0.49). A precision-based stopping rule (mean SE = 0.5) also strongly correlated with actual ABC scores (r = .941) and optimized the relationship between number of items administrated with precision (mean number of items 4.37, range [4–9]). Conclusions An ABC CAT can determine accurate and precise measures of balance confidence in people with stroke with as few as 4 items. Individuals with lower balance confidence may require a greater number of items (up to 9) and attributed to the LEAPS trial excluding more functionally impaired persons. Impact Statement Computerized adaptive testing can drastically reduce the ABC’s test administration time while maintaining accuracy and precision. This should greatly enhance clinical utility, facilitating adoption of clinical practice guidelines in stroke rehabilitation. Lay Summary If you have had a stroke, your physical therapist will likely test your balance confidence. A computerized adaptive test version of the ABC scale can accurately identify balance with as few as 4 questions, which takes much less time.


Author(s):  
David J. Weiss

The process of constructing a fixed-length conventional test frequently focuses on maximizing internal consistency reliability by selecting test items that are of average difficulty and high discrimination (a “peaked” test). The effect of constructing such a test, when viewed from the perspective of item response theory, is test scores that are precise for examinees whose trait levels are near the point at which the test is peaked; as examinee trait levels deviate from the mean, the precision of their scores decreases substantially. Results of a small simulation study demonstrate that when peaked tests are “off target” for an examinee, their scores are biased and have spuriously high standard deviations, reflecting substantial amounts of error. These errors can reduce the correlations of these kinds of scores with other variables and adversely affect the results of standard statistical tests. By contrast, scores from adaptive tests are essentially unbiased and have standard deviations that are much closer to true values. Basic concepts of adaptive testing are introduced and fully adaptive computerized tests (CATs) based on IRT are described. Several examples of response records from CATs are discussed to illustrate how CATs function. Some operational issues, including item exposure, content balancing, and enemy items are also briefly discussed. It is concluded that because CAT constructs a unique test for examinee, scores from CATs will be more precise and should provide better data for social science research and applications. DOI:10.2458/azu_jmmss_v2i1_weiss


1982 ◽  
Vol 6 (4) ◽  
pp. 473-492 ◽  
Author(s):  
David J. Weiss

Approaches to adaptive (tailored) testing based on item response theory are described and research results summarized. Through appropriate combinations of item pool design and use of different test termination criteria, adaptive tests can be designed (1) to improve both measurement quality and measurement efficiency, resulting in measurements of equal precision at all trait levels; (2) to improve measurement efficiency for test batteries using item pools designed for conventional test administration; and (3) to improve the accuracy and efficiency of testing for classification (e.g., mastery testing). Research results show that tests based on item response theory (IRT) can achieve measurements of equal precision at all trait levels, given an adequately designed item pool; these results contrast with those of conventional tests which require a tradeoff of bandwidth for fidelity/precision of measurements. Data also show reductions in bias, inaccuracy, and root mean square error of ability estimates. Improvements in test fidelity observed in simulation studies are supported by live-testing data, which showed adaptive tests requiring half the number of items as that of conventional tests to achieve equal levels of reliability, and almost one-third the number to achieve equal levels of validity. When used with item pools from conventional tests, both simulation and live-testing results show reductions in test battery length from conventional tests, with no reductions in the quality of measurements. Adaptive tests designed for dichotomous classification also represent improvements over conventional tests designed for the same purpose. Simulation studies show reductions in test length and improvements in classification accuracy for adaptive vs. conventional tests; live-testing studies in which adaptive tests were compared with "optimal" conventional tests support these findings. Thus, the research data show that IRT-based adaptive testing takes advantage of the capabilities of IRT to improve the quality and/or efficiency of measurement for each examinee.


TESTFÓRUM ◽  
2016 ◽  
Vol 5 (7) ◽  
pp. 41-51
Author(s):  
Lenka Fiřtová

Adaptivní testování na počítačích podporuje individualizaci, zajímavé a interaktivní typy úloh a výsledky mohou být použity k vytvoření učebního plánu na míru. Scio přistoupilo k vývoji adaptivních testů v roce 2011. Adaptivní test z anglického jazyka byl jedním z prvních. Úkolem bylo vytvořit množství úloh, které pokrývají všechny úrovně CEFR. Navíc jsme vytvářeli i úlohy pro úplné začátečníky, jež jsme označili úrovní A0. Úlohy byly zařazeny do kategorií a byly pilotovány na několika stovkách lidí, u kterých byla prokázána úroveň angličtiny (čerství držitelé certifikátů např. FCE, CAE, TOEFL). Test pak předkládá každému testovanému úlohy, které co nejvíce korespondují s jeho úrovní angličtiny. Základní princip je dát testovanému co nejtěžší úlohu, kterou je ještě schopen vyřešit. Na začátku dostane několik úloh různé obtížnosti pro přibližné zjištění úrovně. Další otázky jsou pak již vybírány na základě předchozích odpovědí. Pro vyhodnocení byla zvolena metoda klasifikátorů (MDT). Adaptivní testování umožňuje zadávat studentům pouze otázky, které jsou pro ně zajímavé. Test se stává nejen konečným hodnocením, ale i pomocníkem v učebním procesu.Computerized Adaptive Testing is a way to create customized tests and to develop new, interactive types of items. The results may then be used to create a tailor-made learning program. Scio started to develop adaptive tests in 2011, the adaptive test in English (SCATE) being one of the first. The aim was to create an item pool covering all the categories defined by CEFR. In addition, we also created items for complete beginners and labeled this category “A0”. The items were divided into categories with respect to their difficulty and piloted using several hundreds of students whose level of English was known from prior testing (holders of internationally recognized certificates, such as FCE, CAE, TOEFL). When taking the test, respondents are presented with items which are appropriate for their ability level. The main idea is to present respondents with items which they don’t find too easy but which they are still able to solve. First, respondents are presented with a set of randomly chosen items of varying difficulty. This results in a rough estimate of their ability level. The remaining items are then selected with respect to the respondents’ previous answers. The algorithm is based on the Measurement Decision Theory. Adaptive testing ensures students are presented only with such items which they are likely to find interesting. The test then becomes not only an assessment tool, but also a tool facilitating the learning process.


2017 ◽  
Vol 42 (5) ◽  
pp. 476-482 ◽  
Author(s):  
Dagmar Amtmann ◽  
Alyssa M Bamer ◽  
Jiseon Kim ◽  
Fraser Bocell ◽  
Hyewon Chung ◽  
...  

Background: New health status instruments can be administered by computerized adaptive test or short forms. The Prosthetic Limb Users Survey of Mobility (PLUS-MTM) is a self-report measure of mobility for prosthesis users with lower limb loss. This study used the PLUS-M to examine advantages and disadvantages of computerized adaptive test and short forms. Objectives: To compare scores obtained from computerized adaptive test to scores obtained from fixed-length short forms (7-item and 12-item) in order to provide guidance to researchers and clinicians on how to select the best form of administration for different uses. Study design: Cross-sectional, observational study. Methods: Individuals with lower limb loss completed the PLUS-M by computerized adaptive test and short forms. Administration time, correlations between the scores, and standard errors were compared. Results: Scores and standard errors from the computerized adaptive test, 7-item short form, and 12-item short form were highly correlated and all forms of administration were efficient. Computerized adaptive test required less time to administer than either paper or electronic short forms; however, time savings were minimal compared to the 7-item short form. Conclusion: Results indicate that the PLUS-M computerized adaptive test is most efficient, and differences in scores between administration methods are minimal. The main advantage of the computerized adaptive test was more reliable scores at higher levels of mobility compared to short forms. Clinical relevance Health-related item banks, like the Prosthetic Limb Users Survey of Mobility (PLUS-MTM), can be administered by computerized adaptive testing (CAT) or as fixed-length short forms (SFs). Results of this study will help clinicians and researchers decide whether they should invest in a CAT administration system or whether SFs are more appropriate.


2011 ◽  
Vol 2 (1) ◽  
pp. 1 ◽  
Author(s):  
David J. Weiss

The process of constructing a fixed-length conventional test frequently focuses on maximizing internal consistency reliability by selecting test items that are of average difficulty and high discrimination (a “peaked” test). The effect of constructing such a test, when viewed from the perspective of item response theory, is test scores that are precise for examinees whose trait levels are near the point at which the test is peaked; as examinee trait levels deviate from the mean, the precision of their scores decreases substantially. Results of a small simulation study demonstrate that when peaked tests are “off target” for an examinee, their scores are biased and have spuriously high standard deviations, reflecting substantial amounts of error. These errors can reduce the correlations of these kinds of scores with other variables and adversely affect the results of standard statistical tests. By contrast, scores from adaptive tests are essentially unbiased and have standard deviations that are much closer to true values. Basic concepts of adaptive testing are introduced and fully adaptive computerized tests (CATs) based on IRT are described. Several examples of response records from CATs are discussed to illustrate how CATs function. Some operational issues, including item exposure, content balancing, and enemy items are also briefly discussed. It is concluded that because CAT constructs a unique test for examinee, scores from CATs will be more precise and should provide better data for social science research and applications. DOI:10.2458/azu_jmmss_v2i1_weiss


1999 ◽  
Vol 15 (2) ◽  
pp. 91-98 ◽  
Author(s):  
Lutz F. Hornke

Summary: Item parameters for several hundreds of items were estimated based on empirical data from several thousands of subjects. The logistic one-parameter (1PL) and two-parameter (2PL) model estimates were evaluated. However, model fit showed that only a subset of items complied sufficiently, so that the remaining ones were assembled in well-fitting item banks. In several simulation studies 5000 simulated responses were generated in accordance with a computerized adaptive test procedure along with person parameters. A general reliability of .80 or a standard error of measurement of .44 was used as a stopping rule to end CAT testing. We also recorded how often each item was used by all simulees. Person-parameter estimates based on CAT correlated higher than .90 with true values simulated. For all 1PL fitting item banks most simulees used more than 20 items but less than 30 items to reach the pre-set level of measurement error. However, testing based on item banks that complied to the 2PL revealed that, on average, only 10 items were sufficient to end testing at the same measurement error level. Both clearly demonstrate the precision and economy of computerized adaptive testing. Empirical evaluations from everyday uses will show whether these trends will hold up in practice. If so, CAT will become possible and reasonable with some 150 well-calibrated 2PL items.


Sign in / Sign up

Export Citation Format

Share Document