scholarly journals CUSUM-Based Person-Fit Statistics for Adaptive Testing

2001 ◽  
Vol 26 (2) ◽  
pp. 199-217 ◽  
Author(s):  
Edith M.L.A. van Krimpen-Stoop ◽  
Rob R. Meijer

Item scores that do not fit an assumed item response theory model may cause the latent trait value to be inaccurately estimated. Several person-fit statistics for detecting nonfitting score patterns for paper-and-pencil tests have been proposed. In the context of computerized adaptive tests (CAT), the use of person-fit analysis has hardly been explored. Because it has been shown that the distribution of existing person-fit statistics is not applicable in a CAT, in this study new person-fit statistics are proposed and critical values for these statistics are derived from existing statistical theory. Statistics are proposed that are sensitive to runs of correct or incorrect item scores and are based on all items administered in a CAT or based on subsets of items, using observed and expected item scores and using cumulative sum (CUSUM) procedures. The theoretical and empirical distributions of the statistics are compared and detection rates are investigated. Results showed that the nominal and empirical Type I error rates were comparable for CUSUM procedures when the number of items in each subset and the number of measurement points were not too small. Detection rates of CUSUM procedures were superior to other fit statis­tics. Applications of the statistics are discussed.

2017 ◽  
Vol 42 (5) ◽  
pp. 343-358
Author(s):  
Yan Xia ◽  
Yi Zheng

Snijders developed a family of person fit indices that asymptotically follow the standard normal distribution, when the ability parameter is estimated. So far, [Formula: see text], U*, W*, [Formula: see text], and [Formula: see text] from this family have been proposed in previous literature. One common property shared by [Formula: see text], U*, and W* (also [Formula: see text] and [Formula: see text] in some specific conditions) is that they employ symmetric weight functions and thus identify spurious scores on both easy and difficult items in the same manner. However, when the purpose is to detect only the spuriously high scores on difficult items, such as cheating, guessing, and having item preknowledge, using symmetric weight functions may jeopardize the detection rates of the target aberrant response patterns. By specifying two types of asymmetric weight functions, this study proposes SHa(λ)* (λ = 1/2 or 1) and SHb(β)* (β = 2 or 3) based on Snijders’s framework to specifically detect spuriously high scores on difficult items. Two simulation studies were carried out to investigate the Type I error rates and empirical power of SHa(λ)* and SHb(β)*, compared with [Formula: see text], U*, W*, [Formula: see text], and [Formula: see text]. The empirical results demonstrated satisfactory performance of the proposed indices. Recommendations were also made on the choice of different person fit indices based on specific purposes.


2019 ◽  
Vol 44 (3) ◽  
pp. 167-181 ◽  
Author(s):  
Wenchao Ma

Limited-information fit measures appear to be promising in assessing the goodness-of-fit of dichotomous response cognitive diagnosis models (CDMs), but their performance has not been examined for polytomous response CDMs. This study investigates the performance of the Mord statistic and standardized root mean square residual (SRMSR) for an ordinal response CDM—the sequential generalized deterministic inputs, noisy “and” gate model. Simulation studies showed that the Mord statistic had well-calibrated Type I error rates, but the correct detection rates were influenced by various factors such as item quality, sample size, and the number of response categories. In addition, the SRMSR was also influenced by many factors and the common practice of comparing the SRMSR against a prespecified cut-off (e.g., .05) may not be appropriate. A set of real data was analyzed as well to illustrate the use of Mord statistic and SRMSR in practice.


2011 ◽  
Vol 71 (6) ◽  
pp. 986-1005 ◽  
Author(s):  
Ying Li ◽  
André A. Rupp

This study investigated the Type I error rate and power of the multivariate extension of the S − χ2 statistic using unidimensional and multidimensional item response theory (UIRT and MIRT, respectively) models as well as full-information bifactor (FI-bifactor) models through simulation. Manipulated factors included test length, sample size, latent trait characteristics such as discrimination pattern and intertrait correlations, and model type misspecification. The nominal Type I error rates were observed under all conditions. The power of the S − χ2 statistic for UIRT models was high for MIRT and FI-bifactor models that were structurally most distinct from the UIRT models but was low otherwise. The power of the S − χ2 statistic to detect misfitting between MIRT and FI-bifactor models was low across all conditions because of the structural similarity of these two models. Finally, information-based indices of relative model–data fit and latent variable correlations were obtained, and these showed expected patterns across conditions.


Author(s):  
Onder Sunbul ◽  
Seha Yormaz

In this study Type I Error and the power rates of ω and GBT (generalized binomial test) indices were investigated for several nominal alpha levels and for 40 and 80-item test lengths with 10,000-examinee sample size under several test level restrictions. As a result, Type I error rates of both indices were found to be below the acceptable nominal alpha levels.  The power study showed that average test difficulty was very effective for power (true detection) rates of indices. Clear patterns were observed for the increase of test difficulty in favor of both ω and GBT power rate. Contrary to expectations; average test discrimination was not as effective as average test difficulty. The results of the interaction effects of item discrimination and difficulty showed that for the cases whose b parameters were lower than 0 with weak discrimination, indices had weak power for both ω and GBT. In addition, for the cases whose b parameter levels were below zero with high discrimination indices, the power performance of both answer-copying indices were very weak. Results for test length showed that with the increase of test length the power rate of both ω and GBT tended to increase. Also, ω performed slightly better than GBT or very close to GBT for 80-item test length however, ω performed better than GBT in terms of power rate for the cases with 40-item test length


Author(s):  
Önder Sünbül ◽  
Seha Yormaz

In this study Type I Error and the power rates of ω and GBT (generalized binomial test) indices were investigated for several nominal alpha levels and for 40 and 80-item test lengths with 10,000-examinee sample size under several test level restrictions. As a result, Type I error rates of both indices were found to be below the acceptable nominal alpha levels.  The power study showed that average test difficulty was very effective for power (true detection) rates of indices. Clear patterns were observed for the increase of test difficulty in favor of both ω and GBT power rate. Contrary to expectations; average test discrimination was not as effective as average test difficulty. The results of the interaction effects of item discrimination and difficulty showed that for the cases whose b parameters were lower than 0 with weak discrimination, indices had weak power for both ω and GBT. In addition, for the cases whose b parameter levels were below zero with high discrimination indices, the power performance of both answer-copying indices were very weak. Results for test length showed that with the increase of test length the power rate of both ω and GBT tended to increase. Also, ω performed slightly better than GBT or very close to GBT for 80-item test length however, ω performed better than GBT in terms of power rate for the cases with 40-item test length


Methodology ◽  
2015 ◽  
Vol 11 (1) ◽  
pp. 3-12 ◽  
Author(s):  
Jochen Ranger ◽  
Jörg-Tobias Kuhn

In this manuscript, a new approach to the analysis of person fit is presented that is based on the information matrix test of White (1982) . This test can be interpreted as a test of trait stability during the measurement situation. The test follows approximately a χ2-distribution. In small samples, the approximation can be improved by a higher-order expansion. The performance of the test is explored in a simulation study. This simulation study suggests that the test adheres to the nominal Type-I error rate well, although it tends to be conservative in very short scales. The power of the test is compared to the power of four alternative tests of person fit. This comparison corroborates that the power of the information matrix test is similar to the power of the alternative tests. Advantages and areas of application of the information matrix test are discussed.


2014 ◽  
Vol 53 (05) ◽  
pp. 343-343

We have to report marginal changes in the empirical type I error rates for the cut-offs 2/3 and 4/7 of Table 4, Table 5 and Table 6 of the paper “Influence of Selection Bias on the Test Decision – A Simulation Study” by M. Tamm, E. Cramer, L. N. Kennes, N. Heussen (Methods Inf Med 2012; 51: 138 –143). In a small number of cases the kind of representation of numeric values in SAS has resulted in wrong categorization due to a numeric representation error of differences. We corrected the simulation by using the round function of SAS in the calculation process with the same seeds as before. For Table 4 the value for the cut-off 2/3 changes from 0.180323 to 0.153494. For Table 5 the value for the cut-off 4/7 changes from 0.144729 to 0.139626 and the value for the cut-off 2/3 changes from 0.114885 to 0.101773. For Table 6 the value for the cut-off 4/7 changes from 0.125528 to 0.122144 and the value for the cut-off 2/3 changes from 0.099488 to 0.090828. The sentence on p. 141 “E.g. for block size 4 and q = 2/3 the type I error rate is 18% (Table 4).” has to be replaced by “E.g. for block size 4 and q = 2/3 the type I error rate is 15.3% (Table 4).”. There were only minor changes smaller than 0.03. These changes do not affect the interpretation of the results or our recommendations.


2021 ◽  
pp. 001316442199489
Author(s):  
Luyao Peng ◽  
Sandip Sinharay

Wollack et al. (2015) suggested the erasure detection index (EDI) for detecting fraudulent erasures for individual examinees. Wollack and Eckerly (2017) and Sinharay (2018) extended the index of Wollack et al. (2015) to suggest three EDIs for detecting fraudulent erasures at the aggregate or group level. This article follows up on the research of Wollack and Eckerly (2017) and Sinharay (2018) and suggests a new aggregate-level EDI by incorporating the empirical best linear unbiased predictor from the literature of linear mixed-effects models (e.g., McCulloch et al., 2008). A simulation study shows that the new EDI has larger power than the indices of Wollack and Eckerly (2017) and Sinharay (2018). In addition, the new index has satisfactory Type I error rates. A real data example is also included.


Sign in / Sign up

Export Citation Format

Share Document