Bayesian modelling of differential item functioning: type I error and power rates in the presence of non-normal ability distributions, impact, and anchor set contamination

Author(s):  
W. Holmes Finch ◽  
Brian F. French
2020 ◽  
Vol 45 (1) ◽  
pp. 37-53
Author(s):  
Wenchao Ma ◽  
Ragip Terzi ◽  
Jimmy de la Torre

This study proposes a multiple-group cognitive diagnosis model to account for the fact that students in different groups may use distinct attributes or use the same attributes but in different manners (e.g., conjunctive, disjunctive, and compensatory) to solve problems. Based on the proposed model, this study systematically investigates the performance of the likelihood ratio (LR) test and Wald test in detecting differential item functioning (DIF). A forward anchor item search procedure was also proposed to identify a set of anchor items with invariant item parameters across groups. Results showed that the LR and Wald tests with the forward anchor item search algorithm produced better calibrated Type I error rates than the ordinary LR and Wald tests, especially when items were of low quality. A set of real data were also analyzed to illustrate the use of these DIF detection procedures.


2016 ◽  
Vol 77 (3) ◽  
pp. 415-428 ◽  
Author(s):  
David R. J. Fikis ◽  
T. C. Oshima

Purification of the test has been a well-accepted procedure in enhancing the performance of tests for differential item functioning (DIF). As defined by Lord, purification requires reestimation of ability parameters after removing DIF items before conducting the final DIF analysis. IRTPRO 3 is a recently updated program for analyses in item response theory, with built-in DIF tests but not purification procedures. A simulation study was conducted to investigate the effect of two new methods of purification. The results suggested that one of the purification procedures showed significantly improved power and Type I error. The procedure, which can be cumbersome by hand, can be easily applied by practitioners by using the web-based program developed for this study.


2021 ◽  
Author(s):  
Rudolf Debelak ◽  
Dries Debeer

Multistage tests are a widely used and efficient type of test presentation that aims to provide accurate ability estimates while keeping the test relatively short. Multistage tests typically rely on the psychometric framework of item response theory. Violations of item response models and other assumptions underlying a multistage test, such as differential item functioning, can lead to inaccurate ability estimates and unfair measurements. There is a practical need for methods to detect problematic model violations to avoid these issues. This study compares and evaluates three methods for the detection of differential item functioning with regard to continuous person covariates in data from multistage tests: a linear logistic regression test and two adaptations of a recently proposed score-based DIF test. While all tests show a satisfactory Type I error rate, the score-based tests show greater power against three types of DIF effects.


2020 ◽  
Vol 18 (1) ◽  
pp. 2-26
Author(s):  
Yan Liu ◽  
Chanmin Kim ◽  
Amery D. Wu ◽  
Paul Gustafson ◽  
Edward Kroc ◽  
...  

To evaluate the performance of propensity score approaches for differential item functioning analysis, this simulation study was conducted to assess bias, mean square error, Type I error, and power under different levels of effect size and a variety of model misspecification conditions, including different types and missing patterns of covariates.


2022 ◽  
pp. 001316442110684
Author(s):  
Natalie A. Koziol ◽  
J. Marc Goodrich ◽  
HyeonJin Yoon

Differential item functioning (DIF) is often used to examine validity evidence of alternate form test accommodations. Unfortunately, traditional approaches for evaluating DIF are prone to selection bias. This article proposes a novel DIF framework that capitalizes on regression discontinuity design analysis to control for selection bias. A simulation study was performed to compare the new framework with traditional logistic regression, with respect to Type I error and power rates of the uniform DIF test statistics and bias and root mean square error of the corresponding effect size estimators. The new framework better controlled the Type I error rate and demonstrated minimal bias but suffered from low power and lack of precision. Implications for practice are discussed.


Psych ◽  
2021 ◽  
Vol 3 (4) ◽  
pp. 619-639
Author(s):  
Rudolf Debelak ◽  
Dries Debeer

Multistage tests are a widely used and efficient type of test presentation that aims to provide accurate ability estimates while keeping the test relatively short. Multistage tests typically rely on the psychometric framework of item response theory. Violations of item response models and other assumptions underlying a multistage test, such as differential item functioning, can lead to inaccurate ability estimates and unfair measurements. There is a practical need for methods to detect problematic model violations to avoid these issues. This study compares and evaluates three methods for the detection of differential item functioning with regard to continuous person covariates in data from multistage tests: a linear logistic regression test and two adaptations of a recently proposed score-based DIF test. While all tests show a satisfactory Type I error rate, the score-based tests show greater power against three types of DIF effects.


Methodology ◽  
2012 ◽  
Vol 8 (4) ◽  
pp. 134-145 ◽  
Author(s):  
Fabiola González-Betanzos ◽  
Francisco J. Abad

The current research compares the effects of several strategies to establish the anchor subtest when detecting for differential item functioning (DIF) using the IRT likelihood ratio test in one- and two-stage procedures. Two one-stage strategies were examined: (1) “One item” and (2) “All other items” used as anchor. Additionally, two two-stage strategies were tested: (3) “One anchor item with posterior anchor test augmentation” and (4) “All other items with purification.” The strategies were compared in a simulation study, where sample sizes, DIF size, type of DIF, and software implementation (MULTILOG vs. IRTLRDIF) were manipulated. Results indicated that Procedure (1) was more efficient than (2). Purification was found to improve Type I error rates substantially with the “all other items” strategy, while “posterior anchor test augmentation” did not yield a significant improvement. In relation to the effect of the software used, we found that MULTILOG generally offers better results than IRTLRDIF.


Sign in / Sign up

Export Citation Format

Share Document