A Robust Method for Detecting Item Misfit in Large Scale Assessments

2021 ◽  
Author(s):  
Matthias von Davier ◽  
Ummugul Bezirhan

Viable methods for the identification of item misfit or Differential Item Functioning (DIF) are central to scale construction and sound measurement. Many approaches rely on the derivation of a limiting distribution under the assumption that a certain model fits the data perfectly. Typical assumptions such as the monotonicity and population independence of item functions are present even in classical test theory but are more explicitly stated when using item response theory or other latent variable models for the assessment of item fit. The work presented here provides an alternative approach that does not assume perfect model data fit, but rather uses Tukey’s concept of contaminated distributions and proposes an application of robust outlier detection in order to flag items for which adequate model data fit cannot be established.

2021 ◽  
Author(s):  
Matthias von Davier ◽  
Ummugul Bezirhan

Viable methods for the identification of item misfit or Differential Item Functioning (DIF) are central to scale construction and sound measurement. Many approaches rely on the derivation of a limiting distribution under the assumption that a certain model fits the data perfectly. Typical assumptions such as the monotonicity and population independence of item functions are present even in classical test theory but are more explicitly stated when using item response theory or other latent variable models for the assessment of item fit. The work presented here provides an alternative approach that does not assume perfect model data fit, but rather uses Tukey’s concept of contaminated distributions and proposes an application of robust outlier detection in order to flag items for which adequate model data fit cannot be established.


2014 ◽  
Vol 10 (2) ◽  
pp. 212-230 ◽  
Author(s):  
Jørgen Sjaastad

This article presents the basic rationale of Rasch theory and seven core properties of Rasch modeling; analyses of test targeting, person separation, person fit, item fit, differential item functioning, functioning of response categories and tests of unidimensionality. Illustrative examples are provided consecutively, drawing on Rasch analysis of data from a survey where students in the 9th grade responded to questions regarding their mathematics competence. The relationship between Rasch theory and classical test theory is commented on. Rasch theory provides science and mathematics education researchers with valuable tools to evaluate the psychometric quality of tests and questionnaires and support the development of these.


2021 ◽  
Vol 104 (3) ◽  
pp. 003685042110283
Author(s):  
Meltem Yurtcu ◽  
Hülya Kelecioglu ◽  
Edward L Boone

Bayesian Nonparametric (BNP) modelling can be used to obtain more detailed information in test equating studies and to increase the accuracy of equating by accounting for covariates. In this study, two covariates are included in the equating under the Bayes nonparametric model, one is continuous, and the other is discrete. Scores equated with this model were obtained for a single group design for a small group in the study. The equated scores obtained with the model were compared with the mean and linear equating methods in the Classical Test Theory. Considering the equated scores obtained from three different methods, it was found that the equated scores obtained with the BNP model produced a distribution closer to the target test. Even the classical methods will give a good result with the smallest error when using a small sample, making equating studies valuable. The inclusion of the covariates in the model in the classical test equating process is based on some assumptions and cannot be achieved especially using small groups. The BNP model will be more beneficial than using frequentist methods, regardless of this limitation. Information about booklets and variables can be obtained from the distributors and equated scores that obtained with the BNP model. In this case, it makes it possible to compare sub-categories. This can be expressed as indicating the presence of differential item functioning (DIF). Therefore, the BNP model can be used actively in test equating studies, and it provides an opportunity to examine the characteristics of the individual participants at the same time. Thus, it allows test equating even in a small sample and offers the opportunity to reach a value closer to the scores in the target test.


The purpose of this study was to examine the differences in sensitivity of three methods: IRT-Likelihood Ratio (IRT-LR), Mantel-Haenszel (MH) and Logistics Regression (LR), in detecting gender differential item functioning (DIF) on National Mathematics Examination (Ujian Nasional: UN) for 2014/2015 academic year in North Sumatera Province of Indonesia. DIF item shows the unfairness. It advantages the test takers of certain groups and disadvantages other group test takers, in the case they have the same ability. The presence of DIF was reviewed in grouping by gender: men as reference groups (R) and women as focus groups (F). This study used the experimental method, 3x1 design, with one factor (i.e. method) with three treatments, in the form of 3 different DIF detection methods. There are 5 types of UN Mathematics Year 2015 packages (codes: 1107, 2207, 3307, 4407 and 5507). The 2207 package code was taken as the sample data, consisting of 5000 participants (3067 women, 1933 men; for 40 UN items). Item selection was carried out based on the classical test theory (CTT) on 40 UN items, producing 32 items that fulfilled, and item response theory selection (IRT) produced 18 items that fulfilled. With program R 3.333 and IRTLRDIF 2.0, it was found 5 items were detected as DIF by the IRT-Likelihood Ratio-method (IRTLR), 4 items were detected as DIF by the Logistic Regression method (LR), and 3 items were detected as DIF by the MantelHaenszel method (MH). To test the sensitivity of the three methods, it is not enough with just one time DIF detection, but formed six groups of data analysis: (4400,40),(4400,32), (4400,18), (3000,40), (3000,32), (3000,18), and generate 40 random data sets (without repetitions) in each group, and conduct detecting DIF on the items in each data set. Although the data lacks model fit, the 3 parameter logistic model (3PL) is chosen as the most suitable model. With the Tukey's HSD post hoc test, the IRT-LR method is known to be more sensitive than the MH and LR methods in the group (4400,40) and (3000,40). The IRT-LR method is not longer more sensitive than LR in the group (4400,32) and (3000,32), but still more sensitive than MH. In the groups (4400,18) and (3000,18) the IRT-LR method is more sensitive than LR, but not significantly more sensitive than MH. The LR method is consistently tested to be more sensitive than the MH method in the entire analysis groups.


Author(s):  
Ansgar Opitz ◽  
Moritz Heene ◽  
Frank Fischer

Abstract. A significant problem that assessments of scientific reasoning face at the level of higher education is the question of domain generality, that is, whether a test will produce biased results for students from different domains. This study applied three recently developed methods of analyzing differential item functioning (DIF) to evaluate the domain generality assumption of a common scientific reasoning test. Additionally, we evaluated the usefulness of these new, tree- and lasso-based, methods to analyze DIF and compared them with methods based on classical test theory. We gave the scientific reasoning test to 507 university students majoring in physics, biology, or medicine. All three DIF analysis methods indicated a domain bias present in about one-third of the items, mostly benefiting biology students. We did not find this bias by using methods based on classical test theory. Those methods indicated instead that all items were easier for physics students compared to biology students. Thus, the tree- and lasso-based methods provide a clear added value to test evaluation. Taken together, our analyses indicate that the scientific reasoning test is neither entirely domain-general, nor entirely domain-specific. We advise against using it in high-stakes situations involving domain comparisons.


2021 ◽  
Vol 12 ◽  
Author(s):  
David Alpizar ◽  
Brian F. French

The Motivational-Developmental Assessment (MDA) measures a university student’s motivational and developmental attributes by utilizing overlapping constructs measured across four writing prompts. The MDA’s format may lead to the violation of the local item independence (LII) assumption for unidimensional item response theory (IRT) scoring models, or the uncorrelated errors assumption for scoring models in classical test theory (CTT) due to the measurement of overlapping constructs within a prompt. This assumption violation is known as a testlet effect, which can be viewed as a method effect. The application of a unidimensional IRT or CTT model to score the MDA can result in imprecise parameter estimates when this effect is ignored. To control for this effect in the MDA responses, we first examined the presence of local dependence via a restricted bifactor model and Yen’s Q3 statistic. Second, we applied bifactor models to account for the testlet effect in the responses, as this effect is modeled as an additional latent variable in a factor model. Results support the presence of local dependence in two of the four MDA prompts, and the use of the restricted bifactor model to account for the testlet effect in the responses. Modeling the testlet effect through the restricted bifactor model supports a scoring inference in a validation argument framework. Implications are discussed.


2017 ◽  
Vol 78 (4) ◽  
pp. 679-707 ◽  
Author(s):  
Stefanie A. Wind ◽  
Eli Jones

Previous research includes frequent admonitions regarding the importance of establishing connectivity in data collection designs prior to the application of Rasch models. However, details regarding the influence of characteristics of the linking sets used to establish connections among facets, such as locations on the latent variable, model–data fit, and sample size, have not been thoroughly explored. These considerations are particularly important in assessment systems that involve large proportions of missing data (i.e., sparse designs) and are associated with high-stakes decisions, such as teacher evaluations based on teaching observations. The purpose of this study is to explore the influence of characteristics of linking sets in sparsely connected rating designs on examinee, rater, and task estimates. A simulation design whose characteristics were intended to reflect practical large-scale assessment networks with sparse connections were used to consider the influence of locations on the latent variable, model–data fit, and sample size within linking sets on the stability and model–data fit of estimates. Results suggested that parameter estimates for examinee and task facets are quite robust to modifications in the size, model–data fit, and latent-variable location of the link. Parameter estimates for the rater, while still quite robust, are more sensitive to reductions in link size. The implications are discussed as they relate to research, theory, and practice.


Author(s):  
Jerhi Wahyu Fernanda ◽  
Noer Hidayah

<p>Penilaian merupakan akhir dari proses pembelajaran yang dapat dilakukan melalui ujian. Soal yang digunakan harus mampu mengukur kemampuan peserta didik. <em>Classical Test Theory</em> (CTT) dan <em>Rasch model</em> merupakan analisis statistika untuk menganalis butir soal. Desain dalam penelitian ini adalah penelitian deskriptif. Hasil analisis terhadap 50 soal menggunakan metode CTT, didapatkan hasil bahwa hanya 21 soal yang memenuhi kriteria <em>item difficulty</em> dan <em>item discriminant</em>. Analisis <em>Rasch model</em>, memberikan informasi bahwa secara keseluruhan kualitas soal dikatakan baik berdasarkan pola kurva <em>item information function</em>. Analisis ini juga memberikan informasi terdapat 42 soal yang layak karena memenuhi kriteria <em>item fit,</em> dan 8 soal yang harus dievaluasi lagi. Analisis menggunakan <em>Rasch model</em> lebih baik dibandingan dengan CTT, sehingga 8 soal yang tidak layak berdasarkan analisis tersebut harus dievaluasi dengan mengubah bentuk studi kasus pada soal tersebut dan membuat inovasi metode pembelajaran terkait materi pada 8 soal tersebut.</p><p><strong>Kata kunci</strong>: analisis soal, <em>Classical Test Theory</em>, <em>Rasch model</em>.</p>


2020 ◽  
Author(s):  
Kazuhiro Yamaguchi ◽  
Jonathan Templin

Quantifying the reliability of latent variable estimates in diagnostic classification models has been a difficult topic, complicated by the classification-based nature of these models. In this study, we derive observed score reliability indices based on diagnostic classification models as an extension of classical test theory-based reliability. Additionally, we derive conditional observed sum- and sub-score distributions. In this manner, various conditional expectations and conditional standard error of measurement estimates can be calculated for both total- and sub-scores of a test. The proposed methods provide a variety of expectations and standard errors for attribute estimates, which we demonstrate in an analysis of an empirical test.


Sign in / Sign up

Export Citation Format

Share Document