scholarly journals The Multidimensionality of Measurement Bias in High‐Stakes Testing: Using Machine Learning to Evaluate Complex Sources of Differential Item Functioning

Author(s):  
William C. M. Belzak
2020 ◽  
Vol 3 (1) ◽  
pp. 5-19
Author(s):  
Don Yao ◽  

Differential item functioning (DIF) is a technique used to examine whether items function differently across different groups. The DIF analysis helps detect bias in an assessment to ensure the fairness of the assessment. However, most of the previous research has focused on high-stakes assessments. There is a dearth in research that laying emphasis on low-stakes assessments, which is also significant for the test development and validation process. Additionally, gender difference in test performance is always a particular concern for researchers to evaluate whether a test is fair or not. This present study investigated whether test items of the General English Proficiency Test for Kids (GEPT-Kids) are free of bias in terms of gender differences. A mixed-method sequential explanatory research design was adopted with two phases. In phase I, test performance data of 492 participants from five Chinese speaking cities were analyzed by the Mantel-Haenszel (MH) method to detect gender DIF. In phase II, items that manifested DIF were subject to content analysis through three experienced reviewers to identify possible sources of DIF. The results showed that three items were detected with moderate gender DIF through statistical methods and three items were identified as possible biased items by expert judgment. The results provide preliminary contributions to DIF analysis for low-stakes assessment in the field of language assessment. Besides, young language learners, especially in the Chinese context, have been drawn renewed attention. Thus, the results may also add to the body of literature that can shed some light on the test development for young language learners.


Author(s):  
Julie Levacher ◽  
Marco Koch ◽  
Johanna Hissbach ◽  
Frank M. Spinath ◽  
Nicolas Becker

Abstract. Due to their high item difficulties and excellent psychometric properties, construction-based figural matrices tasks are of particular interest when it comes to high-stakes testing. An important prerequisite is that test preparation – which is likely to occur in this context – does not impair test fairness or item properties. The goal of this study was to provide initial evidence concerning the influence of test preparation. We administered test items to a sample of N = 882 participants divided into two groups, but only one group was given information about the rules employed in the test items. The probability of solving the items was significantly higher in the test preparation group than in the control group ( M = 0.61, SD = 0.19 vs. M = 0.41, SD = 0.25; t(54) = 3.42, p = .001; d = .92). Nevertheless, a multigroup confirmatory factor analysis, as well as a differential item functioning analysis, indicated no differences between the item properties in the two groups. The results suggest that construction-based figural matrices are suitable in the context of high-stakes testing when all participants are provided with test preparation material so that test fairness is ensured.


2021 ◽  
Vol 6 ◽  
Author(s):  
Francisca Calderón ◽  
Jorge González

School Climate is an essential aspect in every school community. It relates to perceptions of the school environment experienced by various members of the educational system. Research has shown that an appropriate school climate impacts not only on the quality of life of all members in the educational system, but also on learning outcomes and education improvements. This study aims to explore a measure of School Climate on Chilean students. A sample of 176,126 10th grade students was used to investigate the factor structure of the items composing the School Climate construct, and to evaluate the potential presence of Differential Item Functioning between male and female groups. Both explanatory and confirmatory factor analysis as well as Rasch models were used to analyze the scale. Differential item functioning between male and female groups was investigated using the Langer-improved Wald test. The results indicated a multidimensional structure of the School Climate construct and that measurement bias for male and female groups exist in some of the items measuring the construct.


Diagnostica ◽  
2021 ◽  
Vol 67 (1) ◽  
pp. 13-23
Author(s):  
Ariana Garrote ◽  
Elisabeth Moser Opitz

Zusammenfassung. In dieser Studie wurde der Test MARKO-D (Mathematik- und Rechenkonzepte im Vorschulalter–Diagnose) mit einer Stichprobe von Kindern aus der deutschsprachigen Schweiz ( N = 555) im ersten und zweiten Kindergartenjahr erprobt und es wurde analysiert, ob sich die Altersnormen der deutschen Stichprobe auf die Schweiz übertragen lassen. Zudem wurde der Test mit einer Teilstichprobe ( n = 87) hinsichtlich Messinvarianz über die Zeit untersucht. Die Ergebnisse des eindimensionalen Rasch-Modells zeigen, dass das Instrument für die Schweiz geeignet ist. Die Testleistungen hängen jedoch vom Kindergartenbesuch ab. Für die Schweiz müssten deshalb nebst Altersnormen auch Normen pro Kindergartenhalbjahr verwendet werden. Die Analyse mittels Differential Item Functioning ergab, dass 17 von 55 Items von großer Messvarianz über die Zeit betroffen sind. Um das Instrument für Längsschnittuntersuchungen einsetzen zu können, müsste es weiterentwickelt werden.


Sign in / Sign up

Export Citation Format

Share Document