Checking Equity: Why Differential Item Functioning Analysis Should Be a Routine Part of Developing Conceptual Assessments

We provide a tutorial on differential item functioning (DIF) analysis, an analytic method useful for identifying potentially biased items in assessments. After explaining a number of methodological approaches, we test for gender bias in two scenarios that demonstrate why DIF analysis is crucial for developing assessments, particularly because simply comparing two groups’ total scores can lead to incorrect conclusions about test fairness. First, a significant difference between groups on total scores can exist even when items are not biased, as we illustrate with data collected during the validation of the Homeostasis Concept Inventory. Second, item bias can exist even when the two groups have exactly the same distribution of total scores, as we illustrate with a simulated data set. We also present a brief overview of how DIF analysis has been used in the biology education literature to illustrate the way DIF items need to be reevaluated by content experts to determine whether they should be revised or removed from the assessment. Finally, we conclude by arguing that DIF analysis should be used routinely to evaluate items in developing conceptual assessments. These steps will ensure more equitable—and therefore more valid—scores from conceptual assessments.

Download Full-text

Differential Item Functioning in Brief Instruments of Disordered Eating

European Journal of Psychological Assessment ◽

10.1027/1015-5759/a000472 ◽

2019 ◽

Vol 35 (6) ◽

pp. 823-833 ◽

Cited By ~ 4

Author(s):

Desiree Thielemann ◽

Felicitas Richter ◽

Bernd Strauss ◽

Elmar Braehler ◽

Uwe Altmann ◽

...

Keyword(s):

Differential Item Functioning ◽

Disordered Eating ◽

Structural Equation ◽

Young Female ◽

Eating Attitudes ◽

Equation Model ◽

German Population ◽

Test Fairness ◽

Item Functioning ◽

Multiple Indicator

Abstract. Most instruments for the assessment of disordered eating were developed and validated in young female samples. However, they are often used in heterogeneous general population samples. Therefore, brief instruments of disordered eating should assess the severity of disordered eating equally well between individuals with different gender, age, body mass index (BMI), and socioeconomic status (SES). Differential item functioning (DIF) of two brief instruments of disordered eating (SCOFF, Eating Attitudes Test [EAT-8]) was modeled in a representative sample of the German population ( N = 2,527) using a multigroup item response theory (IRT) and a multiple-indicator multiple-cause (MIMIC) structural equation model (SEM) approach. No DIF by age was found in both questionnaires. Three items of the EAT-8 showed DIF across gender, indicating that females are more likely to agree than males, given the same severity of disordered eating. One item of the EAT-8 revealed slight DIF by BMI. DIF with respect to the SCOFF seemed to be negligible. Both questionnaires are equally fair across people with different age and SES. The DIF by gender that we found with respect to the EAT-8 as screening instrument may be also reflected in the use of different cutoff values for men and women. In general, both brief instruments assessing disordered eating revealed their strengths and limitations concerning test fairness for different groups.

Download Full-text

Sensitivity Of Differential Item Functioning Detection Methods On National Mathematics Examination In North Sumatera Province, Indonesia

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e1226.0585c19 ◽

2019 ◽

Vol 8 (5C) ◽

pp. 1538-1549

Keyword(s):

Differential Item Functioning ◽

Likelihood Ratio ◽

Classical Test Theory ◽

Test Theory ◽

Detection Methods ◽

Ratio Method ◽

Suitable Model ◽

Data Set ◽

Item Functioning ◽

Mathematics Examination

The purpose of this study was to examine the differences in sensitivity of three methods: IRT-Likelihood Ratio (IRT-LR), Mantel-Haenszel (MH) and Logistics Regression (LR), in detecting gender differential item functioning (DIF) on National Mathematics Examination (Ujian Nasional: UN) for 2014/2015 academic year in North Sumatera Province of Indonesia. DIF item shows the unfairness. It advantages the test takers of certain groups and disadvantages other group test takers, in the case they have the same ability. The presence of DIF was reviewed in grouping by gender: men as reference groups (R) and women as focus groups (F). This study used the experimental method, 3x1 design, with one factor (i.e. method) with three treatments, in the form of 3 different DIF detection methods. There are 5 types of UN Mathematics Year 2015 packages (codes: 1107, 2207, 3307, 4407 and 5507). The 2207 package code was taken as the sample data, consisting of 5000 participants (3067 women, 1933 men; for 40 UN items). Item selection was carried out based on the classical test theory (CTT) on 40 UN items, producing 32 items that fulfilled, and item response theory selection (IRT) produced 18 items that fulfilled. With program R 3.333 and IRTLRDIF 2.0, it was found 5 items were detected as DIF by the IRT-Likelihood Ratio-method (IRTLR), 4 items were detected as DIF by the Logistic Regression method (LR), and 3 items were detected as DIF by the MantelHaenszel method (MH). To test the sensitivity of the three methods, it is not enough with just one time DIF detection, but formed six groups of data analysis: (4400,40),(4400,32), (4400,18), (3000,40), (3000,32), (3000,18), and generate 40 random data sets (without repetitions) in each group, and conduct detecting DIF on the items in each data set. Although the data lacks model fit, the 3 parameter logistic model (3PL) is chosen as the most suitable model. With the Tukey's HSD post hoc test, the IRT-LR method is known to be more sensitive than the MH and LR methods in the group (4400,40) and (3000,40). The IRT-LR method is not longer more sensitive than LR in the group (4400,32) and (3000,32), but still more sensitive than MH. In the groups (4400,18) and (3000,18) the IRT-LR method is more sensitive than LR, but not significantly more sensitive than MH. The LR method is consistently tested to be more sensitive than the MH method in the entire analysis groups.

Download Full-text

A Bifactor Multidimensional Item Response Theory Model for Differential Item Functioning Analysis on Testlet-Based Items

Applied Psychological Measurement ◽

10.1177/0146621611428447 ◽

2011 ◽

Vol 35 (8) ◽

pp. 604-622 ◽

Cited By ~ 16

Author(s):

Hirotaka Fukuhara ◽

Akihito Kamata

Keyword(s):

Item Response Theory ◽

Differential Item Functioning ◽

Item Response ◽

Estimation Method ◽

Multidimensional Item Response Theory ◽

Multidimensional Item Response ◽

Response Theory ◽

Data Set ◽

Detection Rates ◽

Item Functioning

A differential item functioning (DIF) detection method for testlet-based data was proposed and evaluated in this study. The proposed DIF model is an extension of a bifactor multidimensional item response theory (MIRT) model for testlets. Unlike traditional item response theory (IRT) DIF models, the proposed model takes testlet effects into account, thus estimating DIF magnitude appropriately when a test is composed of testlets. A fully Bayesian estimation method was adopted for parameter estimation. The recovery of parameters was evaluated for the proposed DIF model. Simulation results revealed that the proposed bifactor MIRT DIF model produced better estimates of DIF magnitude and higher DIF detection rates than the traditional IRT DIF model for all simulation conditions. A real data analysis was also conducted by applying the proposed DIF model to a statewide reading assessment data set.

Download Full-text

A Comparative Study on using Contingency Table Approaches in Differential Item Functioning Analysis

10.9734/bpi/mplle/v8/3998f ◽

2021 ◽

pp. 125-142

Author(s):

Jose Quito Pedrajita

Keyword(s):

Comparative Study ◽

Differential Item Functioning ◽

Contingency Table ◽

Item Functioning ◽

Differential Item Functioning Analysis

Download Full-text

Differential Item Functioning Analysis for Repeated Measures Design Social Survey Data: A Method for Detecting Social Demands Difference in Big-Data Era

2018 5th International Conference on Behavioral, Economic, and Socio-Cultural Computing (BESC) ◽

10.1109/besc.2018.8697289 ◽

2018 ◽

Author(s):

Tian Lan ◽

Zhongxuan Lin ◽

Tour Liu

Keyword(s):

Big Data ◽

Differential Item Functioning ◽

Survey Data ◽

Repeated Measures ◽

Social Survey ◽

Repeated Measures Design ◽

Item Functioning ◽

Differential Item Functioning Analysis

Download Full-text

Differential Item Functioning Analysis of the 2003–04 NHANES Physical Activity Questionnaire

Research Quarterly for Exercise and Sport ◽

10.1080/02701367.2011.10599770 ◽

2011 ◽

Vol 82 (3) ◽

pp. 381-390 ◽

Cited By ~ 3

Author(s):

Yong Gao ◽

Weimo Zhu

Keyword(s):

Physical Activity ◽

Differential Item Functioning ◽

Physical Activity Questionnaire ◽

Item Functioning ◽

Differential Item Functioning Analysis ◽

Activity Questionnaire

Download Full-text

Differential Item Functioning Analysis for Accommodated Versus Nonaccommodated Students

Educational Assessment ◽

10.1080/10627190902816264 ◽

2009 ◽

Vol 14 (1) ◽

pp. 38-56 ◽

Cited By ~ 15

Author(s):

Holmes Finch ◽

Karen Barton ◽

Patrick Meyer

Keyword(s):

Differential Item Functioning ◽

Item Functioning ◽

Differential Item Functioning Analysis

Download Full-text

Differential item functioning analysis of the Vanderbilt Expertise Test for cars

Journal of Vision ◽

10.1167/15.13.23 ◽

2015 ◽

Vol 15 (13) ◽

pp. 23 ◽

Cited By ~ 2

Author(s):

Woo-Yeol Lee ◽

Sun-Joo Cho ◽

Rankin W. McGugin ◽

Ana Beth Van Gulick ◽

Isabel Gauthier

Keyword(s):

Differential Item Functioning ◽

Item Functioning ◽

Differential Item Functioning Analysis

Download Full-text

Equivalence and Bias in the South African Substance Use Contextual Risk Instrument

Psychological Reports ◽

10.1177/0033294116685865 ◽

2017 ◽

Vol 120 (1) ◽

pp. 158-178

Author(s):

Ishreen Rawoot ◽

Maria Ann Florence

Keyword(s):

Substance Use ◽

Differential Item Functioning ◽

South African ◽

Economic Status ◽

Scale Level ◽

Contextual Risk ◽

Item Functioning ◽

Differential Item Functioning Analysis ◽

Preventative Interventions ◽

Phi Coefficient

This article forms part of a larger study that sought to develop and validate a scale to measure individual and contextual factors associated with adolescent substance use in low-socio-economic status South African communities. The scale was developed to inform the process of designing preventative interventions in these communities. This study assessed the construct equivalence and item bias across different language versions of the scale. Exploratory factor analysis, equality of reliabilities, and the Tucker’s phi coefficient of congruence were employed to assess whether the two language versions were equivalent at a scale level. Differential item functioning analysis was conducted using ordinal logistic regression and the Mantel-Haenszel method at an item level. The findings revealed that there are significant differences between the two groups at a scale level. Items were flagged as presenting with moderate to large differential item functioning. The biased items have to be closely examined in order to decide how to address the bias.

Download Full-text

deltaPlotR: AnRPackage for Differential Item Functioning Analysis with Angoff's Delta Plot

Journal of Statistical Software ◽

10.18637/jss.v059.c01 ◽

2014 ◽

Vol 59 (Code Snippet 1) ◽

Cited By ~ 1

Author(s):

David Magis ◽

Bruno Facon

Keyword(s):

Differential Item Functioning ◽

Item Functioning ◽

Differential Item Functioning Analysis

Download Full-text