Examining the Impact of Differential Item Functioning on Classification Accuracy in Cognitive Diagnostic Models

Cognitive diagnostic models (CDMs) are of growing interest in educational research because of the models’ ability to provide diagnostic information regarding examinees’ strengths and weaknesses suited to a variety of content areas. An important step to ensure appropriate uses and interpretations from CDMs is to understand the impact of differential item functioning (DIF). While methods of detecting DIF in CDMs have been identified, there is a limited understanding of the extent to which DIF affects classification accuracy. This simulation study provides a reference to practitioners to understand how different magnitudes and types of DIF interact with CDM item types and group distributions and sample sizes to influence attribute- and profile-level classification accuracy. The results suggest that attribute-level classification accuracy is robust to DIF of large magnitudes in most conditions, while profile-level classification accuracy is negatively influenced by the inclusion of DIF. Conditions of unequal group distributions and DIF located on simple structure items had the greatest effect in decreasing classification accuracy. The article closes by considering implications of the results and future directions.

Download Full-text

A Comparison of Differential Item Functioning Detection Methods in Cognitive Diagnostic Models

Frontiers in Psychology ◽

10.3389/fpsyg.2019.01137 ◽

2019 ◽

Vol 10 ◽

Author(s):

Yanlou Liu ◽

Hao Yin ◽

Tao Xin ◽

Laicheng Shao ◽

Lu Yuan

Keyword(s):

Differential Item Functioning ◽

Detection Methods ◽

Item Functioning ◽

Diagnostic Models ◽

Cognitive Diagnostic Models

Download Full-text

Factors Affecting Differential Item Functioning Within the Framework of Cognitive Diagnostic Models: A Comparison of Three Methods

PsycEXTRA Dataset ◽

10.1037/e589302013-001 ◽

2013 ◽

Author(s):

Su-Pin Hung ◽

Hung-Yu Huang

Keyword(s):

Differential Item Functioning ◽

Factors Affecting ◽

Item Functioning ◽

Diagnostic Models ◽

Cognitive Diagnostic Models

Download Full-text

Differential item functioning of the Boston Naming Test in cognitively normal African American and Caucasian older adults

Journal of the International Neuropsychological Society ◽

10.1017/s1355617709990361 ◽

2009 ◽

Vol 15 (5) ◽

pp. 758-768 ◽

Cited By ~ 33

Author(s):

OTTO PEDRAZA ◽

NEILL R. GRAFF-RADFORD ◽

GLENN E. SMITH ◽

ROBERT J. IVNIK ◽

FLOYD B. WILLIS ◽

...

Keyword(s):

African American ◽

Item Response Theory ◽

Differential Item Functioning ◽

Item Response ◽

Group Performance ◽

Response Theory ◽

Boston Naming Test ◽

Item Functioning ◽

The Impact ◽

Naming Test

AbstractScores on the Boston Naming Test (BNT) are frequently lower for African American when compared with Caucasian adults. Although demographically based norms can mitigate the impact of this discrepancy on the likelihood of erroneous diagnostic impressions, a growing consensus suggests that group norms do not sufficiently address or advance our understanding of the underlying psychometric and sociocultural factors that lead to between-group score discrepancies. Using item response theory and methods to detect differential item functioning (DIF), the current investigation moves beyond comparisons of the summed total score to examine whether the conditional probability of responding correctly to individual BNT items differs between African American and Caucasian adults. Participants included 670 adults age 52 and older who took part in Mayo’s Older Americans and Older African Americans Normative Studies. Under a two-parameter logistic item response theory framework and after correction for the false discovery rate, 12 items where shown to demonstrate DIF. Of these 12 items, 6 (“dominoes,” “escalator,” “muzzle,” “latch,” “tripod,” and “palette”) were also identified in additional analyses using hierarchical logistic regression models and represent the strongest evidence for race/ethnicity-based DIF. These findings afford a finer characterization of the psychometric properties of the BNT and expand our understanding of between-group performance. (JINS, 2009, 15, 758–768.)

Download Full-text

The impact of English language learner status on screening for emotional and behavioral disorders: A differential item functioning (DIF) study

Psychology in the Schools ◽

10.1002/pits.22103 ◽

2018 ◽

Vol 55 (3) ◽

pp. 229-239 ◽

Cited By ~ 2

Author(s):

Matthew C. Lambert ◽

Allen G. Garcia ◽

Stacy-Ann A. January ◽

Michael H. Epstein

Keyword(s):

Differential Item Functioning ◽

Behavioral Disorders ◽

English Language Learner ◽

English Language ◽

Emotional And Behavioral Disorders ◽

Language Learner ◽

Item Functioning ◽

The Impact

Download Full-text

Geriatric Anxiety Scale: item response theory analysis, differential item functioning, and creation of a ten-item short form (GAS-10)

International Psychogeriatrics ◽

10.1017/s1041610214000210 ◽

2014 ◽

Vol 27 (7) ◽

pp. 1099-1111 ◽

Cited By ~ 22

Author(s):

Anne E. Mueller ◽

Daniel L. Segal ◽

Brandon Gavett ◽

Meghan A. Marty ◽

Brian Yochim ◽

...

Keyword(s):

Older Adults ◽

Item Response Theory ◽

Psychometric Properties ◽

Differential Item Functioning ◽

Item Response ◽

Short Form ◽

Anxiety Scale ◽

Response Theory ◽

Item Functioning ◽

The Impact

ABSTRACTBackground:The Geriatric Anxiety Scale (GAS; Segal et al. (Segal, D. L., June, A., Payne, M., Coolidge, F. L. and Yochim, B. (2010). Journal of Anxiety Disorders, 24, 709–714. doi:10.1016/j.janxdis.2010.05.002) is a self-report measure of anxiety that was designed to address unique issues associated with anxiety assessment in older adults. This study is the first to use item response theory (IRT) to examine the psychometric properties of a measure of anxiety in older adults.Method:A large sample of older adults (n = 581; mean age = 72.32 years, SD = 7.64 years, range = 60 to 96 years; 64% women; 88% European American) completed the GAS. IRT properties were examined. The presence of differential item functioning (DIF) or measurement bias by age and sex was assessed, and a ten-item short form of the GAS (called the GAS-10) was created.Results:All GAS items had discrimination parameters of 1.07 or greater. Items from the somatic subscale tended to have lower discrimination parameters than items on the cognitive or affective subscales. Two items were flagged for DIF, but the impact of the DIF was negligible. Women scored significantly higher than men on the GAS and its subscales. Participants in the young-old group (60 to 79 years old) scored significantly higher on the cognitive subscale than participants in the old-old group (80 years old and older).Conclusions:Results from the IRT analyses indicated that the GAS and GAS-10 have strong psychometric properties among older adults. We conclude by discussing implications and future research directions.

Download Full-text

A Simulation Study to Assess the Effect of the Number of Response Categories on the Power of Ordinal Logistic Regression for Differential Item Functioning Analysis in Rating Scales

Computational and Mathematical Methods in Medicine ◽

10.1155/2016/5080826 ◽

2016 ◽

Vol 2016 ◽

pp. 1-8 ◽

Cited By ~ 3

Author(s):

Elahe Allahyari ◽

Peyman Jafari ◽

Zahra Bagheri

Keyword(s):

Logistic Regression ◽

Sample Size ◽

Differential Item Functioning ◽

Rating Scales ◽

Error Rates ◽

Ordinal Logistic Regression ◽

Type I ◽

Item Functioning ◽

Quality Of Life Scale ◽

The Impact

Objective.The present study uses simulated data to find what the optimal number of response categories is to achieve adequate power in ordinal logistic regression (OLR) model for differential item functioning (DIF) analysis in psychometric research.Methods.A hypothetical ten-item quality of life scale with three, four, and five response categories was simulated. The power and type I error rates of OLR model for detecting uniform DIF were investigated under different combinations of ability distribution (θ), sample size, sample size ratio, and the magnitude of uniform DIF across reference and focal groups.Results.Whenθwas distributed identically in the reference and focal groups, increasing the number of response categories from 3 to 5 resulted in an increase of approximately 8% in power of OLR model for detecting uniform DIF. The power of OLR was less than 0.36 when ability distribution in the reference and focal groups was highly skewed to the left and right, respectively.Conclusions.The clearest conclusion from this research is that the minimum number of response categories for DIF analysis using OLR is five. However, the impact of the number of response categories in detecting DIF was lower than might be expected.

Download Full-text

Assessing the impact of uniform and nonuniform differential item functioning items on Rasch measure: the polytomous case

Computational Statistics ◽

10.1007/s00180-014-0542-x ◽

2014 ◽

Vol 30 (2) ◽

pp. 441-461 ◽

Cited By ~ 3

Author(s):

Silvia Golia

Keyword(s):

Differential Item Functioning ◽

Item Functioning ◽

The Impact

Download Full-text

Measurement Disparities in Frailty Among Kidney Transplant Patients: Impact of Differential Item Functioning

Innovation in Aging ◽

10.1093/geroni/igab046.1728 ◽

2021 ◽

Vol 5 (Supplement_1) ◽

pp. 447-447

Author(s):

Nadia Chu ◽

Alden Gross ◽

Xiaomeng Chen ◽

Qian-Li Xue ◽

Karen Bandeen-Roche ◽

...

Keyword(s):

Weight Loss ◽

Differential Item Functioning ◽

Kidney Transplant ◽

Frailty Phenotype ◽

Multiple Indicators ◽

Transplant Patients ◽

Black Candidates ◽

Item Functioning ◽

The Impact ◽

Low Activity

Abstract Frailty is commonly measured for clinical risk stratification during transplant evaluation and is more prevalent among older, non-White kidney transplant (KT) patients. However, group differences may be partially attributable to misclassification resulting from measurement bias (differential item functioning/DIF). We examined the extent that DIF affects estimates of age, sex, and race differences in frailty (physical frailty phenotype/PFP) prevalence among 4,300 candidates and 1,396 recipients. We used Multiple Indicators Multiple Causes with dichotomous indicators to assess uniform DIF in PFP criteria attributable to age (≥65vs.18-64 years), sex, and race (Black vs.White). Among candidates (mean age=55 years), 41% were female, 46% were Black, and 19% were frail. After controlling for mean frailty level, females were more likely to endorse exhaustion (OR=1.20,p=0.003), but less likely to endorse low activity (OR=0.83,p=0.01). Younger candidates were more likely to endorse weight loss (OR=1.30,p=0.005), exhaustion (OR=1.60,p<0.001), and low activity (OR=1.80,p<0.001). Black candidates were more likely to endorse exhaustion (OR=1.25,p<0.001), but less likely to endorse weakness (OR=0.79,p<0.001). Among recipients (mean age=54 years), 40% were female, 39% were Black, and 15% were frail. Younger recipients were more likely to endorse weight loss (OR=1.55,p=0.005) and low activity (OR=1.61,p=0.02); however, no DIF was detected by sex or race. Results highlight the impact of DIF for specific PFP measures by age, sex, and race among candidates, but only by age for recipients. Further research is needed to ascertain whether candidate- and/or recipient-specific thresholds to correct for DIF could improve risk prediction and equitable access to KT for older, female, and Black candidates.

Download Full-text

Cognitive Diagnostic Models With Attribute Hierarchies: Model Estimation With a Restricted Q-Matrix Design

Applied Psychological Measurement ◽

10.1177/0146621618765721 ◽

2018 ◽

Vol 43 (4) ◽

pp. 255-271 ◽

Cited By ~ 4

Author(s):

Dongbo Tu ◽

Shiyu Wang ◽

Yan Cai ◽

Jeff Douglas ◽

Hua-Hua Chang

Keyword(s):

Classification Accuracy ◽

Latent Class ◽

Estimation Procedure ◽

Syllogistic Reasoning ◽

Estimation Methods ◽

Diagnostic Models ◽

Cognitive Diagnostic Models ◽

Attribute Hierarchy ◽

Q Matrix ◽

Matrix Design

Attribute hierarchy is a common assumption in the educational context, where the mastery of one attribute is assumed to be a prerequisite to the mastery of another one. The attribute hierarchy can be incorporated through a restricted Q matrix that implies the specified structure. The latent class–based cognitive diagnostic models (CDMs) usually do not assume a hierarchical structure among attributes, which means all profiles of attributes are possible in a population of interest. This study investigates different estimation methods to the classification accuracy for a family of CDMs when they are combined with a restricted Q-matrix design. A simulation study is used to explain the misclassification caused by an unrestricted estimation procedure. The advantages of the restricted estimation procedure utilizing attribute hierarchies for increased classification accuracy are also further illustrated through a real data analysis on a syllogistic reasoning diagnostic assessment. This research can provide guidelines for educational and psychological researchers and practitioners when they use CDMs to analyze the data with a restricted Q-matrix design and make them be aware of the potentially contaminated classification results if ignoring attribute hierarchies.

Download Full-text