Detection of Differential Item Functioning Using Mantel-Haenszel, Standardization Proportion and BILOG-MG Procedures

Differential item functioning (DIF) is a procedure to identify whether an item favours a particular group of respondents once they are matched on respective ability levels. There are numerous procedures reported in the literature to detect DIF, but the Mantel-Haenszel (MH), Standardized Proportion Difference (SPD), and BILOG-MG are frequently used to ensure the fairness of assessments. The aim of the present study was to compare procedural characteristics using empirical data. We found Mantel-Haenszel and standardized proportion difference provide comparable results while BILOG-MG has flagged a large number of items, but the magnitude of DIF was trivial from a test development perspective. The results also showed Mantel-Haenszel and standardized proportion difference index provide the effect size measure of DIF, which facilitates for further necessary actions, especially for item writers and practitioners.

Download Full-text

A new stopping criterion for Rasch trees based on the Mantel-Haenszel effect size measure for differential item functioning

10.31234/osf.io/2jph8 ◽

2021 ◽

Author(s):

Mirka Henninger ◽

Rudolf Debelak ◽

Carolin Strobl

Keyword(s):

Sample Size ◽

Differential Item Functioning ◽

Effect Size ◽

Classification Scheme ◽

Statistical Significance ◽

Data Driven ◽

Stopping Criterion ◽

Item Functioning ◽

Item Parameters ◽

Size Measure

To detect differential item functioning (DIF), Rasch trees search for optimal splitpoints in covariates and identify subgroups of respondents in a data-driven way. To determine whether and in which covariate a split should be performed, Rasch trees use statistical significance tests. Consequently, Rasch trees are more likely to label small DIF effects as significant in larger samples. This leads to larger trees, which split the sample into more subgroups. What would be more desirable is an approach that is driven more by effect size rather than sample size. In order to achieve this, we suggest to implement an additional stopping criterion: the popular ETS classification scheme based on the Mantel-Haenszel odds ratio. This criterion helps us to evaluate whether a split in a Rasch tree is based on a substantial or an ignorable difference in item parameters, and it allows the Rasch tree to stop growing when DIF between the identified subgroups is small. Furthermore, it supports identifying DIF items and quantifying DIF effect sizes in each split. Based on simulation results, we conclude that the Mantel-Haenszel effect size further reduces unnecessary splits in Rasch trees under the null hypothesis, or when the sample size is large but DIF effects are negligible. To make the stopping criterion easy-to-use for applied researchers, we have implemented the procedure in the statistical software R. Finally, we discuss how DIF effects between different nodes in a Rasch tree can be interpreted and emphasize the importance of purification strategies for the Mantel-Haenszel procedure on tree stopping and DIF item classification.

Download Full-text

A non-parametric effect size measure capturing changes in central tendency and shape of data distributions more flexibly than Cohen’s d

10.21203/rs.2.21070/v1 ◽

2020 ◽

Author(s):

Jörn Lötsch ◽

Alfred Ultsch

Keyword(s):

Effect Size ◽

Data Science ◽

Heterogeneous Data ◽

Machine Learning Algorithms ◽

Central Tendency ◽

Effect Size Measure ◽

Specific Data ◽

Cohen’S D ◽

Size Measure ◽

Cohen's D

Abstract Calculating the magnitude of treatment effects or of differences between two groups is a common task in quantitative science. Standard effect size measures based on differences, such as the commonly used Cohen's, fail to capture the treatment-related effects on the data if the effects were not reflected by the central tendency. "Impact” is a novel nonparametric measure of effect size obtained as the sum of two separate components and includes (i) the change in the central tendency of the group-specific data, normalized to the overall variability, and (ii) the difference in the probability density of the group-specific data. Results obtained on artificial data and empirical biomedical data showed that impact outperforms Cohen's d by this additional component. It is shown that in a multivariate setting, while standard statistical analyses and Cohen’s d are not able to identify effects that lead to changes in the form of data distribution, “Impact” correctly captures them. The proposed effect size measure shares the ability to observe such an effect with machine learning algorithms. It is numerically stable even for degenerate distributions consisting of singular values. Therefore, the proposed effect size measure is particularly well suited for data science and artificial intelligence-based knowledge discovery from (big) and heterogeneous data.

Download Full-text

Cliff´s Delta Calculator: A non-parametric effect size program for two groups of observations

Universitas Psychologica ◽

10.11144/javeriana.upsy10-2.cdcp ◽

2010 ◽

Vol 10 (2) ◽

pp. 545-555 ◽

Cited By ~ 101

Author(s):

Guillermo Macbeth ◽

Eugenia Razumiejczyk ◽

Rubén Daniel Ledesma

Keyword(s):

Effect Size ◽

Worked Examples ◽

Visual Interpretation ◽

Effect Size Measure ◽

P Values ◽

Size Measure ◽

Parametric Case ◽

Algorithmic Approaches ◽

Complementary Analysis ◽

Non Parametric

The Cliff´s Delta statistic is an effect size measure that quantifies the amount of difference between two non-parametric variables beyond p-values interpretation. This measure can be understood as a useful complementary analysis for the corresponding hypothesis testing. During the last two decades the use of effect size measures has been strongly encouraged by methodologists and leading institutions of behavioral sciences. The aim of this contribution is to introduce the Cliff´s Delta Calculator software that performs such analysis and offers some interpretation tips. Differences and similarities with the parametric case are analysed and illustrated. The implementation of this free program is fully described and compared with other calculators. Alternative algorithmic approaches are mathematically analysed and a basic linear algebra proof of its equivalence is formally presented. Two worked examples in cognitive psychology are commented. A visual interpretation of Cliff´s Delta is suggested. Availability, installation and applications of the program are presented and discussed.

Download Full-text

Evaluating Type I Error and Power Rates Using an Effect Size Measure With the Logistic Regression Procedure for DIF Detection

Applied Measurement in Education ◽

10.1207/s15324818ame1404_2 ◽

2001 ◽

Vol 14 (4) ◽

pp. 329-349 ◽

Cited By ~ 229

Author(s):

Michael G. Jodoin ◽

Mark J. Gierl

Keyword(s):

Logistic Regression ◽

Effect Size ◽

Type I Error ◽

Type I ◽

Effect Size Measure ◽

Regression Procedure ◽

Size Measure ◽

Logistic Regression Procedure

Download Full-text

Gain Score in Item Response Theory as an Effect Size Measure

Educational and Psychological Measurement ◽

10.1177/0013164404264118 ◽

2004 ◽

Vol 64 (5) ◽

pp. 758-780 ◽

Cited By ~ 19

Author(s):

Wen-Chung Wang ◽

Wu Chyi-In

Keyword(s):

Item Response Theory ◽

Item Response ◽

Effect Size ◽

Response Theory ◽

Effect Size Measure ◽

Gain Score ◽

Size Measure

Download Full-text

An effect size measure and Bayesian analysis of single-case designs

Journal of School Psychology ◽

10.1016/j.jsp.2013.12.002 ◽

2014 ◽

Vol 52 (2) ◽

pp. 213-230 ◽

Cited By ~ 38

Author(s):

Hariharan Swaminathan ◽

H. Jane Rogers ◽

Robert H. Horner

Keyword(s):

Bayesian Analysis ◽

Effect Size ◽

Single Case ◽

Effect Size Measure ◽

Size Measure ◽

Single Case Designs

Download Full-text

Statistical and extra-statistical considerations in differential item functioning analyses

SA Journal of Industrial Psychology ◽

10.4102/sajip.v30i4.173 ◽

2004 ◽

Vol 30 (4) ◽

Cited By ~ 1

Author(s):

G. K. Huysamen

Keyword(s):

Differential Item Functioning ◽

South African ◽

Predictive Validity ◽

Test Development ◽

African Society ◽

South African Society ◽

Item Functioning ◽

Research Findings

This article briefly describes the main procedures for performing differential item functioning (DIF) analyses and points out some of the statistical and extra-statistical implications of these methods. Research findings on the sources of DIF, including those associated with translated tests, are reviewed. As DIF analyses are oblivious of correlations between a test and relevant criteria, the elimination of differentially functioning items does not necessarily improve predictive validity or reduce any predictive bias. The implications of the results of past DIF research for test development in the multilingual and multi-cultural South African society are considered. Opsomming Hierdie artikel beskryf kortliks die hoofprosedures vir die ontleding van differensiële itemfunksionering (DIF) en verwys na sommige van die statistiese en buite-statistiese implikasies van hierdie metodes. ’n Oorsig word verskaf van navorsingsbevindings oor die bronne van DIF, insluitend dié by vertaalde toetse. Omdat DIF-ontledings nie die korrelasies tussen ’n toets en relevante kriteria in ag neem nie, sal die verwydering van differensieel-funksionerende items nie noodwendig voorspellingsgeldigheid verbeter of voorspellingsydigheid verminder nie. Die implikasies van vorige DIF-navorsingsbevindings vir toetsontwikkeling in die veeltalige en multikulturele Suid-Afrikaanse gemeenskap word oorweeg.

Download Full-text