Examining the Effects of Differential Item (Functioning and Differential) Test Functioning on Selection Decisions: When Are Statistically Significant Effects Practically Important?

2004 ◽  
Vol 89 (3) ◽  
pp. 497-508 ◽  
Author(s):  
Stephen Stark ◽  
Oleksandr S. Chernyshenko ◽  
Fritz Drasgow
2021 ◽  
pp. 001316442110015
Author(s):  
Dimiter M. Dimitrov ◽  
Dimitar V. Atanasov

This study offers an approach to testing for differential item functioning (DIF) in a recently developed measurement framework, referred to as D-scoring method (DSM). Under the proposed approach, called P–Z method of testing for DIF, the item response functions of two groups (reference and focal) are compared by transforming their probabilities of correct item response, estimated under the DSM, into Z-scale normal deviates. Using the liner relationship between such Z-deviates, the testing for DIF is reduced to testing two basic statistical hypotheses about equal variances and equal means of the Z-deviates for the reference and focal groups. The results from a simulation study support the efficiency (low Type error and high power) of the proposed P–Z method. Furthermore, it is shown that the P–Z method is directly applicable in testing for differential test functioning. Recommendations for practical use and future research, including possible applications of the P–Z method in IRT context, are also provided.


2013 ◽  
Vol 34 (3) ◽  
pp. 170-183 ◽  
Author(s):  
Eunike Wetzel ◽  
Benedikt Hell

Large mean differences are consistently found in the vocational interests of men and women. These differences may be attributable to real differences in the underlying traits. However, they may also depend on the properties of the instrument being used. It is conceivable that, in addition to the intended dimension, items assess a second dimension that differentially influences responses by men and women. This question is addressed in the present study by analyzing a widely used German interest inventory (Allgemeiner Interessen-Struktur-Test, AIST-R) regarding differential item functioning (DIF) using a DIF estimate in the framework of item response theory. Furthermore, the impact of DIF at the scale level is investigated using differential test functioning (DTF) analyses. Several items on the AIST-R’s scales showed significant DIF, especially on the Realistic, Social, and Enterprising scales. Removal of DIF items reduced gender differences on the Realistic scale, though gender differences on the Investigative, Artistic, and Social scales remained practically unchanged. Thus, responses to some AIST-R items appear to be influenced by a secondary dimension apart from the interest domain the items were intended to measure.


2019 ◽  
Vol 38 (5) ◽  
pp. 627-641
Author(s):  
Beyza Aksu Dunya ◽  
Clark McKown ◽  
Everett Smith

Emotion recognition (ER) involves understanding what others are feeling by interpreting nonverbal behavior, including facial expressions. The purpose of this study is to evaluate the psychometric properties of a web-based social ER assessment designed for children in kindergarten through third grade. Data were collected from two separate samples of children. The first sample included 3,224 children and the second sample included 4,419 children. Data were calibrated using Rasch dichotomous model. Differential item and test functioning were also evaluated across gender and ethnicity. Across both samples, we found consistent item fit, unidimensional item structure, and adequate item targeting. Analyses of differential item functioning (DIF) found six out of 111 items displaying DIF across gender and no items demonstrating DIF across ethnicity. The analyses of person measure calibrations with and without DIF items yielded no evidence of differential test functioning (DTF) across gender and ethnicity groups in both samples.


2020 ◽  
pp. 073428292094552
Author(s):  
Maryellen Brunson McClain ◽  
Bryn Harris ◽  
Sarah E. Schwartz ◽  
Megan E. Golson

Although the racial/ethnic demographics in the United States are changing, few studies evaluate the cultural and linguistic responsiveness of commonly used autism spectrum disorder screening and diagnostic assessment measures. The purpose of this study is to evaluate item and test functioning of the Autism Spectrum Rating Scales (ASRS) in a sample of racially/ethnically diverse parents of children (nonclinical) between the ages of 6–18 ( N = 806). This study is a follow-up to a prior publication examining the factor structure of the ASRS among a similar sample. The present study furthers the examination of measurement invariance of the ASRS in racially/ethnically diverse populations by conducting differential item functioning and differential test functioning with a larger sample. Results indicate test-level invariance; however, five items are noninvariant across parent reporters from different racial/ethnic groups. Implications for practice and directions for future research are discussed.


1998 ◽  
Vol 24 (2) ◽  
Author(s):  
E. Van Zyl ◽  
D. Visser

The elimination of unfair discrimination and cultural bias of any kind, is a contentious workplace issue in contemporary South Africa. To ensure fairness in testing, psychometric instruments are subjected to empirical investigations for the detection of possible bias that could lead to selection decisions constituting unfair discrimination. This study was conducted to explore the possible existence of differential item functioning (DIF), or potential bias, in the Figure Classification Test (A121) by means of the Mantel-Haenszel chi-square technique. The sample consisted of 498 men at a production company in the Western Cape. Although statistical analysis revealed significant differences between the mean test scores of three racial groups on the test, very few items were identified as having statistically significant DIF. The possibility is discussed that, despite the presence of some DIF, the differences between the means may not be due to the measuring instrument itself being biased/ but rather to extraneous sources of variation, such as the unequal education and socio-economic backgrounds of the racial groups. It was concluded that there is very little evidence of item bias in the test. Opsomming Die uitskakeling van onregverdige diskriminasie en kultuursydigheid van enige aard, is tans 'n omstrede kwessie in die werkpiek in Suid-Afrika. Ten einde regverdigheid in toetsing te verseker, word psigomefrriese toetse onderwerp aan empiriese ondersoeke na die moontlikheid van sydigheid wat kan lei tot keuringsbesluite wat onregverdige diskriminasie meebring. Hierdie ondersoek is ondemeem om die moontlikheid van differensiele itemfunksionering (DIF), of potensiële sydigheid, in die Figuurindelingtoets (A121), met behulp van die Mantel-Haenszel chikwadraattegniek, te ondersoek. Die steekproef het bestaan uit 498 mans by 'n produksiemaatskappy in die Wes-Kaap. Alhoewel statistiese ontleding beduidende verskille in gemiddelde toetstellings van drie rassegroepe op die toets aangedui het, is bate min items aangedui wat statistics beduidende DIF bevat. Die moontlikheid word bespreek dat, hoewel sommige DIF in die toets teenwoordig is, die verskille tussen die gemiddeldes nie die gevolg is van 'n sydige meetinstrument per se nie, maar eerder die gevolg van eksteme bronne van variasie, soos byvoorbeeld die ongelyke opvoedkundige- en sosio-ekonomiese agtergronde van die rassegroepe. Die gevolgtrekking was dat daar bate min getuienis van itemsydigheid in die toets is.


2021 ◽  
Vol 7 (1) ◽  
Author(s):  
Philseok Lee ◽  
Seang-Hwane Joo

To address faking issues associated with Likert-type personality measures, multidimensional forced-choice (MFC) measures have recently come to light as important components of personnel assessment systems. Despite various efforts to investigate the fake resistance of MFC measures, previous research has mainly focused on the scale mean differences between honest and faking conditions. Given the recent psychometric advancements in MFC measures (e.g., Brown & Maydeu-Olivares, 2011; Stark et al., 2005; Lee et al., 2019; Joo et al., 2019), there is a need to investigate the fake resistance of MFC measures through a new methodological lens. This research investigates the fake resistance of MFC measures through recently proposed differential item functioning (DIF) and differential test functioning (DTF) methodologies for MFC measures (Lee, Joo, & Stark, 2020). Overall, our results show that MFC measures are more fake resistant than Likert-type measures at the item and test levels. However, MFC measures may still be susceptible to faking if MFC measures include many mixed blocks consisting of positively and negatively keyed statements within a block. It may be necessary for future research to find an optimal strategy to design mixed blocks in the MFC measures to satisfy the goals of validity and scoring accuracy. Practical implications and limitations are discussed in the paper.


Diagnostica ◽  
2021 ◽  
Vol 67 (1) ◽  
pp. 13-23
Author(s):  
Ariana Garrote ◽  
Elisabeth Moser Opitz

Zusammenfassung. In dieser Studie wurde der Test MARKO-D (Mathematik- und Rechenkonzepte im Vorschulalter–Diagnose) mit einer Stichprobe von Kindern aus der deutschsprachigen Schweiz ( N = 555) im ersten und zweiten Kindergartenjahr erprobt und es wurde analysiert, ob sich die Altersnormen der deutschen Stichprobe auf die Schweiz übertragen lassen. Zudem wurde der Test mit einer Teilstichprobe ( n = 87) hinsichtlich Messinvarianz über die Zeit untersucht. Die Ergebnisse des eindimensionalen Rasch-Modells zeigen, dass das Instrument für die Schweiz geeignet ist. Die Testleistungen hängen jedoch vom Kindergartenbesuch ab. Für die Schweiz müssten deshalb nebst Altersnormen auch Normen pro Kindergartenhalbjahr verwendet werden. Die Analyse mittels Differential Item Functioning ergab, dass 17 von 55 Items von großer Messvarianz über die Zeit betroffen sind. Um das Instrument für Längsschnittuntersuchungen einsetzen zu können, müsste es weiterentwickelt werden.


Sign in / Sign up

Export Citation Format

Share Document