You Can Play the Game Without Knowing the Rules – But You’re Better Off Knowing Them

Abstract. Due to their high item difficulties and excellent psychometric properties, construction-based figural matrices tasks are of particular interest when it comes to high-stakes testing. An important prerequisite is that test preparation – which is likely to occur in this context – does not impair test fairness or item properties. The goal of this study was to provide initial evidence concerning the influence of test preparation. We administered test items to a sample of N = 882 participants divided into two groups, but only one group was given information about the rules employed in the test items. The probability of solving the items was significantly higher in the test preparation group than in the control group ( M = 0.61, SD = 0.19 vs. M = 0.41, SD = 0.25; t(54) = 3.42, p = .001; d = .92). Nevertheless, a multigroup confirmatory factor analysis, as well as a differential item functioning analysis, indicated no differences between the item properties in the two groups. The results suggest that construction-based figural matrices are suitable in the context of high-stakes testing when all participants are provided with test preparation material so that test fairness is ensured.

Download Full-text

Gender-related Differential Item Functioning Analysis on an ESL Test

Journal of Language Testing & Assessment ◽

10.23977/langta.2020.030102 ◽

2020 ◽

Vol 3 (1) ◽

pp. 5-19

Author(s):

Don Yao ◽

Keyword(s):

Differential Item Functioning ◽

Language Learners ◽

Test Performance ◽

Test Development ◽

The Body ◽

High Stakes ◽

Test Items ◽

Item Functioning ◽

Differential Item Functioning Analysis ◽

Two Phases

Differential item functioning (DIF) is a technique used to examine whether items function differently across different groups. The DIF analysis helps detect bias in an assessment to ensure the fairness of the assessment. However, most of the previous research has focused on high-stakes assessments. There is a dearth in research that laying emphasis on low-stakes assessments, which is also significant for the test development and validation process. Additionally, gender difference in test performance is always a particular concern for researchers to evaluate whether a test is fair or not. This present study investigated whether test items of the General English Proficiency Test for Kids (GEPT-Kids) are free of bias in terms of gender differences. A mixed-method sequential explanatory research design was adopted with two phases. In phase I, test performance data of 492 participants from five Chinese speaking cities were analyzed by the Mantel-Haenszel (MH) method to detect gender DIF. In phase II, items that manifested DIF were subject to content analysis through three experienced reviewers to identify possible sources of DIF. The results showed that three items were detected with moderate gender DIF through statistical methods and three items were identified as possible biased items by expert judgment. The results provide preliminary contributions to DIF analysis for low-stakes assessment in the field of language assessment. Besides, young language learners, especially in the Chinese context, have been drawn renewed attention. Thus, the results may also add to the body of literature that can shed some light on the test development for young language learners.

Download Full-text

What is at stake in knowing the content and capabilities of children’s minds?

Theory and Research in Education ◽

10.1177/1477878504046524 ◽

2004 ◽

Vol 2 (3) ◽

pp. 283-308 ◽

Cited By ~ 12

Author(s):

Stephen P. Norris ◽

Jacqueline P. Leighton ◽

Linda M. Phillips

Keyword(s):

Construct Validity ◽

High Stakes Testing ◽

Cognitive Models ◽

Test Design ◽

Educational Testing ◽

High Stakes ◽

Test Items ◽

Reasoning Strategies ◽

Interpretation Process ◽

The One

Many significant changes in perspective have to take place before efforts to learn the content and capabilities of children’s minds can hold much sway in educational testing. The language of testing, especially of high stakes testing, remains firmly in the realm of ‘behaviors’, ‘performance’ and ‘competency’ defined in terms of behaviors, test items, or observations. What is on children’s minds is not taken into account as integral to the test design and interpretation process. The point of this article is to argue that behaviorist-based validation models are ill-founded, and to recommend basing tests on cognitive models that theorize the content and capabilities of children’s minds in terms of such features as meta-cognition, reasoning strategies, and principles of sound thinking. This approach is the one most likely to yield the construct validity for tests long endorsed by many testing theorists. The implications of adopting a cognitive basis for testing that might be upsetting to many current practices are explored.

Download Full-text

Third through Sixth Graders' Perceptions of High-Stakes Testing

Journal of Literacy Research ◽

10.1207/s15548430jlr3702_5 ◽

2005 ◽

Vol 37 (2) ◽

pp. 237-260 ◽

Cited By ~ 30

Author(s):

Cheri Foster Triplett ◽

Mary Alice Barksdale

Keyword(s):

Test Anxiety ◽

Test Preparation ◽

High Stakes Testing ◽

Sixth Graders ◽

High Stakes ◽

Power Politics ◽

Content Areas ◽

Language And Culture ◽

High Stakes Tests ◽

Written Descriptions

This study examined elementary students' perceptions of high-stakes testing through the use of drawings and writings. On the day after students completed their high-stakes tests in the spring, 225 students were asked to “draw a picture about your recent testing experience.” The same students then responded in writing to the prompt “tell me about your picture.” During data analysis, nine categories were constructed from the themes in students' drawings and written descriptions: Emotions, Easy, Content Areas, Teacher Role, Student Metaphors, Fire, Power/Politics, Adult Language, and Culture of Testing. Each of these categories was supported by drawings and written descriptions. Two additional categories were compelling because of their prevalence in students' drawings: Accoutrements of Testing and Isolation. The researchers examine the prevailing negativity in students' responses and suggest ways to decrease students' overall test anxiety, including making changes in the overall testing culture and changing the role teachers play in test preparation.

Download Full-text

PND27 CONFIRMATORY FACTOR ANALYSIS AND DIFFERENTIAL ITEM FUNCTIONING ANALYSIS OF THE MIGRAINE-SPECIFIC QUALITY OF LIFE QUESTIONNAIRE VERSION 2.1 IN CHRONIC MIGRAINEURS

Value in Health ◽

10.1016/s1098-3015(10)72689-9 ◽

2010 ◽

Vol 13 (3) ◽

pp. A142

Author(s):

R Rendas-Baum ◽

GA Maglinte ◽

M DeRosa ◽

M Yang ◽

SF Varon

Keyword(s):

Quality Of Life ◽

Factor Analysis ◽

Life Questionnaire ◽

Quality Of Life Questionnaire ◽

Item Functioning ◽

Confirmatory Factor ◽

Differential Item Functioning Analysis ◽

Questionnaire Version ◽

Chronic Migraineurs

Download Full-text

Flexibility at the Price of Volatility: Concurrent Calibration in Multistage Tests in Practice Using a 2PL Model

Frontiers in Education ◽

10.3389/feduc.2021.679864 ◽

2021 ◽

Vol 6 ◽

Author(s):

Laura A. Helbling ◽

Stéphanie Berger ◽

Angela Verschoor

Keyword(s):

High Stakes Testing ◽

Practical Reasons ◽

Test Quality ◽

High Stakes ◽

Test Items ◽

Item Parameters ◽

Concurrent Calibration ◽

Student Ability ◽

To Come ◽

Multistage Test

Multistage test (MST) designs promise efficient student ability estimates, an indispensable asset for individual diagnostics in high-stakes educational assessments. In high-stakes testing, annually changing test forms are required because publicly known test items impair accurate student ability estimation, and items of bad model fit must be continually replaced to guarantee test quality. This requires a large and continually refreshed item pool as the basis for high-stakes MST. In practice, the calibration of newly developed items to feed annually changing tests is highly resource intensive. Piloting based on a representative sample of students is often not feasible, given that, for schools, participation in actual high-stakes assessments already requires considerable organizational effort. Hence, under practical constraints, the calibration of newly developed items may take place on the go in the form of a concurrent calibration in MST designs. Based on a simulation approach this paper focuses on the performance of Rasch vs. 2PL modeling in retrieving item parameters when items are for practical reasons non-optimally placed in multistage tests. Overall, the results suggest that the 2PL model performs worse in retrieving item parameters compared to the Rasch model when there is non-optimal item assembly in the MST; especially in retrieving parameters at the margins. The higher flexibility of 2PL modeling, where item discrimination is allowed to vary, seems to come at the cost of increased volatility in parameter estimation. Although the overall bias may be modest, single items can be affected by severe biases when using a 2PL model for item calibration in the context of non-optimal item placement.

Download Full-text

Test fairness and assessment of differential item functioning of mathematics achievement test for senior secondary students in Cross River state, Nigeria using item response theory

Global Journal of Educational Research ◽

10.4314/gjedr.v20i1.6 ◽

2021 ◽

Vol 20 (1) ◽

pp. 55-62

Author(s):

Anthony Pius Effiom

Keyword(s):

Item Response Theory ◽

Mathematics Achievement ◽

Differential Item Functioning ◽

Item Response ◽

Female Students ◽

Response Theory ◽

Test Fairness ◽

Cross River State ◽

Test Items ◽

Item Functioning

This study used Item Response Theory approach to assess Differential Item Functioning (DIF) and detect item bias in Mathematics Achievement Test (MAT). The MAT was administered to 1,751 SS2 students in public secondary schools in Cross River State. Instrumentation research design was used to develop and validate a 50-item instrument. Data were analysed using the maximum likelihood estimation technique of BILOG-MG V3 software. The result of the study revealed that 6% of the total items exhibited differential item functioning between the male and female students. Based on the analysis, the study observed that there was sex bias on some of the test items in the MAT. DIF analysis attempt at eliminating irrelevant factors and sources of bias from any kind for a test to yield valid results is among the best methods of recent. As such, test developers and policymakers are recommended to take into serious consideration and exercise care in fair test practice by dedicating effort to more unbiased test development and decision making. Examination bodies should adopt the Item Response Theory in educational testing and test developers should therefore be mindful of the test items that can cause bias in response pattern between male and female students or any sub-group of consideration. Keywords: Assessment, Differential Item Functioning, Validity, Reliability, Test Fairness, Item Bias, Item Response Theory.

Download Full-text

Checking Equity: Why Differential Item Functioning Analysis Should Be a Routine Part of Developing Conceptual Assessments

CBE—Life Sciences Education ◽

10.1187/cbe.16-10-0307 ◽

2017 ◽

Vol 16 (2) ◽

pp. rm2 ◽

Cited By ~ 16

Author(s):

Patrícia Martinková ◽

Adéla Drabinová ◽

Yuan-Ling Liaw ◽

Elizabeth A. Sanders ◽

Jenny L. McFarland ◽

...

Keyword(s):

Differential Item Functioning ◽

Simulated Data ◽

Biology Education ◽

Test Fairness ◽

Data Set ◽

Education Literature ◽

Item Functioning ◽

Significant Difference ◽

Differential Item Functioning Analysis ◽

Methodological Approaches

We provide a tutorial on differential item functioning (DIF) analysis, an analytic method useful for identifying potentially biased items in assessments. After explaining a number of methodological approaches, we test for gender bias in two scenarios that demonstrate why DIF analysis is crucial for developing assessments, particularly because simply comparing two groups’ total scores can lead to incorrect conclusions about test fairness. First, a significant difference between groups on total scores can exist even when items are not biased, as we illustrate with data collected during the validation of the Homeostasis Concept Inventory. Second, item bias can exist even when the two groups have exactly the same distribution of total scores, as we illustrate with a simulated data set. We also present a brief overview of how DIF analysis has been used in the biology education literature to illustrate the way DIF items need to be reevaluated by content experts to determine whether they should be revised or removed from the assessment. Finally, we conclude by arguing that DIF analysis should be used routinely to evaluate items in developing conceptual assessments. These steps will ensure more equitable—and therefore more valid—scores from conceptual assessments.

Download Full-text

The Multidimensionality of Measurement Bias in High‐Stakes Testing: Using Machine Learning to Evaluate Complex Sources of Differential Item Functioning

Educational Measurement Issues and Practice ◽

10.1111/emip.12486 ◽

2022 ◽

Author(s):

William C. M. Belzak

Keyword(s):

Machine Learning ◽

Differential Item Functioning ◽

High Stakes Testing ◽

Measurement Bias ◽

High Stakes ◽

Item Functioning

Download Full-text

High-Stakes Testing & Student Learning

Education Policy Analysis Archives ◽

10.14507/epaa.v10n18.2002 ◽

2002 ◽

Vol 10 ◽

pp. 18 ◽

Cited By ~ 98

Author(s):

Audrey L. Amrein ◽

David C. Berliner

Keyword(s):

Student Learning ◽

Test Preparation ◽

High Stakes Testing ◽

High Stakes ◽

Preparation Programs ◽

Testing Program ◽

History Of ◽

The Individual ◽

Testing Policies

A brief history of high-stakes testing is followed by an analysis of eighteen states with severe consequences attached to their testing programs. These 18 states were examined to see if their high-stakes testing programs were affecting student learning, the intended outcome of high-stakes testing policies promoted throughout the nation. Scores on the individual tests that states use were not analyzed for evidence of learning. Such scores are easily manipulated through test-preparation programs, narrow curricula focus, exclusion of certain students, and so forth. Student learning was measured by means of additional tests covering some of the same domain as each state's own high-stakes test. The question asked was whether transfer to these domains occurs as a function of a state's high-stakes testing program.

Download Full-text

The Embittered Mind

Journal of Individual Differences ◽

10.1027/1614-0001/a000208 ◽

2016 ◽

Vol 37 (4) ◽

pp. 213-222 ◽

Cited By ~ 4

Author(s):

Hansjörg Znoj ◽

Sandra Abegglen ◽

Ulrike Buchkremer ◽

Michael Linden

Keyword(s):

Emotional Reaction ◽

Control Group ◽

Systematic Research ◽

Roc Curve Analysis ◽

Confirmatory Factor Analyses ◽

Receiver Operating Characteristic Curves ◽

Confirmatory Factor ◽

Different Dimensions ◽

First Time ◽

Common Understanding

Abstract. There is a growing interest in embitterment as psychological concept. However, little systematic research has been conducted to characterize this emotional reaction. Still, there is an ongoing debate about the distinctiveness of embitterment and its dimensions. Additionally, a categorical and a dimensional perspective on embitterment have been developed independently over the last decade. The present study investigates the dimensions of embitterment by bringing these two different approaches together, for the first time. The Bern Embitterment Inventory (BEI) was given to 49 patients diagnosed with “Posttraumatic Embitterment Disorder (PTED)” and a matched control group of 49 patients with psychological disorders with other dominant emotional dysregulations. The ability to discriminate between the two groups was assessed by t-tests and Receiver Operating Characteristic Curves (ROC curve analysis). PTED patients scored significantly higher on the BEI than the patients of the control group. ROC analyses indicated diagnostic accuracy of the inventory. Further, we conducted Confirmatory Factor Analyses (CFA) to examine the different dimensions of embitterment and their relations. As a result, we found four characteristic dimensions of embitterment, namely disappointment, lack of acknowledge, pessimism, and misanthropy. In general, our findings showed a common understanding of embitterment as a unique but multidimensional emotional reaction to distressful life-events.

Download Full-text