About Still Nonignorable Consequences of (Partially) Ignoring Missing Item Responses in Large-scale Assessment

Mapping Intimacies ◽

10.31219/osf.io/hmy45 ◽

2020 ◽

Author(s):

Alexander Robitzsch

Keyword(s):

Large Scale ◽

Test Theory ◽

Present Contribution ◽

Response Models ◽

Item Response Models ◽

Large Scale Assessment ◽

Country Comparison ◽

Country Rankings ◽

Item Responses ◽

Missing Item

In recent literature, alternative models for handling missing item responses in large-scale assessments are proposed. In principle, based on simulations and arguments based test theory (Rose, 2013). In those approaches, it is argued that missing item responses should never be scored as incorrect, but rather treated as ignorable (e.g., Pohl et al., 2014). The present contribution shows that these arguments have limited validity and illustrates the consequences in a country comparison in the PIRLS 2011 study. A different treatment of missing item responses than recoding them as incorrect leads to significant changes in country rankings, which induces nonignorable consequences regarding the results' validity. Additionally, two alternative item response models based on different assumptions for missing item responses are proposed.

Download Full-text

On the Treatment of Missing Item Responses in Educational Large-scale Assessment Data: The Case of PISA 2018 Mathematics

10.20944/preprints202110.0107.v1 ◽

2021 ◽

Author(s):

Alexander Robitzsch

Keyword(s):

Missing Data ◽

Large Scale ◽

Model Fit ◽

Large Scale Assessment ◽

Missing Data Treatments ◽

Scale Assessment ◽

Scaling Models ◽

Item Responses ◽

Response Propensity ◽

Missing Item

Missing item responses are prevalent in educational large-scale assessment studies like the programme for international student assessment (PISA). The current operational practice scores missing item responses as wrong, but several psychometricians advocated a model-based treatment based on latent ignorability assumption. In this approach, item responses and response indicators are jointly modeled conditional on a latent ability and a latent response propensity variable. Alternatively, imputation-based approaches can be used. The latent ignorability assumption is weakened in the Mislevy-Wu model that characterizes a nonignorable missingness mechanism and allows the missingness of an item to depend on the item itself. The scoring of missing item responses as wrong and the latent ignorable model are submodels of the Mislevy-Wu model. This article uses the PISA 2018 mathematics dataset to investigate the consequences of different missing data treatments on country means. Obtained country means can substantially differ for the different scaling models. In contrast to previous statements in the literature, the scoring of missing item responses as incorrect provided a better model fit than a latent ignorable model for most countries. Furthermore, the dependence of the missingness of an item from the item itself after conditioning on the latent response propensity was much more pronounced for constructed-response items than for multiple-choice items. As a consequence, scaling models that presuppose latent ignorability should be refused from two perspectives. First, the Mislevy-Wu model is preferred over the latent ignorable model for reasons of model fit. Second, we argue that model fit should only play a minor role in choosing psychometric models in large-scale assessment studies because validity aspects are most relevant. Missing data treatments that countries can simply manipulate (and, hence, their students) result in unfair country comparisons.

Download Full-text

On the Treatment of Missing Item Responses in Educational Large-Scale Assessment Data: An Illustrative Simulation Study and a Case Study Using PISA 2018 Mathematics Data

European Journal of Investigation in Health, Psychology and Education ◽

10.3390/ejihpe11040117 ◽

2021 ◽

Vol 11 (4) ◽

pp. 1653-1687

Author(s):

Alexander Robitzsch

Keyword(s):

Large Scale ◽

Model Fit ◽

Large Scale Assessment ◽

Missing Data Treatments ◽

Scale Assessment ◽

Scaling Models ◽

Item Responses ◽

Response Propensity ◽

Missing Item

Missing item responses are prevalent in educational large-scale assessment studies such as the programme for international student assessment (PISA). The current operational practice scores missing item responses as wrong, but several psychometricians have advocated for a model-based treatment based on latent ignorability assumption. In this approach, item responses and response indicators are jointly modeled conditional on a latent ability and a latent response propensity variable. Alternatively, imputation-based approaches can be used. The latent ignorability assumption is weakened in the Mislevy-Wu model that characterizes a nonignorable missingness mechanism and allows the missingness of an item to depend on the item itself. The scoring of missing item responses as wrong and the latent ignorable model are submodels of the Mislevy-Wu model. In an illustrative simulation study, it is shown that the Mislevy-Wu model provides unbiased model parameters. Moreover, the simulation replicates the finding from various simulation studies from the literature that scoring missing item responses as wrong provides biased estimates if the latent ignorability assumption holds in the data-generating model. However, if missing item responses are generated such that they can only be generated from incorrect item responses, applying an item response model that relies on latent ignorability results in biased estimates. The Mislevy-Wu model guarantees unbiased parameter estimates if the more general Mislevy-Wu model holds in the data-generating model. In addition, this article uses the PISA 2018 mathematics dataset as a case study to investigate the consequences of different missing data treatments on country means and country standard deviations. Obtained country means and country standard deviations can substantially differ for the different scaling models. In contrast to previous statements in the literature, the scoring of missing item responses as incorrect provided a better model fit than a latent ignorable model for most countries. Furthermore, the dependence of the missingness of an item from the item itself after conditioning on the latent response propensity was much more pronounced for constructed-response items than for multiple-choice items. As a consequence, scaling models that presuppose latent ignorability should be refused from two perspectives. First, the Mislevy-Wu model is preferred over the latent ignorable model for reasons of model fit. Second, in the discussion section, we argue that model fit should only play a minor role in choosing psychometric models in large-scale assessment studies because validity aspects are most relevant. Missing data treatments that countries can simply manipulate (and, hence, their students) result in unfair country comparisons.

Download Full-text

Disentangling Interviewer and Area Effects in Large-Scale Educational Assessments Using Cross-Classified Multilevel Item Response Models

Journal of Survey Statistics and Methodology ◽

10.1093/jssam/smaa015 ◽

2020 ◽

Author(s):

Theresa Rohm ◽

Claus H Carstensen ◽

Luise Fischer ◽

Timo Gnambs

Keyword(s):

Item Response ◽

Large Scale ◽

Multilevel Models ◽

Response Models ◽

Item Response Models ◽

Interviewer Effects ◽

Regional Effects ◽

Area Effects ◽

Geographical Regions ◽

Educational Assessments

Abstract In large-scale educational assessments, interviewers should ensure standardized settings for all participants. However, in practice many interviewers do not strictly adhere to standardized field protocols. Therefore, systematic interviewer effects for the measurement of mathematical competence were examined in a representative sample of N = 5,139 German adults. To account for interviewers working in specific geographical regions, interviewer and area effects were disentangled using cross-classified multilevel item response models. These analyses showed that interviewer behavior distorted competence measurements, whereas regional effects were negligible. On a more general note, it is demonstrated how to identify conspicuous interviewer behavior with Bayesian multilevel models.

Download Full-text

An Analysis of Cross Racial Identity Scale Scores Using Classical Test Theory and Rasch Item Response Models

Measurement and Evaluation in Counseling and Development ◽

10.1177/0748175612468594 ◽

2013 ◽

Vol 46 (2) ◽

pp. 136-153 ◽

Cited By ~ 5

Author(s):

Joshua Sussman ◽

A. Alexander Beaujean ◽

Frank C. Worrell ◽

Stevie Watson

Keyword(s):

Racial Identity ◽

Item Response ◽

Classical Test Theory ◽

Test Theory ◽

Response Models ◽

Item Response Models ◽

Classical Test ◽

Scale Scores

Download Full-text

Explanatory Item Response Models for Polytomous Item Responses

International Journal of Assessment Tools in Education ◽

10.21449/ijate.515085 ◽

2019 ◽

pp. 259-278 ◽

Cited By ~ 2

Author(s):

Luke Stanke ◽

Okan Bulut

Keyword(s):

Item Response ◽

Response Models ◽

Item Response Models ◽

Polytomous Item ◽

Item Responses ◽

Polytomous Item Responses ◽

Explanatory Item Response Models

Download Full-text

ALIGNMENT OF 6TH GRADE LARGE-SCALE ASSESSMENT CONSTRUCTS WITH THE REVISED CURRICULUM FRAMEWORK

SOCIETY INTEGRATION EDUCATION Proceedings of the International Scientific Conference ◽

10.17770/sie2019vol2.3811 ◽

2019 ◽

Vol 2 ◽

pp. 387

Author(s):

Pāvels Pestovs ◽

Dace Namsone ◽

Līga Čakāne ◽

Ilze Saleniece

Keyword(s):

Cognitive Skills ◽

Large Scale ◽

National Curriculum ◽

National Development ◽

Theoretical Framework ◽

Test Theory ◽

Test Items ◽

Large Scale Assessment ◽

International Student Assessment ◽

Scale Assessment

One of the goals of the National Development Plan 2014-2020 is to reduce the proportion of students with low cognitive skills, and at the same time increase the proportion of students with higher level cognitive skills. In line with those goals, the National Centre for Education is implementing the project “Competency-based approach to curriculum”, funded by the European Social Fund. The purpose of the research described in this article is to find out to what extent the current large-scale national assessments for 6th Grade are coherent with the new curriculum and what improvements are needed for aligning the national assessments with the national curriculum. The theoretical framework of the research is developed by analysing the frameworks of the programme for international student assessment (PISA), trends in international mathematics and science study (TIMSS), progress in international reading literacy study (PIRLS), as well as the framework of the revised national curriculum in Latvia. National 6th Grade assessments of the year 2018 are analysed by using Classical test theory and Rasch model. The indicators of the test items are mapped according to the developed theoretical framework. Authors conclude that the national 6th Grade tests assess the elements of literacy, numeracy and scientific literacy. Students have a high level of performance in test items with low cognitive depths, but there is an insufficient number of test items with high cognitive depths, allowing pupils to demonstrate skills in new contexts, which is an essential goal of the new national curriculum. Further research is required on the use of data from the large-scale assessment in supporting and guiding student instruction and learning.

Download Full-text

Missing item responses in latent growth analysis: Item response theory versus classical test theory

Statistical Methods in Medical Research ◽

10.1177/0962280219897706 ◽

2020 ◽

Vol 29 (4) ◽

pp. 996-1014

Author(s):

R Gorter ◽

J-P Fox ◽

I Eekhout ◽

MW Heymans ◽

JWR Twisk

Keyword(s):

Item Response Theory ◽

Item Response ◽

Growth Analysis ◽

Classical Test Theory ◽

Test Theory ◽

Response Theory ◽

Latent Growth ◽

Classical Test ◽

Item Responses ◽

Missing Item

In medical research, repeated questionnaire data is often used to measure and model latent variables across time. Through a novel imputation method, a direct comparison is made between latent growth analysis under classical test theory and item response theory, while also including effects of missing item responses. For classical test theory and item response theory, by means of a simulation study the effects of item missingness on latent growth parameter estimates are examined given longitudinal item response data. Several missing data mechanisms and conditions are evaluated in the simulation study. The additional effects of missingness on differences in classical test theory- and item response theory-based latent growth analysis are directly assessed by rescaling the multiple imputations. The multiple imputation method is used to generate latent variable and item scores from the posterior predictive distributions to account for missing item responses in observed multilevel binary response data. It is shown that a multivariate probit model, as a novel imputation model, improves the latent growth analysis, when dealing with missing at random (MAR) in classical test theory. The study also shows that the parameter estimates for the latent growth model using item response theory show less bias and have smaller MSE’s compared to the estimates using classical test theory.

Download Full-text

A comparison of item response models for accuracy and speed of item responses with applications to adaptive testing

British Journal of Mathematical and Statistical Psychology ◽

10.1111/bmsp.12101 ◽

2017 ◽

Vol 70 (2) ◽

pp. 317-345 ◽

Cited By ~ 13

Author(s):

Peter W. Rijn ◽

Usama S. Ali

Keyword(s):

Item Response ◽

Adaptive Testing ◽

Response Models ◽

Item Response Models ◽

Item Responses

Download Full-text

Detecting Person Heterogeneity in a Large-Scale Orthographic Test Using Item Response Models

Algorithms from and for Nature and Life - Studies in Classification, Data Analysis, and Knowledge Organization ◽

10.1007/978-3-319-00035-0_33 ◽

2013 ◽

pp. 329-336

Author(s):

Christine Hohensinn ◽

Klaus D. Kubinger ◽

Manuel Reif

Keyword(s):

Item Response ◽

Large Scale ◽

Response Models ◽

Item Response Models

Download Full-text

Nonparametric CAT for CD in Educational Settings With Small Samples

Applied Psychological Measurement ◽

10.1177/0146621618813113 ◽

2018 ◽

Vol 43 (7) ◽

pp. 543-561 ◽

Cited By ~ 3

Author(s):

Yuan-Pei Chang ◽

Chia-Yi Chiu ◽

Rung-Ching Tsai

Keyword(s):

Large Scale ◽

Computerized Adaptive Testing ◽

Small Samples ◽

Accurate Estimation ◽

Small Scale ◽

Large Scale Assessment ◽

Scale Assessment ◽

Item Parameters ◽

Item Responses ◽

Nonparametric Classification

Cognitive diagnostic computerized adaptive testing (CD-CAT) has been suggested by researchers as a diagnostic tool for assessment and evaluation. Although model-based CD-CAT is relatively well researched in the context of large-scale assessment systems, this type of system has not received the same degree of research and development in small-scale settings, such as at the course-based level, where this system would be the most useful. The main obstacle is that the statistical estimation techniques that are successfully applied within the context of a large-scale assessment require large samples to guarantee reliable calibration of the item parameters and an accurate estimation of the examinees’ proficiency class membership. Such samples are simply not obtainable in course-based settings. Therefore, the nonparametric item selection (NPS) method that does not require any parameter calibration, and thus, can be used in small educational programs is proposed in the study. The proposed nonparametric CD-CAT uses the nonparametric classification (NPC) method to estimate an examinee’s attribute profile and based on the examinee’s item responses, the item that can best discriminate the estimated attribute profile and the other attribute profiles is then selected. The simulation results show that the NPS method outperformed the compared parametric CD-CAT algorithms and the differences were substantial when the calibration samples were small.

Download Full-text