scholarly journals About Still Nonignorable Consequences of (Partially) Ignoring Missing Item Responses in Large-scale Assessment

2020 ◽  
Author(s):  
Alexander Robitzsch

In recent literature, alternative models for handling missing item responses in large-scale assessments are proposed. In principle, based on simulations and arguments based test theory (Rose, 2013). In those approaches, it is argued that missing item responses should never be scored as incorrect, but rather treated as ignorable (e.g., Pohl et al., 2014). The present contribution shows that these arguments have limited validity and illustrates the consequences in a country comparison in the PIRLS 2011 study. A different treatment of missing item responses than recoding them as incorrect leads to significant changes in country rankings, which induces nonignorable consequences regarding the results' validity. Additionally, two alternative item response models based on different assumptions for missing item responses are proposed.

Author(s):  
Alexander Robitzsch

Missing item responses are prevalent in educational large-scale assessment studies like the programme for international student assessment (PISA). The current operational practice scores missing item responses as wrong, but several psychometricians advocated a model-based treatment based on latent ignorability assumption. In this approach, item responses and response indicators are jointly modeled conditional on a latent ability and a latent response propensity variable. Alternatively, imputation-based approaches can be used. The latent ignorability assumption is weakened in the Mislevy-Wu model that characterizes a nonignorable missingness mechanism and allows the missingness of an item to depend on the item itself. The scoring of missing item responses as wrong and the latent ignorable model are submodels of the Mislevy-Wu model. This article uses the PISA 2018 mathematics dataset to investigate the consequences of different missing data treatments on country means. Obtained country means can substantially differ for the different scaling models. In contrast to previous statements in the literature, the scoring of missing item responses as incorrect provided a better model fit than a latent ignorable model for most countries. Furthermore, the dependence of the missingness of an item from the item itself after conditioning on the latent response propensity was much more pronounced for constructed-response items than for multiple-choice items. As a consequence, scaling models that presuppose latent ignorability should be refused from two perspectives. First, the Mislevy-Wu model is preferred over the latent ignorable model for reasons of model fit. Second, we argue that model fit should only play a minor role in choosing psychometric models in large-scale assessment studies because validity aspects are most relevant. Missing data treatments that countries can simply manipulate (and, hence, their students) result in unfair country comparisons.


2021 ◽  
Vol 11 (4) ◽  
pp. 1653-1687
Author(s):  
Alexander Robitzsch

Missing item responses are prevalent in educational large-scale assessment studies such as the programme for international student assessment (PISA). The current operational practice scores missing item responses as wrong, but several psychometricians have advocated for a model-based treatment based on latent ignorability assumption. In this approach, item responses and response indicators are jointly modeled conditional on a latent ability and a latent response propensity variable. Alternatively, imputation-based approaches can be used. The latent ignorability assumption is weakened in the Mislevy-Wu model that characterizes a nonignorable missingness mechanism and allows the missingness of an item to depend on the item itself. The scoring of missing item responses as wrong and the latent ignorable model are submodels of the Mislevy-Wu model. In an illustrative simulation study, it is shown that the Mislevy-Wu model provides unbiased model parameters. Moreover, the simulation replicates the finding from various simulation studies from the literature that scoring missing item responses as wrong provides biased estimates if the latent ignorability assumption holds in the data-generating model. However, if missing item responses are generated such that they can only be generated from incorrect item responses, applying an item response model that relies on latent ignorability results in biased estimates. The Mislevy-Wu model guarantees unbiased parameter estimates if the more general Mislevy-Wu model holds in the data-generating model. In addition, this article uses the PISA 2018 mathematics dataset as a case study to investigate the consequences of different missing data treatments on country means and country standard deviations. Obtained country means and country standard deviations can substantially differ for the different scaling models. In contrast to previous statements in the literature, the scoring of missing item responses as incorrect provided a better model fit than a latent ignorable model for most countries. Furthermore, the dependence of the missingness of an item from the item itself after conditioning on the latent response propensity was much more pronounced for constructed-response items than for multiple-choice items. As a consequence, scaling models that presuppose latent ignorability should be refused from two perspectives. First, the Mislevy-Wu model is preferred over the latent ignorable model for reasons of model fit. Second, in the discussion section, we argue that model fit should only play a minor role in choosing psychometric models in large-scale assessment studies because validity aspects are most relevant. Missing data treatments that countries can simply manipulate (and, hence, their students) result in unfair country comparisons.


Author(s):  
Theresa Rohm ◽  
Claus H Carstensen ◽  
Luise Fischer ◽  
Timo Gnambs

Abstract In large-scale educational assessments, interviewers should ensure standardized settings for all participants. However, in practice many interviewers do not strictly adhere to standardized field protocols. Therefore, systematic interviewer effects for the measurement of mathematical competence were examined in a representative sample of N = 5,139 German adults. To account for interviewers working in specific geographical regions, interviewer and area effects were disentangled using cross-classified multilevel item response models. These analyses showed that interviewer behavior distorted competence measurements, whereas regional effects were negligible. On a more general note, it is demonstrated how to identify conspicuous interviewer behavior with Bayesian multilevel models.


Author(s):  
Pāvels Pestovs ◽  
Dace Namsone ◽  
Līga Čakāne ◽  
Ilze Saleniece

One of the goals of the National Development Plan 2014-2020 is to reduce the proportion of students with low cognitive skills, and at the same time increase the proportion of students with higher level cognitive skills. In line with those goals, the National Centre for Education is implementing the project “Competency-based approach to curriculum”, funded by the European Social Fund. The purpose of the research described in this article is to find out to what extent the current large-scale national assessments for 6th Grade are coherent with the new curriculum and what improvements are needed for aligning the national assessments with the national curriculum. The theoretical framework of the research is developed by analysing the frameworks of the programme for international student assessment (PISA), trends in international mathematics and science study (TIMSS), progress in international reading literacy study (PIRLS), as well as the framework of the revised national curriculum in Latvia. National 6th Grade assessments of the year 2018 are analysed by using Classical test theory and Rasch model. The indicators of the test items are mapped according to the developed theoretical framework. Authors conclude that the national 6th Grade tests assess the elements of literacy, numeracy and scientific literacy. Students have a high level of performance in test items with low cognitive depths, but there is an insufficient number of test items with high cognitive depths, allowing pupils to demonstrate skills in new contexts, which is an essential goal of the new national curriculum. Further research is required on the use of data from the large-scale assessment in supporting and guiding student instruction and learning. 


2020 ◽  
Vol 29 (4) ◽  
pp. 996-1014
Author(s):  
R Gorter ◽  
J-P Fox ◽  
I Eekhout ◽  
MW Heymans ◽  
JWR Twisk

In medical research, repeated questionnaire data is often used to measure and model latent variables across time. Through a novel imputation method, a direct comparison is made between latent growth analysis under classical test theory and item response theory, while also including effects of missing item responses. For classical test theory and item response theory, by means of a simulation study the effects of item missingness on latent growth parameter estimates are examined given longitudinal item response data. Several missing data mechanisms and conditions are evaluated in the simulation study. The additional effects of missingness on differences in classical test theory- and item response theory-based latent growth analysis are directly assessed by rescaling the multiple imputations. The multiple imputation method is used to generate latent variable and item scores from the posterior predictive distributions to account for missing item responses in observed multilevel binary response data. It is shown that a multivariate probit model, as a novel imputation model, improves the latent growth analysis, when dealing with missing at random (MAR) in classical test theory. The study also shows that the parameter estimates for the latent growth model using item response theory show less bias and have smaller MSE’s compared to the estimates using classical test theory.


2018 ◽  
Vol 43 (7) ◽  
pp. 543-561 ◽  
Author(s):  
Yuan-Pei Chang ◽  
Chia-Yi Chiu ◽  
Rung-Ching Tsai

Cognitive diagnostic computerized adaptive testing (CD-CAT) has been suggested by researchers as a diagnostic tool for assessment and evaluation. Although model-based CD-CAT is relatively well researched in the context of large-scale assessment systems, this type of system has not received the same degree of research and development in small-scale settings, such as at the course-based level, where this system would be the most useful. The main obstacle is that the statistical estimation techniques that are successfully applied within the context of a large-scale assessment require large samples to guarantee reliable calibration of the item parameters and an accurate estimation of the examinees’ proficiency class membership. Such samples are simply not obtainable in course-based settings. Therefore, the nonparametric item selection (NPS) method that does not require any parameter calibration, and thus, can be used in small educational programs is proposed in the study. The proposed nonparametric CD-CAT uses the nonparametric classification (NPC) method to estimate an examinee’s attribute profile and based on the examinee’s item responses, the item that can best discriminate the estimated attribute profile and the other attribute profiles is then selected. The simulation results show that the NPS method outperformed the compared parametric CD-CAT algorithms and the differences were substantial when the calibration samples were small.


Sign in / Sign up

Export Citation Format

Share Document