scholarly journals Evaluating the Performances of Missing Data Handling Methods in Ability Estimation From Sparse Data

2020 ◽  
Vol 80 (5) ◽  
pp. 932-954 ◽  
Author(s):  
Jiaying Xiao ◽  
Okan Bulut

Large amounts of missing data could distort item parameter estimation and lead to biased ability estimates in educational assessments. Therefore, missing responses should be handled properly before estimating any parameters. In this study, two Monte Carlo simulation studies were conducted to compare the performance of four methods in handling missing data when estimating ability parameters. The methods were full-information maximum likelihood (FIML), zero replacement, and multiple imputation with chain equations utilizing classification and regression trees (MICE-CART) and random forest imputation (MICE-RFI). For the two imputation methods, missing responses were considered as a valid response category to enhance the accuracy of imputations. Bias, root mean square error, and the correlation between true ability parameters and estimated ability parameters were used to evaluate the accuracy of ability estimates for each method. Results indicated that FIML outperformed the other methods under most conditions. Zero replacement yielded accurate ability estimates when missing proportions were very high. The performances of MICE-CART and MICE-RFI were quite similar but these two methods appeared to be affected differently by the missing data mechanism. As the number of items increased and missing proportions decreased, all the methods performed better. In addition, the information on missing data could improve the performance of MICE-RFI and MICE-CART when the data set is sparse and the missing data mechanism is missing at random.

2018 ◽  
Vol 79 (3) ◽  
pp. 495-511 ◽  
Author(s):  
Dee Duygu Cetin-Berber ◽  
Halil Ibrahim Sari ◽  
Anne Corinne Huggins-Manley

Routing examinees to modules based on their ability level is a very important aspect in computerized adaptive multistage testing. However, the presence of missing responses may complicate estimation of examinee ability, which may result in misrouting of individuals. Therefore, missing responses should be handled carefully. This study investigated multiple missing data methods in computerized adaptive multistage testing, including two imputation techniques, the use of full information maximum likelihood and the use of scoring missing data as incorrect. These methods were examined under the missing completely at random, missing at random, and missing not at random frameworks, as well as other testing conditions. Comparisons were made to baseline conditions where no missing data were present. The results showed that imputation and the full information maximum likelihood methods outperformed incorrect scoring methods in terms of average bias, average root mean square error, and correlation between estimated and true thetas.


Psychometrika ◽  
1990 ◽  
Vol 55 (2) ◽  
pp. 371-390 ◽  
Author(s):  
Robert K. Tsutakawa ◽  
Jane C. Johnson

2021 ◽  
pp. 001316442110204
Author(s):  
Kang Xue ◽  
Anne Corinne Huggins-Manley ◽  
Walter Leite

In data collected from virtual learning environments (VLEs), item response theory (IRT) models can be used to guide the ongoing measurement of student ability. However, such applications of IRT rely on unbiased item parameter estimates associated with test items in the VLE. Without formal piloting of the items, one can expect a large amount of nonignorable missing data in the VLE log file data, and this is expected to negatively affect IRT item parameter estimation accuracy, which then negatively affects any future ability estimates utilized in the VLE. In the psychometric literature, methods for handling missing data have been studied mostly around conditions in which the data and the amount of missing data are not as large as those that come from VLEs. In this article, we introduce a semisupervised learning method to deal with a large proportion of missingness contained in VLE data from which one needs to obtain unbiased item parameter estimates. First, we explored the factors relating to the missing data. Then we implemented a semisupervised learning method under the two-parameter logistic IRT model to estimate the latent abilities of students. Last, we applied two adjustment methods designed to reduce bias in item parameter estimates. The proposed framework showed its potential for obtaining unbiased item parameter estimates that can then be fixed in the VLE in order to obtain ongoing ability estimates for operational purposes.


Psych ◽  
2021 ◽  
Vol 3 (4) ◽  
pp. 673-693
Author(s):  
Shenghai Dai

The presence of missing responses in assessment settings is inevitable and may yield biased parameter estimates in psychometric modeling if ignored or handled improperly. Many methods have been proposed to handle missing responses in assessment data that are often dichotomous or polytomous. Their applications remain nominal, however, partly due to that (1) there is no sufficient support in the literature for an optimal method; (2) many practitioners and researchers are not familiar with these methods; and (3) these methods are usually not employed by psychometric software and missing responses need to be handled separately. This article introduces and reviews the commonly used missing response handling methods in psychometrics, along with the literature that examines and compares the performance of these methods. Further, the use of the TestDataImputation package in R is introduced and illustrated with an example data set and a simulation study. Corresponding R codes are provided.


2018 ◽  
Vol 19 (2) ◽  
pp. 174-193 ◽  
Author(s):  
José LP da Silva ◽  
Enrico A Colosimo ◽  
Fábio N Demarqui

Generalized estimating equations (GEEs) are a well-known method for the analysis of categorical longitudinal data. This method presents computational simplicity and provides consistent parameter estimates that have a population-averaged interpretation. However, with missing data, the resulting parameter estimates are consistent only under the strong assumption of missing completely at random (MCAR). Some corrections can be done when the missing data mechanism is missing at random (MAR): inverse probability weighting GEE (WGEE) and multiple imputation GEE (MIGEE). A recent method combining ideas of these two approaches has a doubly robust property in the sense that one only needs to correctly specify the weight or the imputation model in order to obtain consistent estimates for the parameters. In this work, a proportional odds model is assumed and a doubly robust estimator is proposed for the analysis of ordinal longitudinal data with intermittently missing responses and covariates under the MAR mechanism. In addition, the association structure is modelled by means of either the correlation coefficient or local odds ratio. The performance of the proposed method is compared to both WGEE and MIGEE through a simulation study. The method is applied to a dataset related to rheumatic mitral stenosis.


Author(s):  
Seçil Ömür Sünbül

<p>In this study, it was aimed to investigate the impact of different missing data handling methods on DINA model parameter estimation and classification accuracy. In the study, simulated data were used and the data were generated by manipulating the number of items and sample size. In the generated data, two different missing data mechanisms (missing completely at random and missing at random) were created according to three different amounts of missing data. The generated missing data was completed by using methods of treating missing data as incorrect, person mean imputation, two-way imputation, and expectation-maximization algorithm imputation. As a result, it was observed that both s and g parameter estimations and classification accuracies were effected from, missing data rates, missing data handling methods and missing data mechanisms.</p>


2020 ◽  
Vol 9 (2) ◽  
pp. 755-763
Author(s):  
Shamihah Muhammad Ghazali ◽  
Norshahida Shaadan ◽  
Zainura Idrus

Missing values often occur in many data sets of various research areas. This has been recognized as data quality problem because missing values could affect the performance of analysis results. To overcome the problem, the incomplete data set need to be treated or replaced using imputation method. Thus, exploring missing values pattern must be conducted beforehand to determine a suitable method. This paper discusses on the application of data visualisation as a smart technique for missing data exploration aiming to increase understanding on missing data behaviour which include missing data mechanism (MCAR, MAR and MNAR), distribution pattern of missingness in terms of percentage as well as the gap size. This paper presents the application of several data visualisation tools from five R-packges such as visdat, VIM, ggplot2, Amelia and UpSetR for data missingness exploration.  For an illustration, based on an air quality data set in Malaysia, several graphics were produced and discussed to illustrate the contribution of the visualisation tools in providing input and the insight on the pattern of data missingness. Based on the results, it is shown that missing values in air quality data set of the chosen sites in Malaysia behave as missing at random (MAR) with small percentage of missingness  and do contain long gap size of  missingness.


2022 ◽  
Vol 2022 ◽  
pp. 1-8
Author(s):  
Xiaoying Lv ◽  
Ruonan Zhao ◽  
Tongsheng Su ◽  
Liyun He ◽  
Rui Song ◽  
...  

Objective. To explore the optimal fitting path of missing data of the Scale to make the fitting data close to the real situation of patients’ data. Methods. Based on the complete data set of the SDS of 507 patients with stroke, the data simulation sets of Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR) were constructed by R software, respectively, with missing rates of 5%, 10%, 15%, 20%, 25%, 30%, 35%, and 40% under three missing mechanisms. Mean substitution (MS), random forest regression (RFR), and predictive mean matching (PMM) were used to fit the data. Root mean square error (RMSE), the width of 95% confidence intervals (95% CI), and Spearman correlation coefficient (SCC) were used to evaluate the fitting effect and determine the optimal fitting path. Results. when dealing with the problem of missing data in scales, the optimal fitting path is ① under the MCAR deletion mechanism, when the deletion proportion is less than 20%, the MS method is the most convenient; when the missing ratio is greater than 20%, RFR algorithm is the best fitting method. ② Under the Mar mechanism, when the deletion ratio is less than 35%, the MS method is the most convenient. When the deletion ratio is greater than 35%, RFR has a better correlation. ③ Under the mechanism of MNAR, RFR is the best data fitting method, especially when the missing proportion is greater than 30%. In reality, when the deletion ratio is small, the complete case deletion method is the most commonly used, but the RFR algorithm can greatly expand the application scope of samples and save the cost of clinical research when the deletion ratio is less than 30%. The best way to deal with data missing should be based on the missing mechanism and proportion of actual data, and choose the best method between the statistical analysis ability of the research team, the effectiveness of the method, and the understanding of readers.


2016 ◽  
Author(s):  
Lucas Jardim ◽  
Luis Mauricio Bini ◽  
José Alexandre Felizola Diniz-Filho ◽  
Fabricio Villalobos

SummaryGiven the prevalence of missing data on species’ traits – Raunkiaeran shorfall — and its importance for theoretical and empirical investigations, several methods have been proposed to fill sparse databases. Despite its advantages, imputation of missing data can introduce biases. Here, we evaluate the bias in descriptive statistics, model parameters, and phylogenetic signal estimation from imputed databases under different missing and imputing scenarios.We simulated coalescent phylogenies and traits under Brownian Motion and different Ornstein-Uhlenbeck evolutionary models. Missing values were created using three scenarios: missing completely at random, missing at random but phylogenetically structured and missing at random but correlated with some other variable. We considered four methods for handling missing data: delete missing values, imputation based on observed mean trait value, Phylogenetic Eigenvectors Maps and Multiple Imputation by Chained Equations. Finally, we assessed estimation errors of descriptive statistics (mean, variance), regression coefficient, Moran’s correlogram and Blomberg’s K of imputed traits.We found that percentage of missing data, missing mechanisms, Ornstein-Uhlenbeck strength and handling methods were important to define estimation errors. When data were missing completely at random, descriptive statistics were well estimated but Moran’s correlogram and Blomberg’s K were not well estimated, depending on handling methods. We also found that handling methods performed worse when data were missing at random, but phylogenetically structured. In this case adding phylogenetic information provided better estimates. Although the error caused by imputation was correlated with estimation errors, we found that such relationship is not linear with estimation errors getting larger as the imputation error increases.Imputed trait databases could bias ecological and evolutionary analyses. We advise researchers to share their raw data along with their imputed database, flagging imputed data and providing information on the imputation process. Thus, users can and should consider the pattern of missing data and then look for the best method to overcome this problem. In addition, we suggest the development of phylogenetic methods that consider imputation uncertainty, phylogenetic autocorrelation and preserve the level of phylogenetic signal of the original data.


2020 ◽  
Vol 35 (4) ◽  
pp. 589-614
Author(s):  
Melanie-Angela Neuilly ◽  
Ming-Li Hsieh ◽  
Alex Kigerl ◽  
Zachary K. Hamilton

Research on homicide missing data conventionally posits a Missing At Random pattern despite the relationship between missing data and clearance. The latter, however, cannot be satisfactorily modeled using variables traditionally available in homicide datasets. For this reason, it has been argued that missingness in homicide data follows a Nonignorable pattern instead. Hence, the use of multiple imputation strategies as recommended in the field for ignorable patterns would thus pose a threat to the validity of results obtained in such a way. This study examines missing data mechanisms by using a set of primary data collected in New Jersey. After comparing Listwise Deletion, Multiple Imputation, Propensity Score Matching, and Log-Multiplicative Association Models, our findings underscore that data in homicide datasets are indeed Missing Not At Random.


Sign in / Sign up

Export Citation Format

Share Document