scholarly journals Comparison of Single and MICE Imputation Methods for Missing Values: A Simulation Study

2021 ◽  
Vol 29 (2) ◽  
Author(s):  
Nurul Azifah Mohd Pauzi ◽  
Yap Bee Wah ◽  
Sayang Mohd Deni ◽  
Siti Khatijah Nor Abdul Rahim ◽  
Suhartono

High quality data is essential in every field of research for valid research findings. The presence of missing data in a dataset is common and occurs for a variety of reasons such as incomplete responses, equipment malfunction and data entry error. Single and multiple data imputation methods have been developed for data imputation of missing values. This study investigated the performance of single imputation using mean and multiple imputation method using Multivariate Imputation by Chained Equations (MICE) via a simulation study. The MCAR which means missing completely at random were generated randomly for ten levels of missing rates (proportion of missing data): 5% to 50% for different sample sizes. Mean Square Error (MSE) was used to evaluate the performance of the imputation methods. Data imputation method depends on data types. Mean imputation is commonly used to impute missing values for continuous variable while MICE method can handle both continuous and categorical variables. The simulation results indicate that group mean imputation (GMI) performed better compared to overall mean imputation (OMI) and MICE with lowest value of MSE for all sample sizes and missing rates. The MSE of OMI, GMI, and MICE increases when missing rate increases. The MICE method has the lowest performance (i.e. highest MSE) when percentage of missing rates is more than 15%. Overall, GMI is more superior compared to OMI and MICE for all missing rates and sample size for MCAR mechanism. An application to a real dataset confirmed the findings of the simulation results. The findings of this study can provide knowledge to researchers and practitioners on which imputation method is more suitable when the data involves missing data.

Author(s):  
Wisam A. Mahmood ◽  
Mohammed S. Rashid ◽  
Teaba Wala Aldeen ◽  
Teaba Wala Aldeen

Missing values commonly happen in the realm of medical research, which is regarded creating a lot of bias in case it is neglected with poor handling. However, while dealing with such challenges, some standard statistical methods have been already developed and available, yet no credible method is available so far to infer credible estimates. The existing data size gets lowered, apart from a decrease in efficiency happens when missing values is found in a dataset. A number of imputation methods have addressed such challenges in early scholarly works for handling missing values. Some of the regular methods include complete case method, mean imputation method, Last Observation Carried Forward (LOCF) method, Expectation-Maximization (EM) algorithm, and Markov Chain Monte Carlo (MCMC), Mean Imputation (Mean), Hot Deck (HOT), Regression Imputation (Regress), K-nearest neighbor (KNN),K-Mean Clustering, Fuzzy K-Mean Clustering, Support Vector Machine, and Multiple Imputation (MI) method. In the present paper, a simulation study is attempted for carrying out an investigative exploration into the efficacy of the above mentioned archetypal imputation methods along with longitudinal data setting under missing completely at random (MCAR). We took out missingness from three cases in a block having low missingness of 5% as well as higher levels at 30% and 50%. With this simulation study, we concluded LOCF method having more bias than the other methods in most of the situations after carrying out a comparison through simulation study.


2021 ◽  
Vol 7 ◽  
pp. e619
Author(s):  
Khaled M. Fouad ◽  
Mahmoud M. Ismail ◽  
Ahmad Taher Azar ◽  
Mona M. Arafa

The real-world data analysis and processing using data mining techniques often are facing observations that contain missing values. The main challenge of mining datasets is the existence of missing values. The missing values in a dataset should be imputed using the imputation method to improve the data mining methods’ accuracy and performance. There are existing techniques that use k-nearest neighbors algorithm for imputing the missing values but determining the appropriate k value can be a challenging task. There are other existing imputation techniques that are based on hard clustering algorithms. When records are not well-separated, as in the case of missing data, hard clustering provides a poor description tool in many cases. In general, the imputation depending on similar records is more accurate than the imputation depending on the entire dataset's records. Improving the similarity among records can result in improving the imputation performance. This paper proposes two numerical missing data imputation methods. A hybrid missing data imputation method is initially proposed, called KI, that incorporates k-nearest neighbors and iterative imputation algorithms. The best set of nearest neighbors for each missing record is discovered through the records similarity by using the k-nearest neighbors algorithm (kNN). To improve the similarity, a suitable k value is estimated automatically for the kNN. The iterative imputation method is then used to impute the missing values of the incomplete records by using the global correlation structure among the selected records. An enhanced hybrid missing data imputation method is then proposed, called FCKI, which is an extension of KI. It integrates fuzzy c-means, k-nearest neighbors, and iterative imputation algorithms to impute the missing data in a dataset. The fuzzy c-means algorithm is selected because the records can belong to multiple clusters at the same time. This can lead to further improvement for similarity. FCKI searches a cluster, instead of the whole dataset, to find the best k-nearest neighbors. It applies two levels of similarity to achieve a higher imputation accuracy. The performance of the proposed imputation techniques is assessed by using fifteen datasets with variant missing ratios for three types of missing data; MCAR, MAR, MNAR. These different missing data types are generated in this work. The datasets with different sizes are used in this paper to validate the model. Therefore, proposed imputation techniques are compared with other missing data imputation methods by means of three measures; the root mean square error (RMSE), the normalized root mean square error (NRMSE), and the mean absolute error (MAE). The results show that the proposed methods achieve better imputation accuracy and require significantly less time than other missing data imputation methods.


Missing data imputation is essential task becauseremoving all records with missing values will discard useful information from other attributes. This paper estimates the performanceof prediction for autism dataset with imputed missing values. Statistical imputation methods like mean, imputation with zero or constant and machine learning imputation methods like K-nearest neighbour chained Equation methods were compared with the proposed deep learning imputation method. The predictions of patients with autistic spectrum disorder were measured using support vector machine for imputed dataset. Among the imputation methods, Deeplearningalgorithm outperformed statistical and machine learning imputation methods. The same is validated using significant difference in p values revealed using Friedman’s test


Author(s):  
Ahmad R. Alsaber ◽  
Jiazhu Pan ◽  
Adeeba Al-Hurban 

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.


Sensors ◽  
2020 ◽  
Vol 20 (20) ◽  
pp. 5947
Author(s):  
Liang Zhang

Building operation data are important for monitoring, analysis, modeling, and control of building energy systems. However, missing data is one of the major data quality issues, making data imputation techniques become increasingly important. There are two key research gaps for missing sensor data imputation in buildings: the lack of customized and automated imputation methodology, and the difficulty of the validation of data imputation methods. In this paper, a framework is developed to address these two gaps. First, a validation data generation module is developed based on pattern recognition to create a validation dataset to quantify the performance of data imputation methods. Second, a pool of data imputation methods is tested under the validation dataset to find an optimal single imputation method for each sensor, which is termed as an ensemble method. The method can reflect the specific mechanism and randomness of missing data from each sensor. The effectiveness of the framework is demonstrated by 18 sensors from a real campus building. The overall accuracy of data imputation for those sensors improves by 18.2% on average compared with the best single data imputation method.


2019 ◽  
Vol 6 (339) ◽  
pp. 73-98
Author(s):  
Małgorzata Aleksandra Misztal

The problem of incomplete data and its implications for drawing valid conclusions from statistical analyses is not related to any particular scientific domain, it arises in economics, sociology, education, behavioural sciences or medicine. Almost all standard statistical methods presume that every object has information on every variable to be included in the analysis and the typical approach to missing data is simply to delete them. However, this leads to ineffective and biased analysis results and is not recommended in the literature. The state of the art technique for handling missing data is multiple imputation. In the paper, some selected multiple imputation methods were taken into account. Special attention was paid to using principal components analysis (PCA) as an imputation method. The goal of the study was to assess the quality of PCA‑based imputations as compared to two other multiple imputation techniques: multivariate imputation by chained equations (MICE) and missForest. The comparison was made by artificially simulating different proportions (10–50%) and mechanisms of missing data using 10 complete data sets from the UCI repository of machine learning databases. Then, missing values were imputed with the use of MICE, missForest and the PCA‑based method (MIPCA). The normalised root mean square error (NRMSE) was calculated as a measure of imputation accuracy. On the basis of the conducted analyses, missForest can be recommended as a multiple imputation method providing the lowest rates of imputation errors for all types of missingness. PCA‑based imputation does not perform well in terms of accuracy.


2018 ◽  
Vol 39 (1) ◽  
pp. 88-117 ◽  
Author(s):  
Julianne M. Edwards ◽  
W. Holmes Finch

AbstractMissing data is a common problem faced by psychometricians and measurement professionals. To address this issue, there are a number of techniques that have been proposed to handle missing data regarding Item Response Theory. These methods include several types of data imputation methods - corrected item mean substitution imputation, response function imputation, multiple imputation, and the EM algorithm, as well as approaches that do not rely on the imputation of missing values - treating the item as not presented, coding missing responses as incorrect, or as fractionally correct. Of these methods, even though multiple imputation has demonstrated the best performance in prior research, higher MAE was still present. Given this higher model parameter estimation MAE for even the best performing missing data methods, this simulation study’s goal was to explore the performance of a set of potentially promising data imputation methods based on recursive partitioning. Results of this study demonstrated that approaches that combine multivariate imputation by chained equations and recursive partitioning algorithms yield data with relatively low estimation MAE for both item difficulty and item discrimination. Implications of these findings are discussed.


2013 ◽  
Vol 11 (7) ◽  
pp. 2779-2786
Author(s):  
Rahul Singhai

One relevant problem in data preprocessing is the presence of missing data that leads the poor quality of patterns, extracted after mining. Imputation is one of the widely used procedures that replace the missing values in a data set by some probable values. The advantage of this approach is that the missing data treatment is independent of the learning algorithm used. This allows the user to select the most suitable imputation method for each situation. This paper analyzes the various imputation methods proposed in the field of statistics with respect to data mining. A comparative analysis of three different imputation approaches which can be used to impute missing attribute values in data mining are given that shows the most promising method. An artificial input data (of numeric type) file of 1000 records is used to investigate the performance of these methods. For testing the significance of these methods Z-test approach were used.


Author(s):  
Mehmet S. Aktaş ◽  
Sinan Kaplan ◽  
Hasan Abacı ◽  
Oya Kalipsiz ◽  
Utku Ketenci ◽  
...  

Missing data is a common problem for data clustering quality. Most real-life datasets have missing data, which in turn has some effect on clustering tasks. This chapter investigates the appropriate data treatment methods for varying missing data scarcity distributions including gamma, Gaussian, and beta distributions. The analyzed data imputation methods include mean, hot-deck, regression, k-nearest neighbor, expectation maximization, and multiple imputation. To reveal the proper methods to deal with missing data, data mining tasks such as clustering is utilized for evaluation. With the experimental studies, this chapter identifies the correlation between missing data imputation methods and missing data distributions for clustering tasks. The results of the experiments indicated that expectation maximization and k-nearest neighbor methods provide best results for varying missing data scarcity distributions.


2018 ◽  
Vol 2018 ◽  
pp. 1-11 ◽  
Author(s):  
Xiaobo Chen ◽  
Yan Xiao

Missing data is a frequently encountered problem in environment research community. To facilitate the analysis and management of air quality data, for example, PM2.5concentration in this study, a commonly adopted strategy for handling missing values in the samples is to generate a complete data set using imputation methods. Many imputation methods based on temporal or spatial correlation have been developed for this purpose in the existing literatures. The difference of various methods lies in characterizing the dependence relationship of data samples with different mathematical models, which is crucial for missing data imputation. In this paper, we propose two novel and principled imputation methods based on the nuclear norm of a matrix since it measures such dependence in a global fashion. The first method, termed as global nuclear norm minimization (GNNM), tries to impute missing values through directly minimizing the nuclear norm of the whole sample matrix, thus at the same time maximizing the linear dependence of samples. The second method, called local nuclear norm minimization (LNNM), concentrates more on each sample and its most similar samples which are estimated from the imputation results of the first method. In such a way, the nuclear norm minimization can be performed on those highly correlated samples instead of the whole sample matrix as in GNNM, thus reducing the adverse impact of irrelevant samples. The two methods are evaluated on a data set of PM2.5concentration measured every 1 h by 22 monitoring stations. The missing values are simulated with different percentages. The imputed values are compared with the ground truth values to evaluate the imputation performance of different methods. The experimental results verify the effectiveness of our methods, especially LNNM, for missing air quality data imputation.


Sign in / Sign up

Export Citation Format

Share Document