Missing Data Handling by Mean Imputation Method and Statistical Analysis of Classification Algorithm

Author(s):  
K. Maheswari ◽  
P. Packia Amutha Priya ◽  
S. Ramkumar ◽  
M. Arun
Author(s):  
Seçil Ömür Sünbül

<p>In this study, it was aimed to investigate the impact of different missing data handling methods on DINA model parameter estimation and classification accuracy. In the study, simulated data were used and the data were generated by manipulating the number of items and sample size. In the generated data, two different missing data mechanisms (missing completely at random and missing at random) were created according to three different amounts of missing data. The generated missing data was completed by using methods of treating missing data as incorrect, person mean imputation, two-way imputation, and expectation-maximization algorithm imputation. As a result, it was observed that both s and g parameter estimations and classification accuracies were effected from, missing data rates, missing data handling methods and missing data mechanisms.</p>


Author(s):  
Seçil Ömür Sünbül

<p>In this study, it was aimed to investigate the impact of different missing data<br />handling methods on DINA model parameter estimation and classification<br />accuracy. In the study, simulated data were used and the data were generated<br />by manipulating the number of items and sample size. In the generated data,<br />two different missing data mechanisms (missing completely at random and<br />missing at random) were created according to three different amounts of<br />missing data. The generated missing data was completed by using methods<br />of treating missing data as incorrect, person mean imputation, two-way<br />imputation, and expectation-maximization algorithm imputation. As a result,<br />it was observed that both s and g parameter estimations and classification<br />accuracies were effected from, missing data rates, missing data handling<br />methods and missing data mechanisms.</p>


2020 ◽  
Vol 79 (Suppl 1) ◽  
pp. 519.1-519
Author(s):  
A. Alsaber ◽  
A. Al-Herz ◽  
J. Pan ◽  
K. Saleh ◽  
A. Al-Awadhi ◽  
...  

Background:Missing data in clinical epidemiological researches violate the intention to treat principle,reduce statistical power and can induce bias if they are related to patient’s response to treatment. In multiple imputation (MI), covariates are included in the imputation equation to predict the values of missing data.Objectives:To find the best approach to estimate and impute the missing values in Kuwait Registry for Rheumatic Diseases (KRRD) patients data.Methods:A number of methods were implemented for dealing with missing data. These includedMultivariate imputation by chained equations(MICE),K-Nearest Neighbors(KNN),Bayesian Principal Component Analysis(BPCA),EM with Bootstrapping(Amelia II),Sequential Random Forest(MissForest) and mean imputation. Choosing the best imputation method wasjudged by the minimum scores ofRoot Mean Square Error(RMSE),Mean Absolute Error(MAE) andKolmogorov–Smirnov D test statistic(KS) between the imputed datapoints and the original datapoints that were subsequently sat to missing.Results:A total of 1,685 rheumatoid arthritis (RA) patients and 10,613 hospital visits were included in the registry. Among them, we found a number of variables that had missing values exceeding 5% of the total values. These included duration of RA (13.0%), smoking history (26.3%), rheumatoid factor (7.93%), anti-citrullinated peptide antibodies (20.5%), anti-nuclear antibodies (20.4%), sicca symptoms (19.2%), family history of a rheumatic disease (28.5%), steroid therapy (5.94%), ESR (5.16%), CRP (22.9%) and SDAI (38.0%), The results showed that among the methods used, MissForest gave the highest level of accuracy to estimate the missing values. It had the least imputation errors for both continuous and categorical variables at each frequency of missingness and it had the smallest prediction differences when the models used imputed laboratory values. In both data sets, MICE had the second least imputation errors and prediction differences, followed by KNN and mean imputation.Conclusion:MissForest is a highly accurate method of imputation for missing data in KRRD and outperforms other common imputation techniques in terms of imputation error and maintenance of predictive ability with imputed values in clinical predictive models. This approach can be used in registries to improve the accuracy of data, including the ones for rheumatoid arthritis patients.References:[1]Junninen, H.; Niska, H.; Tuppurainen, K.; Ruuskanen, J.; Kolehmainen, M. Methods for imputation ofmissing values in air quality data sets.Atmospheric Environment2004,38, 2895–2907.[2]Norazian, M.N.; Shukri, Y.A.; Azam, R.N.; Al Bakri, A.M.M. Estimation of missing values in air pollutiondata using single imputation techniques.ScienceAsia2008,34, 341–345.[3]Plaia, A.; Bondi, A. Single imputation method of missing values in environmental pollution data sets.Atmospheric Environment2006,40, 7316–7330.[4]Kabir, G.; Tesfamariam, S.; Hemsing, J.; Sadiq, R. Handling incomplete and missing data in water networkdatabase using imputation methods.Sustainable and Resilient Infrastructure2019, pp. 1–13.[5]Di Zio, M.; Guarnera, U.; Luzi, O. Imputation through finite Gaussian mixture models.ComputationalStatistics & Data Analysis2007,51, 5305–5316.Disclosure of Interests:None declared


2021 ◽  
Vol 29 (2) ◽  
Author(s):  
Nurul Azifah Mohd Pauzi ◽  
Yap Bee Wah ◽  
Sayang Mohd Deni ◽  
Siti Khatijah Nor Abdul Rahim ◽  
Suhartono

High quality data is essential in every field of research for valid research findings. The presence of missing data in a dataset is common and occurs for a variety of reasons such as incomplete responses, equipment malfunction and data entry error. Single and multiple data imputation methods have been developed for data imputation of missing values. This study investigated the performance of single imputation using mean and multiple imputation method using Multivariate Imputation by Chained Equations (MICE) via a simulation study. The MCAR which means missing completely at random were generated randomly for ten levels of missing rates (proportion of missing data): 5% to 50% for different sample sizes. Mean Square Error (MSE) was used to evaluate the performance of the imputation methods. Data imputation method depends on data types. Mean imputation is commonly used to impute missing values for continuous variable while MICE method can handle both continuous and categorical variables. The simulation results indicate that group mean imputation (GMI) performed better compared to overall mean imputation (OMI) and MICE with lowest value of MSE for all sample sizes and missing rates. The MSE of OMI, GMI, and MICE increases when missing rate increases. The MICE method has the lowest performance (i.e. highest MSE) when percentage of missing rates is more than 15%. Overall, GMI is more superior compared to OMI and MICE for all missing rates and sample size for MCAR mechanism. An application to a real dataset confirmed the findings of the simulation results. The findings of this study can provide knowledge to researchers and practitioners on which imputation method is more suitable when the data involves missing data.


Author(s):  
Ahmad R. Alsaber ◽  
Jiazhu Pan ◽  
Adeeba Al-Hurban 

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.


Author(s):  
Caio Ribeiro ◽  
Alex A. Freitas

AbstractLongitudinal datasets of human ageing studies usually have a high volume of missing data, and one way to handle missing values in a dataset is to replace them with estimations. However, there are many methods to estimate missing values, and no single method is the best for all datasets. In this article, we propose a data-driven missing value imputation approach that performs a feature-wise selection of the best imputation method, using known information in the dataset to rank the five methods we selected, based on their estimation error rates. We evaluated the proposed approach in two sets of experiments: a classifier-independent scenario, where we compared the applicabilities and error rates of each imputation method; and a classifier-dependent scenario, where we compared the predictive accuracy of Random Forest classifiers generated with datasets prepared using each imputation method and a baseline approach of doing no imputation (letting the classification algorithm handle the missing values internally). Based on our results from both sets of experiments, we concluded that the proposed data-driven missing value imputation approach generally resulted in models with more accurate estimations for missing data and better performing classifiers, in longitudinal datasets of human ageing. We also observed that imputation methods devised specifically for longitudinal data had very accurate estimations. This reinforces the idea that using the temporal information intrinsic to longitudinal data is a worthwhile endeavour for machine learning applications, and that can be achieved through the proposed data-driven approach.


Author(s):  
Craig K. Enders ◽  
Amanda N. Baraldi

Sign in / Sign up

Export Citation Format

Share Document