scholarly journals Imputation Methods for Missing Data for a Proposed VASA Dataset

Preprocessing is the presentation of raw data before apply the actual statistical method. Data preprocessing is one of the most vital steps in data mining process and it deals with the preparation and transformation of the initial dataset. It is prominent because the investigating data which is not properly preprocessed could lead to the result which is not accurate and meaningless. Almost every research have missing data and introduce an element into data analysis using some method. To consider the missing values that need to provide an efficient and valid analysis. Missing imputation is one of the process in data cleaning. Here, four different types of imputation methods are compared: Mean, Singular Value Decomposition (SVD), K-Nearest Neighbors (KNN), Bayesian Principal Component Analysis (BPCA). Comparison was performed in the real VASA dataset and based on performance evaluation criteria such as Mean Square Error (MSE) and Root Mean Square Error (RMSE). BPCA is the best imputation method of interest which deserve further consideration in practice.

Symmetry ◽  
2020 ◽  
Vol 12 (10) ◽  
pp. 1594
Author(s):  
Samih M. Mostafa ◽  
Abdelrahman S. Eladimy ◽  
Safwat Hamad ◽  
Hirofumi Amano

In most scientific studies such as data analysis, the existence of missing data is a critical problem, and selecting the appropriate approach to deal with missing data is a challenge. In this paper, the authors perform a fair comparative study of some practical imputation methods used for handling missing values against two proposed imputation algorithms. The proposed algorithms depend on the Bayesian Ridge technique under two different feature selection conditions. The proposed algorithms differ from the existing approaches in that they cumulate the imputed features; those imputed features will be incorporated within the Bayesian Ridge equation for predicting the missing values in the next incomplete selected feature. The authors applied the proposed algorithms on eight datasets with different amount of missing values created from different missingness mechanisms. The performance was measured in terms of imputation time, root-mean-square error (RMSE), coefficient of determination (R2), and mean absolute error (MAE). The results showed that the performance varies depending on missing values percentage, size of the dataset, and the missingness mechanism. In addition, the performance of the proposed methods is slightly better.


2021 ◽  
Vol 7 ◽  
pp. e619
Author(s):  
Khaled M. Fouad ◽  
Mahmoud M. Ismail ◽  
Ahmad Taher Azar ◽  
Mona M. Arafa

The real-world data analysis and processing using data mining techniques often are facing observations that contain missing values. The main challenge of mining datasets is the existence of missing values. The missing values in a dataset should be imputed using the imputation method to improve the data mining methods’ accuracy and performance. There are existing techniques that use k-nearest neighbors algorithm for imputing the missing values but determining the appropriate k value can be a challenging task. There are other existing imputation techniques that are based on hard clustering algorithms. When records are not well-separated, as in the case of missing data, hard clustering provides a poor description tool in many cases. In general, the imputation depending on similar records is more accurate than the imputation depending on the entire dataset's records. Improving the similarity among records can result in improving the imputation performance. This paper proposes two numerical missing data imputation methods. A hybrid missing data imputation method is initially proposed, called KI, that incorporates k-nearest neighbors and iterative imputation algorithms. The best set of nearest neighbors for each missing record is discovered through the records similarity by using the k-nearest neighbors algorithm (kNN). To improve the similarity, a suitable k value is estimated automatically for the kNN. The iterative imputation method is then used to impute the missing values of the incomplete records by using the global correlation structure among the selected records. An enhanced hybrid missing data imputation method is then proposed, called FCKI, which is an extension of KI. It integrates fuzzy c-means, k-nearest neighbors, and iterative imputation algorithms to impute the missing data in a dataset. The fuzzy c-means algorithm is selected because the records can belong to multiple clusters at the same time. This can lead to further improvement for similarity. FCKI searches a cluster, instead of the whole dataset, to find the best k-nearest neighbors. It applies two levels of similarity to achieve a higher imputation accuracy. The performance of the proposed imputation techniques is assessed by using fifteen datasets with variant missing ratios for three types of missing data; MCAR, MAR, MNAR. These different missing data types are generated in this work. The datasets with different sizes are used in this paper to validate the model. Therefore, proposed imputation techniques are compared with other missing data imputation methods by means of three measures; the root mean square error (RMSE), the normalized root mean square error (NRMSE), and the mean absolute error (MAE). The results show that the proposed methods achieve better imputation accuracy and require significantly less time than other missing data imputation methods.


Author(s):  
Yuzhe Liu ◽  
Vanathi Gopalakrishnan

Many clinical research datasets have a large percentage of missing values that directly impacts their usefulness in yielding high accuracy classifiers when used for training in supervised machine learning. While missing value imputation methods have been shown to work well with smaller percentages of missing values, their ability to impute sparse clinical research data can be problem specific. We previously attempted to learn quantitative guidelines for ordering cardiac magnetic resonance imaging during the evaluation for pediatric cardiomyopathy, but missing data significantly reduced our usable sample size. In this work, we sought to determine if increasing the usable sample size through imputation would allow us to learn better guidelines. We first review several machine learning methods for estimating missing data. Then, we apply four popular methods (mean imputation, decision tree, k-nearest neighbors, and self-organizing maps) to a clinical research dataset of pediatric patients undergoing evaluation for cardiomyopathy. Using Bayesian Rule Learning (BRL) to learn ruleset models, we compared the performance of imputation-augmented models versus unaugmented models. We found that all four imputation-augmented models performed similarly to unaugmented models. While imputation did not improve performance, it did provide evidence for the robustness of our learned models.


2018 ◽  
Author(s):  
Yeping Lina Qiu ◽  
Hong Zheng ◽  
Olivier Gevaert

AbstractMotivationThe presence of missing values is a frequent problem encountered in genomic data analysis. Lost data can be an obstacle to downstream analyses that require complete data matrices. State-of-the-art imputation techniques including Singular Value Decomposition (SVD) and K-Nearest Neighbors (KNN) based methods usually achieve good performances, but are computationally expensive especially for large datasets such as those involved in pan-cancer analysis.ResultsThis study describes a new method: a denoising autoencoder with partial loss (DAPL) as a deep learning based alternative for data imputation. Results on pan-cancer gene expression data and DNA methylation data from over 11,000 samples demonstrate significant improvement over standard denoising autoencoder for both data missing-at-random cases with a range of missing percentages, and missing-not-at-random cases based on expression level and GC-content. We discuss the advantages of DAPL over traditional imputation methods and show that it achieves comparable or better performance with less computational burden.Availabilityhttps://github.com/gevaertlab/[email protected]


2021 ◽  
Vol 8 (3) ◽  
pp. 215-226
Author(s):  
Parisa Saeipourdizaj ◽  
Parvin Sarbakhsh ◽  
Akbar Gholampour

Background: PIn air quality studies, it is very often to have missing data due to reasons such as machine failure or human error. The approach used in dealing with such missing data can affect the results of the analysis. The main aim of this study was to review the types of missing mechanism, imputation methods, application of some of them in imputation of missing of PM10 and O3 in Tabriz, and compare their efficiency. Methods: Methods of mean, EM algorithm, regression, classification and regression tree, predictive mean matching (PMM), interpolation, moving average, and K-nearest neighbor (KNN) were used. PMM was investigated by considering the spatial and temporal dependencies in the model. Missing data were randomly simulated with 10, 20, and 30% missing values. The efficiency of methods was compared using coefficient of determination (R2 ), mean absolute error (MAE) and root mean square error (RMSE). Results: Based on the results for all indicators, interpolation, moving average, and KNN had the best performance, respectively. PMM did not perform well with and without spatio-temporal information. Conclusion: Given that the nature of pollution data always depends on next and previous information, methods that their computational nature is based on before and after information indicated better performance than others, so in the case of pollutant data, it is recommended to use these methods.


2019 ◽  
Author(s):  
Bret Beheim ◽  
Quentin Atkinson ◽  
Joseph Bulbulia ◽  
Will M Gervais ◽  
Russell Gray ◽  
...  

Whitehouse, et al. have recently used the Seshat archaeo-historical databank to argue that beliefs in moralizing gods appear in world history only after the formation of complex “megasocieties” of around one million people. Inspection of the authors’ data, however, shows that 61% of Seshat data points on moralizing gods are missing values, mostly from smaller populations below one million people, and during the analysis the authors re-coded these data points to signify the absence of moralizing gods beliefs. When we confine the analysis only to the extant data or use various standard imputation methods, the reported finding is reversed: moralizing gods precede increases in social complexity. We suggest that the reported “megasociety threshold” for the emergence of moralizing gods is thus solely a consequence of the decision to re-code nearly two-thirds of Seshat data from unknown values to known absences of moralizing gods.


2020 ◽  
Author(s):  
David N Borg ◽  
Robert Nguyen ◽  
Nicholas J Tierney

Missing data are often unavoidable. The reason values go missing, along with decisions made of how missing data are handled (deleted or imputed), can have a profound effect on the validity and accuracy of study results. In this article, we aimed to: estimate the proportion of studies in football research that included a missing data statement, highlight several practices to avoid in relation to missing data, and provide recommendations for exploring, visualising and reporting missingness. Football related articles, published in 2019 were studied. A survey of 136 articles, sampled at random, was conducted to determine whether a missing data statement was included. As expected, the proportion of studies in football research that included a missing data statement was low, at only 11.0% (95% CI: 6.3% to 17.5%); suggesting that missingness is seldom considered by researchers. We recommend that researchers describe the number and percentage of missing values, including when there are no missing values. Exploratory analysis should be conducted to explore missing values, and visualisations describing missingness overall should be provided in the paper, or at least supplementary materials. Missing values should almost always be imputed, and imputation methods should be explored to ensure they are appropriately representative. Researchers should consider these recommendations, and pay greater attention to missing data and its influence on research results.


2019 ◽  
Vol 6 (339) ◽  
pp. 73-98
Author(s):  
Małgorzata Aleksandra Misztal

The problem of incomplete data and its implications for drawing valid conclusions from statistical analyses is not related to any particular scientific domain, it arises in economics, sociology, education, behavioural sciences or medicine. Almost all standard statistical methods presume that every object has information on every variable to be included in the analysis and the typical approach to missing data is simply to delete them. However, this leads to ineffective and biased analysis results and is not recommended in the literature. The state of the art technique for handling missing data is multiple imputation. In the paper, some selected multiple imputation methods were taken into account. Special attention was paid to using principal components analysis (PCA) as an imputation method. The goal of the study was to assess the quality of PCA‑based imputations as compared to two other multiple imputation techniques: multivariate imputation by chained equations (MICE) and missForest. The comparison was made by artificially simulating different proportions (10–50%) and mechanisms of missing data using 10 complete data sets from the UCI repository of machine learning databases. Then, missing values were imputed with the use of MICE, missForest and the PCA‑based method (MIPCA). The normalised root mean square error (NRMSE) was calculated as a measure of imputation accuracy. On the basis of the conducted analyses, missForest can be recommended as a multiple imputation method providing the lowest rates of imputation errors for all types of missingness. PCA‑based imputation does not perform well in terms of accuracy.


Author(s):  
Mohamed Reda Abonazel

This paper has reviewed two important problems in regression analysis (outliers and missing data), as well as some handling methods for these problems. Moreover, two applications have been introduced to understand and study these methods by R-codes. Practical evidence was provided to researchers to deal with those problems in regression modeling with R. Finally, we created a Monte Carlo simulation study to compare different handling methods of missing data in the regression model. Simulation results indicate that, under our simulation factors, the k-nearest neighbors method is the best method to estimate the missing values in regression models.


Sign in / Sign up

Export Citation Format

Share Document