A practical comparison of single and multiple imputation methods to handle complex missing data in air quality datasets

2014 ◽  
Vol 134 ◽  
pp. 23-33 ◽  
Author(s):  
M.P. Gómez-Carracedo ◽  
J.M. Andrade ◽  
P. López-Mahía ◽  
S. Muniategui ◽  
D. Prada
2021 ◽  
Author(s):  
Adrienne D. Woods ◽  
Pamela Davis-Kean ◽  
Max Andrew Halvorson ◽  
Kevin Michael King ◽  
Jessica A. R. Logan ◽  
...  

A common challenge in developmental research is the amount of incomplete and missing data that occurs from respondents failing to complete tasks or questionnaires, as well as from disengaging from the study (i.e., attrition). This missingness can lead to biases in parameter estimates and, hence, in the interpretation of findings. These biases can be addressed through statistical techniques that adjust for missing data, such as multiple imputation. Although this technique is highly effective, it has not been widely adopted by developmental scientists given barriers such as lack of training or misconceptions about imputation methods and instead utilizing default methods within software like listwise deletion. This manuscript is intended to provide practical guidelines for developmental researchers to follow when examining their data for missingness, making decisions about how to handle that missingness, and reporting the extent of missing data biases and specific multiple imputation procedures in publications.


2019 ◽  
Vol 6 (339) ◽  
pp. 73-98
Author(s):  
Małgorzata Aleksandra Misztal

The problem of incomplete data and its implications for drawing valid conclusions from statistical analyses is not related to any particular scientific domain, it arises in economics, sociology, education, behavioural sciences or medicine. Almost all standard statistical methods presume that every object has information on every variable to be included in the analysis and the typical approach to missing data is simply to delete them. However, this leads to ineffective and biased analysis results and is not recommended in the literature. The state of the art technique for handling missing data is multiple imputation. In the paper, some selected multiple imputation methods were taken into account. Special attention was paid to using principal components analysis (PCA) as an imputation method. The goal of the study was to assess the quality of PCA‑based imputations as compared to two other multiple imputation techniques: multivariate imputation by chained equations (MICE) and missForest. The comparison was made by artificially simulating different proportions (10–50%) and mechanisms of missing data using 10 complete data sets from the UCI repository of machine learning databases. Then, missing values were imputed with the use of MICE, missForest and the PCA‑based method (MIPCA). The normalised root mean square error (NRMSE) was calculated as a measure of imputation accuracy. On the basis of the conducted analyses, missForest can be recommended as a multiple imputation method providing the lowest rates of imputation errors for all types of missingness. PCA‑based imputation does not perform well in terms of accuracy.


2010 ◽  
Vol 7 (1) ◽  
Author(s):  
Claudio Quintano ◽  
Rosalia Castellano ◽  
Antonella Rocca

In the field of data quality, imputation is the most used method for handling missing data. The performance of imputation techniques is influenced by various factors, especially when data represent only a sample of population, for example the survey design characteristics. In this paper, we compare the results of different multiple imputation methods in terms of final estimates when outliers occur in a dataset. Consequently, in order to evaluate the influence of outliers on the performance of these methods, the procedure is applied before and after that we have identified and removed them. For this purpose, missing data were simulated on data coming from sample ISTAT annual survey on Small and Medium Enterprises. MAR mechanism is assumed for missing data. The methods are based on the multiple imputation through the Markov Chain Monte Carlo (MCMC), the propensity score and the mixture models. The results highlight the strong influence of data characteristics on final estimates.


Author(s):  
Marc J. Lajeunesse

This chapter discusses possible solutions for dealing with partial information and missing data from published studies. These solutions can improve the amount of information extracted from individual studies, and increase the representation of data for meta-analysis. It begins with a description of the mechanisms that generate missing information within studies, followed by a discussion of how gaps of information can influence meta-analysis and the way studies are quantitatively reviewed. It then suggests some practical solutions to recovering missing statistics from published studies. These include statistical acrobatics to convert available information (e.g., t-test) into those that are more useful to compute effect sizes, as well as a heuristic approaches that impute (fill gaps) missing information when pooling effect sizes. Finally, the chapter discusses multiple-imputation methods that account for the uncertainty associated with filling gaps of information when performing meta-analysis.


2018 ◽  
Vol 22 (4) ◽  
pp. 391-409
Author(s):  
John M. Roberts ◽  
Aki Roberts ◽  
Tim Wadsworth

Incident-level homicide datasets such as the Supplementary Homicide Reports (SHR) commonly exhibit missing data. We evaluated multiple imputation methods (that produce multiple completed datasets, across which imputed values may vary) via unique data that included actual values, from police agency incident reports, of seemingly missing SHR data. This permitted evaluation under a real, not assumed or simulated, missing data mechanism. We compared analytic results based on multiply imputed and actual data; multiple imputation rather successfully recovered victim–offender relationship distributions and regression coefficients that hold in the actual data. Results are encouraging for users of multiple imputation, though it is still important to minimize the extent of missing information in SHR and similar data.


2018 ◽  
Author(s):  
Kieu Trinh Do ◽  
Simone Wahl ◽  
Johannes Raffler ◽  
Sophie Molnos ◽  
Michael Laimighofer ◽  
...  

AbstractBACKGROUNDUntargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in epidemiological studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation.METHODSWe investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established genetically metabolic quantitative trait loci.RESULTSRun day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations (MICE) performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable.CONCLUSIONMissing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.Key messagesUntargeted MS-based metabolomics data show missing values due to both batch-specific LOD-based and non-LOD-based effects.Statistical evaluation of multiple imputation methods was conducted on both simulated and real datasets.Biological evaluation on real data assessed the ability of imputation methods to preserve statistical inference of biochemical pathways and correctly estimate effects of genetic variants on metabolite levels.KNN-based imputation on observations with variable pre-selection and K = 10 showed robust performance for all data scenarios across all evaluation schemes.


Sign in / Sign up

Export Citation Format

Share Document