A comparison of multiple imputation methods for handling missing values in longitudinal data in the presence of a time-varying covariate with a non-linear association with time: a simulation study

Abstract Background : Poor data quality is limiting the greater use of data sourced from routine health information systems (RHIS), especially in low and middle-income countries. An important part of this issue comes from missing values, where health facilities, for a variety of reasons, miss their reports into the central system. Methods : Using data from the Health Management Information System (HMIS) and the advent of COVID-19 pandemic in the Democratic Republic of the Congo (DRC) as an illustrative case study, we implemented six commonly-used imputation methods using the DRC’s HMIS datasets and evaluated their performance through various statistical techniques, i.e., simple linear regression, segmented regression which is widely used in interrupted time series studies, and parametric comparisons through t-tests and non-parametric comparisons through Wilcoxon Rank-Sum tests. We also examined the performance of these six imputation methods under different missing mechanisms and tested their stability to changes in the data. Results : For regression analyses, there was no substantial difference found in the results generated from all methods except mean imputation and exclusion & interpolation when the RHIS dataset contained less than 20% missing values. However, as the missing proportion grew, machine learning methods such as missForest and k -NN started to produce biased estimates, and they were found to be also lack of robustness to minimal changes in data or to consecutive missingness. On the other hand, multiple imputation generated the overall most unbiased estimates and was the most robust to all changes in data. For comparing group means through t-tests, the results from mean imputation and exclusion & interpolation disagreed with the true inference obtained using the complete data, suggesting that these two methods would not only lead to biased regression estimates but also generate unreliable t-test results. Conclusions : We recommend the use of multiple imputation in addressing missing values in RHIS datasets. In cases necessary computing resources are unavailable to multiple imputation, one may consider seasonal decomposition as the next best method. Mean imputation and exclusion & interpolation, however, always produced biased and misleading results in the subsequent analyses, and thus their use in the handling of missing values should be discouraged. Keywords : Missing Data; Routine Health Information Systems (RHIS); Health Management Information System (HMIS); Health Services Research; Low and middle-income countries (LMICs); Multiple imputation

Download Full-text

Comparison of Selected Multiple Imputation Methods for Continuous Variables – Preliminary Simulation Study Results

Acta Universitatis Lodziensis Folia oeconomica ◽

10.18778/0208-6018.339.05 ◽

2019 ◽

Vol 6 (339) ◽

pp. 73-98

Author(s):

Małgorzata Aleksandra Misztal

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Missing Values ◽

Imputation Accuracy ◽

Imputation Method ◽

Data Sets ◽

Continuous Variables ◽

Imputation Methods ◽

Study Results ◽

Almost All

The problem of incomplete data and its implications for drawing valid conclusions from statistical analyses is not related to any particular scientific domain, it arises in economics, sociology, education, behavioural sciences or medicine. Almost all standard statistical methods presume that every object has information on every variable to be included in the analysis and the typical approach to missing data is simply to delete them. However, this leads to ineffective and biased analysis results and is not recommended in the literature. The state of the art technique for handling missing data is multiple imputation. In the paper, some selected multiple imputation methods were taken into account. Special attention was paid to using principal components analysis (PCA) as an imputation method. The goal of the study was to assess the quality of PCA‑based imputations as compared to two other multiple imputation techniques: multivariate imputation by chained equations (MICE) and missForest. The comparison was made by artificially simulating different proportions (10–50%) and mechanisms of missing data using 10 complete data sets from the UCI repository of machine learning databases. Then, missing values were imputed with the use of MICE, missForest and the PCA‑based method (MIPCA). The normalised root mean square error (NRMSE) was calculated as a measure of imputation accuracy. On the basis of the conducted analyses, missForest can be recommended as a multiple imputation method providing the lowest rates of imputation errors for all types of missingness. PCA‑based imputation does not perform well in terms of accuracy.

Download Full-text

A SAS macro for a simulation study of imputation methods for missing values—an application of Bebbington's algorithm

Public Health ◽

10.1016/s0033-3506(98)00597-6 ◽

1998 ◽

Vol 112 (2) ◽

pp. 129-132

Author(s):

S Wang

Keyword(s):

Simulation Study ◽

Missing Values ◽

Imputation Methods ◽

Sas Macro

Download Full-text

Comparison of Imputation Methods for Missing Values in Longitudinal Data Under Missing Completely at Random (mcar) mechanism

African Journal of Applied Statistics ◽

10.16929/ajas/241.213 ◽

2017 ◽

Vol 4 (1) ◽

pp. 241-258

Author(s):

Lotsi Anani

Keyword(s):

Longitudinal Data ◽

Missing Values ◽

Imputation Methods ◽

Missing Completely At Random

Download Full-text

Multiple Imputation for Missing Values in Homicide Incident Data: An Evaluation Using Unique Test Data

Homicide Studies ◽

10.1177/1088767918778309 ◽

2018 ◽

Vol 22 (4) ◽

pp. 391-409

Author(s):

John M. Roberts ◽

Aki Roberts ◽

Tim Wadsworth

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Missing Values ◽

Actual Data ◽

Regression Coefficients ◽

Similar Data ◽

Missing Information ◽

Imputation Methods ◽

Unique Data ◽

Incident Reports

Incident-level homicide datasets such as the Supplementary Homicide Reports (SHR) commonly exhibit missing data. We evaluated multiple imputation methods (that produce multiple completed datasets, across which imputed values may vary) via unique data that included actual values, from police agency incident reports, of seemingly missing SHR data. This permitted evaluation under a real, not assumed or simulated, missing data mechanism. We compared analytic results based on multiply imputed and actual data; multiple imputation rather successfully recovered victim–offender relationship distributions and regression coefficients that hold in the actual data. Results are encouraging for users of multiple imputation, though it is still important to minimize the extent of missing information in SHR and similar data.

Download Full-text

Handling incomplete data classification using imputed feature selected bagging (IFBag) method

Intelligent Data Analysis ◽

10.3233/ida-205331 ◽

2021 ◽

Vol 25 (4) ◽

pp. 825-846

Author(s):

Ahmad Jaffar Khan ◽

Basit Raza ◽

Ahmad Raza Shahid ◽

Yogan Jaya Kumar ◽

Muhammad Faheem ◽

...

Keyword(s):

Multiple Imputation ◽

Ensemble Learning ◽

Incomplete Data ◽

Missing Values ◽

Learning Approach ◽

Imputation Methods ◽

Real World Datasets ◽

Almost All ◽

Bagging Ensemble

Almost all real-world datasets contain missing values. Classification of data with missing values can adversely affect the performance of a classifier if not handled correctly. A common approach used for classification with incomplete data is imputation. Imputation transforms incomplete data with missing values to complete data. Single imputation methods are mostly less accurate than multiple imputation methods which are often computationally much more expensive. This study proposes an imputed feature selected bagging (IFBag) method which uses multiple imputation, feature selection and bagging ensemble learning approach to construct a number of base classifiers to classify new incomplete instances without any need for imputation in testing phase. In bagging ensemble learning approach, data is resampled multiple times with substitution, which can lead to diversity in data thus resulting in more accurate classifiers. The experimental results show the proposed IFBag method is considerably fast and gives 97.26% accuracy for classification with incomplete data as compared to common methods used.

Download Full-text

Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies

10.1101/260281 ◽

2018 ◽

Cited By ~ 2

Author(s):

Kieu Trinh Do ◽

Simone Wahl ◽

Johannes Raffler ◽

Sophie Molnos ◽

Michael Laimighofer ◽

...

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Statistical Power ◽

Missing Values ◽

Biological Evaluation ◽

List Type ◽

Robust Performance ◽

Metabolomics Data ◽

Imputation Methods ◽

Biochemical Pathways

AbstractBACKGROUNDUntargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in epidemiological studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation.METHODSWe investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established genetically metabolic quantitative trait loci.RESULTSRun day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations (MICE) performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable.CONCLUSIONMissing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.Key messagesUntargeted MS-based metabolomics data show missing values due to both batch-specific LOD-based and non-LOD-based effects.Statistical evaluation of multiple imputation methods was conducted on both simulated and real datasets.Biological evaluation on real data assessed the ability of imputation methods to preserve statistical inference of biochemical pathways and correctly estimate effects of genetic variants on metabolite levels.KNN-based imputation on observations with variable pre-selection and K = 10 showed robust performance for all data scenarios across all evaluation schemes.

Download Full-text

Recursive Partitioning Methods for Data Imputation in the Context of Item Response Theory: A Monte Carlo Simulation

Psicológica Journal ◽

10.2478/psicolj-2018-0005 ◽

2018 ◽

Vol 39 (1) ◽

pp. 88-117 ◽

Cited By ~ 1

Author(s):

Julianne M. Edwards ◽

W. Holmes Finch

Keyword(s):

Missing Data ◽

Item Response Theory ◽

Multiple Imputation ◽

Item Response ◽

Missing Values ◽

Recursive Partitioning ◽

Data Imputation ◽

Response Theory ◽

Imputation Methods ◽

Missing Responses

AbstractMissing data is a common problem faced by psychometricians and measurement professionals. To address this issue, there are a number of techniques that have been proposed to handle missing data regarding Item Response Theory. These methods include several types of data imputation methods - corrected item mean substitution imputation, response function imputation, multiple imputation, and the EM algorithm, as well as approaches that do not rely on the imputation of missing values - treating the item as not presented, coding missing responses as incorrect, or as fractionally correct. Of these methods, even though multiple imputation has demonstrated the best performance in prior research, higher MAE was still present. Given this higher model parameter estimation MAE for even the best performing missing data methods, this simulation study’s goal was to explore the performance of a set of potentially promising data imputation methods based on recursive partitioning. Results of this study demonstrated that approaches that combine multivariate imputation by chained equations and recursive partitioning algorithms yield data with relatively low estimation MAE for both item difficulty and item discrimination. Implications of these findings are discussed.

Download Full-text