scholarly journals Comparison of Performance of Recursive Partitioning Models and Parametric Imputation Models Based On Predictive Mean Matching In Multiple Imputation By Chained Equations, in Order to Impute The Missing Values of The Binary Outcome, in The Presence of An Interaction Effect

Author(s):  
Sara Javadi ◽  
Abbas Bahrampour ◽  
Mohammad Mehdi Saber ◽  
Mohammad Reza Baneshi

Abstract Background: Among the new multiple imputation methods, Multiple Imputation by Chained ‎Equations (MICE) is a ‎popular ‎approach for implementing multiple imputations because of its ‎flexibility. Our main focus in this study ‎is to ‎compare the performance of parametric ‎imputation models based on predictive mean matching and ‎recursive partitioning methods ‎in multiple imputation by chained equations in the ‎presence of interaction in the ‎data.Methods: We compared the performance of parametric and tree-based imputation methods via simulation using two data generation models. For each combination of data generation model and imputation method, the following steps were performed: data generation, removal of observations, imputation, logistic regression analysis, and calculation of bias, Coverage Probability (CP), and Confidence Interval (CI) width for each coefficient Furthermore, model-based and empirical SE, and estimated proportion of the variance attributable to the missing data (λ) were calculated.Results: ‎We have shown by simulation that to impute a binary response in ‎observations involving an ‎interaction, manually interring the interaction term into the imputation model in the ‎predictive mean matching ‎model improves the performance of the PMM method compared to the recursive partitioning models in ‎ ‎multiple imputation by chained equations.‎ The parametric method in which we entered the interaction model into the imputation model (MICE-‎‎‎Interaction) led to smaller bias, slightly higher coverage probability for the interaction effect, but it ‎had ‎slightly ‎wider confidence intervals than tree-based imputation (especially classification and ‎regression ‎trees). Conclusions: The application of MICE-Interaction led to better performance than ‎recursive ‎partitioning methods in MICE, although ‎the user is interested in estimating the interaction and does not ‎know ‎enough about the structure of the observations, recursive partitioning methods can be ‎suggested to impute ‎the ‎missing values.

2018 ◽  
Vol 39 (1) ◽  
pp. 88-117 ◽  
Author(s):  
Julianne M. Edwards ◽  
W. Holmes Finch

AbstractMissing data is a common problem faced by psychometricians and measurement professionals. To address this issue, there are a number of techniques that have been proposed to handle missing data regarding Item Response Theory. These methods include several types of data imputation methods - corrected item mean substitution imputation, response function imputation, multiple imputation, and the EM algorithm, as well as approaches that do not rely on the imputation of missing values - treating the item as not presented, coding missing responses as incorrect, or as fractionally correct. Of these methods, even though multiple imputation has demonstrated the best performance in prior research, higher MAE was still present. Given this higher model parameter estimation MAE for even the best performing missing data methods, this simulation study’s goal was to explore the performance of a set of potentially promising data imputation methods based on recursive partitioning. Results of this study demonstrated that approaches that combine multivariate imputation by chained equations and recursive partitioning algorithms yield data with relatively low estimation MAE for both item difficulty and item discrimination. Implications of these findings are discussed.


2017 ◽  
Vol 10 (19) ◽  
pp. 1-7 ◽  
Author(s):  
Geeta Chhabra ◽  
Vasudha Vashisht ◽  
Jayanthi Ranjan ◽  
◽  
◽  
...  

Author(s):  
V. Jinubala ◽  
P. Jeyakumar

Data Mining is an emerging research field in the analysis of agricultural data. In fact the most important problem in extracting knowledge from the agriculture data is the missing values of the attributes in the selected data set. If such deficiencies are there in the selected data set then it needs to be cleaned during preprocessing of the data in order to obtain a functional data. The main objective of this paper is to analyse the effectiveness of the various imputation methods in producing a complete data set that can be more useful for applying data mining techniques and presented a comparative analysis of the imputation methods for handling missing values. The pest data set of rice crop collected throughout Maharashtra state under Crop Pest Surveillance and Advisory Project (CROPSAP) during 2009-2013 was used for analysis. The different methodologies like Deleting of rows, Mean & Median, Linear regression and Predictive Mean Matching were analysed for Imputation of Missing values. The comparative analysis shows that Predictive Mean Matching Methodology was better than other methods and effective for imputation of missing values in large data set.


2021 ◽  
Author(s):  
Shuo Feng ◽  
Celestin Hategeka ◽  
Karen Ann Grépin

Abstract Background : Poor data quality is limiting the greater use of data sourced from routine health information systems (RHIS), especially in low and middle-income countries. An important part of this issue comes from missing values, where health facilities, for a variety of reasons, miss their reports into the central system. Methods : Using data from the Health Management Information System (HMIS) and the advent of COVID-19 pandemic in the Democratic Republic of the Congo (DRC) as an illustrative case study, we implemented six commonly-used imputation methods using the DRC’s HMIS datasets and evaluated their performance through various statistical techniques, i.e., simple linear regression, segmented regression which is widely used in interrupted time series studies, and parametric comparisons through t-tests and non-parametric comparisons through Wilcoxon Rank-Sum tests. We also examined the performance of these six imputation methods under different missing mechanisms and tested their stability to changes in the data. Results : For regression analyses, there was no substantial difference found in the results generated from all methods except mean imputation and exclusion & interpolation when the RHIS dataset contained less than 20% missing values. However, as the missing proportion grew, machine learning methods such as missForest and k -NN started to produce biased estimates, and they were found to be also lack of robustness to minimal changes in data or to consecutive missingness. On the other hand, multiple imputation generated the overall most unbiased estimates and was the most robust to all changes in data. For comparing group means through t-tests, the results from mean imputation and exclusion & interpolation disagreed with the true inference obtained using the complete data, suggesting that these two methods would not only lead to biased regression estimates but also generate unreliable t-test results. Conclusions : We recommend the use of multiple imputation in addressing missing values in RHIS datasets. In cases necessary computing resources are unavailable to multiple imputation, one may consider seasonal decomposition as the next best method. Mean imputation and exclusion & interpolation, however, always produced biased and misleading results in the subsequent analyses, and thus their use in the handling of missing values should be discouraged. Keywords : Missing Data; Routine Health Information Systems (RHIS); Health Management Information System (HMIS); Health Services Research; Low and middle-income countries (LMICs); Multiple imputation


Missing of partial data is a problem that is prevalent in most of the datasets used for statistical analysis. In this study, we analyzed the missing values in ISBSG R1 2018 dataset and addressed the problem through imputation, a machine learning technique which can increase the availability of data. Additionally, we compare the performance of three imputation methods: Classification and Regression Trees (CART), Polynomial Regression (PR), Predictive Mean Matching (PMM), and Random Forest (RF) applied to ISBSG R1 2018 dataset available from International Standards Benchmarks Group. Through imputation, we were able to increase data availability by four times. We also evaluated the performance of these methods against the original dataset without imputation using an ensemble of Linear Regression, Gradient Boosting, Random Forest, and ANN. Imputation using CART can increase the availability of the overall dataset but only at the loss of some predictive capability of the model. However, CART remains the option of choice to extend the usability of the data by retaining rows that are otherwise removed from the dataset in traditional methods. In our experiments, this approach has been able to increase the usability of the original dataset to 63%, but with 2 to 3% decrease in its overall predictive performance.


2019 ◽  
Vol 6 (339) ◽  
pp. 73-98
Author(s):  
Małgorzata Aleksandra Misztal

The problem of incomplete data and its implications for drawing valid conclusions from statistical analyses is not related to any particular scientific domain, it arises in economics, sociology, education, behavioural sciences or medicine. Almost all standard statistical methods presume that every object has information on every variable to be included in the analysis and the typical approach to missing data is simply to delete them. However, this leads to ineffective and biased analysis results and is not recommended in the literature. The state of the art technique for handling missing data is multiple imputation. In the paper, some selected multiple imputation methods were taken into account. Special attention was paid to using principal components analysis (PCA) as an imputation method. The goal of the study was to assess the quality of PCA‑based imputations as compared to two other multiple imputation techniques: multivariate imputation by chained equations (MICE) and missForest. The comparison was made by artificially simulating different proportions (10–50%) and mechanisms of missing data using 10 complete data sets from the UCI repository of machine learning databases. Then, missing values were imputed with the use of MICE, missForest and the PCA‑based method (MIPCA). The normalised root mean square error (NRMSE) was calculated as a measure of imputation accuracy. On the basis of the conducted analyses, missForest can be recommended as a multiple imputation method providing the lowest rates of imputation errors for all types of missingness. PCA‑based imputation does not perform well in terms of accuracy.


2018 ◽  
Vol 22 (4) ◽  
pp. 391-409
Author(s):  
John M. Roberts ◽  
Aki Roberts ◽  
Tim Wadsworth

Incident-level homicide datasets such as the Supplementary Homicide Reports (SHR) commonly exhibit missing data. We evaluated multiple imputation methods (that produce multiple completed datasets, across which imputed values may vary) via unique data that included actual values, from police agency incident reports, of seemingly missing SHR data. This permitted evaluation under a real, not assumed or simulated, missing data mechanism. We compared analytic results based on multiply imputed and actual data; multiple imputation rather successfully recovered victim–offender relationship distributions and regression coefficients that hold in the actual data. Results are encouraging for users of multiple imputation, though it is still important to minimize the extent of missing information in SHR and similar data.


2021 ◽  
Vol 25 (4) ◽  
pp. 825-846
Author(s):  
Ahmad Jaffar Khan ◽  
Basit Raza ◽  
Ahmad Raza Shahid ◽  
Yogan Jaya Kumar ◽  
Muhammad Faheem ◽  
...  

Almost all real-world datasets contain missing values. Classification of data with missing values can adversely affect the performance of a classifier if not handled correctly. A common approach used for classification with incomplete data is imputation. Imputation transforms incomplete data with missing values to complete data. Single imputation methods are mostly less accurate than multiple imputation methods which are often computationally much more expensive. This study proposes an imputed feature selected bagging (IFBag) method which uses multiple imputation, feature selection and bagging ensemble learning approach to construct a number of base classifiers to classify new incomplete instances without any need for imputation in testing phase. In bagging ensemble learning approach, data is resampled multiple times with substitution, which can lead to diversity in data thus resulting in more accurate classifiers. The experimental results show the proposed IFBag method is considerably fast and gives 97.26% accuracy for classification with incomplete data as compared to common methods used.


2018 ◽  
Author(s):  
Kieu Trinh Do ◽  
Simone Wahl ◽  
Johannes Raffler ◽  
Sophie Molnos ◽  
Michael Laimighofer ◽  
...  

AbstractBACKGROUNDUntargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in epidemiological studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation.METHODSWe investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established genetically metabolic quantitative trait loci.RESULTSRun day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations (MICE) performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable.CONCLUSIONMissing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.Key messagesUntargeted MS-based metabolomics data show missing values due to both batch-specific LOD-based and non-LOD-based effects.Statistical evaluation of multiple imputation methods was conducted on both simulated and real datasets.Biological evaluation on real data assessed the ability of imputation methods to preserve statistical inference of biochemical pathways and correctly estimate effects of genetic variants on metabolite levels.KNN-based imputation on observations with variable pre-selection and K = 10 showed robust performance for all data scenarios across all evaluation schemes.


Sign in / Sign up

Export Citation Format

Share Document