A SAS macro for a simulation study of imputation methods for missing values—an application of Bebbington's algorithm

Public Health ◽  
1998 ◽  
Vol 112 (2) ◽  
pp. 129-132
Author(s):  
S Wang
Author(s):  
Wisam A. Mahmood ◽  
Mohammed S. Rashid ◽  
Teaba Wala Aldeen ◽  
Teaba Wala Aldeen

Missing values commonly happen in the realm of medical research, which is regarded creating a lot of bias in case it is neglected with poor handling. However, while dealing with such challenges, some standard statistical methods have been already developed and available, yet no credible method is available so far to infer credible estimates. The existing data size gets lowered, apart from a decrease in efficiency happens when missing values is found in a dataset. A number of imputation methods have addressed such challenges in early scholarly works for handling missing values. Some of the regular methods include complete case method, mean imputation method, Last Observation Carried Forward (LOCF) method, Expectation-Maximization (EM) algorithm, and Markov Chain Monte Carlo (MCMC), Mean Imputation (Mean), Hot Deck (HOT), Regression Imputation (Regress), K-nearest neighbor (KNN),K-Mean Clustering, Fuzzy K-Mean Clustering, Support Vector Machine, and Multiple Imputation (MI) method. In the present paper, a simulation study is attempted for carrying out an investigative exploration into the efficacy of the above mentioned archetypal imputation methods along with longitudinal data setting under missing completely at random (MCAR). We took out missingness from three cases in a block having low missingness of 5% as well as higher levels at 30% and 50%. With this simulation study, we concluded LOCF method having more bias than the other methods in most of the situations after carrying out a comparison through simulation study.


2019 ◽  
Vol 19 (1) ◽  
Author(s):  
Anurika Priyanjali De Silva ◽  
Margarita Moreno-Betancur ◽  
Alysha Madhu De Livera ◽  
Katherine Jane Lee ◽  
Julie Anne Simpson

2021 ◽  
Vol 29 (2) ◽  
Author(s):  
Nurul Azifah Mohd Pauzi ◽  
Yap Bee Wah ◽  
Sayang Mohd Deni ◽  
Siti Khatijah Nor Abdul Rahim ◽  
Suhartono

High quality data is essential in every field of research for valid research findings. The presence of missing data in a dataset is common and occurs for a variety of reasons such as incomplete responses, equipment malfunction and data entry error. Single and multiple data imputation methods have been developed for data imputation of missing values. This study investigated the performance of single imputation using mean and multiple imputation method using Multivariate Imputation by Chained Equations (MICE) via a simulation study. The MCAR which means missing completely at random were generated randomly for ten levels of missing rates (proportion of missing data): 5% to 50% for different sample sizes. Mean Square Error (MSE) was used to evaluate the performance of the imputation methods. Data imputation method depends on data types. Mean imputation is commonly used to impute missing values for continuous variable while MICE method can handle both continuous and categorical variables. The simulation results indicate that group mean imputation (GMI) performed better compared to overall mean imputation (OMI) and MICE with lowest value of MSE for all sample sizes and missing rates. The MSE of OMI, GMI, and MICE increases when missing rate increases. The MICE method has the lowest performance (i.e. highest MSE) when percentage of missing rates is more than 15%. Overall, GMI is more superior compared to OMI and MICE for all missing rates and sample size for MCAR mechanism. An application to a real dataset confirmed the findings of the simulation results. The findings of this study can provide knowledge to researchers and practitioners on which imputation method is more suitable when the data involves missing data.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Mar Rodríguez-Girondo ◽  
Niels van den Berg ◽  
Michel H. Hof ◽  
Marian Beekman ◽  
Eline Slagboom

Abstract Background Although human longevity tends to cluster within families, genetic studies on longevity have had limited success in identifying longevity loci. One of the main causes of this limited success is the selection of participants. Studies generally include sporadically long-lived individuals, i.e. individuals with the longevity phenotype but without a genetic predisposition for longevity. The inclusion of these individuals causes phenotype heterogeneity which results in power reduction and bias. A way to avoid sporadically long-lived individuals and reduce sample heterogeneity is to include family history of longevity as selection criterion using a longevity family score. A main challenge when developing family scores are the large differences in family size, because of real differences in sibship sizes or because of missing data. Methods We discussed the statistical properties of two existing longevity family scores: the Family Longevity Selection Score (FLoSS) and the Longevity Relatives Count (LRC) score and we evaluated their performance dealing with differential family size. We proposed a new longevity family score, the mLRC score, an extension of the LRC based on random effects modeling, which is robust for family size and missing values. The performance of the new mLRC as selection tool was evaluated in an intensive simulation study and illustrated in a large real dataset, the Historical Sample of the Netherlands (HSN). Results Empirical scores such as the FLOSS and LRC cannot properly deal with differential family size and missing data. Our simulation study showed that mLRC is not affected by family size and provides more accurate selections of long-lived families. The analysis of 1105 sibships of the Historical Sample of the Netherlands showed that the selection of long-lived individuals based on the mLRC score predicts excess survival in the validation set better than the selection based on the LRC score . Conclusions Model-based score systems such as the mLRC score help to reduce heterogeneity in the selection of long-lived families. The power of future studies into the genetics of longevity can likely be improved and their bias reduced, by selecting long-lived cases using the mLRC.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Pooja Rani ◽  
Rajneesh Kumar ◽  
Anurag Jain

PurposeDecision support systems developed using machine learning classifiers have become a valuable tool in predicting various diseases. However, the performance of these systems is adversely affected by the missing values in medical datasets. Imputation methods are used to predict these missing values. In this paper, a new imputation method called hybrid imputation optimized by the classifier (HIOC) is proposed to predict missing values efficiently.Design/methodology/approachThe proposed HIOC is developed by using a classifier to combine multivariate imputation by chained equations (MICE), K nearest neighbor (KNN), mean and mode imputation methods in an optimum way. Performance of HIOC has been compared to MICE, KNN, and mean and mode methods. Four classifiers support vector machine (SVM), naive Bayes (NB), random forest (RF) and decision tree (DT) have been used to evaluate the performance of imputation methods.FindingsThe results show that HIOC performed efficiently even with a high rate of missing values. It had reduced root mean square error (RMSE) up to 17.32% in the heart disease dataset and 34.73% in the breast cancer dataset. Correct prediction of missing values improved the accuracy of the classifiers in predicting diseases. It increased classification accuracy up to 18.61% in the heart disease dataset and 6.20% in the breast cancer dataset.Originality/valueThe proposed HIOC is a new hybrid imputation method that can efficiently predict missing values in any medical dataset.


2010 ◽  
Vol 6 (3) ◽  
pp. 1-10 ◽  
Author(s):  
Shichao Zhang

In this paper, the author designs an efficient method for imputing iteratively missing target values with semi-parametric kernel regression imputation, known as the semi-parametric iterative imputation algorithm (SIIA). While there is little prior knowledge on the datasets, the proposed iterative imputation method, which impute each missing value several times until the algorithms converges in each model, utilize a substantially useful amount of information. Additionally, this information includes occurrences involving missing values as well as capturing the real dataset distribution easier than the parametric or nonparametric imputation techniques. Experimental results show that the author’s imputation methods outperform the existing methods in terms of imputation accuracy, in particular in the situation with high missing ratio.


Symmetry ◽  
2020 ◽  
Vol 12 (11) ◽  
pp. 1792
Author(s):  
Shu-Fen Huang ◽  
Ching-Hsue Cheng

Medical data usually have missing values; hence, imputation methods have become an important issue. In previous studies, many imputation methods based on variable data had a multivariate normal distribution, such as expectation-maximization and regression-based imputation. These assumptions may lead to deviations in the results, which sometimes create a bottleneck. In addition, directly deleting instances with missing values may have several problems, such as losing important data, producing invalid research samples, and leading to research deviations. Therefore, this study proposed a safe-region imputation method for handling medical data with missing values; we also built a medical prediction model and compared the removed missing values with imputation methods in terms of the generated rules, accuracy, and AUC. First, this study used the kNN imputation, multiple imputation, and the proposed imputation to impute the missing data and then applied four attribute selection methods to select the important attributes. Then, we used the decision tree (C4.5), random forest, REP tree, and LMT classifier to generate the rules, accuracy, and AUC for comparison. Because there were four datasets with imbalanced classes (asymmetric classes), the AUC was an important criterion. In the experiment, we collected four open medical datasets from UCI and one international stroke trial dataset. The results show that the proposed safe-region imputation is better than the listing imputation methods and after imputing offers better results than directly deleting instances with missing values in the number of rules, accuracy, and AUC. These results will provide a reference for medical stakeholders.


2020 ◽  
Vol 48 (14) ◽  
pp. e83-e83 ◽  
Author(s):  
Shisheng Wang ◽  
Wenxue Li ◽  
Liqiang Hu ◽  
Jingqiu Cheng ◽  
Hao Yang ◽  
...  

Abstract Mass spectrometry (MS)-based quantitative proteomics experiments frequently generate data with missing values, which may profoundly affect downstream analyses. A wide variety of imputation methods have been established to deal with the missing-value issue. To date, however, there is a scarcity of efficient, systematic, and easy-to-handle tools that are tailored for proteomics community. Herein, we developed a user-friendly and powerful stand-alone software, NAguideR, to enable implementation and evaluation of different missing value methods offered by 23 widely used missing-value imputation algorithms. NAguideR further evaluates data imputation results through classic computational criteria and, unprecedentedly, proteomic empirical criteria, such as quantitative consistency between different charge-states of the same peptide, different peptides belonging to the same proteins, and individual proteins participating protein complexes and functional interactions. We applied NAguideR into three label-free proteomic datasets featuring peptide-level, protein-level, and phosphoproteomic variables respectively, all generated by data independent acquisition mass spectrometry (DIA-MS) with substantial biological replicates. The results indicate that NAguideR is able to discriminate the optimal imputation methods that are facilitating DIA-MS experiments over those sub-optimal and low-performance algorithms. NAguideR further provides downloadable tables and figures supporting flexible data analysis and interpretation. NAguideR is freely available at http://www.omicsolution.org/wukong/NAguideR/ and the source code: https://github.com/wangshisheng/NAguideR/.


Sign in / Sign up

Export Citation Format

Share Document