Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study

Abstract Background LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matter the reason for the missing values in the data, coherent and complete data matrix is always a pre-requisite for accurate and reliable statistical analysis. Therefore, there is a need for proper imputation strategies that account for the missingness and reduce the bias in the statistical analysis. Results Here we present our results after evaluating nine imputation methods in four different percentages of missing values of different origin. The performance of each imputation method was analyzed by Normalized Root Mean Squared Error (NRMSE). We demonstrated that random forest (RF) had the lowest NRMSE in the estimation of missing values for Missing at Random (MAR) and Missing Completely at Random (MCAR). In case of absent values due to Missing Not at Random (MNAR), the left truncated data was best imputed with minimum value imputation. We also tested the different imputation methods for datasets containing missing data of various origin, and RF was the most accurate method in all cases. The results were obtained by repeating the evaluation process 100 times with the use of metabolomics datasets where the missing values were introduced to represent absent data of different origin. Conclusion Type and rate of missingness affects the performance and suitability of imputation methods. RF-based imputation method performs best in most of the tested scenarios, including combinations of different types and rates of missingness. Therefore, we recommend using random forest-based imputation for imputing missing metabolomics data, and especially in situations where the types of missingness are not known in advance.

Download Full-text

A framework for testing different imputation methods for tabular datasets

10.1101/773762 ◽

2019 ◽

Author(s):

Tabea Kossen ◽

Michelle Livne ◽

Vince I Madai ◽

Ivana Galinovic ◽

Dietmar Frey ◽

...

Keyword(s):

Linear Model ◽

Missing Values ◽

Mean Squared Error ◽

Missing At Random ◽

Imputation Method ◽

Similar Data ◽

Missing Value ◽

Imputation Methods ◽

Listwise Deletion ◽

Clinical Dataset

AbstractBackground and purposeHandling missing values is a prevalent challenge in the analysis of clinical data. The rise of data-driven models demands an efficient use of the available data. Methods to impute missing values are thus crucial. Here, we developed a publicly available framework to test different imputation methods and compared their impact in a typical stroke clinical dataset as a use case.MethodsA clinical dataset based on the 1000Plus stroke study with 380 completed-entries patients was used. 13 common clinical parameters including numerical and categorical values were selected. Missing values in a missing-at-random (MAR) and missing-completely-at-random (MCAR) fashion from 0% to 60% were simulated and consequently imputed using the mean, hot-deck, multiple imputation by chained equations, expectation maximization method and listwise deletion. The performance was assessed by the root mean squared error, the absolute bias and the performance of a linear model for discharge mRS prediction.ResultsListwise deletion was the worst performing method and started to be significantly worse than any imputation method from 2% (MAR) and 3% (MCAR) missing values on. The underlying missing value mechanism seemed to have a crucial influence on the identified best performing imputation method. Consequently no single imputation method outperformed all others. A significant performance drop of the linear model started from 11% (MAR+MCAR) and 18% (MCAR) missing values.ConclusionsIn the presented case study of a typical clinical stroke dataset we confirmed that listwise deletion should be avoided for dealing with missing values. Our findings indicate that the underlying missing value mechanism and other dataset characteristics strongly influence the best choice of imputation method. For future studies with similar data structure, we thus suggest to use the developed framework in this study to select the most suitable imputation method for a given dataset prior to analysis.

Download Full-text

Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018)

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18031333 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1333

Author(s):

Ahmad R. Alsaber ◽

Jiazhu Pan ◽

Adeeba Al-Hurban

Keyword(s):

Air Quality ◽

Missing Data ◽

Random Forest ◽

Missing Values ◽

Imputation Method ◽

Environmental Data ◽

Environmental Research ◽

Quality Data ◽

Data Set ◽

Air Quality Data

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.

Download Full-text

Kernel weighted least square approach for imputing missing values of metabolomics data

Scientific Reports ◽

10.1038/s41598-021-90654-0 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Nishith Kumar ◽

Md. Aminul Hoque ◽

Masahiro Sugimoto

Keyword(s):

Missing Data ◽

Large Scale ◽

Missing Values ◽

Kernel Weight ◽

Least Square ◽

Data Matrix ◽

Data Imputation ◽

Metabolomics Data ◽

Missing Value ◽

Missing Data Imputation

AbstractMass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomic analyses. It yields a high-dimensional large-scale matrix (samples × metabolites) of quantified data that often contain missing cells in the data matrix as well as outliers that originate for several reasons, including technical and biological sources. Although several missing data imputation techniques are described in the literature, all conventional existing techniques only solve the missing value problems. They do not relieve the problems of outliers. Therefore, outliers in the dataset decrease the accuracy of the imputation. We developed a new kernel weight function-based proposed missing data imputation technique that resolves the problems of missing values and outliers. We evaluated the performance of the proposed method and other conventional and recently developed missing imputation techniques using both artificially generated data and experimentally measured data analysis in both the absence and presence of different rates of outliers. Performances based on both artificial data and real metabolomics data indicate the superiority of our proposed kernel weight-based missing data imputation technique to the existing alternatives. For user convenience, an R package of the proposed kernel weight-based missing value imputation technique was developed, which is available at https://github.com/NishithPaul/tWLSA.

Download Full-text

HIOC: a hybrid imputation method to predict missing values in medical datasets

International Journal of Intelligent Computing and Cybernetics ◽

10.1108/ijicc-03-2021-0042 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Pooja Rani ◽

Rajneesh Kumar ◽

Anurag Jain

Keyword(s):

Breast Cancer ◽

Heart Disease ◽

Missing Values ◽

Imputation Method ◽

Support Vector ◽

Correct Prediction ◽

Breast Cancer Dataset ◽

Cancer Dataset ◽

Content Type ◽

Imputation Methods

PurposeDecision support systems developed using machine learning classifiers have become a valuable tool in predicting various diseases. However, the performance of these systems is adversely affected by the missing values in medical datasets. Imputation methods are used to predict these missing values. In this paper, a new imputation method called hybrid imputation optimized by the classifier (HIOC) is proposed to predict missing values efficiently.Design/methodology/approachThe proposed HIOC is developed by using a classifier to combine multivariate imputation by chained equations (MICE), K nearest neighbor (KNN), mean and mode imputation methods in an optimum way. Performance of HIOC has been compared to MICE, KNN, and mean and mode methods. Four classifiers support vector machine (SVM), naive Bayes (NB), random forest (RF) and decision tree (DT) have been used to evaluate the performance of imputation methods.FindingsThe results show that HIOC performed efficiently even with a high rate of missing values. It had reduced root mean square error (RMSE) up to 17.32% in the heart disease dataset and 34.73% in the breast cancer dataset. Correct prediction of missing values improved the accuracy of the classifiers in predicting diseases. It increased classification accuracy up to 18.61% in the heart disease dataset and 6.20% in the breast cancer dataset.Originality/valueThe proposed HIOC is a new hybrid imputation method that can efficiently predict missing values in any medical dataset.

Download Full-text

Estimating Semi-Parametric Missing Values with Iterative Imputation

International Journal of Data Warehousing and Mining ◽

10.4018/jdwm.2010070101 ◽

2010 ◽

Vol 6 (3) ◽

pp. 1-10 ◽

Cited By ~ 3

Author(s):

Shichao Zhang

Keyword(s):

Prior Knowledge ◽

Efficient Method ◽

Missing Values ◽

Imputation Accuracy ◽

Imputation Method ◽

Imputation Methods ◽

Real Dataset ◽

Regression Imputation ◽

Target Values ◽

Nonparametric Imputation

In this paper, the author designs an efficient method for imputing iteratively missing target values with semi-parametric kernel regression imputation, known as the semi-parametric iterative imputation algorithm (SIIA). While there is little prior knowledge on the datasets, the proposed iterative imputation method, which impute each missing value several times until the algorithms converges in each model, utilize a substantially useful amount of information. Additionally, this information includes occurrences involving missing values as well as capturing the real dataset distribution easier than the parametric or nonparametric imputation techniques. Experimental results show that the author’s imputation methods outperform the existing methods in terms of imputation accuracy, in particular in the situation with high missing ratio.

Download Full-text

A Safe-Region Imputation Method for Handling Medical Data with Missing Values

Symmetry ◽

10.3390/sym12111792 ◽

2020 ◽

Vol 12 (11) ◽

pp. 1792

Author(s):

Shu-Fen Huang ◽

Ching-Hsue Cheng

Keyword(s):

Missing Values ◽

Medical Data ◽

Imputation Method ◽

Attribute Selection ◽

Multivariate Normal ◽

Imputation Methods ◽

Safe Region ◽

Imbalanced Classes ◽

Stroke Trial ◽

Better Than

Medical data usually have missing values; hence, imputation methods have become an important issue. In previous studies, many imputation methods based on variable data had a multivariate normal distribution, such as expectation-maximization and regression-based imputation. These assumptions may lead to deviations in the results, which sometimes create a bottleneck. In addition, directly deleting instances with missing values may have several problems, such as losing important data, producing invalid research samples, and leading to research deviations. Therefore, this study proposed a safe-region imputation method for handling medical data with missing values; we also built a medical prediction model and compared the removed missing values with imputation methods in terms of the generated rules, accuracy, and AUC. First, this study used the kNN imputation, multiple imputation, and the proposed imputation to impute the missing data and then applied four attribute selection methods to select the important attributes. Then, we used the decision tree (C4.5), random forest, REP tree, and LMT classifier to generate the rules, accuracy, and AUC for comparison. Because there were four datasets with imbalanced classes (asymmetric classes), the AUC was an important criterion. In the experiment, we collected four open medical datasets from UCI and one international stroke trial dataset. The results show that the proposed safe-region imputation is better than the listing imputation methods and after imputing offers better results than directly deleting instances with missing values in the number of rules, accuracy, and AUC. These results will provide a reference for medical stakeholders.

Download Full-text

Missing Value Imputation using XGboost for Label-Free Mass Spectrometry-Based Proteomics Data

10.1101/2021.04.08.438945 ◽

2021 ◽

Author(s):

Jian Song ◽

Changbin Yu

Keyword(s):

Mass Spectrometry ◽

High Performance ◽

Missing Values ◽

Mean Squared Error ◽

Learning Algorithm ◽

Pearson Correlation ◽

Data Matrix ◽

Label Free ◽

Proteomics Data ◽

Benchmark Datasets

AbstractThe label-free mass spectrometry-based proteomics data inevitably suffer from the problem of missing values. The existence of missing values prevents the downstream analyses which need a complete data matrix. Our motivation is to introduce the state-of-art machine learning algorithm XGboost to realize a method of imputation which can improve the accuracy of imputation. But in practical, XGboost has many parameters need to be tuned to deliver on its potential high performance. Although cross validation may find the best parameters, it is much time-consuming. Alternatively, we empirically determined the parameters to two kinds of base learners of XGboost. To explore the robustness and performance of XGboost based imputation with predetermined parameters, we conducted tests on three benchmark datasets. As a comparative, six common imputation methods were also experimented in terms of normalized root mean squared error and Pearson correlation coefficient. The comparative experimental results indicated that the XGboost based imputation method using the linear base learner is competitive to or out-performs its competitors, including the random forest based imputation, by achieving smaller imputation errors and better structure preservation under the empirical parameters for the three benchmark datasets.

Download Full-text

NMF-Based Approach for Missing Values Imputation of Mass Spectrometry Metabolomics Data

Molecules ◽

10.3390/molecules26195787 ◽

2021 ◽

Vol 26 (19) ◽

pp. 5787

Author(s):

Jingjing Xu ◽

Yuanshan Wang ◽

Xiangnan Xu ◽

Kian-Kai Cheng ◽

Daniel Raftery ◽

...

Keyword(s):

Mass Spectrometry ◽

Missing Values ◽

Nearest Neighbors ◽

Imputation Method ◽

Challenging Problem ◽

K Nearest Neighbors ◽

Metabolomics Data ◽

Global And Local ◽

Sample Heterogeneity ◽

Non Negative Matrix Factorization

In mass spectrometry (MS)-based metabolomics, missing values (NAs) may be due to different causes, including sample heterogeneity, ion suppression, spectral overlap, inappropriate data processing, and instrumental errors. Although a number of methodologies have been applied to handle NAs, NA imputation remains a challenging problem. Here, we propose a non-negative matrix factorization (NMF)-based method for NA imputation in MS-based metabolomics data, which makes use of both global and local information of the data. The proposed method was compared with three commonly used methods: k-nearest neighbors (kNN), random forest (RF), and outlier-robust (ORI) missing values imputation. These methods were evaluated from the perspectives of accuracy of imputation, retrieval of data structures, and rank of imputation superiority. The experimental results showed that the NMF-based method is well-adapted to various cases of data missingness and the presence of outliers in MS-based metabolic profiles. It outperformed kNN and ORI and showed results comparable with the RF method. Furthermore, the NMF method is more robust and less susceptible to outliers as compared with the RF method. The proposed NMF-based scheme may serve as an alternative NA imputation method which may facilitate biological interpretations of metabolomics data.

Download Full-text

Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data

10.1101/171967 ◽

2017 ◽

Cited By ~ 1

Author(s):

Runmin Wei ◽

Jingye Wang ◽

Mingming Su ◽

Erik Jia ◽

Tianlu Chen ◽

...

Keyword(s):

Mass Spectrometry ◽

Missing Values ◽

Pearson Correlation ◽

Imputation Accuracy ◽

Metabolomics Data ◽

Missing Value ◽

Sample Distribution ◽

Imputation Methods ◽

Missing Value Imputation ◽

Squared Error

AbstractIntroductionMissing values exist widely in mass-spectrometry (MS) based metabolomics data. Various methods have been applied for handling missing values, but the selection of methods can significantly affect following data analyses and interpretations. According to the definition, there are three types of missing values, missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).ObjectivesThe aim of this study was to comprehensively compare common imputation methods for different types of missing values using two separate metabolomics data sets (977 and 198 serum samples respectively) to propose a strategy to deal with missing values in metabolomics studies.MethodsImputation methods included zero, half minimum (HM), mean, median, random forest (RF), singular value decomposition (SVD), k-nearest neighbors (kNN), and quantile regression imputation of left-censored data (QRILC). Normalized root mean squared error (NRMSE) and NRMSE-based sum of ranks (SOR) were applied to evaluate the imputation accuracy for MCAR/MAR and MNAR correspondingly. Principal component analysis (PCA)/partial least squares (PLS)-Procrustes sum of squared error were used to evaluate the overall sample distribution. Student’s t-test followed by Pearson correlation analysis was conducted to evaluate the effect of imputation on univariate statistical analysis.ResultsOur findings demonstrated that RF imputation performed the best for MCAR/MAR and QRILC was the favored one for MNAR.ConclusionCombining with “modified 80% rule”, we proposed a comprehensive strategy and developed a public-accessible web-tool for missing value imputation in metabolomics data.

Download Full-text

Evaluating Imputation Methods to Improve Data Availability in a Software Estimation Dataset

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1025.0982s1119 ◽

2019 ◽

Vol 8 (2S11) ◽

pp. 153-159

Keyword(s):

Random Forest ◽

Missing Values ◽

Polynomial Regression ◽

Predictive Performance ◽

International Standards ◽

Data Availability ◽

Gradient Boosting ◽

Imputation Methods ◽

Original Dataset ◽

Predictive Mean Matching

Missing of partial data is a problem that is prevalent in most of the datasets used for statistical analysis. In this study, we analyzed the missing values in ISBSG R1 2018 dataset and addressed the problem through imputation, a machine learning technique which can increase the availability of data. Additionally, we compare the performance of three imputation methods: Classification and Regression Trees (CART), Polynomial Regression (PR), Predictive Mean Matching (PMM), and Random Forest (RF) applied to ISBSG R1 2018 dataset available from International Standards Benchmarks Group. Through imputation, we were able to increase data availability by four times. We also evaluated the performance of these methods against the original dataset without imputation using an ensemble of Linear Regression, Gradient Boosting, Random Forest, and ANN. Imputation using CART can increase the availability of the overall dataset but only at the loss of some predictive capability of the model. However, CART remains the option of choice to extend the usability of the data by retaining rows that are otherwise removed from the dataset in traditional methods. In our experiments, this approach has been able to increase the usability of the original dataset to 63%, but with 2 to 3% decrease in its overall predictive performance.

Download Full-text