scholarly journals BayesMetab: treatment of missing values in metabolomic studies using a Bayesian modeling approach

2019 ◽  
Vol 20 (S24) ◽  
Author(s):  
Jasmit Shah ◽  
Guy N. Brock ◽  
Jeremy Gaskins

Abstract Background With the rise of metabolomics, the development of methods to address analytical challenges in the analysis of metabolomics data is of great importance. Missing values (MVs) are pervasive, yet the treatment of MVs can have a substantial impact on downstream statistical analyses. The MVs problem in metabolomics is quite challenging and can arise because the metabolite is not biologically present in the sample, or is present in the sample but at a concentration below the lower limit of detection (LOD), or is present in the sample but undetected due to technical issues related to sample pre-processing steps. The former is considered missing not at random (MNAR) while the latter is an example of missing at random (MAR). Typically, such MVs are substituted by a minimum value, which may lead to severely biased results in downstream analyses. Results We develop a Bayesian model, called BayesMetab, that systematically accounts for missing values based on a Markov chain Monte Carlo (MCMC) algorithm that incorporates data augmentation by allowing MVs to be due to either truncation below the LOD or other technical reasons unrelated to its abundance. Based on a variety of performance metrics (power for detecting differential abundance, area under the curve, bias and MSE for parameter estimates), our simulation results indicate that BayesMetab outperformed other imputation algorithms when there is a mixture of missingness due to MAR and MNAR. Further, our approach was competitive with other methods tailored specifically to MNAR in situations where missing data were completely MNAR. Applying our approach to an analysis of metabolomics data from a mouse myocardial infarction revealed several statistically significant metabolites not previously identified that were of direct biological relevance to the study. Conclusions Our findings demonstrate that BayesMetab has improved performance in imputing the missing values and performing statistical inference compared to other current methods when missing values are due to a mixture of MNAR and MAR. Analysis of real metabolomics data strongly suggests this mixture is likely to occur in practice, and thus, it is important to consider an imputation model that accounts for a mixture of missing data types.

Marketing ZFP ◽  
2019 ◽  
Vol 41 (4) ◽  
pp. 21-32
Author(s):  
Dirk Temme ◽  
Sarah Jensen

Missing values are ubiquitous in empirical marketing research. If missing data are not dealt with properly, this can lead to a loss of statistical power and distorted parameter estimates. While traditional approaches for handling missing data (e.g., listwise deletion) are still widely used, researchers can nowadays choose among various advanced techniques such as multiple imputation analysis or full-information maximum likelihood estimation. Due to the available software, using these modern missing data methods does not pose a major obstacle. Still, their application requires a sound understanding of the prerequisites and limitations of these methods as well as a deeper understanding of the processes that have led to missing values in an empirical study. This article is Part 1 and first introduces Rubin’s classical definition of missing data mechanisms and an alternative, variable-based taxonomy, which provides a graphical representation. Secondly, a selection of visualization tools available in different R packages for the description and exploration of missing data structures is presented.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Nishith Kumar ◽  
Md. Aminul Hoque ◽  
Masahiro Sugimoto

AbstractMass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomic analyses. It yields a high-dimensional large-scale matrix (samples × metabolites) of quantified data that often contain missing cells in the data matrix as well as outliers that originate for several reasons, including technical and biological sources. Although several missing data imputation techniques are described in the literature, all conventional existing techniques only solve the missing value problems. They do not relieve the problems of outliers. Therefore, outliers in the dataset decrease the accuracy of the imputation. We developed a new kernel weight function-based proposed missing data imputation technique that resolves the problems of missing values and outliers. We evaluated the performance of the proposed method and other conventional and recently developed missing imputation techniques using both artificially generated data and experimentally measured data analysis in both the absence and presence of different rates of outliers. Performances based on both artificial data and real metabolomics data indicate the superiority of our proposed kernel weight-based missing data imputation technique to the existing alternatives. For user convenience, an R package of the proposed kernel weight-based missing value imputation technique was developed, which is available at https://github.com/NishithPaul/tWLSA.


2014 ◽  
Vol 926-930 ◽  
pp. 3830-3833
Author(s):  
Zhi Hui Fu ◽  
Cui Xin Peng ◽  
Bin Li

Missing data are often a problem in statistical modeling. How to estimate item parameters with missing data in item response theory (IRT) is an interesting issue. The Bayesian paradigm offers a natural model-based solution for this problem by treating missing values as random variables and estimating their posterior distributions. In this article, based on a data augmentation scheme using the Gibbs sampler, we propose a Bayesian procedure to estimate the multidimensional two parameter Logistic model with missing responses.


Author(s):  
Brett K Beaulieu-Jones ◽  
Daniel R Lavage ◽  
John W Snyder ◽  
Jason H Moore ◽  
Sarah A Pendergrass ◽  
...  

BACKGROUND Missing data is a challenge for all studies; however, this is especially true for electronic health record (EHR)-based analyses. Failure to appropriately consider missing data can lead to biased results. While there has been extensive theoretical work on imputation, and many sophisticated methods are now available, it remains quite challenging for researchers to implement these methods appropriately. Here, we provide detailed procedures for when and how to conduct imputation of EHR laboratory results. OBJECTIVE The objective of this study was to demonstrate how the mechanism of missingness can be assessed, evaluate the performance of a variety of imputation methods, and describe some of the most frequent problems that can be encountered. METHODS We analyzed clinical laboratory measures from 602,366 patients in the EHR of Geisinger Health System in Pennsylvania, USA. Using these data, we constructed a representative set of complete cases and assessed the performance of 12 different imputation methods for missing data that was simulated based on 4 mechanisms of missingness (missing completely at random, missing not at random, missing at random, and real data modelling). RESULTS Our results showed that several methods, including variations of Multivariate Imputation by Chained Equations (MICE) and softImpute, consistently imputed missing values with low error; however, only a subset of the MICE methods was suitable for multiple imputation. CONCLUSIONS The analyses we describe provide an outline of considerations for dealing with missing EHR data, steps that researchers can perform to characterize missingness within their own data, and an evaluation of methods that can be applied to impute clinical data. While the performance of methods may vary between datasets, the process we describe can be generalized to the majority of structured data types that exist in EHRs, and all of our methods and code are publicly available.


2021 ◽  
Author(s):  
Nwamaka Okafor ◽  
Declan Delaney

IoT sensors are becoming increasingly important supplement to traditional monitoring systems, particularly for in-situ based monitoring. Data collected using IoT sensors are often plagued with missing values occurring as a result of sensor faults, network failures, drifts and other operational issues. Missing data can have substantial impact on in-field sensor calibration methods. The goal of this research is to achieve effective calibration of sensors in the context of such missing data. To this end, two objectives are presented in this paper. 1) Identify and examine effective imputation strategy for missing data in IoT sensors. 2) Determine sensor calibration performance using calibration techniques on data set with imputed values. Specifically, this paper examines the performance of Variational Autoencoder (VAE), Neural Network with Random Weights (NNRW), Multiple Imputation by Chain Equations (MICE), Random forest based imputation (missForest) and K-Nearest Neighbour (KNN) for imputation of missing values on IoT sensors. Furthermore, the performance of sensor calibration via different supervised algorithms trained on the imputed dataset were evaluated. The analysis showed that VAE technique outperforms the others in imputing the missing values at different proportions of missingness on two real-world datasets. Experimental results also showed improved calibration performance with imputed dataset.


2018 ◽  
Author(s):  
Kieu Trinh Do ◽  
Simone Wahl ◽  
Johannes Raffler ◽  
Sophie Molnos ◽  
Michael Laimighofer ◽  
...  

AbstractBACKGROUNDUntargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in epidemiological studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation.METHODSWe investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established genetically metabolic quantitative trait loci.RESULTSRun day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations (MICE) performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable.CONCLUSIONMissing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.Key messagesUntargeted MS-based metabolomics data show missing values due to both batch-specific LOD-based and non-LOD-based effects.Statistical evaluation of multiple imputation methods was conducted on both simulated and real datasets.Biological evaluation on real data assessed the ability of imputation methods to preserve statistical inference of biochemical pathways and correctly estimate effects of genetic variants on metabolite levels.KNN-based imputation on observations with variable pre-selection and K = 10 showed robust performance for all data scenarios across all evaluation schemes.


2021 ◽  
Author(s):  
Nishith Kumar ◽  
Md. Hoque ◽  
Masahiro Sugimoto

Abstract Mass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomics analyses. It yields a high dimensional large scale matrix (samples × metabolites) of quantified data that often contain missing cell in the data matrix as well as outliers which originate from several reasons, including technical and biological sources. Although, in the literature, several missing data imputation techniques can be found, however all the conventional existing techniques can only solve the missing value problems but not relieve the problems of outliers. Therefore, outliers in the dataset, deteriorate the accuracy of imputation. To overcome both the missing data imputation and outlier’s problem, here, we developed a new kernel weight function based missing data imputation technique (proposed) that resolves both the missing values and outliers. We evaluated the performance of the proposed method and other nine conventional missing imputation techniques using both artificially generated data and experimentally measured data analysis in both absence and presence of different rates of outliers. Performance based on both artificial data and real metabolomics data indicates that our proposed kernel weight based missing data imputation technique is a better performer than some existing alternatives. For user convenience, an R package of the proposed kernel weight based missing value imputation technique has been developed which is available at https://github.com/NishithPaul/tWLSA .


2017 ◽  
Author(s):  
Brett K. Beaulieu-Jones ◽  
Daniel R. Lavage ◽  
John W. Snyder ◽  
Jason H. Moore ◽  
Sarah A Pendergrass ◽  
...  

ABSTRACTMissing data is a challenge for all studies; however, this is especially true for electronic health record (EHR) based analyses. Failure to appropriately consider missing data can lead to biased results. Here, we provide detailed procedures for when and how to conduct imputation of EHR data. We demonstrate how the mechanism of missingness can be assessed, evaluate the performance of a variety of imputation methods, and describe some of the most frequent problems that can be encountered. We analyzed clinical lab measures from 602,366 patients in the Geisinger Health System EHR. Using these data, we constructed a representative set of complete cases and assessed the performance of 12 different imputation methods for missing data that was simulated based on 4 mechanisms of missingness. Our results show that several methods including variations of Multivariate Imputation by Chained Equations (MICE) and softImpute consistently imputed missing values with low error; however, only a subset of the MICE methods were suitable for multiple imputation. The analyses described provide an outline of considerations for dealing with missing EHR data, steps that researchers can perform to characterize missingness within their own data, and an evaluation of methods that can be applied to impute clinical data. While the performance of methods may vary between datasets, the process we describe can be generalized to the majority of structured data types that exist in EHRs and all of our methods and code are publicly available.


Author(s):  
William H Clark ◽  
Steven Hauser ◽  
William C Headley ◽  
Alan J Michaels

Applications of machine learning are subject to three major components that contribute to the final performance metrics. Within the category of neural networks, and deep learning specifically, the first two are the architecture for the model being trained and the training approach used. This work focuses on the third component, the data used during training. The primary questions that arise are “what is in the data” and “what within the data matters?” looking into the radio frequency machine learning (RFML) field of automatic modulation classification (AMC) as an example of a tool used for situational awareness, the use of synthetic, captured, and augmented data are examined and compared to provide insights about the quantity and quality of the available data necessary to achieve desired performance levels. Three questions are discussed within this work: (1) how useful a synthetically trained system is expected to be when deployed without considering the environment within the synthesis, (2) how can augmentation be leveraged within the RFML domain, and, lastly, (3) what impact knowledge of degradations to the signal caused by the transmission channel contributes to the performance of a system. In general, the examined data types each make useful contributions to a final application, but captured data germane to the intended use case will always provide more significant information and enable the greatest performance. Despite the benefit of captured data, the difficulties and costs that arise from live collection often make the quantity of data needed to achieve peak performance impractical. This paper helps quantify the balance between real and synthetic data, offering concrete examples where training data is parametrically varied in size and source.


2021 ◽  
Author(s):  
Nwamaka Okafor ◽  
Declan Delaney

IoT sensors are becoming increasingly important supplement to traditional monitoring systems, particularly for in-situ based monitoring. Data collected using IoT sensors are often plagued with missing values occurring as a result of sensor faults, network failures, drifts and other operational issues. Missing data can have substantial impact on in-field sensor calibration methods. The goal of this research is to achieve effective calibration of sensors in the context of such missing data. To this end, two objectives are presented in this paper. 1) Identify and examine effective imputation strategy for missing data in IoT sensors. 2) Determine sensor calibration performance using calibration techniques on data set with imputed values. Specifically, this paper examines the performance of Variational Autoencoder (VAE), Neural Network with Random Weights (NNRW), Multiple Imputation by Chain Equations (MICE), Random forest based imputation (missForest) and K-Nearest Neighbour (KNN) for imputation of missing values on IoT sensors. Furthermore, the performance of sensor calibration via different supervised algorithms trained on the imputed dataset were evaluated. The analysis showed that VAE technique outperforms the others in imputing the missing values at different proportions of missingness on two real-world datasets. Experimental results also showed improved calibration performance with imputed dataset.


Sign in / Sign up

Export Citation Format

Share Document