Addressing Missing Data in Untargeted Metabolomics: Identifying Missingness Type and Evaluating the Impact of Imputation Methods on Experimental Replication

Missing data is a significant issue in metabolomics that is often neglected when conducting data pre-processing, particularly when it comes to imputation. This can have serious implications for downstream statistical analyses and lead to misleading or uninterpretable inferences. In this study, we aim to identify the primary types of missingness that affect untargeted metabolomics data and compare strategies for imputation using two real-world comprehensive two-dimensional gas chromatog-raphy (GC×GC) data sets. We also present these goals in the context of experimental replication whereby imputation is conducted in a within-replicate-based fashion—the first description and evaluation of this strategy—and introduce an R package MetabImpute to carry out these analyses. Our results conclude that, in these two data sets, missingness was most likely of the missing at-random (MAR) and missing not-at-random (MNAR) types as opposed to missing completely at-random (MCAR). Gibbs sampler imputation and Random Forest gave the best results when imputing MAR and MNAR compared against single-value imputation (zero, minimum, mean, median, and half-minimum) and other more sophisticated approach-es (Bayesian principal components analysis and quantile regression imputation for left-censored data). When samples are replicated, within-replicate imputation approaches led to an increase in the reproducibility of peak quantification compared to imputation that ignores replication, suggesting that imputing with respect to replication may preserve potentially important features in downstream analyses for biomarker discovery.

Download Full-text

Addressing Missing Data in Untargeted Metabolomics: Identifying Missingness Type and Evaluating the Impact of Imputation Meth-ods on Experimental Replication

10.33774/chemrxiv-2021-v9djt ◽

2021 ◽

Author(s):

Trenton J. Davis ◽

Tarek R. Firzli ◽

Emily A. Higgins Keppler ◽

Matt Richardson ◽

Heather D. Bean

Keyword(s):

Missing Data ◽

Biomarker Discovery ◽

Missing At Random ◽

R Package ◽

Data Sets ◽

Regression Imputation ◽

Missing Completely At Random ◽

Components Analysis ◽

Experimental Replication ◽

The Impact

Missing data is a significant issue in metabolomics that is often neglected when conducting data pre-processing, particularly when it comes to imputation. This can have serious implications for downstream statistical analyses and lead to misleading or uninterpretable inferences. In this study, we aim to identify the primary types of missingness that affect untargeted metab-olomics data and compare strategies for imputation using two real-world comprehensive two-dimensional gas chromatog-raphy (GC×GC) data sets. We also present these goals in the context of experimental replication whereby imputation is con-ducted in a within-replicate-based fashion—the first description and evaluation of this strategy—and introduce an R package MetabImpute to carry out these analyses. Our results conclude that, in these two data sets, missingness was most likely of the missing at-random (MAR) and missing not-at-random (MNAR) types as opposed to missing completely at-random (MCAR). Gibbs sampler imputation and Random Forest gave the best results when imputing MAR and MNAR compared against single-value imputation (zero, minimum, mean, median, and half-minimum) and other more sophisticated approach-es (Bayesian principal components analysis and quantile regression imputation for left-censored data). When samples are replicated, within-replicate imputation approaches led to an increase in the reproducibility of peak quantification compared to imputation that ignores replication, suggesting that imputing with respect to replication may preserve potentially im-portant features in downstream analyses for biomarker discovery.

Download Full-text

Assessing the effect of phenotyping scoring systems and SNP calling and filtering methods on detection of QTL associated with reaction of Brassica napus to Sclerotinia sclerotiorum

PhytoFrontiers™ ◽

10.1094/phytofr-10-20-0029-r ◽

2021 ◽

Author(s):

Fereshteh Shahoveisi ◽

Atena Oladzad ◽

Luis E. del Rio Mendoza ◽

Seyedali Hosseinirad ◽

Susan Ruud ◽

...

Keyword(s):

Brassica Napus ◽

Missing Data ◽

Sclerotinia Sclerotiorum ◽

Scoring Systems ◽

Mortality Data ◽

Lesion Length ◽

Data Sets ◽

Qtl Detection ◽

Snp Calling ◽

The Impact

The polyploid nature of canola (Brassica napus) represents a challenge for the accurate identification of single nucleotide polymorphisms (SNPs) and the detection of quantitative trait loci (QTL). In this study, combinations of eight phenotyping scoring systems and six SNP calling and filtering parameters were evaluated for their efficiency in detection of QTL associated with response to Sclerotinia stem rot, caused by Sclerotinia sclerotiorum, in two doubled haploid (DH) canola mapping populations. Most QTL were detected in lesion length, relative areas under the disease progress curve (rAUDPC) for lesion length, and binomial-plant mortality data sets. Binomial data derived from lesion size were less efficient in QTL detection. Inclusion of additional phenotypic sets to the analysis increased the numbers of significant QTL by 2.3-fold; however, the continuous data sets were more efficient. Between two filtering parameters used to analyze genotyping by sequencing (GBS) data, imputation of missing data increased QTL detection in one population with a high level of missing data but not in the other. Inclusion of segregation-distorted SNPs increased QTL detection but did not impact their R2 values significantly. Twelve of the 16 detected QTL were on chromosomes A02 and C01, and the rest were on A07, A09, and C03. Marker A02-7594120, associated with a QTL on chromosome A02 was detected in both populations. Results of this study suggest the impact of genotypic variant calling and filtering parameters may be population dependent while deriving additional phenotyping scoring systems such as rAUDPC datasets and mortality binary may improve QTL detection efficiency.

Download Full-text

P427 A hybrid approach of handling missing data in inflammatory bowel disease (IBD) trials: results from VISIBLE 1 and VARSITY

Journal of Crohn s and Colitis ◽

10.1093/ecco-jcc/jjz203.556 ◽

2020 ◽

Vol 14 (Supplement_1) ◽

pp. S388-S389

Author(s):

J Chen ◽

S Hunter ◽

K Kisfalvi ◽

R A Lirio

Keyword(s):

Sensitivity Analysis ◽

Missing Data ◽

Statistical Power ◽

Hybrid Approach ◽

Missing At Random ◽

P Value ◽

Two Phase ◽

Treatment Difference ◽

Mayo Score ◽

The Impact

Abstract Background Missing data is common in IBD trials. Depending on the volume and nature of missing data, it can reduce statistical power for detecting treatment difference, introduce potential bias and invalidate conclusions. Non-responder imputation (NRI), where patients (patients) with missing data are considered treatment failures, is widely used to handle missing data for dichotomous efficacy endpoints in IBD trials. However, it does not consider the mechanisms leading to missing data and can potentially underestimate the treatment effect. We proposed a hybrid (HI) approach combining NRI and multiple imputation (MI) as an alternative to NRI in the analyses of two phase 3 trials of vedolizumab (VDZ) in patients with moderate-to-severe UC – VISIBLE 11 and VARSITY2. Methods VISIBLE 1 and VARSITY assessed efficacy using dichotomous endpoints based on complete Mayo score. Full methodologies reported previously.1,2 Our proposed HI approach is aimed at imputing missing Mayo scores, instead of imputing the missing dichotomous efficacy endpoint. To assess the impact of dropouts for different missing data mechanisms (categorised as ‘missing not at random [MNAR]’ and ‘missing at random [MAR]’, HI was implemented as a potential sensitivity analysis, where dropouts owing to safety or lack of efficacy were imputed using NRI (assuming MNAR) and other missing data were imputed using MI (assuming MAR). For MI, each component of the Mayo score was imputed via a multivariate stepwise approach using a fully conditional specification ordinal logistic method. Missing baseline scores were imputed using baseline characteristics data. Missing scores from each subsequent visit were imputed using all previous visits in a stepwise fashion. Fifty imputation datasets were computed for each component of Mayo score. The complete Mayo score and relevant efficacy endpoints were derived subsequently. The analysis was performed within each imputed dataset to determine treatment difference, 95% CI and p-value, which were then combined via Rubin’s rules3. Results Tables 1 and 2 show a comparison of efficacy in the two studies using the primary NRI analysis vs. the alternative HI approach for handling missing data. Conclusion HI and NRI approaches can provide consistent efficacy analyses in IBD trials. The HI approach can serve as a useful sensitivity analysis to assess the impact of dropouts under different missing data mechanisms and evaluate the robustness of efficacy conclusions. Reference

Download Full-text

Do Alignment and Trimming Methods Matter for Phylogenomic (UCE) Analyses?

Systematic Biology ◽

10.1093/sysbio/syaa064 ◽

2020 ◽

Cited By ~ 1

Author(s):

Daniel M Portik ◽

John J Wiens

Keyword(s):

Missing Data ◽

Molecular Phylogenetics ◽

Species Tree ◽

Sequence Length ◽

Data Sets ◽

Full Data ◽

Tree Methods ◽

Phylogenomic Analyses ◽

Alignment Errors ◽

The Impact

Abstract Alignment is a crucial issue in molecular phylogenetics because different alignment methods can potentially yield very different topologies for individual genes. But it is unclear if the choice of alignment methods remains important in phylogenomic analyses, which incorporate data from hundreds or thousands of genes. For example, problematic biases in alignment might be multiplied across many loci, whereas alignment errors in individual genes might become irrelevant. The issue of alignment trimming (i.e., removing poorly aligned regions or missing data from individual genes) is also poorly explored. Here, we test the impact of 12 different combinations of alignment and trimming methods on phylogenomic analyses. We compare these methods using published phylogenomic data from ultraconserved elements (UCEs) from squamate reptiles (lizards and snakes), birds, and tetrapods. We compare the properties of alignments generated by different alignment and trimming methods (e.g., length, informative sites, missing data). We also test whether these data sets can recover well-established clades when analyzed with concatenated (RAxML) and species-tree methods (ASTRAL-III), using the full data ($\sim $5000 loci) and subsampled data sets (10% and 1% of loci). We show that different alignment and trimming methods can significantly impact various aspects of phylogenomic data sets (e.g., length, informative sites). However, these different methods generally had little impact on the recovery and support values for well-established clades, even across very different numbers of loci. Nevertheless, our results suggest several “best practices” for alignment and trimming. Intriguingly, the choice of phylogenetic methods impacted the phylogenetic results most strongly, with concatenated analyses recovering significantly more well-established clades (with stronger support) than the species-tree analyses. [Alignment; concatenated analysis; phylogenomics; sequence length heterogeneity; species-tree analysis; trimming]

Download Full-text

Multiple Imputation of Multilevel Missing Data

SAGE Open ◽

10.1177/2158244016668220 ◽

2016 ◽

Vol 6 (4) ◽

pp. 215824401666822 ◽

Cited By ~ 17

Author(s):

Simon Grund ◽

Oliver Lüdtke ◽

Alexander Robitzsch

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Multilevel Models ◽

R Package ◽

Data Sets ◽

Multilevel Data ◽

Statistical Knowledge ◽

Multilevel Research ◽

User Friendly ◽

High Degree

The treatment of missing data can be difficult in multilevel research because state-of-the-art procedures such as multiple imputation (MI) may require advanced statistical knowledge or a high degree of familiarity with certain statistical software. In the missing data literature, pan has been recommended for MI of multilevel data. In this article, we provide an introduction to MI of multilevel missing data using the R package pan, and we discuss its possibilities and limitations in accommodating typical questions in multilevel research. To make pan more accessible to applied researchers, we make use of the mitml package, which provides a user-friendly interface to the pan package and several tools for managing and analyzing multiply imputed data sets. We illustrate the use of pan and mitml with two empirical examples that represent common applications of multilevel models, and we discuss how these procedures may be used in conjunction with other software.

Download Full-text

Untargeted Metabolomic Analysis Combined With Multivariate Statistics Reveal Distinct Metabolic Changes in GPR40 Agonist-Treated Animals Related to Bile Acid Metabolism

Frontiers in Molecular Biosciences ◽

10.3389/fmolb.2020.598369 ◽

2021 ◽

Vol 7 ◽

Author(s):

Hannes Doerfler ◽

Dana-Adriana Botesteanu ◽

Stefan Blech ◽

Ralf Laux

Keyword(s):

Bile Acid ◽

Multivariate Statistics ◽

Biomarker Discovery ◽

Liver Toxicity ◽

Untargeted Metabolomics ◽

Bile Acid Metabolism ◽

Metabolomic Analysis ◽

Drug Induced ◽

Metabolomics Data

Metabolomics has been increasingly applied to biomarker discovery, as untargeted metabolic profiling represents a powerful exploratory tool for identifying causal links between biomarkers and disease phenotypes. In the present work, we used untargeted metabolomics to investigate plasma specimens of rats, dogs, and mice treated with small-molecule drugs designed for improved glycemic control of type 2 diabetes mellitus patients via activation of GPR40. The in vivo pharmacology of GPR40 is not yet fully understood. Compounds targeting this receptor have been found to induce drug-induced liver injury (DILI). Metabolomic analysis facilitating an integrated UPLC-TWIMS-HRMS platform was used to detect metabolic differences between treated and non-treated animals within two 4-week toxicity studies in rat and dog, and one 2-week toxicity study in mouse. Multivariate statistics of untargeted metabolomics data subsequently revealed the presence of several significantly upregulated endogenous compounds in the treated animals whose plasma level is known to be affected during DILI. A specific bile acid metabolite useful as endogenous probe for drug–drug interaction studies was identified (chenodeoxycholic acid-24 glucuronide), as well as a metabolic precursor indicative of acidic bile acid biosynthesis (7α-hydroxy-3-oxo-4-cholestenoic acid). These results correlate with typical liver toxicity parameters on the individual level.

Download Full-text

Using the CES-D scale in a large cohort study and dealing with missing data: Application to the French E3N cohort

European Psychiatry ◽

10.1016/s0924-9338(11)72279-9 ◽

2011 ◽

Vol 26 (S2) ◽

pp. 572-572

Author(s):

N. Resseguier ◽

H. Verdoux ◽

F. Clavel-Chapelon ◽

X. Paoletti

Keyword(s):

Sensitivity Analysis ◽

Missing Data ◽

Multiple Imputation ◽

Missing Values ◽

Large Population ◽

Missing At Random ◽

Population Based ◽

Missing Value ◽

Perform Sensitivity Analysis ◽

The Impact

IntroductionThe CES-D scale is commonly used to assess depressive symptoms (DS) in large population-based studies. Missing values in items of the scale may create biases.ObjectivesTo explore reasons for not completing items of the CES-D scale and to perform sensitivity analysis of the prevalence of DS to assess the impact of different missing data hypotheses.Methods71412 women included in the French E3N cohort returned in 2005 a questionnaire containing the CES-D scale. 45% presented at least one missing value in the scale. An interview study was carried out on a random sample of 204 participants to examine the different hypotheses for the missing value mechanism. The prevalence of DS was estimated according to different methods for handling missing values: complete cases analysis, single imputation, multiple imputation under MAR (missing at random) and MNAR (missing not at random) assumptions.ResultsThe interviews showed that participants were not embarrassed to fill in questions about DS. Potential reasons of nonresponse were identified. MAR and MNAR hypotheses remained plausible and were explored.Among complete responders, the prevalence of DS was 26.1%. After multiple imputation under MAR assumption, it was 28.6%, 29.8% and 31.7% among women presenting up to 4, to 10 and to 20 missing values, respectively. The estimates were robust after applying various scenarios of MNAR data for the sensitivity analysis.ConclusionsThe CES-D scale can easily be used to assess DS in large cohorts. Multiple imputation under MAR assumption allows to reliably handle missing values.

Download Full-text

The Impact of Different Missing Data Handling Methods on DINA Model

International Journal of Evaluation and Research in Education (IJERE) ◽

10.11591/ijere.v7i1.11682 ◽

2018 ◽

Vol 7 (1) ◽

pp. 77

Author(s):

Seçil Ömür Sünbül

Keyword(s):

Missing Data ◽

Expectation Maximization Algorithm ◽

Simulated Data ◽

Missing At Random ◽

Data Handling ◽

Parameter Estimations ◽

Dina Model ◽

Mean Imputation ◽

Handling Methods ◽

The Impact

<p>In this study, it was aimed to investigate the impact of different missing data handling methods on DINA model parameter estimation and classification accuracy. In the study, simulated data were used and the data were generated by manipulating the number of items and sample size. In the generated data, two different missing data mechanisms (missing completely at random and missing at random) were created according to three different amounts of missing data. The generated missing data was completed by using methods of treating missing data as incorrect, person mean imputation, two-way imputation, and expectation-maximization algorithm imputation. As a result, it was observed that both s and g parameter estimations and classification accuracies were effected from, missing data rates, missing data handling methods and missing data mechanisms.</p>

Download Full-text

Coupled Matrix Factorization with Sparse Factors to Identify Potential Biomarkers in Metabolomics

International Journal of Knowledge Discovery in Bioinformatics ◽

10.4018/jkdb.2012070102 ◽

2012 ◽

Vol 3 (3) ◽

pp. 22-43 ◽

Cited By ~ 9

Author(s):

Evrim Acar ◽

Gozde Gurdeniz ◽

Morten A. Rasmussen ◽

Daniela Rago ◽

Lars O. Dragsted ◽

...

Keyword(s):

Matrix Factorization ◽

Biomarker Discovery ◽

Biological Fluids ◽

Joint Analysis ◽

Data Sets ◽

Optimization Approach ◽

Data Set ◽

Metabolomics Data ◽

Factorization Problem ◽

Potential Biomarkers

Metabolomics focuses on the detection of chemical substances in biological fluids such as urine and blood using a number of analytical techniques including Nuclear Magnetic Resonance (NMR) spectroscopy and Liquid Chromatography-Mass Spectrometry (LC-MS). Among the major challenges in analysis of metabolomics data are (i) joint analysis of data from multiple platforms, and (ii) capturing easily interpretable underlying patterns, which could be further utilized for biomarker discovery. In order to address these challenges, the authors formulate joint analysis of data from multiple platforms as a coupled matrix factorization problem with sparsity penalties on the factor matrices. They developed an all-at-once optimization algorithm, called CMF-SPOPT (Coupled Matrix Factorization with SParse OPTimization), which is a gradient-based optimization approach solving for all factor matrices simultaneously. Using numerical experiments on simulated data, the authors demonstrate that CMF-SPOPT can capture the underlying sparse patterns in data. Furthermore, on a real data set of blood samples collected from a group of rats, the authors use the proposed approach to jointly analyze metabolomics data sets and identify potential biomarkers for apple intake. Advantages and limitations of the proposed approach are also discussed using illustrative examples on metabolomics data sets.

Download Full-text

Quantile regression for incomplete longitudinal data with selection by death

Statistical Methods in Medical Research ◽

10.1177/0962280220909986 ◽

2020 ◽

Vol 29 (9) ◽

pp. 2697-2716

Author(s):

Hélène Jacqmin-Gadda ◽

Anaïs Rouanet ◽

Robert D Mba ◽

Viviane Philipps ◽

Jean-François Dartigues

Keyword(s):

Missing Data ◽

Longitudinal Data ◽

Quantile Regression ◽

Real Data ◽

Missing At Random ◽

The Elderly ◽

R Package ◽

Estimating Equation ◽

Cognitive Test ◽

Quantile Regressions

Quantile regressions are increasingly used to provide population norms for quantitative variables. Indeed, they do not require any Gaussian assumption for the response and allow to characterize its entire distribution through different quantiles. Quantile regressions are especially useful to provide norms of cognitive scores in the elderly that may help general practitioners to identify subjects with unexpectedly low cognitive level in routine examinations. These norms may be estimated from cohorts of elderly using quantile regression for longitudinal data, but this requires to properly account for selection by death, dropout and intermittent missing data. In this work, we extend the weighted estimating equation approach to estimate conditional quantiles in the population currently alive from mortal cohorts with dropout and intermittent missing data. Suitable weight estimation procedures are provided for both monotone and intermittent missing data and under two missing-at-random assumptions, when the observation probability given that the subject is alive depends on the survival time (p-MAR assumption) or not (u-MAR assumption). Inference is performed through subject-level bootstrap. The method is validated in a simulation study and applied to the French cohort Paquid to estimate quantiles of a cognitive test in the elderly population currently alive. On one hand, the simulations show that the u-MAR analysis is quite robust when the true missingness mechanism is p-MAR. This is a useful result because computation of suitable weights for intermittent missing data under the p-MAR assumption is untractable. On the other hand, the simulations highlight, along with the real data analysis, the usefulness of suitable weights for intermittent missing data. This method is implemented in the R package weightQuant.

Download Full-text