Addressing Missing Data in Untargeted Metabolomics: Identifying Missingness Type and Evaluating the Impact of Imputation Meth-ods on Experimental Replication

Missing data is a significant issue in metabolomics that is often neglected when conducting data pre-processing, particularly when it comes to imputation. This can have serious implications for downstream statistical analyses and lead to misleading or uninterpretable inferences. In this study, we aim to identify the primary types of missingness that affect untargeted metab-olomics data and compare strategies for imputation using two real-world comprehensive two-dimensional gas chromatog-raphy (GC×GC) data sets. We also present these goals in the context of experimental replication whereby imputation is con-ducted in a within-replicate-based fashion—the first description and evaluation of this strategy—and introduce an R package MetabImpute to carry out these analyses. Our results conclude that, in these two data sets, missingness was most likely of the missing at-random (MAR) and missing not-at-random (MNAR) types as opposed to missing completely at-random (MCAR). Gibbs sampler imputation and Random Forest gave the best results when imputing MAR and MNAR compared against single-value imputation (zero, minimum, mean, median, and half-minimum) and other more sophisticated approach-es (Bayesian principal components analysis and quantile regression imputation for left-censored data). When samples are replicated, within-replicate imputation approaches led to an increase in the reproducibility of peak quantification compared to imputation that ignores replication, suggesting that imputing with respect to replication may preserve potentially im-portant features in downstream analyses for biomarker discovery.

Download Full-text

Addressing Missing Data in Untargeted Metabolomics: Identifying Missingness Type and Evaluating the Impact of Imputation Methods on Experimental Replication

10.33774/chemrxiv-2021-v9djt-v2 ◽

2021 ◽

Author(s):

Trenton J. Davis ◽

Tarek R. Firzli ◽

Emily A. Higgins Keppler ◽

Matt Richardson ◽

Heather D. Bean

Keyword(s):

Missing Data ◽

Biomarker Discovery ◽

Missing At Random ◽

R Package ◽

Untargeted Metabolomics ◽

Data Sets ◽

Metabolomics Data ◽

Regression Imputation ◽

Experimental Replication ◽

The Impact

Missing data is a significant issue in metabolomics that is often neglected when conducting data pre-processing, particularly when it comes to imputation. This can have serious implications for downstream statistical analyses and lead to misleading or uninterpretable inferences. In this study, we aim to identify the primary types of missingness that affect untargeted metabolomics data and compare strategies for imputation using two real-world comprehensive two-dimensional gas chromatog-raphy (GC×GC) data sets. We also present these goals in the context of experimental replication whereby imputation is conducted in a within-replicate-based fashion—the first description and evaluation of this strategy—and introduce an R package MetabImpute to carry out these analyses. Our results conclude that, in these two data sets, missingness was most likely of the missing at-random (MAR) and missing not-at-random (MNAR) types as opposed to missing completely at-random (MCAR). Gibbs sampler imputation and Random Forest gave the best results when imputing MAR and MNAR compared against single-value imputation (zero, minimum, mean, median, and half-minimum) and other more sophisticated approach-es (Bayesian principal components analysis and quantile regression imputation for left-censored data). When samples are replicated, within-replicate imputation approaches led to an increase in the reproducibility of peak quantification compared to imputation that ignores replication, suggesting that imputing with respect to replication may preserve potentially important features in downstream analyses for biomarker discovery.

Download Full-text

IMPUTATION OF MISSING DATA WITH DIFFERENT MISSINGNESS MECHANISM

Jurnal Teknologi ◽

10.11113/jt.v57.1523 ◽

2012 ◽

Vol 57 (1) ◽

Author(s):

HO MING KANG ◽

FADHILAH YUSOF ◽

ISMAIL MOHAMAD

Keyword(s):

Missing Data ◽

Missing Values ◽

Missing At Random ◽

Absolute Error ◽

Data Sets ◽

Missing Completely At Random ◽

Missingness Mechanism ◽

Mean Imputation ◽

The Mean ◽

Estimation Of Missing Data

This paper presents a study on the estimation of missing data. Data samples with different missingness mechanism namely Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR) are simulated accordingly. Expectation maximization (EM) algorithm and mean imputation (MI) are applied to these data sets and compared and the performances are evaluated by the mean absolute error (MAE) and root mean square error (RMSE). The results showed that EM is able to estimate the missing data with minimum errors compared to mean imputation (MI) for the three missingness mechanisms. However the graphical results showed that EM failed to estimate the missing values in the missing quadrants when the situation is MNAR.

Download Full-text

Other Issues in Statistics I

Critical Thinking in Clinical Research ◽

10.1093/med/9780199324491.003.0013 ◽

2018 ◽

pp. 257-283

Author(s):

Tamara Jorquiera ◽

Hang Lee ◽

Felipe Fregni ◽

Andre Brunoni

Keyword(s):

Missing Data ◽

Trial Design ◽

Case Analysis ◽

Missing At Random ◽

Covariate Adjustment ◽

Missing Not At Random ◽

Single Imputation ◽

Missing Completely At Random ◽

Data Collection Process ◽

The Impact

This chapter discusses the problem of incomplete or missing data. The three types of missing data mechanisms are examined: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). It discusses how to reduce its occurrence using trial design and improving the data collection process. The chapter also provides methods to control this factor during the analysis stage, using some strategies such as not replacing the lost data (complete case analysis), replacing each lost value with a single value (single imputation), replacing the lost data with multiple values for each lost observation (multiple imputation). It then discusses sensitivity analysis, which measures the impact on the results from different methods of handling missing data, and it helps to justify the choice of the particular method applied. Finally, it reviews covariate adjustment as another topic in statistics.

Download Full-text

Assessing the effect of phenotyping scoring systems and SNP calling and filtering methods on detection of QTL associated with reaction of Brassica napus to Sclerotinia sclerotiorum

PhytoFrontiers™ ◽

10.1094/phytofr-10-20-0029-r ◽

2021 ◽

Author(s):

Fereshteh Shahoveisi ◽

Atena Oladzad ◽

Luis E. del Rio Mendoza ◽

Seyedali Hosseinirad ◽

Susan Ruud ◽

...

Keyword(s):

Brassica Napus ◽

Missing Data ◽

Sclerotinia Sclerotiorum ◽

Scoring Systems ◽

Mortality Data ◽

Lesion Length ◽

Data Sets ◽

Qtl Detection ◽

Snp Calling ◽

The Impact

The polyploid nature of canola (Brassica napus) represents a challenge for the accurate identification of single nucleotide polymorphisms (SNPs) and the detection of quantitative trait loci (QTL). In this study, combinations of eight phenotyping scoring systems and six SNP calling and filtering parameters were evaluated for their efficiency in detection of QTL associated with response to Sclerotinia stem rot, caused by Sclerotinia sclerotiorum, in two doubled haploid (DH) canola mapping populations. Most QTL were detected in lesion length, relative areas under the disease progress curve (rAUDPC) for lesion length, and binomial-plant mortality data sets. Binomial data derived from lesion size were less efficient in QTL detection. Inclusion of additional phenotypic sets to the analysis increased the numbers of significant QTL by 2.3-fold; however, the continuous data sets were more efficient. Between two filtering parameters used to analyze genotyping by sequencing (GBS) data, imputation of missing data increased QTL detection in one population with a high level of missing data but not in the other. Inclusion of segregation-distorted SNPs increased QTL detection but did not impact their R2 values significantly. Twelve of the 16 detected QTL were on chromosomes A02 and C01, and the rest were on A07, A09, and C03. Marker A02-7594120, associated with a QTL on chromosome A02 was detected in both populations. Results of this study suggest the impact of genotypic variant calling and filtering parameters may be population dependent while deriving additional phenotyping scoring systems such as rAUDPC datasets and mortality binary may improve QTL detection efficiency.

Download Full-text

Exploring the Impact of Missing Data on Residual-Based Dimensionality Analysis for Measurement Models

Educational and Psychological Measurement ◽

10.1177/0013164420939634 ◽

2020 ◽

pp. 001316442093963

Author(s):

Stefanie A. Wind ◽

Randall E. Schumacker

Keyword(s):

Missing Data ◽

Parallel Analysis ◽

Supplementary Information ◽

Parameter Estimates ◽

Additional Information ◽

Measurement Models ◽

Standardized Residuals ◽

Components Analysis ◽

Survey Responses ◽

The Impact

Researchers frequently use Rasch models to analyze survey responses because these models provide accurate parameter estimates for items and examinees when there are missing data. However, researchers have not fully considered how missing data affect the accuracy of dimensionality assessment in Rasch analyses such as principal components analysis (PCA) of standardized residuals. Because adherence to unidimensionality is a prerequisite for the appropriate interpretation and use of Rasch model results, insight into the impact of missing data on the accuracy of this approach is critical. We used a simulation study to examine the accuracy of standardized residual PCA with various proportions of missing data and multidimensionality. We also explored an adaptation of modified parallel analysis in combination with standardized residual PCA as a source of additional information about dimensionality when missing data are present. Our results suggested that missing data impact the accuracy of PCA on standardized residuals, and that the adaptation of modified parallel analysis provides useful supplementary information about dimensionality when there are missing data.

Download Full-text

P427 A hybrid approach of handling missing data in inflammatory bowel disease (IBD) trials: results from VISIBLE 1 and VARSITY

Journal of Crohn s and Colitis ◽

10.1093/ecco-jcc/jjz203.556 ◽

2020 ◽

Vol 14 (Supplement_1) ◽

pp. S388-S389

Author(s):

J Chen ◽

S Hunter ◽

K Kisfalvi ◽

R A Lirio

Keyword(s):

Sensitivity Analysis ◽

Missing Data ◽

Statistical Power ◽

Hybrid Approach ◽

Missing At Random ◽

P Value ◽

Two Phase ◽

Treatment Difference ◽

Mayo Score ◽

The Impact

Abstract Background Missing data is common in IBD trials. Depending on the volume and nature of missing data, it can reduce statistical power for detecting treatment difference, introduce potential bias and invalidate conclusions. Non-responder imputation (NRI), where patients (patients) with missing data are considered treatment failures, is widely used to handle missing data for dichotomous efficacy endpoints in IBD trials. However, it does not consider the mechanisms leading to missing data and can potentially underestimate the treatment effect. We proposed a hybrid (HI) approach combining NRI and multiple imputation (MI) as an alternative to NRI in the analyses of two phase 3 trials of vedolizumab (VDZ) in patients with moderate-to-severe UC – VISIBLE 11 and VARSITY2. Methods VISIBLE 1 and VARSITY assessed efficacy using dichotomous endpoints based on complete Mayo score. Full methodologies reported previously.1,2 Our proposed HI approach is aimed at imputing missing Mayo scores, instead of imputing the missing dichotomous efficacy endpoint. To assess the impact of dropouts for different missing data mechanisms (categorised as ‘missing not at random [MNAR]’ and ‘missing at random [MAR]’, HI was implemented as a potential sensitivity analysis, where dropouts owing to safety or lack of efficacy were imputed using NRI (assuming MNAR) and other missing data were imputed using MI (assuming MAR). For MI, each component of the Mayo score was imputed via a multivariate stepwise approach using a fully conditional specification ordinal logistic method. Missing baseline scores were imputed using baseline characteristics data. Missing scores from each subsequent visit were imputed using all previous visits in a stepwise fashion. Fifty imputation datasets were computed for each component of Mayo score. The complete Mayo score and relevant efficacy endpoints were derived subsequently. The analysis was performed within each imputed dataset to determine treatment difference, 95% CI and p-value, which were then combined via Rubin’s rules3. Results Tables 1 and 2 show a comparison of efficacy in the two studies using the primary NRI analysis vs. the alternative HI approach for handling missing data. Conclusion HI and NRI approaches can provide consistent efficacy analyses in IBD trials. The HI approach can serve as a useful sensitivity analysis to assess the impact of dropouts under different missing data mechanisms and evaluate the robustness of efficacy conclusions. Reference

Download Full-text

Do Alignment and Trimming Methods Matter for Phylogenomic (UCE) Analyses?

Systematic Biology ◽

10.1093/sysbio/syaa064 ◽

2020 ◽

Cited By ~ 1

Author(s):

Daniel M Portik ◽

John J Wiens

Keyword(s):

Missing Data ◽

Molecular Phylogenetics ◽

Species Tree ◽

Sequence Length ◽

Data Sets ◽

Full Data ◽

Tree Methods ◽

Phylogenomic Analyses ◽

Alignment Errors ◽

The Impact

Abstract Alignment is a crucial issue in molecular phylogenetics because different alignment methods can potentially yield very different topologies for individual genes. But it is unclear if the choice of alignment methods remains important in phylogenomic analyses, which incorporate data from hundreds or thousands of genes. For example, problematic biases in alignment might be multiplied across many loci, whereas alignment errors in individual genes might become irrelevant. The issue of alignment trimming (i.e., removing poorly aligned regions or missing data from individual genes) is also poorly explored. Here, we test the impact of 12 different combinations of alignment and trimming methods on phylogenomic analyses. We compare these methods using published phylogenomic data from ultraconserved elements (UCEs) from squamate reptiles (lizards and snakes), birds, and tetrapods. We compare the properties of alignments generated by different alignment and trimming methods (e.g., length, informative sites, missing data). We also test whether these data sets can recover well-established clades when analyzed with concatenated (RAxML) and species-tree methods (ASTRAL-III), using the full data ($\sim $5000 loci) and subsampled data sets (10% and 1% of loci). We show that different alignment and trimming methods can significantly impact various aspects of phylogenomic data sets (e.g., length, informative sites). However, these different methods generally had little impact on the recovery and support values for well-established clades, even across very different numbers of loci. Nevertheless, our results suggest several “best practices” for alignment and trimming. Intriguingly, the choice of phylogenetic methods impacted the phylogenetic results most strongly, with concatenated analyses recovering significantly more well-established clades (with stronger support) than the species-tree analyses. [Alignment; concatenated analysis; phylogenomics; sequence length heterogeneity; species-tree analysis; trimming]

Download Full-text

Multiple Imputation of Multilevel Missing Data

SAGE Open ◽

10.1177/2158244016668220 ◽

2016 ◽

Vol 6 (4) ◽

pp. 215824401666822 ◽

Cited By ~ 17

Author(s):

Simon Grund ◽

Oliver Lüdtke ◽

Alexander Robitzsch

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Multilevel Models ◽

R Package ◽

Data Sets ◽

Multilevel Data ◽

Statistical Knowledge ◽

Multilevel Research ◽

User Friendly ◽

High Degree

The treatment of missing data can be difficult in multilevel research because state-of-the-art procedures such as multiple imputation (MI) may require advanced statistical knowledge or a high degree of familiarity with certain statistical software. In the missing data literature, pan has been recommended for MI of multilevel data. In this article, we provide an introduction to MI of multilevel missing data using the R package pan, and we discuss its possibilities and limitations in accommodating typical questions in multilevel research. To make pan more accessible to applied researchers, we make use of the mitml package, which provides a user-friendly interface to the pan package and several tools for managing and analyzing multiply imputed data sets. We illustrate the use of pan and mitml with two empirical examples that represent common applications of multilevel models, and we discuss how these procedures may be used in conjunction with other software.

Download Full-text

JMASM 54: A Comparison of Four Different Estimation Approaches for Prognostic Survival Oral Cancer Model

Journal of Modern Applied Statistical Methods ◽

10.22237/jmasm/1594045320 ◽

2020 ◽

Vol 18 (2) ◽

pp. 2-6

Author(s):

Thomas R. Knapp

Keyword(s):

Missing Data ◽

Oral Cancer ◽

Missing At Random ◽

Cancer Model ◽

Missing Not At Random ◽

Opposing View ◽

Missing Completely At Random ◽

Almost All

Rubin (1976, and elsewhere) claimed that there are three kinds of “missingness”: missing completely at random; missing at random; and missing not at random. He gave examples of each. The article that now follows takes an opposing view by arguing that almost all missing data are missing not at random.

Download Full-text

Using the CES-D scale in a large cohort study and dealing with missing data: Application to the French E3N cohort

European Psychiatry ◽

10.1016/s0924-9338(11)72279-9 ◽

2011 ◽

Vol 26 (S2) ◽

pp. 572-572

Author(s):

N. Resseguier ◽

H. Verdoux ◽

F. Clavel-Chapelon ◽

X. Paoletti

Keyword(s):

Sensitivity Analysis ◽

Missing Data ◽

Multiple Imputation ◽

Missing Values ◽

Large Population ◽

Missing At Random ◽

Population Based ◽

Missing Value ◽

Perform Sensitivity Analysis ◽

The Impact

IntroductionThe CES-D scale is commonly used to assess depressive symptoms (DS) in large population-based studies. Missing values in items of the scale may create biases.ObjectivesTo explore reasons for not completing items of the CES-D scale and to perform sensitivity analysis of the prevalence of DS to assess the impact of different missing data hypotheses.Methods71412 women included in the French E3N cohort returned in 2005 a questionnaire containing the CES-D scale. 45% presented at least one missing value in the scale. An interview study was carried out on a random sample of 204 participants to examine the different hypotheses for the missing value mechanism. The prevalence of DS was estimated according to different methods for handling missing values: complete cases analysis, single imputation, multiple imputation under MAR (missing at random) and MNAR (missing not at random) assumptions.ResultsThe interviews showed that participants were not embarrassed to fill in questions about DS. Potential reasons of nonresponse were identified. MAR and MNAR hypotheses remained plausible and were explored.Among complete responders, the prevalence of DS was 26.1%. After multiple imputation under MAR assumption, it was 28.6%, 29.8% and 31.7% among women presenting up to 4, to 10 and to 20 missing values, respectively. The estimates were robust after applying various scenarios of MNAR data for the sensitivity analysis.ConclusionsThe CES-D scale can easily be used to assess DS in large cohorts. Multiple imputation under MAR assumption allows to reliably handle missing values.

Download Full-text