scholarly journals A Statistical Framework for Data Purification with Application to Microbiome Data Analysis

2021 ◽  
Author(s):  
Zequn Sun ◽  
Jing Zhao ◽  
Zhaoqian Liu ◽  
Qin Ma ◽  
Dongjun Chung

AbstractIdentification of disease-associated microbial species is of great biological and clinical interest. However, this investigation still remains challenges due to heterogeneity in microbial composition between individuals, data quality issues, and complex relationships among species. In this paper, we propose a novel data purification algorithm that allows elimination of noise observations, which leads to increased statistical power to detect disease-associated microbial species. We illustrate the proposed algorithm using the metagenomic data generated from colorectal cancer patients.

F1000Research ◽  
2016 ◽  
Vol 5 ◽  
pp. 1492 ◽  
Author(s):  
Ben J. Callahan ◽  
Kris Sankaran ◽  
Julia A. Fukuyama ◽  
Paul J. McMurdie ◽  
Susan P. Holmes

High-throughput sequencing of PCR-amplified taxonomic markers (like the 16S rRNA gene) has enabled a new level of analysis of complex bacterial communities known as microbiomes. Many tools exist to quantify and compare abundance levels or microbial composition of communities in different conditions. The sequencing reads have to be denoised and assigned to the closest taxa from a reference database. Common approaches use a notion of 97% similarity and normalize the data by subsampling to equalize library sizes. In this paper, we show that statistical models allow more accurate abundance estimates. By providing a complete workflow in R, we enable the user to do sophisticated downstream statistical analyses, including both parameteric and nonparametric methods. We provide examples of using the R packages dada2, phyloseq, DESeq2, ggplot2 and vegan to filter, visualize and test microbiome data. We also provide examples of supervised analyses using random forests, partial least squares and linear models as well as nonparametric testing using community networks and the ggnetwork package.


2020 ◽  
Author(s):  
Quy Xuan Cao ◽  
Xinxin Sun ◽  
Karun Rajesh ◽  
Naga Chalasani ◽  
Kayla Gelow ◽  
...  

Abstract Background: Accuracy of microbial community detection in 16S rRNA marker-gene and metagenomic studies suffers from contamination and sequencing errors that lead to either falsely identifying microbial taxa that were not in the sample or misclassifying the taxa of DNA fragment reads. Filtering is defined as removing taxa that are present in a small number of samples and have small counts in the samples where they are observed. This approach reduces extreme sparsity of microbiome data and has been shown to correctly remove contaminant taxa in cultured "mock" datasets, where the true taxa compositions are known. Although filtering is frequently used, careful evaluation of its effect on the data analysis and scientific conclusions remains unreported. Here, we assess the effect of filtering on the alpha and beta diversity estimation, as well as its impact on identifying taxa that discriminate between disease states. Results: The effect of filtering on microbiome data analysis is illustrated on four datasets: two mock quality control datasets where same cultured samples with known microbial composition are processed at different labs and two disease study datasets. Results show that in microbiome quality control datasets, filtering reduces the magnitude of differences in alpha diversity and alleviates technical variability between labs, while preserving between samples similarity (beta diversity). In the disease study datasets, DESeq2 and linear discriminant analysis Effect Size (LEfSe) methods were used to identify taxa that are differentially expressed across groups of samples, and random forest models to rank features with largest contribution towards disease classiffcation. Results reveal that filtering retains significant taxa and preserves the model classification ability measured by the area under the receiver operating characteristic curve (AUC). The comparison between filtering and contaminant removal method shows that they have complementary effects and are advised to be used in conjunction. Conclusions: Filtering reduces the complexity of microbiome data, while preserving their integrity in downstream analysis. This leads to mitigation of the classification methods' sensitivity and reduction of technical variability, allowing researchers to generate more reproducible and comparable results in microbiome data analysis.


2021 ◽  
Author(s):  
Andrew Hinton ◽  
Peter J. Mucha

Abstract Background: Numerous metagenomic studies aim to discover associations between the microbial composition of an environment (e.g. Gut, Skin, Oral) and a phenotype of interest. Multivariate analysis (MVA) is often performed in these studies without critical a priori knowledge of which taxa are associated with the phenotype being studied. Consequently, non-parametric MVA methods are applied directly to all taxa surveyed independent of noise. This approach typically reduces statistical power in settings where true associations among only a few taxa are obscured by high dimensionality (i.e. sparse association signals). At the same time, the inclusion of all taxa can confound the extraction of key biological insights. Further, low sample size and compositional sample space constraints exist in these data whereby beyond-study generalizability may be reduced if not properly accounted for. More powerful association tests that are interpretable and directly account for compositional constraints while detecting sparse association signals are needed.Methods: We developed Selection-Energy-Permutation (SelEnergyPerm), a non-parametric group association test with embedded feature selection. SelEnergyPerm directly accounts for compositional constraints by selecting parsimonious log ratio signatures from the set of all pairwise log ratios (PLR) between features (OTUs, taxa, etc.). To do this, network methods are used to rank, select, and maximize the between-group association of a candidate log ratio subset. This process is then repeated with an appropriate permutation testing design to simultaneously determine the significance of the selected signatures and association.Results: Simulation results show SelEnergyPerm selects small independent sets of log ratios that capture strong associations in a range of scenarios with small and large dimensional feature spaces. Additionally, our simulation results demonstrate SelEnergyPerm consistently detects/rejects associations in synthetic data with sparse, dense, or no association signals. We demonstrate the novel benefits of our method in four case studies utilizing publicly available 16S rRNA and whole-genome sequencing datasets.Conclusions: Tools to analyze complex high-dimensional metagenomic datasets with sparse association signals using robust PLR have not been sufficiently developed previously. We propose SelEnergyPerm, a novel framework for the discovery of phenotype-associated, metagenomic log ratio signatures for characterizing and understanding alterations in microbial community structure. SelEnergyPerm is implemented in R, available at https://github.com/andrew84830813/selEnergyPermR.


2021 ◽  
Vol 11 ◽  
Author(s):  
Quy Cao ◽  
Xinxin Sun ◽  
Karun Rajesh ◽  
Naga Chalasani ◽  
Kayla Gelow ◽  
...  

Background: The accuracy of microbial community detection in 16S rRNA marker-gene and metagenomic studies suffers from contamination and sequencing errors that lead to either falsely identifying microbial taxa that were not in the sample or misclassifying the taxa of DNA fragment reads. Removing contaminants and filtering rare features are two common approaches to deal with this problem. While contaminant detection methods use auxiliary sequencing process information to identify known contaminants, filtering methods remove taxa that are present in a small number of samples and have small counts in the samples where they are observed. The latter approach reduces the extreme sparsity of microbiome data and has been shown to correctly remove contaminant taxa in cultured “mock” datasets, where the true taxa compositions are known. Although filtering is frequently used, careful evaluation of its effect on the data analysis and scientific conclusions remains unreported. Here, we assess the effect of filtering on the alpha and beta diversity estimation as well as its impact on identifying taxa that discriminate between disease states.Results: The effect of filtering on microbiome data analysis is illustrated on four datasets: two mock quality control datasets where the same cultured samples with known microbial composition are processed at different labs and two disease study datasets. Results show that in microbiome quality control datasets, filtering reduces the magnitude of differences in alpha diversity and alleviates technical variability between labs while preserving the between samples similarity (beta diversity). In the disease study datasets, DESeq2 and linear discriminant analysis Effect Size (LEfSe) methods were used to identify taxa that are differentially abundant across groups of samples, and random forest models were used to rank features with the largest contribution toward disease classification. Results reveal that filtering retains significant taxa and preserves the model classification ability measured by the area under the receiver operating characteristic curve (AUC). The comparison between the filtering and the contaminant removal method shows that they have complementary effects and are advised to be used in conjunction.Conclusions: Filtering reduces the complexity of microbiome data while preserving their integrity in downstream analysis. This leads to mitigation of the classification methods' sensitivity and reduction of technical variability, allowing researchers to generate more reproducible and comparable results in microbiome data analysis.


2021 ◽  
Author(s):  
Yueqiong Ni ◽  
Zoltan Lohinai ◽  
Yoshitaro Heshiki ◽  
Balazs Dome ◽  
Judit Moldvay ◽  
...  

AbstractCachexia is associated with decreased survival in cancer patients and has a prevalence of up to 80%. The etiology of cachexia is poorly understood, and limited treatment options exist. Here, we investigated the role of the human gut microbiome in cachexia by integrating shotgun metagenomics and plasma metabolomics of 31 lung cancer patients. The cachexia group showed significant differences in the gut microbial composition, functional pathways of the metagenome, and the related plasma metabolites compared to non-cachectic patients. Branched-chain amino acids (BCAAs), methylhistamine, and vitamins were significantly depleted in the plasma of cachexia patients, which was also reflected in the depletion of relevant gut microbiota functional pathways. The enrichment of BCAAs and 3-oxocholic acid in non-cachectic patients were positively correlated with gut microbial species Prevotella copri and Lactobacillus gasseri, respectively. Furthermore, the gut microbiota capacity for lipopolysaccharides biosynthesis was significantly enriched in cachectic patients. The involvement of the gut microbiome in cachexia was further observed in a high-performance machine learning model using solely gut microbial features. Our study demonstrates the links between cachectic host metabolism and specific gut microbial species and functions in a clinical setting, suggesting that the gut microbiota could have an influence on cachexia with possible therapeutic applications.


2020 ◽  
Vol 16 (12) ◽  
pp. e1008473
Author(s):  
Pamela N. Luna ◽  
Jonathan M. Mansbach ◽  
Chad A. Shaw

Changes in the composition of the microbiome over time are associated with myriad human illnesses. Unfortunately, the lack of analytic techniques has hindered researchers’ ability to quantify the association between longitudinal microbial composition and time-to-event outcomes. Prior methodological work developed the joint model for longitudinal and time-to-event data to incorporate time-dependent biomarker covariates into the hazard regression approach to disease outcomes. The original implementation of this joint modeling approach employed a linear mixed effects model to represent the time-dependent covariates. However, when the distribution of the time-dependent covariate is non-Gaussian, as is the case with microbial abundances, researchers require different statistical methodology. We present a joint modeling framework that uses a negative binomial mixed effects model to determine longitudinal taxon abundances. We incorporate these modeled microbial abundances into a hazard function with a parameterization that not only accounts for the proportional nature of microbiome data, but also generates biologically interpretable results. Herein we demonstrate the performance improvements of our approach over existing alternatives via simulation as well as a previously published longitudinal dataset studying the microbiome during pregnancy. The results demonstrate that our joint modeling framework for longitudinal microbiome count data provides a powerful methodology to uncover associations between changes in microbial abundances over time and the onset of disease. This method offers the potential to equip researchers with a deeper understanding of the associations between longitudinal microbial composition changes and disease outcomes. This new approach could potentially lead to new diagnostic biomarkers or inform clinical interventions to help prevent or treat disease.


mSystems ◽  
2021 ◽  
Vol 6 (1) ◽  
Author(s):  
Gongchao Jing ◽  
Lu Liu ◽  
Zengbin Wang ◽  
Yufeng Zhang ◽  
Li Qian ◽  
...  

ABSTRACT Metagenomic data sets from diverse environments have been growing rapidly. To ensure accessibility and reusability, tools that quickly and informatively correlate new microbiomes with existing ones are in demand. Here, we introduce Microbiome Search Engine 2 (MSE 2), a microbiome database platform for searching query microbiomes in the global metagenome data space based on the taxonomic or functional similarity of a whole microbiome to those in the database. MSE 2 consists of (i) a well-organized and regularly updated microbiome database that currently contains over 250,000 metagenomic shotgun and 16S rRNA gene amplicon samples associated with unified metadata collected from 798 studies, (ii) an enhanced search engine that enables real-time and fast (<0.5 s per query) searches against the entire database for best-matched microbiomes using overall taxonomic or functional profiles, and (iii) a Web-based graphical user interface for user-friendly searching, data browsing, and tutoring. MSE 2 is freely accessible via http://mse.ac.cn. For standalone searches of customized microbiome databases, the kernel of the MSE 2 search engine is provided at GitHub (https://github.com/qibebt-bioinfo/meta-storms). IMPORTANCE A search-based strategy is useful for large-scale mining of microbiome data sets, such as a bird’s-eye view of the microbiome data space and disease diagnosis via microbiome big data. Here, we introduce Microbiome Search Engine 2 (MSE 2), a microbiome database platform for searching query microbiomes against the existing microbiome data sets on the basis of their similarity in taxonomic structure or functional profile. Key improvements include database extension, data compatibility, a search engine kernel, and a user interface. The new ability to search the microbiome space via functional similarity greatly expands the scope of search-based mining of the microbiome big data.


2017 ◽  
Vol 34 (8) ◽  
pp. 1411-1413 ◽  
Author(s):  
Nick Weber ◽  
David Liou ◽  
Jennifer Dommer ◽  
Philip MacMenamin ◽  
Mariam Quiñones ◽  
...  

2021 ◽  
Vol 39 (15_suppl) ◽  
pp. 2570-2570
Author(s):  
Alya Heirali ◽  
Bo Chen ◽  
Matthew Wong ◽  
Pierre H.H. Schneeberger ◽  
Victor Rey ◽  
...  

2570 Background: A number of studies have demonstrated that the gut microbiome of responders to immune checkpoint inhibitors (ICI) is compositionally different compared to that of non-responders. However, differences in study design, patient cohorts and bioinformatic analyses make it challenging to identify bacterial species consistently associated with response to ICI across different cohorts and cancer types. Methods: We leveraged the statistical power of mega- and meta-analyses to identify bacterial species consistently associated with response to ICI using data from three published fecal metagenomic studies (Gopalakrishnan et al., Science 2018; Matson et al., Science 2018; Routy et al., Science 2018). Metagenomic data was uniformly processed and analyzed using Metaphlan v2.0. We conducted a two-part modelling approach of bacterial species present in at least 20% of samples to account for both prevalence and relative abundance differences between responders/non-responders. Results: A total of 190 patients (n = 103 responders; n = 87 non-responders) were included from the three studies. Data from Routy et al., was analyzed as subsets based on tumor type for a total of 4 analyzed cohorts. We identified five species including Bacteroides thetaiotaomicron, Clostridium bolteae, Holdemania filiformis, Clostridiaceae bacterium JC118 and Escherichia coli that were concordantly significantly different between responders and non-responders using both meta- and mega-analyses. B. thetaiotaomicron and Clostridium bolteae relative abundance (RA) were independently predictive of non-response to immunotherapy when data sets were combined and analyzed using mega-analyses (AUC 0.59 95% CI 0.51-0.68 and AUC 0.61 95% CI 0.52-0.69, respectively). Conclusions: Despite inter-cohort heterogeneity in tumor type, treatment regimens, and sequencing modalities, meta- and mega analysis of published metagenomic studies identified generalizable bacterial species associated with ICI response or lack thereof. B. thetaiotaomicron and C. bolteae were predictors of non-response to ICI suggesting the clinical potential of narrow spectrum anti-biotics targeting non-response associated bacterial species to improve outcomes in ICI recipients.


2020 ◽  
Vol 21 (S6) ◽  
Author(s):  
Yuanyuan Ma ◽  
Junmin Zhao ◽  
Yingjun Ma

Abstract Background With the rapid development of high-throughput technique, multiple heterogeneous omics data have been accumulated vastly (e.g., genomics, proteomics and metabolomics data). Integrating information from multiple sources or views is challenging to obtain a profound insight into the complicated relations among micro-organisms, nutrients and host environment. In this paper we propose a multi-view Hessian regularization based symmetric nonnegative matrix factorization algorithm (MHSNMF) for clustering heterogeneous microbiome data. Compared with many existing approaches, the advantages of MHSNMF lie in: (1) MHSNMF combines multiple Hessian regularization to leverage the high-order information from the same cohort of instances with multiple representations; (2) MHSNMF utilities the advantages of SNMF and naturally handles the complex relationship among microbiome samples; (3) uses the consensus matrix obtained by MHSNMF, we also design a novel approach to predict the classification of new microbiome samples. Results We conduct extensive experiments on two real-word datasets (Three-source dataset and Human Microbiome Plan dataset), the experimental results show that the proposed MHSNMF algorithm outperforms other baseline and state-of-the-art methods. Compared with other methods, MHSNMF achieves the best performance (accuracy: 95.28%, normalized mutual information: 91.79%) on microbiome data. It suggests the potential application of MHSNMF in microbiome data analysis. Conclusions Results show that the proposed MHSNMF algorithm can effectively combine the phylogenetic, transporter, and metabolic profiles into a unified paradigm to analyze the relationships among different microbiome samples. Furthermore, the proposed prediction method based on MHSNMF has been shown to be effective in judging the types of new microbiome samples.


Sign in / Sign up

Export Citation Format

Share Document