scholarly journals Uniform Manifold Approximation and Projection (UMAP) Reveals Composite Patterns and Resolves Visualization Artifacts in Microbiome Data

mSystems ◽  
2021 ◽  
Author(s):  
George Armstrong ◽  
Cameron Martino ◽  
Gibraan Rahman ◽  
Antonio Gonzalez ◽  
Yoshiki Vázquez-Baeza ◽  
...  

UMAP provides an additional method to visualize microbiome data. The method is extensible to any beta diversity metric used with PCoA, and our results demonstrate that UMAP can indeed improve visualization quality and correspondence with biological and technical variables of interest.

2020 ◽  
Author(s):  
Quy Xuan Cao ◽  
Xinxin Sun ◽  
Karun Rajesh ◽  
Naga Chalasani ◽  
Kayla Gelow ◽  
...  

Abstract Background: Accuracy of microbial community detection in 16S rRNA marker-gene and metagenomic studies suffers from contamination and sequencing errors that lead to either falsely identifying microbial taxa that were not in the sample or misclassifying the taxa of DNA fragment reads. Filtering is defined as removing taxa that are present in a small number of samples and have small counts in the samples where they are observed. This approach reduces extreme sparsity of microbiome data and has been shown to correctly remove contaminant taxa in cultured "mock" datasets, where the true taxa compositions are known. Although filtering is frequently used, careful evaluation of its effect on the data analysis and scientific conclusions remains unreported. Here, we assess the effect of filtering on the alpha and beta diversity estimation, as well as its impact on identifying taxa that discriminate between disease states. Results: The effect of filtering on microbiome data analysis is illustrated on four datasets: two mock quality control datasets where same cultured samples with known microbial composition are processed at different labs and two disease study datasets. Results show that in microbiome quality control datasets, filtering reduces the magnitude of differences in alpha diversity and alleviates technical variability between labs, while preserving between samples similarity (beta diversity). In the disease study datasets, DESeq2 and linear discriminant analysis Effect Size (LEfSe) methods were used to identify taxa that are differentially expressed across groups of samples, and random forest models to rank features with largest contribution towards disease classiffcation. Results reveal that filtering retains significant taxa and preserves the model classification ability measured by the area under the receiver operating characteristic curve (AUC). The comparison between filtering and contaminant removal method shows that they have complementary effects and are advised to be used in conjunction. Conclusions: Filtering reduces the complexity of microbiome data, while preserving their integrity in downstream analysis. This leads to mitigation of the classification methods' sensitivity and reduction of technical variability, allowing researchers to generate more reproducible and comparable results in microbiome data analysis.


Biometrics ◽  
2021 ◽  
Author(s):  
J. Liu ◽  
Xinlian Zhang ◽  
T. Chen ◽  
T. Wu ◽  
T. Lin ◽  
...  

2021 ◽  
Author(s):  
Yushu Shi ◽  
Liangliang Zhang ◽  
Christine Peterson ◽  
Kim-Anh Do ◽  
Robert Jenq

Abstract Background: In microbiome data analysis, unsupervised clustering is often used to identify naturally occurring clusters, which can then be assessed for associations with characteristics of interest. In this work, we systematically compared beta diversity and clustering methods commonly used in microbiome analyses. We applied these to four published datasets where highly distinct microbiome profiles could be seen between sample groups. Results: Although no single method outperformed the others consistently, we did identify key scenarios where certain methods can underperform. Specifically, the Bray Curtis metric resulted in poor clustering in a dataset where high-abundance OTUs were relatively rare. In contrast, the unweighted UniFrac metric clustered poorly when used on a dataset with a high prevalence of low-abundance OTUs. To test our proposition, we systematically modified properties of the poorly performing datasets and found that this approach resulted in improved Bray Curtis and unweighted UniFrac performance. Based on these observations, we rationally combined the Bray Curtis metric and the unweighted UniFrac metrics and found that this new beta diversity metric showed high performance across all datasets. We also evaluated our findings by examining a clinical dataset where clusters are less separated. Conclusions: Our systematic evaluation of clustering performance in these five datasets demonstrates that there is no existing clustering method that universally performs best across all datasets. We propose a combined metric of Bray Curtis and unweighted UniFrac that capitalizes on the complementary strengths of the two metrics.


2021 ◽  
pp. 101-127
Author(s):  
Anna M. Plantinga ◽  
Michael C. Wu

2021 ◽  
Author(s):  
Yushu Shi ◽  
Liangliang Zhang ◽  
Christine Peterson ◽  
Kim-Anh Do ◽  
Robert Jenq

In microbiome data analysis, unsupervised clustering is often used to identify naturally occurring clusters, which can then be assessed for associations with characteristics of interest. In this work, we systematically compared beta diversity and clustering methods commonly used in microbiome analyses. We applied these to four published datasets where highly distinct microbiome profiles could be seen between sample groups. Although no single method outperformed the others consistently, we did identify key scenarios where certain methods can underperform. Specifically, the Bray Curtis metric resulted in poor clustering in a dataset where high-abundance OTUs were relatively rare. In contrast, the unweighted UniFrac metric clustered poorly when used on a dataset with a high prevalence of low-abundance OTUs. To test our proposition, we systematically modified properties of the poorly performing datasets and found that this approach resulted in improved Bray Curtis and unweighted UniFrac performance. Based on these observations, we rationally combined the Bray Curtis metric and the unweighted UniFrac metrics and found that this new beta diversity metric showed high performance across all datasets. We also evaluated our findings by examining a clinical dataset where clusters are less separated. Our systematic evaluation of clustering performance in these five datasets demonstrates that there is no existing clustering method that universally performs best across all datasets. We propose a combined metric of Bray Curtis and unweighted UniFrac that capitalizes on the complementary strengths of the two metrics.


mSystems ◽  
2019 ◽  
Vol 4 (1) ◽  
Author(s):  
Cameron Martino ◽  
James T. Morton ◽  
Clarisse A. Marotz ◽  
Luke R. Thompson ◽  
Anupriya Tripathi ◽  
...  

ABSTRACTThe central aims of many host or environmental microbiome studies are to elucidate factors associated with microbial community compositions and to relate microbial features to outcomes. However, these aims are often complicated by difficulties stemming from high-dimensionality, non-normality, sparsity, and the compositional nature of microbiome data sets. A key tool in microbiome analysis is beta diversity, defined by the distances between microbial samples. Many different distance metrics have been proposed, all with varying discriminatory power on data with differing characteristics. Here, we propose a compositional beta diversity metric rooted in a centered log-ratio transformation and matrix completion called robust Aitchison PCA. We demonstrate the benefits of compositional transformations upstream of beta diversity calculations through simulations. Additionally, we demonstrate improved effect size, classification accuracy, and robustness to sequencing depth over the current methods on several decreased sample subsets of real microbiome data sets. Finally, we highlight the ability of this new beta diversity metric to retain the feature loadings linked to sample ordinations revealing salient intercommunity niche feature importance.IMPORTANCEBy accounting for the sparse compositional nature of microbiome data sets, robust Aitchison PCA can yield high discriminatory power and salient feature ranking between microbial niches. The software to perform this analysis is available under an open-source license and can be obtained athttps://github.com/biocore/DEICODE; additionally, a QIIME 2 plugin is provided to perform this analysis athttps://library.qiime2.org/plugins/q2-deicode.


2020 ◽  
Author(s):  
Yushu Shi ◽  
Liangliang Zhang ◽  
Christine Peterson ◽  
Kim-Anh Do ◽  
Robert Jenq

Abstract Background In Microbiome data analysis, unsupervised clustering is often used to identify naturally occurring clusters, which can then be assessed for associations with characteristics of interest. In this work, we systematically compared beta diversity and clustering methods commonly used in microbiome analyses. We applied these to four published datasets where highly distinct microbiome profiles could be seen between sample groups. Results Although no single method outperformed the others consistently, we did identify key scenarios where certain methods can underperform. Specifically, the Bray Curtis metric resulted in poor clustering in a dataset where high-abundance OTUs were relatively rare. In contrast, the unweighted UniFrac metric clustered poorly when used on a dataset with a high prevalence of low-abundance OTUs. To test our proposition, we systematically modified properties of the poorly performing datasets and found that this approach resulted in improved Bray Curtis and unweighted UniFrac performance. Conclusions Based on these observations, we rationally combined the Bray Curtis metric and the unweighted UniFrac metrics and found that this new beta diversity metric showed high performance across all datasets.


2021 ◽  
Vol 11 ◽  
Author(s):  
Quy Cao ◽  
Xinxin Sun ◽  
Karun Rajesh ◽  
Naga Chalasani ◽  
Kayla Gelow ◽  
...  

Background: The accuracy of microbial community detection in 16S rRNA marker-gene and metagenomic studies suffers from contamination and sequencing errors that lead to either falsely identifying microbial taxa that were not in the sample or misclassifying the taxa of DNA fragment reads. Removing contaminants and filtering rare features are two common approaches to deal with this problem. While contaminant detection methods use auxiliary sequencing process information to identify known contaminants, filtering methods remove taxa that are present in a small number of samples and have small counts in the samples where they are observed. The latter approach reduces the extreme sparsity of microbiome data and has been shown to correctly remove contaminant taxa in cultured “mock” datasets, where the true taxa compositions are known. Although filtering is frequently used, careful evaluation of its effect on the data analysis and scientific conclusions remains unreported. Here, we assess the effect of filtering on the alpha and beta diversity estimation as well as its impact on identifying taxa that discriminate between disease states.Results: The effect of filtering on microbiome data analysis is illustrated on four datasets: two mock quality control datasets where the same cultured samples with known microbial composition are processed at different labs and two disease study datasets. Results show that in microbiome quality control datasets, filtering reduces the magnitude of differences in alpha diversity and alleviates technical variability between labs while preserving the between samples similarity (beta diversity). In the disease study datasets, DESeq2 and linear discriminant analysis Effect Size (LEfSe) methods were used to identify taxa that are differentially abundant across groups of samples, and random forest models were used to rank features with the largest contribution toward disease classification. Results reveal that filtering retains significant taxa and preserves the model classification ability measured by the area under the receiver operating characteristic curve (AUC). The comparison between the filtering and the contaminant removal method shows that they have complementary effects and are advised to be used in conjunction.Conclusions: Filtering reduces the complexity of microbiome data while preserving their integrity in downstream analysis. This leads to mitigation of the classification methods' sensitivity and reduction of technical variability, allowing researchers to generate more reproducible and comparable results in microbiome data analysis.


Author(s):  
E Martins Camara ◽  
Tubino Andrade Andrade-Tub ◽  
T Pontes Franco ◽  
LN dos Santos ◽  
AFGN dos Santos ◽  
...  

2014 ◽  
Vol 25 (3-4) ◽  
pp. 53-68
Author(s):  
I. V. Goncharenko ◽  
H. M. Holyk

Cenotic diversity and leading ecological factors of its floristic differentiation were studied on an example of two areas – Kyiv parks "Nivki" and "Teremki". It is shown that in megalopolis the Galeobdoloni-Carpinetum impatientosum parviflorae subassociation is formed under anthropogenic pressure on the typical ecotope of near-Dnieper hornbeam oak forests on fresh gray-forest soils. The degree of anthropogenic transformation of cenofloras can be estimated by the number of species of Robinietea and Galio-Urticetea classes, as well as neophytes and cultivars. Phytoindication for hemeroby index may be also used in calculation. We propose the modified index of biotic dispersion (normalized by alpha-diversity) for the estimation of ecophytocenotic range (beta-diversity) of releves series. We found that alpha-diversity initially increases (due to the invasion of antropophytes) at low level of antropogenic pressure, then it decreases (due to the loss of aboriginal species) secondarily with increasing of human impact. Also we found that beta-diversity (differential diversity) decreases, increasing homogeneity of plant cover, under the influence of anthropogenic factor. Vegetation classification was completed by a new original method of cluster analysis, designated as DRSA («distance-ranked sorting assembling»). The classification quality is suggested to be validated on the "seriation" diagram, which is а distance matrix between objects with gradient filling. Dark diagonal blocks confirm clusters’ density (intracluster compactness), uncolored off-diagonal blocks are evidence in favor of clusters’ isolation (intercluster distinctness). In addition, distinction of clusters (syntaxa) in ordination area suggests their independence. For phytoindication we propose to include only species with more than 10% constancy. Furthermore, for the description of syntaxonomic amplitude we suggest to use 25%-75% interquartile scope instead of mean and standard deviation. It is shown that comparative analysis of syntaxa for each ecofactor is convenient to carry out by using violin (bulb) plots. A new approach to the phytoindication of syntaxa, designated as R-phytoindication, was proposed for our study. In this case, the ecofactor values, calculated for individual releves, are not taken into account, however, the composition of cenoflora with species constancies is used that helps us to minimize for phytoindication the influence of non-typical species. We suggested a syntaxon’s amplitude to be described by more robust statistics: for the optimum of amplitude (central tendency) – by a median (instead of arithmetic mean), and for the range of tolerance – by an interquartile scope (instead of standard deviation). We assesses amplitudes of syntaxa by phytoindication method for moisture (Hd), acidity (Rc), soil nitrogen content (Nt), wetting variability (vHd), light regime (Lc), salt regime (Sl). We revealed no significant differences on these ecofactors among ecotopes of our syntaxa, that proved the variant syntaxonomic rank for all syntaxa. We found that the core of species composition of our phytocenoses consists of plants with moderate requirements for moisture, soil nitrogen, light and salt regime. We prove that the leading factor of syntaxonomic differentiation is hidden anthropogenic, which is not subject to direct measurement. But we detect that hidden factor of "human pressure" was correlated with phytoindication parameters (variables) that can be measured "directly" by species composition of plant communities. The most correlated factors were ecofactors of soil nitrogen, wetting variability, light regime and hemeroby. The last one is the most indicative empirically for the assessment of "human impact". We establish that there is a concept of «hemeroby of phytocenosis» (tolerance to human impact), which can be calculated approximately as the mean or the median of hemeroby scores of individual species which are present in it.


Sign in / Sign up

Export Citation Format

Share Document