scholarly journals A broken promise: microbiome differential abundance methods do not control the false discovery rate

2017 ◽  
Vol 20 (1) ◽  
pp. 210-221 ◽  
Author(s):  
Stijn Hawinkel ◽  
Federico Mattiello ◽  
Luc Bijnens ◽  
Olivier Thas

Abstract High-throughput sequencing technologies allow easy characterization of the human microbiome, but the statistical methods to analyze microbiome data are still in their infancy. Differential abundance methods aim at detecting associations between the abundances of bacterial species and subject grouping factors. The results of such methods are important to identify the microbiome as a prognostic or diagnostic biomarker or to demonstrate efficacy of prodrug or antibiotic drugs. Because of a lack of benchmarking studies in the microbiome field, no consensus exists on the performance of the statistical methods. We have compared a large number of popular methods through extensive parametric and nonparametric simulation as well as real data shuffling algorithms. The results are consistent over the different approaches and all point to an alarming excess of false discoveries. This raises great doubts about the reliability of discoveries in past studies and imperils reproducibility of microbiome experiments. To further improve method benchmarking, we introduce a new simulation tool that allows to generate correlated count data following any univariate count distribution; the correlation structure may be inferred from real data. Most simulation studies discard the correlation between species, but our results indicate that this correlation can negatively affect the performance of statistical methods.

2020 ◽  
Author(s):  
Chan Wang ◽  
Jiyuan Hu ◽  
Martin J. Blaser ◽  
Huilin Li

AbstractMotivationThe human microbiome is inherently dynamic and its dynamic nature plays a critical role in maintaining health and driving disease. With an increasing number of longitudinal microbiome studies, scientists are eager to learn the comprehensive characterization of microbial dynamics and their implications to the health and disease-related phenotypes. However, due to the challenging structure of longitudinal microbiome data, few analytic methods are available to characterize the microbial dynamics over time.ResultsWe propose a microbial trend analysis (MTA) framework for the high-dimensional and phylogenetically-based longitudinal microbiome data. In particular, MTA can perform three tasks: 1) capture the common microbial dynamic trends for a group of subjects on the community level and identify the dominant taxa; 2) examine whether or not the microbial overall dynamic trends are significantly different in groups; 3) classify an individual subject based on its longitudinal microbial profiling. Our extensive simulations demonstrate that the proposed MTA framework is robust and powerful in hypothesis testing, taxon identification, and subject classification. Our real data analyses further illustrate the utility of MTA through a longitudinal study in mice.ConclusionsThe proposed MTA framework is an attractive and effective tool in investigating dynamic microbial pattern from longitudinal microbiome studies.


2021 ◽  
Author(s):  
Quang P. Nguyen ◽  
Anne G. Hoen ◽  
H. Robert Frost

AbstractResearch in human associated microbiomes often involves the analysis of taxonomic count tables generated via high-throughput sequencing. It is difficult to apply statistical tools as the data is high-dimensional, sparse, and strictly compositional. An approachable way to alleviate high-dimensionality and sparsity is to aggregate variables into pre-defined sets. Set-based analysis is ubiquitous in the genomics literature, and has demonstrable impact in improving interpretability and power of downstream analysis. Unfortunately, there is a lack of sophisticated set-based analysis methods specific to microbiome taxonomic data, where current practice often employs abundance summation as a technique for aggregation. This approach prevents comparison across sets of different sizes, does not preserve inter-sample distances, and amplifies protocol bias. Here, we attempt to fill this gap with a new single sample taxon set enrichment method based on the isometric log ratio transformation and the competitive null hypothesis commonly used in the enrichment analysis literature. Our approach, titled competitive isometric log ratio (cILR), generates sample-specific enrichment scores as the scaled log ratio of the subcomposition defined by taxa within a set and the subcomposition defined by its complement. We provide sample-level significance testing by estimating an empirical null distribution of our test statistic with valid p-values. Herein we demonstrate using both real data applications and simulations that cILR controls for type I error even under high sparsity and high inter-taxa correlation scenarios. Additionally, it provides informative scores that can be inputs to downstream differential abundance and prediction tasks.Author summaryThe study of human associated microbiomes relies on genomic surveys via high-throughput sequencing. However, microbiome taxonomic data is sparse and high dimensional which prevents the application of standard statistical techniques. One approach to address this problem is to perform analyses at the level of taxon sets. Set-based analysis has a long history in the genomics literature, with demonstrable impact in improving both power and interpretability. Unfortunately, there is not a lot of research in developing new set-based tools for microbiome taxonomic data specifically, given that compared to other ‘omics data types microbiome taxonomic data is strictly compositional. We developed a new tool to generate taxon set enrichment scores at the sample level by combining the isometric log-ratio and the competitive null hypothesis. Our scores can be used for statistical inference, and as inputs to other downstream analyses such as differential abundance and prediction models. We demonstrate the performance of our method against competing approaches across both real data analyses and simulation studies.


2018 ◽  
Author(s):  
Nathan LaPierre ◽  
Serghei Mangul ◽  
Mohammed Alser ◽  
Igor Mandric ◽  
Nicholas C. Wu ◽  
...  

AbstractBackgroundHigh throughput sequencing has spurred the development of metagenomics, which involves the direct analysis of microbial communities in various environments such as soil, ocean water, and the human body. Many existing methods based on marker genes or k-mers have limited sensitivity or are too computationally demanding for many users. Additionally, most work in metagenomics has focused on bacteria and archaea, neglecting to study other key microbes such as viruses and eukaryotes.ResultsHere we present a method, MiCoP (Microbiome Community Profiling), that uses fast-mapping of reads to build a comprehensive reference database of full genomes from viruses and eukaryotes to achieve maximum read usage and enable the analysis of the virome and eukaryome in each sample. We demonstrate that mapping of metagenomic reads is feasible for the smaller viral and eukaryotic reference databases. We show that our method is accurate on simulated and mock community data and identifies many more viral and fungal species than previously-reported results on real data from the Human Microbiome Project.ConclusionsMiCoP is a mapping-based method that proves more effective than existing methods at abundance profiling of viruses and eukaryotes in metagenomic samples. MiCoP can be used to detect the full diversity of these communities. The code, data, and documentation is publicly available on GitHub at: https://github.com/smangul1/MiCoP


Author(s):  
Gülendam Bozdayı ◽  
Işıl Fidan

The viral component of the human microbiome is referred as ‘virobiota’. The virobiota is the sum of all viruses found in or on humans. The set of all genes of virobiota is referred as ‘virome’. The human virome consists of virus-derived genetic elements found in human genome constituted of viruses that infect eukaryotic cells, bacteriophages, prokaryotic cells, and, endogenous retroviruses. The development of new sequencing technologies, such as high-throughput sequencing techniques allowed the analysis of the human virome. Many new viruses have been discovered lately, using new generation sequencing technology. In recent years, there has been an increase in the studies of the human virome as changes in virome have been observed in diseases. The alterations in the human virome may be associated with infectious, inflammatory diseases, cancer and autoimmunity. The understanding of how the virome affects human health and disease can provide the development of potential therapeutic approaches that target the members of the virome.


2020 ◽  
Vol 6 (1) ◽  
Author(s):  
Huang Lin ◽  
Shyamal Das Peddada

AbstractIncreasingly, researchers are discovering associations between microbiome and a wide range of human diseases such as obesity, inflammatory bowel diseases, HIV, and so on. The first step towards microbiome wide association studies is the characterization of the composition of human microbiome under different conditions. Determination of differentially abundant microbes between two or more environments, known as differential abundance (DA) analysis, is a challenging and an important problem that has received considerable interest during the past decade. It is well documented in the literature that the observed microbiome data (OTU/SV table) are relative abundances with an excess of zeros. Since relative abundances sum to a constant, these data are necessarily compositional. In this article we review some recent methods for DA analysis and describe their strengths and weaknesses.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Chan Wang ◽  
Jiyuan Hu ◽  
Martin J. Blaser ◽  
Huilin Li

Abstract Background The human microbiome is inherently dynamic and its dynamic nature plays a critical role in maintaining health and driving disease. With an increasing number of longitudinal microbiome studies, scientists are eager to learn the comprehensive characterization of microbial dynamics and their implications to the health and disease-related phenotypes. However, due to the challenging structure of longitudinal microbiome data, few analytic methods are available to characterize the microbial dynamics over time. Results We propose a microbial trend analysis (MTA) framework for the high-dimensional and phylogenetically-based longitudinal microbiome data. In particular, MTA can perform three tasks: 1) capture the common microbial dynamic trends for a group of subjects at the community level and identify the dominant taxa; 2) examine whether or not the microbial overall dynamic trends are significantly different between groups; 3) classify an individual subject based on its longitudinal microbial profiling. Our extensive simulations demonstrate that the proposed MTA framework is robust and powerful in hypothesis testing, taxon identification, and subject classification. Our real data analyses further illustrate the utility of MTA through a longitudinal study in mice. Conclusions The proposed MTA framework is an attractive and effective tool in investigating dynamic microbial pattern from longitudinal microbiome studies.


Author(s):  
Hye-Won Cho ◽  
Yong-Bin Eom

High-throughput DNA sequencing technologies have facilitated the in silico forensic analysis of human microbiome. Specific microbial species or communities obtained from the crime scene provide evidence of human contacts and their body fluids. The microbial community is influenced by geographic, ethnic, lifestyle, and environmental factors such as urbanization. An understanding of the effects of these external stressors on the human microbiome and determination of stable and changing elements are important in selecting appropriate targets for investigation. In this study, the Forensic Microbiome Database (FMD) (http://www.fmd.jcvi.org) containing the microbiome data of various locations in the human body in 35 countries was used. We focused on skin, saliva, vaginal fluid, and stool and found that the microbiome distribution differed according to the body part as well as the geographic location. In the case of skin samples, Staphylococcus species were higher than Corynebacterium species among Asians compared with Americans. Holdemanella and Fusobacterium were specific in the saliva of Koreans and Japanese populations. Lactobacillus was found in the vaginal fluids of individuals in all countries, whereas Serratia and Enterobacter were endemic to Bolivia and Congo, respectively. This study is the first attempt to collate and describe the observed variation in microbiomes from the forensic microbiome database. As additional microbiome databases are reported by studies worldwide, the diversity of the applications may exceed and expand beyond the initial identification of the host.


2017 ◽  
Author(s):  
J. Rivera-Pinto ◽  
J.J. Egozcue ◽  
V. Pawlowsky–Glahn ◽  
R. Paredes ◽  
M. Noguera-Julian ◽  
...  

ABSTRACTHigh-throughput sequencing technologies have revolutionized microbiome research by allowing the relative quantification of microbiome composition and function in different environments. One of the main goals in microbiome analysis is the identification of microbial species that are differentially abundant among groups of samples, or whose abundance is associated with a variable of interest. Most available methods for microbiome abundance testing perform univariate tests for each microbial species or taxa separately, ignoring the compositional nature of microbiome data.We propose an alternative approach for microbiome abundance testing that consists on the identification of two groups of taxa whose relative abundance, or balance, is associated with the response variable of interest. This approach is appealing, since it has direct translation to the biological concept of ecological balance between species in an ecosystem. In this work, we present selbal, a greedy stepwise algorithm for balance selection. We illustrate the algorithm with 16s abundance data from an HIV-microbiome study and a Crohn-microbiome study.ImportanceA more meaningful approach for microbiome abundance testing is presented. Instead of testing each taxon separately we propose to explore abundance balances among groups of taxa. This approach acknowledges the compositional nature of microbiome data.


2020 ◽  
Author(s):  
Ruochen Jiang ◽  
Wei Vivian Li ◽  
Jingyi Jessica Li

AbstractMicrobiome studies have gained increased attention since many discoveries revealed connections between human microbiome compositions and diseases. A critical challenge in microbiome research is that excess non-biological zeros distort taxon abundances, complicate data analysis, and jeopardize the reliability of scientific discoveries. To address this issue, we propose the first imputation method, mbImpute, to identify and recover likely non-biological zeros by borrowing information jointly from similar samples, similar taxa, and optional metadata including sample covariates and taxon phylogeny. Comprehensive simulations verified that mbImpute achieved better imputation accuracy under multiple measures than five state-of-the-art imputation methods designed for non-microbiome data. In real data applications, we demonstrate that mbImpute improved the power and reproducibility of identifying disease-related taxa from microbiome data of type 2 diabetes and colorectal cancer.


2021 ◽  
Author(s):  
Ozvan Bocher ◽  
Thomas E. Ludwig ◽  
Gaëlle Marenne ◽  
Jean-François Deleuze ◽  
Suryakant Suryakant ◽  
...  

Rare variant association tests (RVAT) have been developed to study the contribution of rare variants widely accessible through high-throughput sequencing technologies. RVAT require to aggregate rare variants in testing units and to filter variants to retain only the most likely causal ones. In the exome, genes are natural testing units and variants are usually filtered based on their functional consequences. However, when dealing with whole-genome sequence (WGS) data, both steps are challenging. No natural biological unit is available for aggregating rare variants. Sliding windows procedures have been proposed to circumvent this difficulty, however they are blind to biological information and result in a large number of tests. We propose a new strategy to perform RVAT on WGS data: “RAVA-FIRST” (RAre Variant Association using Functionally-InfoRmed STeps) comprising three steps. (1) New testing units are defined genome-wide based on functionally-adjusted Combined Annotation Dependent Depletion (CADD) scores of variants observed in the GnomAD populations, which are referred to as “CADD regions”. (2) A region-dependent filtering of rare variants is applied in each CADD region. (3) A functionally-informed burden test is performed with sub-scores computed for each genomic category within each CADD region. Both on simulations and real data, RAVA-FIRST was found to outperform other WGS-based RVAT. Applied to a WGS dataset of venous thromboembolism patients, we identified an intergenic region on chromosome 18 that is enriched for rare variants in early-onset patients and that was that was missed by standard sliding windows procedures. RAVA-FIRST enables new investigations of rare non-coding variants in complex diseases, facilitated by its implementation in the R package Ravages.


Sign in / Sign up

Export Citation Format

Share Document