Rank normalization empowers a t-test for microbiome differential abundance analysis while controlling for false discoveries

Author(s):  
Matthew L Davis ◽  
Yuan Huang ◽  
Kai Wang

Abstract A major task in the analysis of microbiome data is to identify microbes associated with differing biological conditions. Before conducting analysis, raw data must first be adjusted so that counts from different samples are comparable. A typical approach is to estimate normalization factors by which all counts in a sample are multiplied or divided. However, the inherent variation associated with estimation of normalization factors are often not accounted for in subsequent analysis, leading to a loss of precision. Rank normalization is a nonparametric alternative to the estimation of normalization factors in which each count for a microbial feature is replaced by its intrasample rank. Although rank normalization has been successfully applied to microarray analysis in the past, it has yet to be explored for microbiome data, which is characterized by high frequencies of 0s, strongly correlated features and compositionality. We propose to use rank normalization as an alternative to the estimation of normalization factors and examine its performance when paired with a two-sample t-test. On a rigorous 3rd-party benchmarking simulation, it is shown to offer strong control over the false discovery rate, and at sample sizes greater than 50 per treatment group, to offer an improvement in performance over commonly used normalization factors paired with t-tests, Wilcoxon rank-sum tests and methodologies implemented by R packages. On two real datasets, it yielded valid and reproducible results that were strongly in agreement with the original findings and the existing literature, further demonstrating its robustness and future potential. Availability: The data underlying this article are available online along with R code and supplementary materials at https://github.com/matthewlouisdavisBioStat/Rank-Normalization-Empowers-a-T-Test.


2020 ◽  
Vol 6 (1) ◽  
Author(s):  
Huang Lin ◽  
Shyamal Das Peddada

AbstractIncreasingly, researchers are discovering associations between microbiome and a wide range of human diseases such as obesity, inflammatory bowel diseases, HIV, and so on. The first step towards microbiome wide association studies is the characterization of the composition of human microbiome under different conditions. Determination of differentially abundant microbes between two or more environments, known as differential abundance (DA) analysis, is a challenging and an important problem that has received considerable interest during the past decade. It is well documented in the literature that the observed microbiome data (OTU/SV table) are relative abundances with an excess of zeros. Since relative abundances sum to a constant, these data are necessarily compositional. In this article we review some recent methods for DA analysis and describe their strengths and weaknesses.



2020 ◽  
Vol 36 (13) ◽  
pp. 3959-3965
Author(s):  
Yuanjing Ma ◽  
Yuan Luo ◽  
Hongmei Jiang

Abstract Motivation Microbial communities have been proved to have close relationship with many diseases. The identification of differentially abundant microbial species is clinically meaningful for finding disease-related pathogenic or probiotic bacteria. However, certain characteristics of microbiome data have hurdled the accuracy and effectiveness of differential abundance analysis. The abundances or counts of microbiome species are usually on different scales and exhibit zero-inflation and over-dispersion. Normalization is a crucial step before the differential abundance test. However, existing normalization methods typically try to adjust counts on different scales to a common scale by constructing size factors with the assumption that count distributions across samples are equivalent up to a certain percentile. These methods often yield undesirable results when differentially abundant species are of low to medium abundance level. For differential abundance analysis, existing methods often use a single distribution to model the dispersion of species which lacks flexibility to catch a single species’ distinctiveness. These methods tend to detect a lot of false positives and often lack of power when the effect size is small. Results We develop a novel framework for differential abundance analysis on sparse high-dimensional marker gene microbiome data. Our methodology relies on a novel network-based normalization technique and a two-stage zero-inflated mixture count regression model (RioNorm2). Our normalization method aims to find a group of relatively invariant microbiome species across samples and conditions in order to construct the size factor. Another contribution of the paper is that our testing approach can take under-sampling and over-dispersion into consideration by separating microbiome species into two groups and model them separately. Through comprehensive simulation studies, the performance of our method is consistently powerful and robust across different settings with different sample size, library size and effect size. We also demonstrate the effectiveness of our novel framework using a published dataset of metastatic melanoma and find biological insights from the results. Availability and implementation The R package ‘RioNorm2’ can be installed from Github athttps://github.com/yuanjing-ma/RioNorm2. Supplementary information Supplementary data are available at Bioinformatics online.



2021 ◽  
Author(s):  
Shulei Wang

Differential abundance analysis is an essential and commonly used tool to characterize the difference between microbial communities. However, identifying differentially abundant microbes remains a challenging problem because the observed microbiome data is inherently compositional, excessive sparse, and distorted by experimental bias. Besides these major challenges, the results of differential abundance analysis also depend largely on the choice of analysis unit, adding another practical complexity to this already complicated problem. In this work, we introduce a new differential abundance test called the MsRDB test, which embeds the sequences into a metric space and integrates a multi-scale adaptive strategy for utilizing spatial structure to identify differentially abundant microbes. Compared with existing methods, the MsRDB test can detect differentially abundant microbes at the finest resolution offered by data and provide adequate detection power while being robust to zero counts, compositional effect, and experimental bias in the microbial compositional data set. Applications to both simulated and real microbial compositional data sets demonstrate the usefulness of the MsRDB test.



2021 ◽  
Vol 15 (Supplement_1) ◽  
pp. S063-S063
Author(s):  
A Pisani ◽  
P Rausch ◽  
S Ellul ◽  
C Bang ◽  
T Tabone ◽  
...  

Abstract Background Members of the Enterobacteriaceae have been associated with active Crohn’s Disease (CD), possibly as a result of intestinal inflammation via production of a lipopolysaccharide that can trigger TLR4 signalling. This study aims to assess whether this association persists in remission of CD patients and whether correlation with disease phenotype is present. Methods Stool samples of 32 CD patients in remission and 97 healthy controls were analyzed by 16S rRNA sequencing. High quality Amplicon sequence variants (ASV) were derived and classified via DADA2. Results ASV 6-Escherichia/Shigella uncl. was found to be more abundant in CD (padj=0.0003) while ASV 24, another member of the Escherichia/Shigella cluster was identified as being an indicator species for CD (padj=0.09). Differential abundance analysis according to phenotype as per Montreal classification revealed that, compared to patients with the B1 phenotype, patients with the B2 and/or B3 have a higher abundance of Escherichia/Shigella uncl. (ASVs 13, 31, 282 and 422), Klebsiella uncl. (ASVs 75 and 101) and Enterobacter uncl. (ASV 219) (Figure 1). Furthermore, patients with L3 involvement had higher abundances of Klebsiella uncl. (ASVs 75 and 101) and Parasutturella uncl. (ASVs 22, 53, 120, 199, 249 and 510), the latter being a Proteobacteria, compared to patients with L1 and/or L2 involvement. No significant association with “Age of Onset” was identified. In addition, network analyses revealed a strongly correlated group of Enterobacteriaceae ASVs (Klebsiella, Escherichia/Shigella, Enterobacter, Citrobacter) which appear to collectively associate to CD. Abstract DOP25 – Figure 1: Heatmap visualizing significant differentially abundant ASVs in CD patients with respect to behaviour subgroups Conclusion Enterobacteriaceae persist in the faecal microbiota in significantly higher levels than controls despite remission and furthermore are associated with the more severe phenotypes of stricturing and penetrating disease. Further studies might indicate whether microbiota assessment on diagnosis might predict CD subtypes and therefore influence therapeutic choices.



Genetics ◽  
2021 ◽  
Author(s):  
Jonas Wallin ◽  
Małgorzata Bogdan ◽  
Piotr A Szulc ◽  
R W Doerge ◽  
David O Siegmund

Abstract Ghost quantitative trait loci (QTL) are the false discoveries in QTL mapping, that arise due to the “accumulation” of the polygenic effects, uniformly distributed over the genome. The locations on the chromosome that are strongly correlated with the total of the polygenic effects depend on a specific sample correlation structure determined by the genotypes at all loci. The problem is particularly severe when the same genotypes are used to study multiple QTL, e.g. using recombinant inbred lines or studying the expression QTL. In this case, the ghost QTL phenomenon can lead to false hotspots, where multiple QTL show apparent linkage to the same locus. We illustrate the problem using the classic backcross design and suggest that it can be solved by the application of the extended mixed effect model, where the random effects are allowed to have a nonzero mean. We provide formulas for estimating the thresholds for the corresponding t-test statistics and use them in the stepwise selection strategy, which allows for a simultaneous detection of several QTL. Extensive simulation studies illustrate that our approach eliminates ghost QTL/false hotspots, while preserving a high power of true QTL detection.



F1000Research ◽  
2019 ◽  
Vol 8 ◽  
pp. 726
Author(s):  
Mike W.C. Thang ◽  
Xin-Yi Chua ◽  
Gareth Price ◽  
Dominique Gorse ◽  
Matt A. Field

Metagenomic sequencing is an increasingly common tool in environmental and biomedical sciences.  While software for detailing the composition of microbial communities using 16S rRNA marker genes is relatively mature, increasingly researchers are interested in identifying changes exhibited within microbial communities under differing environmental conditions. In order to gain maximum value from metagenomic sequence data we must improve the existing analysis environment by providing accessible and scalable computational workflows able to generate reproducible results. Here we describe a complete end-to-end open-source metagenomics workflow running within Galaxy for 16S differential abundance analysis. The workflow accepts 454 or Illumina sequence data (either overlapping or non-overlapping paired end reads) and outputs lists of the operational taxonomic unit (OTUs) exhibiting the greatest change under differing conditions. A range of analysis steps and graphing options are available giving users a high-level of control over their data and analyses. Additionally, users are able to input complex sample-specific metadata information which can be incorporated into differential analysis and used for grouping / colouring within graphs.  Detailed tutorials containing sample data and existing workflows are available for three different input types: overlapping and non-overlapping read pairs as well as for pre-generated Biological Observation Matrix (BIOM) files. Using the Galaxy platform we developed MetaDEGalaxy, a complete metagenomics differential abundance analysis workflow. MetaDEGalaxy is designed for bench scientists working with 16S data who are interested in comparative metagenomics.  MetaDEGalaxy builds on momentum within the wider Galaxy metagenomics community with the hope that more tools will be added as existing methods mature.



2021 ◽  
Author(s):  
Zachary D Wallen

Background: When studying the relationship between the microbiome and a disease, a common question asked is what individual microbes are differentially abundant between a disease and healthy state. Numerous differential abundance (DA) testing methods exist and range from standard statistical tests to methods specifically designed for microbiome data. Comparison studies of DA testing methods have been performed, but none were performed on microbiome datasets collected for the study of real, complex disease. Due to this, we performed DA testing of microbial genera using 16 DA methods in two large, uniformly collected gut microbiome datasets on Parkinson disease (PD), and compared their results. Results: Pairwise concordances between methods ranged from 46%-99% similarity. Average pairwise concordance per dataset was 76%, and dropped to 62% when taking replication of signals across datasets into account. Certain methods consistently resulted in above average concordances (e.g. Kruskal-Wallis, ALDEx2, GLM with centered-log-ratio transform), while others consistently resulted in lower than average concordances (e.g. edgeR, fitZIG). Overall, ~80% of genera tested were detected as differentially abundant by at least one method in each dataset. Requiring associations to replicate across datasets reduced significant signals by almost half. Further requirement of signals to be replicated by the majority of methods (≥8) yielded 19 associations. Only one genus (Agathobacter) was replicated by all methods. Use of hierarchical clustering revealed three groups of DA signatures that were (1) replicated by the majority of methods and included genera previously associated with PD, (2) replicated by few or no methods, and (3) replicated by a subset of methods and included rarer genera, all enriched in PD. Conclusions: Differential abundance tests yielded varied results. Using one method on one dataset may find true associations, but may also detect non-reproducible signals, adding to inconsistency in the literature. To help lower false positives, one might analyze data with two or more DA methods to gauge concordance, and use a built-in replication dataset to show reproducibility. This study corroborated previously reported microorganism associations in PD, and revealed a potential new group of microorganisms whose abundance is significantly elevated in PD, and might be worth pursuing in future investigations.



Author(s):  
Oliver Gutiérrez-Hernández ◽  
Luis Ventura García

Multiplicity arises when data analysis involves multiple simultaneous inferences, increasing the chance of spurious findings. It is a widespread problem frequently ignored by researchers. In this paper, we perform an exploratory analysis of the Web of Science database for COVID-19 observational studies. We examined 100 top-cited COVID-19 peer-reviewed articles based on p-values, including up to 7100 simultaneous tests, with 50% including >34 tests, and 20% > 100 tests. We found that the larger the number of tests performed, the larger the number of significant results (r = 0.87, p < 10−6). The number of p-values in the abstracts was not related to the number of p-values in the papers. However, the highly significant results (p < 0.001) in the abstracts were strongly correlated (r = 0.61, p < 10−6) with the number of p < 0.001 significances in the papers. Furthermore, the abstracts included a higher proportion of significant results (0.91 vs. 0.50), and 80% reported only significant results. Only one reviewed paper addressed multiplicity-induced type I error inflation, pointing to potentially spurious results bypassing the peer-review process. We conclude the need to pay special attention to the increased chance of false discoveries in observational studies, including non-replicated striking discoveries with a potentially large social impact. We propose some easy-to-implement measures to assess and limit the effects of multiplicity.



2013 ◽  
Vol 10 (12) ◽  
pp. 1200-1202 ◽  
Author(s):  
Joseph N Paulson ◽  
O Colin Stine ◽  
Héctor Corrada Bravo ◽  
Mihai Pop


2017 ◽  
Vol 95 (9) ◽  
pp. 855-857
Author(s):  
Henrique Reggiani ◽  
Jorge Meléndez

The differential abundance analysis method can improve the precision of stellar chemical abundances. The method compares the equivalent widths of a certain line in a star with the same line in a star considered to be a standard representative of its class, using high resolution and high signal to noise ratio spectra. The method has achieved great results by reducing the measurement errors to unprecedentedly low levels. However, to date, there has not been a consistent analysis on the actual improvements of this method when compared to a classical analysis in metal-poor stars. Here we present a comparison between the errors of a classical stellar analysis and a differential analysis among low-metallicity stars.



Sign in / Sign up

Export Citation Format

Share Document