Discrete False-Discovery Rate Improves Identification of Differentially Abundant Microbes

ABSTRACT DS-FDR can achieve higher statistical power to detect significant findings in sparse and noisy microbiome data compared to the commonly used Benjamini-Hochberg procedure and other FDR-controlling procedures. Differential abundance testing is a critical task in microbiome studies that is complicated by the sparsity of data matrices. Here we adapt for microbiome studies a solution from the field of gene expression analysis to produce a new method, discrete false-discovery rate (DS-FDR), that greatly improves the power to detect differential taxa by exploiting the discreteness of the data. Additionally, DS-FDR is relatively robust to the number of noninformative features, and thus removes the problem of filtering taxonomy tables by an arbitrary abundance threshold. We show by using a combination of simulations and reanalysis of nine real-world microbiome data sets that this new method outperforms existing methods at the differential abundance testing task, producing a false-discovery rate that is up to threefold more accurate, and halves the number of samples required to find a given difference (thus increasing the efficiency of microbiome experiments considerably). We therefore expect DS-FDR to be widely applied in microbiome studies. IMPORTANCE DS-FDR can achieve higher statistical power to detect significant findings in sparse and noisy microbiome data compared to the commonly used Benjamini-Hochberg procedure and other FDR-controlling procedures.

Download Full-text

The Functional False Discovery Rate with Applications to Genomics

10.1101/241133 ◽

2017 ◽

Cited By ~ 2

Author(s):

Xiongzhi Chen ◽

David G. Robinson ◽

John D. Storey

Keyword(s):

Gene Expression ◽

False Discovery Rate ◽

Genetic Marker ◽

Read Depth ◽

False Discovery Rates ◽

Additional Information ◽

False Discovery ◽

Gene Expression Trait ◽

Genetics Of Gene Expression ◽

False Discoveries

AbstractThe false discovery rate measures the proportion of false discoveries among a set of hypothesis tests called significant. This quantity is typically estimated based on p-values or test statistics. In some scenarios, there is additional information available that may be used to more accurately estimate the false discovery rate. We develop a new framework for formulating and estimating false discovery rates and q-values when an additional piece of information, which we call an “informative variable”, is available. For a given test, the informative variable provides information about the prior probability a null hypothesis is true or the power of that particular test. The false discovery rate is then treated as a function of this informative variable. We consider two applications in genomics. Our first is a genetics of gene expression (eQTL) experiment in yeast where every genetic marker and gene expression trait pair are tested for associations. The informative variable in this case is the distance between each genetic marker and gene. Our second application is to detect differentially expressed genes in an RNA-seq study carried out in mice. The informative variable in this study is the per-gene read depth. The framework we develop is quite general, and it should be useful in a broad range of scientific applications.

Download Full-text

On “Field Significance” and the False Discovery Rate

Journal of Applied Meteorology and Climatology ◽

10.1175/jam2404.1 ◽

2006 ◽

Vol 45 (9) ◽

pp. 1181-1189 ◽

Cited By ~ 280

Author(s):

D. S. Wilks

Keyword(s):

False Discovery Rate ◽

Statistical Power ◽

Statistical Significance ◽

Significance Test ◽

P Value ◽

Test Statistic ◽

Global Test ◽

Additional Advantage ◽

Counting Procedure ◽

False Discovery

Abstract The conventional approach to evaluating the joint statistical significance of multiple hypothesis tests (i.e., “field,” or “global,” significance) in meteorology and climatology is to count the number of individual (or “local”) tests yielding nominally significant results and then to judge the unusualness of this integer value in the context of the distribution of such counts that would occur if all local null hypotheses were true. The sensitivity (i.e., statistical power) of this approach is potentially compromised both by the discrete nature of the test statistic and by the fact that the approach ignores the confidence with which locally significant tests reject their null hypotheses. An alternative global test statistic that has neither of these problems is the minimum p value among all of the local tests. Evaluation of field significance using the minimum local p value as the global test statistic, which is also known as the Walker test, has strong connections to the joint evaluation of multiple tests in a way that controls the “false discovery rate” (FDR, or the expected fraction of local null hypothesis rejections that are incorrect). In particular, using the minimum local p value to evaluate field significance at a level αglobal is nearly equivalent to the slightly more powerful global test based on the FDR criterion. An additional advantage shared by Walker’s test and the FDR approach is that both are robust to spatial dependence within the field of tests. The FDR method not only provides a more broadly applicable and generally more powerful field significance test than the conventional counting procedure but also allows better identification of locations with significant differences, because fewer than αglobal × 100% (on average) of apparently significant local tests will have resulted from local null hypotheses that are true.

Download Full-text

Analyzing Large Gene Expression Data Sets

Computational Text Analysis ◽

10.1093/oso/9780198567400.003.0014 ◽

2006 ◽

Author(s):

Soumya Raychaudhuri

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Expression Analysis ◽

Gene Expression Analysis ◽

Data Sets ◽

Expression Data ◽

Clustering Methods ◽

Biologically Relevant ◽

Large Gene ◽

Functional Coherence

The most interesting and challenging gene expression data sets to analyze are large multidimensional data sets that contain expression values for many genes across multiple conditions. In these data sets the use of scientific text can be particularly useful, since there are a myriad of genes examined under vastly different conditions, each of which may induce or repress expression of the same gene for different reasons. There is an enormous complexity to the data that we are examining—each gene is associated with dozens if not hundreds of expression values as well as multiple documents built up from vocabularies consisting of thousands of words. In Section 2.4 we reviewed common gene expression strategies, most of which revolve around defining groups of genes based on common profiles. A limitation of many gene expression analytic approaches is that they do not incorporate comprehensive background knowledge about the genes into the analysis. We present computational methods that leverage the peer-reviewed literature in the automatic analysis of gene expression data sets. Including the literature in gene expression data analysis offers an opportunity to incorporate background functional information about the genes when defining expression clusters. In Chapter 5 we saw how literature- based approaches could help in the analysis of single condition experiments. Here we will apply the strategies introduced in Chapter 6 to assess the coherence of groups of genes to enhance gene expression analysis approaches. The methods proposed here could, in fact, be applied to any multivariate genomics data type. The key concepts discussed in this chapter are listed in the frame box. We begin with a discussion of gene groups and their role in expression analysis; we briefly discuss strategies to assign keywords to groups and strategies to assess their functional coherence. We apply functional coherence measures to gene expression analysis; for examples we focus on a yeast expression data set. We first demonstrate how functional coherence can be used to focus in on the key biologically relevant gene groups derived by clustering methods such as self-organizing maps and k-means clustering.

Download Full-text

A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets

Molecular & Cellular Proteomics ◽

10.1074/mcp.m114.046995 ◽

2015 ◽

Vol 14 (9) ◽

pp. 2394-2404 ◽

Cited By ~ 157

Author(s):

Mikhail M. Savitski ◽

Mathias Wilhelm ◽

Hannes Hahne ◽

Bernhard Kuster ◽

Marcus Bantscheff

Keyword(s):

False Discovery Rate ◽

Data Sets ◽

Rate Estimation ◽

Proteomic Data ◽

False Discovery ◽

False Discovery Rate Estimation

Download Full-text

Estimation of statistical power and false discovery rate of QTL mapping methods through computer simulation

Chinese Science Bulletin ◽

10.1007/s11434-012-5239-3 ◽

2012 ◽

Vol 57 (21) ◽

pp. 2701-2710 ◽

Cited By ~ 11

Author(s):

HuiHui Li ◽

LuYan Zhang ◽

JianKang Wang

Keyword(s):

Computer Simulation ◽

Qtl Mapping ◽

False Discovery Rate ◽

Statistical Power ◽

False Discovery

Download Full-text

Testing Mediation Effects in High-Dimensional Microbiome Data with False Discovery Rate Control

10.21203/rs.3.rs-1105471/v1 ◽

2021 ◽

Author(s):

Ye Yue ◽

Yijuan Hu

Keyword(s):

False Discovery Rate ◽

Relative Abundance ◽

Mediation Analysis ◽

Type I Error ◽

New Method ◽

Type I ◽

Inverse Regression ◽

Global Test ◽

Mediation Effects ◽

False Discovery

Abstract Background: Understanding whether and which microbes played a mediating role between an exposure and a disease outcome are essential for researchers to develop clinical interventions to treat the disease by modulating the microbes. Existing methods for mediation analysis of the microbiome are often limited to a global test of community-level mediation or selection of mediating microbes without control of the false discovery rate (FDR). Further, while the null hypothesis of no mediation at each microbe is a composite null that consists of three types of null (no exposure-microbe association, no microbe-outcome association given the exposure, or neither), most existing methods for the global test such as MedTest and MODIMA treat the microbes as if they are all under the same type of null. Results: We propose a new approach based on inverse regression that regresses the (possibly transformed) relative abundance of each taxon on the exposure and the exposure-adjusted outcome to assess the exposure-taxon and taxon-outcome associations simultaneously. Then the association p-values are used to test mediation at both the community and individual taxon levels. This approach fits nicely into our Linear Decomposition Model (LDM) framework, so our new method is implemented in the LDM and enjoys all the features of the LDM, i.e., allowing an arbitrary number of taxa to be tested, supporting continuous, discrete, or multivariate exposures and outcomes as well as adjustment of confounding covariates, accommodating clustered data, and offering analysis at the relative abundance or presence-absence scale. We refer to this new method as LDM-med. Using extensive simulations, we showed that LDM-med always controlled the type I error of the global test and had compelling power over existing methods; LDM-med always preserved the FDR of testing individual taxa and had much better sensitivity than alternative approaches. In contrast, MedTest and MODIMA had severely inflated type I error when different taxa were under different types of null. The flexibility of LDM-med for a variety of mediation analyses is illustrated by the application to a murine microbiome dataset, which identified a plausible mediator.Conclusions: Inverse regression coupled with the LDM is a strategy that performs well and is capable of handling mediation analysis in a wide variety of microbiome studies.

Download Full-text

An averaging strategy to reduce variability in target-decoy estimates of false discovery rate

10.1101/440594 ◽

2018 ◽

Cited By ~ 1

Author(s):

Uri Keich ◽

Kaipo Tamura ◽

William Stafford Noble

Keyword(s):

False Discovery Rate ◽

Statistical Power ◽

Shotgun Proteomics ◽

Database Search ◽

Proteomics Data ◽

Decoy Database ◽

Software Toolkit ◽

True Proportion ◽

False Discovery ◽

False Discoveries

AbstractDecoy database search with target-decoy competition (TDC) provides an intuitive, easy-to-implement method for estimating the false discovery rate (FDR) associated with spectrum identifications from shotgun proteomics data. However, the procedure can yield different results for a fixed dataset analyzed with different decoy databases, and this decoy-induced variability is particularly problematic for smaller FDR thresholds, datasets or databases. In such cases, the nominal FDR might be 1% but the true proportion of false discoveries might be 10%. The averaged TDC protocol combats this problem by exploiting multiple independently shuffled decoy databases to provide an FDR estimate with reduced variability. We provide a tutorial introduction to aTDC, describe an improved variant of the protocol that offers increased statistical power, and discuss how to deploy aTDC in practice using the Crux software toolkit.

Download Full-text

Genetic regulation and heritability of miRNA and mRNA expression link to phosphorus utilization and gut microbiome

Open Biology ◽

10.1098/rsob.200182 ◽

2021 ◽

Vol 11 (2) ◽

pp. 200182

Author(s):

Siriluck Ponsuksili ◽

Michael Oster ◽

Henry Reyer ◽

Frieder Hadlich ◽

Nares Trakooljul ◽

...

Keyword(s):

False Discovery Rate ◽

Mrna Expression ◽

Genetic Regulation ◽

Snp Markers ◽

Phenotypic Data ◽

Genomic Heritability ◽

False Discovery ◽

P Utilization ◽

Microbiome Data

Improved utilization of phytates and mineral phosphorus (P) in monogastric animals contributes significantly to preserving the finite resource of mineral P and mitigating environmental pollution. In order to identify pathways and to prioritize candidate genes related to P utilization (PU), the genomic heritability of 77 and 80 trait-dependent expressed miRNAs and mRNAs in 482 Japanese quail were estimated and eQTL (expression quantitative trait loci) were detected. In total, 104 miR-eQTL (microRNA expression quantitative traits loci) were associated with SNP markers (false discovery rate less than 10%) including 41 eQTL of eight miRNAs. Similarly, 944 mRNA-eQTL were identified at the 5% False discovery rate threshold, with 573 being cis-eQTL of 36 mRNAs. High heritabilities of miRNA and mRNA expression coincide with highly significant eQTL. Integration of phenotypic data with transcriptome and microbiome data of the same animals revealed genetic regulated mRNA and miRNA transcripts (SMAD3, CAV1, ENNPP6, ATP2B4, miR-148a-3p, miR-146b-5p, miR-16-5p, miR-194, miR-215-5p, miR-199-3p, miR-1388a-3p) and microbes ( Candidatus Arthromitus , Enterococcus ) that are associated with PU. The results reveal novel insights into the role of mRNAs and miRNAs in host gut tissue functions, which are involved in PU and other related traits, in terms of the genetic regulation and inheritance of their expression and in association with microbiota components.

Download Full-text