An averaging strategy to reduce variability in target-decoy estimates of false discovery rate

AbstractDecoy database search with target-decoy competition (TDC) provides an intuitive, easy-to-implement method for estimating the false discovery rate (FDR) associated with spectrum identifications from shotgun proteomics data. However, the procedure can yield different results for a fixed dataset analyzed with different decoy databases, and this decoy-induced variability is particularly problematic for smaller FDR thresholds, datasets or databases. In such cases, the nominal FDR might be 1% but the true proportion of false discoveries might be 10%. The averaged TDC protocol combats this problem by exploiting multiple independently shuffled decoy databases to provide an FDR estimate with reduced variability. We provide a tutorial introduction to aTDC, describe an improved variant of the protocol that offers increased statistical power, and discuss how to deploy aTDC in practice using the Crux software toolkit.

Download Full-text

Mass spectrometrists should search for all peptides, but assess only the ones they care about

10.1101/094581 ◽

2017 ◽

Author(s):

Adriaan Sticker ◽

Lennart Martens ◽

Lieven Clement

Keyword(s):

False Discovery Rate ◽

Statistical Power ◽

Mass Spectra ◽

Shotgun Proteomics ◽

False Discovery ◽

Scientific Hypothesis ◽

Calculation Results ◽

Classical Strategy

AbstractIn shotgun proteomics identified mass spectra that are deemed irrelevant to the scientific hypothesis are often discarded. Noble (2015)1 therefore urged researchers to remove irrelevant peptides from the database prior to searching to improve statistical power. We here however, argue that both the classical as well as Noble’s revised method produce suboptimal peptide identifications and have problems in controlling the false discovery rate (FDR). Instead, we show that searching for all expected peptides, and removing irrelevant peptides prior to FDR calculation results in more reliable identifications at controlled FDR level than the classical strategy that discards irrelevant peptides post FDR calculation, or than Noble’s strategy that discards irrelevant peptides prior to searching.

Download Full-text

The Functional False Discovery Rate with Applications to Genomics

10.1101/241133 ◽

2017 ◽

Cited By ~ 2

Author(s):

Xiongzhi Chen ◽

David G. Robinson ◽

John D. Storey

Keyword(s):

Gene Expression ◽

False Discovery Rate ◽

Genetic Marker ◽

Read Depth ◽

False Discovery Rates ◽

Additional Information ◽

False Discovery ◽

Gene Expression Trait ◽

Genetics Of Gene Expression ◽

False Discoveries

AbstractThe false discovery rate measures the proportion of false discoveries among a set of hypothesis tests called significant. This quantity is typically estimated based on p-values or test statistics. In some scenarios, there is additional information available that may be used to more accurately estimate the false discovery rate. We develop a new framework for formulating and estimating false discovery rates and q-values when an additional piece of information, which we call an “informative variable”, is available. For a given test, the informative variable provides information about the prior probability a null hypothesis is true or the power of that particular test. The false discovery rate is then treated as a function of this informative variable. We consider two applications in genomics. Our first is a genetics of gene expression (eQTL) experiment in yeast where every genetic marker and gene expression trait pair are tested for associations. The informative variable in this case is the distance between each genetic marker and gene. Our second application is to detect differentially expressed genes in an RNA-seq study carried out in mice. The informative variable in this study is the per-gene read depth. The framework we develop is quite general, and it should be useful in a broad range of scientific applications.

Download Full-text

Deep Semi-Supervised Learning Improves Universal Peptide Identification of Shotgun Proteomics Data

10.1101/2020.11.12.380881 ◽

2020 ◽

Author(s):

John T. Halloran ◽

Gregor Urban ◽

David Rocke ◽

Pierre Baldi

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Peptide Identification ◽

Shotgun Proteomics ◽

Database Search ◽

Supervised Machine Learning ◽

Superior Performance ◽

Support Vector ◽

Proteomics Data ◽

Learning Classifier

AbstractSemi-supervised machine learning post-processors critically improve peptide identification of shot-gun proteomics data. Such post-processors accept the peptide-spectrum matches (PSMs) and feature vectors resulting from a database search, train a machine learning classifier, and recalibrate PSMs using the trained parameters, often yielding significantly more identified peptides across q-value thresholds. However, current state-of-the-art post-processors rely on shallow machine learning methods, such as support vector machines. In contrast, the powerful training capabilities of deep learning models have displayed superior performance to shallow models in an ever-growing number of other fields. In this work, we show that deep models significantly improve the recalibration of PSMs compared to the most accurate and widely-used post-processors, such as Percolator and PeptideProphet. Furthermore, we show that deep learning is able to adaptively analyze complex datasets and features for more accurate universal post-processing, leading to both improved Prosit analysis and markedly better recalibration of recently developed database-search functions.

Download Full-text

On “Field Significance” and the False Discovery Rate

Journal of Applied Meteorology and Climatology ◽

10.1175/jam2404.1 ◽

2006 ◽

Vol 45 (9) ◽

pp. 1181-1189 ◽

Cited By ~ 280

Author(s):

D. S. Wilks

Keyword(s):

False Discovery Rate ◽

Statistical Power ◽

Statistical Significance ◽

Significance Test ◽

P Value ◽

Test Statistic ◽

Global Test ◽

Additional Advantage ◽

Counting Procedure ◽

False Discovery

Abstract The conventional approach to evaluating the joint statistical significance of multiple hypothesis tests (i.e., “field,” or “global,” significance) in meteorology and climatology is to count the number of individual (or “local”) tests yielding nominally significant results and then to judge the unusualness of this integer value in the context of the distribution of such counts that would occur if all local null hypotheses were true. The sensitivity (i.e., statistical power) of this approach is potentially compromised both by the discrete nature of the test statistic and by the fact that the approach ignores the confidence with which locally significant tests reject their null hypotheses. An alternative global test statistic that has neither of these problems is the minimum p value among all of the local tests. Evaluation of field significance using the minimum local p value as the global test statistic, which is also known as the Walker test, has strong connections to the joint evaluation of multiple tests in a way that controls the “false discovery rate” (FDR, or the expected fraction of local null hypothesis rejections that are incorrect). In particular, using the minimum local p value to evaluate field significance at a level αglobal is nearly equivalent to the slightly more powerful global test based on the FDR criterion. An additional advantage shared by Walker’s test and the FDR approach is that both are robust to spatial dependence within the field of tests. The FDR method not only provides a more broadly applicable and generally more powerful field significance test than the conventional counting procedure but also allows better identification of locations with significant differences, because fewer than αglobal × 100% (on average) of apparently significant local tests will have resulted from local null hypotheses that are true.

Download Full-text

Estimation of statistical power and false discovery rate of QTL mapping methods through computer simulation

Chinese Science Bulletin ◽

10.1007/s11434-012-5239-3 ◽

2012 ◽

Vol 57 (21) ◽

pp. 2701-2710 ◽

Cited By ~ 11

Author(s):

HuiHui Li ◽

LuYan Zhang ◽

JianKang Wang

Keyword(s):

Computer Simulation ◽

Qtl Mapping ◽

False Discovery Rate ◽

Statistical Power ◽

False Discovery

Download Full-text

A Heuristic Method for Assigning a False-discovery Rate for Protein Identifications from Mascot Database Search Results

Molecular & Cellular Proteomics ◽

10.1074/mcp.m400215-mcp200 ◽

2005 ◽

Vol 4 (6) ◽

pp. 762-772 ◽

Cited By ~ 131

Author(s):

D. Brent Weatherly ◽

James A. Atwood ◽

Todd A. Minning ◽

Cameron Cavola ◽

Rick L. Tarleton ◽

...

Keyword(s):

False Discovery Rate ◽

Heuristic Method ◽

Database Search ◽

Search Results ◽

False Discovery

Download Full-text

Combining high resolution and exact calibration to boost statistical power: A well-calibrated score function for high-resolution MS2 data

10.1101/290858 ◽

2018 ◽

Author(s):

Andy Lin ◽

J. Jeffry Howbert ◽

William Stafford Noble

Keyword(s):

Mass Spectrometry ◽

High Resolution ◽

Statistical Power ◽

State Of The Art ◽

Score Function ◽

Shotgun Proteomics ◽

Database Search ◽

Mass Spectrometry Data ◽

P Value ◽

Score Functions

AbstractTo achieve accurate assignment of peptide sequences to observed fragmentation spectra, a shotgun proteomics database search tool must make good use of the very high resolution information produced by state-of-the-art mass spectrometers. However, making use of this information while also ensuring that the search engine’s scores are well calibrated—i.e., that the score assigned to one spectrum can be meaningfully compared to the score assigned to a different spectrum—has proven to be challenging. Here, we describe a database search score function, the “residue evidence” (res-ev) score, that achieves both of these goals simultaneously. We also demonstrate how to combine calibrated res-ev scores with calibrated XCorr scores to produce a “combined p-value” score function. We provide a benchmark consisting of four mass spectrometry data sets, which we use to compare the combined p-value to the score functions used by several existing search engines. Our results suggest that the combined p-value achieves state-of-the-art performance, generally outperforming MS Amanda and Morpheus and performing comparably to MS-GF+. The res-ev and combined p-value score functions are freely available as part of the Tide search engine in the Crux mass spectrometry toolkit (http://crux.ms).

Download Full-text

Unbiased False Discovery Rate Estimation for Shotgun Proteomics Based on the Target-Decoy Approach

Journal of Proteome Research ◽

10.1021/acs.jproteome.6b00144 ◽

2016 ◽

Vol 16 (2) ◽

pp. 393-397 ◽

Cited By ~ 30

Author(s):

Lev I. Levitsky ◽

Mark V. Ivanov ◽

Anna A. Lobas ◽

Mikhail V. Gorshkov

Keyword(s):

False Discovery Rate ◽

Shotgun Proteomics ◽

Rate Estimation ◽

False Discovery ◽

False Discovery Rate Estimation

Download Full-text

Empirical approach to false discovery rate estimation in shotgun proteomics

Rapid Communications in Mass Spectrometry ◽

10.1002/rcm.4417 ◽

2010 ◽

Vol 24 (4) ◽

pp. 454-462 ◽

Cited By ~ 12

Author(s):

Anton A. Goloborodko ◽

Corina Mayerhofer ◽

Alexander R. Zubarev ◽

Irina A. Tarasova ◽

Alexander V. Gorshkov ◽

...

Keyword(s):

False Discovery Rate ◽

Shotgun Proteomics ◽

Empirical Approach ◽

Rate Estimation ◽

False Discovery ◽

False Discovery Rate Estimation

Download Full-text

Discrete False-Discovery Rate Improves Identification of Differentially Abundant Microbes

mSystems ◽

10.1128/msystems.00092-17 ◽

2017 ◽

Vol 2 (6) ◽

Cited By ~ 25

Author(s):

Lingjing Jiang ◽

Amnon Amir ◽

James T. Morton ◽

Ruth Heller ◽

Ery Arias-Castro ◽

...

Keyword(s):

Gene Expression ◽

False Discovery Rate ◽

Real World ◽

Statistical Power ◽

Gene Expression Analysis ◽

New Method ◽

Data Sets ◽

Differential Abundance ◽

False Discovery ◽

Microbiome Data

ABSTRACT DS-FDR can achieve higher statistical power to detect significant findings in sparse and noisy microbiome data compared to the commonly used Benjamini-Hochberg procedure and other FDR-controlling procedures. Differential abundance testing is a critical task in microbiome studies that is complicated by the sparsity of data matrices. Here we adapt for microbiome studies a solution from the field of gene expression analysis to produce a new method, discrete false-discovery rate (DS-FDR), that greatly improves the power to detect differential taxa by exploiting the discreteness of the data. Additionally, DS-FDR is relatively robust to the number of noninformative features, and thus removes the problem of filtering taxonomy tables by an arbitrary abundance threshold. We show by using a combination of simulations and reanalysis of nine real-world microbiome data sets that this new method outperforms existing methods at the differential abundance testing task, producing a false-discovery rate that is up to threefold more accurate, and halves the number of samples required to find a given difference (thus increasing the efficiency of microbiome experiments considerably). We therefore expect DS-FDR to be widely applied in microbiome studies. IMPORTANCE DS-FDR can achieve higher statistical power to detect significant findings in sparse and noisy microbiome data compared to the commonly used Benjamini-Hochberg procedure and other FDR-controlling procedures.

Download Full-text