New mixture models for decoy-free false discovery rate estimation in mass spectrometry proteomics

Yisu Peng; Shantanu Jain; Yong Fuga Li; Michal Greguš; Alexander R. Ivanov; Olga Vitek; Predrag Radivojac

doi:10.1093/bioinformatics/btaa807

New mixture models for decoy-free false discovery rate estimation in mass spectrometry proteomics

Bioinformatics ◽

10.1093/bioinformatics/btaa807 ◽

2020 ◽

Vol 36 (Supplement_2) ◽

pp. i745-i753

Author(s):

Yisu Peng ◽

Shantanu Jain ◽

Yong Fuga Li ◽

Michal Greguš ◽

Alexander R. Ivanov ◽

...

Keyword(s):

Mass Spectrometry ◽

False Discovery Rate ◽

Mixture Models ◽

Experimental Spectrum ◽

Supplementary Information ◽

Accurate Estimation ◽

Component Mixture ◽

Second Best ◽

False Discovery ◽

Score Distributions

Abstract Motivation Accurate estimation of false discovery rate (FDR) of spectral identification is a central problem in mass spectrometry-based proteomics. Over the past two decades, target-decoy approaches (TDAs) and decoy-free approaches (DFAs) have been widely used to estimate FDR. TDAs use a database of decoy species to faithfully model score distributions of incorrect peptide-spectrum matches (PSMs). DFAs, on the other hand, fit two-component mixture models to learn the parameters of correct and incorrect PSM score distributions. While conceptually straightforward, both approaches lead to problems in practice, particularly in experiments that push instrumentation to the limit and generate low fragmentation-efficiency and low signal-to-noise-ratio spectra. Results We introduce a new decoy-free framework for FDR estimation that generalizes present DFAs while exploiting more search data in a manner similar to TDAs. Our approach relies on multi-component mixtures, in which score distributions corresponding to the correct PSMs, best incorrect PSMs and second-best incorrect PSMs are modeled by the skew normal family. We derive EM algorithms to estimate parameters of these distributions from the scores of best and second-best PSMs associated with each experimental spectrum. We evaluate our models on multiple proteomics datasets and a HeLa cell digest case study consisting of more than a million spectra in total. We provide evidence of improved performance over existing DFAs and improved stability and speed over TDAs without any performance degradation. We propose that the new strategy has the potential to extend beyond peptide identification and reduce the need for TDA on all analytical platforms. Availabilityand implementation https://github.com/shawn-peng/FDR-estimation. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DiffNetFDR: differential network analysis with false discovery rate control

Bioinformatics ◽

10.1093/bioinformatics/btz051 ◽

2019 ◽

Vol 35 (17) ◽

pp. 3184-3186

Author(s):

Xiao-Fei Zhang ◽

Le Ou-Yang ◽

Shuo Yang ◽

Xiaohua Hu ◽

Hong Yan

Keyword(s):

False Discovery Rate ◽

Graphical Models ◽

Biological Significance ◽

R Package ◽

Supplementary Information ◽

Gaussian Graphical Models ◽

Multiple Testing Procedure ◽

False Discovery ◽

Differential Network ◽

Shiny App

Abstract Summary To identify biological network rewiring under different conditions, we develop a user-friendly R package, named DiffNetFDR, to implement two methods developed for testing the difference in different Gaussian graphical models. Compared to existing tools, our methods have the following features: (i) they are based on Gaussian graphical models which can capture the changes of conditional dependencies; (ii) they determine the tuning parameters in a data-driven manner; (iii) they take a multiple testing procedure to control the overall false discovery rate; and (iv) our approach defines the differential network based on partial correlation coefficients so that the spurious differential edges caused by the variants of conditional variances can be excluded. We also develop a Shiny application to provide easier analysis and visualization. Simulation studies are conducted to evaluate the performance of our methods. We also apply our methods to two real gene expression datasets. The effectiveness of our methods is validated by the biological significance of the identified differential networks. Availability and implementation R package and Shiny app are available at https://github.com/Zhangxf-ccnu/DiffNetFDR. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A synthetic peptide library for benchmarking crosslinking mass spectrometry search engines

10.1101/821447 ◽

2019 ◽

Author(s):

Rebecca Beveridge ◽

Johannes Stadlmann ◽

Josef M. Penninger ◽

Karl Mechtler

Keyword(s):

Mass Spectrometry ◽

False Discovery Rate ◽

Search Engines ◽

Synthetic Peptide ◽

External Validation ◽

Data Interpretation ◽

Peptide Libraries ◽

Mass Spectrometry Data ◽

Biological Interactions ◽

False Discovery

We have created synthetic peptide libraries to benchmark crosslinking mass spectrometry search engines for different types of crosslinker. The unique benefit of using a library is knowing which identified crosslinks are true and which are false. Here we have used mass spectrometry data generated from measurement of the peptide libraries to evaluate the most frequently applied search algorithms in crosslinking mass-spectrometry. When filtered to an estimated false discovery rate of 5%, false crosslink identification ranged from 5.2% to 11.3% for search engines with inbuilt validation strategies for error estimation. When different external validation strategies were applied to one single search output, false crosslink identification ranged from 2.4% to a surprising 32%, despite being filtered to an estimated 5% false discovery rate. Remarkably, the use of MS-cleavable crosslinkers did not reduce the false discovery rate compared to non-cleavable crosslinkers, results from which have far-reaching implications in structural biology. We anticipate that the datasets acquired during this research will further drive optimisation and development of search engines and novel data-interpretation technologies, thereby advancing our understanding of vital biological interactions.

Download Full-text

A generalizable method for false-discovery rate estimation in mass spectrometry-based lipidomics

10.1101/2020.02.18.946483 ◽

2020 ◽

Author(s):

Grant M. Fujimoto ◽

Jennifer E. Kyle ◽

Joon-Yong Lee ◽

Thomas O. Metz ◽

Samuel H. Payne

Keyword(s):

Mass Spectrometry ◽

Data Analysis ◽

False Discovery Rate ◽

Current Method ◽

Data Interpretation ◽

Related Field ◽

Statistical Confidence ◽

False Discovery ◽

False Discovery Rate Estimation ◽

And Function

AbstractMass spectrometry (MS)-based lipidomics is revolutionizing lipid research with high throughput identification and quantification of hundreds to thousands of lipids with the goal of elucidating lipid metabolism and function. Estimates of statistical confidence in lipid identification are essential for downstream data interpretation in a biological context. In the related field of proteomics, a variety of methods for estimating false-discovery are available, and understanding the statistical confidence of identifications is typically required for data analysis and hypothesis testing. However, there is no current method for estimating the false discovery rate (FDR) or statistical confidence for MS-based lipid identifications. This has slowed the adoption of MS-based lipidomics research, as all identifications require manual inspection and validation to ensure their accuracy. We present here the first generalizable method for FDR estimation, a target/decoy approach, that allows those conducting MS-based lipidomics research to confidently adjust spectral score thresholds to minimize false discovery and to enable full automation of data analysis.

Download Full-text

57 Non-targeted metabolomic profiles within the uterine milieu of porcine pregnancies containing populations of uniform or diverse spherical, ovoid, or tubular conceptuses during initiation of embryo elongation

Reproduction Fertility and Development ◽

10.1071/rdv31n1ab57 ◽

2019 ◽

Vol 31 (1) ◽

pp. 154

Author(s):

J. Miles ◽

E. Wright-Johnson ◽

S. Walsh ◽

C. Corey ◽

L. Yao ◽

...

Keyword(s):

Mass Spectrometry ◽

Principal Component Analysis ◽

False Discovery Rate ◽

Principal Component ◽

Component Analysis ◽

False Discovery ◽

Chromatography Mass Spectrometry ◽

Ms Analysis ◽

Rate Adjustment ◽

False Discovery Rate Adjustment

Alterations in the signalling of critical molecular factors within the uterine milieu result in deficiencies in embryo elongation, leading directly to embryonic loss as well as delayed elongation. The objective of this study was to identify metabolites within the uterine environment from populations of uniform and diverse porcine conceptuses as they transition between spherical, ovoid, and tubular conceptuses during the initiation of embryo elongation. White crossbred gilts (n=38) were bred at standing oestrus (designated Day 0) and again 24h later and randomly assigned to collection group. At Day 9, 10, or 11 of gestation, reproductive tracts were collected immediately following harvest and flushed with 40mL of RPMI-1640 media. Conceptus morphologies were assessed from each pregnancy to assign to 1 of 5 treatment groups based on these morphologies: (1) uniform spherical (n=8); (2) diverse spherical and ovoid (n=8); (3) uniform ovoid (n=8); (4) diverse ovoid and tubular (n=8); and (5) uniform tubular (n=6). Subsequently uterine flushings from these pregnancies were submitted for non-targeted profiling by gas chromatography-mass spectrometry (GC-MS) and ultra-performance liquid chromatography-mass spectrometry (UPLC-MS) techniques. Raw spectral data were processed using the XCMS package in R (R Foundation for Statistical Computing, Vienna, Austria) and features were clustered using RAMclustR. Unsupervised multivariate principal component analysis was performed in R using pcamethods package, and univariate ANOVA was performed in R with a Benjamini-Hochberg false discovery rate adjustment. Principal component analysis of the GC-MS and UPLC-MS data identified 153 and 104 metabolites, respectively. Of the identified metabolites, 51 and 71 metabolites from the GC-MS and UPLC-MS analysis, respectively, corresponded to known compounds. After false discovery rate adjustment of the GC-MS and UPLC-MS data, 38 and 59 metabolites from the GC-MS and UPLC-MS analysis, respectively, differed (P<0.05) in uterine flushings from pregnancies for the 5 conceptus stages. Some metabolites were greater (P<0.05) in abundance for uterine flushings containing earlier stage conceptuses (i.e. spherical) such as uric acid, tryptophan, 5-hydroxy-L-tryptophan, and L-tryosine. In contrast, some metabolites were greater (P<0.05) in abundance for uterine flushings containing later stage conceptuses (i.e. tubular) such as creatinine, serine, isovaleryl-I-carnitine, and lauric diethaolamide. These data illustrate several putative metabolites that change within the uterine milieu as porcine embryos transition between spherical, ovoid, and tubular conceptuses. Funding was provided by USDA-NIFA-AFRI Grant no. 2017-67015-26456.

Download Full-text

RECAP reveals the true statistical significance of ChIP-seq peak calls

Bioinformatics ◽

10.1093/bioinformatics/btz150 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3592-3598 ◽

Cited By ~ 1

Author(s):

Justin G Chitpin ◽

Aseel Awdeh ◽

Theodore J Perkins

Keyword(s):

False Discovery Rate ◽

Statistical Significance ◽

Statistical Hypothesis ◽

Supplementary Information ◽

Peak Calling ◽

Statistical Hypothesis Testing ◽

P Values ◽

False Discovery ◽

Genomic Regions ◽

And Control

Abstract Motivation Chromatin Immunopreciptation (ChIP)-seq is used extensively to identify sites of transcription factor binding or regions of epigenetic modifications to the genome. A key step in ChIP-seq analysis is peak calling, where genomic regions enriched for ChIP versus control reads are identified. Many programs have been designed to solve this task, but nearly all fall into the statistical trap of using the data twice—once to determine candidate enriched regions, and again to assess enrichment by classical statistical hypothesis testing. This double use of the data invalidates the statistical significance assigned to enriched regions, thus the true significance or reliability of peak calls remains unknown. Results Using simulated and real ChIP-seq data, we show that three well-known peak callers, MACS, SICER and diffReps, output biased P-values and false discovery rate estimates that can be many orders of magnitude too optimistic. We propose a wrapper algorithm, RECAP, that uses resampling of ChIP-seq and control data to estimate a monotone transform correcting for biases built into peak calling algorithms. When applied to null hypothesis data, where there is no enrichment between ChIP-seq and control, P-values recalibrated by RECAP are approximately uniformly distributed. On data where there is genuine enrichment, RECAP P-values give a better estimate of the true statistical significance of candidate peaks and better false discovery rate estimates, which correlate better with empirical reproducibility. RECAP is a powerful new tool for assessing the true statistical significance of ChIP-seq peak calls. Availability and implementation The RECAP software is available through www.perkinslab.ca or on github at https://github.com/theodorejperkins/RECAP. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Using mixture models to detect differentially expressed genes

Australian Journal of Experimental Agriculture ◽

10.1071/ea05051 ◽

2005 ◽

Vol 45 (8) ◽

pp. 859 ◽

Cited By ~ 13

Author(s):

G. J. McLachlan ◽

R. W. Bean ◽

L. Ben-Tovim Jones ◽

J. X. Zhu

Keyword(s):

False Discovery Rate ◽

Mixture Models ◽

False Negative ◽

False Negative Rate ◽

Multiple Hypothesis Testing ◽

Differentially Expressed ◽

False Discovery ◽

Mixture Model Approach ◽

Number Of Classes ◽

Selection Of

An important and common problem in microarray experiments is the detection of genes that are differentially expressed in a given number of classes. As this problem concerns the selection of significant genes from a large pool of candidate genes, it needs to be carried out within the framework of multiple hypothesis testing. In this paper, we focus on the use of mixture models to handle the multiplicity issue. With this approach, a measure of the local false discovery rate is provided for each gene, and it can be implemented so that the implied global false discovery rate is bounded as with the Benjamini-Hochberg methodology based on tail areas. The latter procedure is too conservative, unless it is modified according to the prior probability that a gene is not differentially expressed. An attractive feature of the mixture model approach is that it provides a framework for the estimation of this probability and its subsequent use in forming a decision rule. The rule can also be formed to take the false negative rate into account.

Download Full-text

PEAK DETECTION IN MASS SPECTROMETRY BY GABOR FILTERS AND ENVELOPE ANALYSIS

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720009004229 ◽

2009 ◽

Vol 07 (03) ◽

pp. 547-569 ◽

Cited By ~ 10

Author(s):

NHA NGUYEN ◽

HENG HUANG ◽

SOONTORN ORAINTARA ◽

AN VO

Keyword(s):

Mass Spectrometry ◽

False Discovery Rate ◽

Gabor Filter ◽

Signal To Noise Ratio ◽

Peak Detection ◽

Envelope Analysis ◽

Local Maxima ◽

False Discovery ◽

Lower False Discovery Rate ◽

True Position

Mass Spectrometry (MS) is increasingly being used to discover diseases-related proteomic patterns. The peak detection step is one of the most important steps in the typical analysis of MS data. Recently, many new algorithms have been proposed to increase true position rate with low false discovery rate in peak detection. Most of them follow two approaches: one is the denoising approach and the other is the decomposing approach. In the previous studies, the decomposition of MS data method shows more potential than the first one. In this paper, we propose two novel methods, named GaborLocal and GaborEnvelop, both of which can detect more true peaks with a lower false discovery rate than previous methods. We employ the method of Gaussian local maxima to detect peaks, because it is robust to noise in signals. A new approach, peak rank, is defined for the first time to identify peaks instead of using the signal-to-noise ratio. Meanwhile, the Gabor filter is used to amplify important information and compress noise in the raw MS signal. Moreover, we also propose the envelope analysis to improve the quantification of peaks and remove more false peaks. The proposed methods have been performed on the real SELDI-TOF spectrum with known polypeptide positions. The experimental results demonstrate that our methods outperform other commonly used methods in the Receiver Operating Characteristic (ROC) curve.

Download Full-text

KIMI: Knockoff Inference for Motif Identification from molecular sequences with controlled false discovery rate

Bioinformatics ◽

10.1093/bioinformatics/btaa912 ◽

2020 ◽

Author(s):

Xin Bai ◽

Jie Ren ◽

Yingying Fan ◽

Fengzhu Sun

Keyword(s):

False Discovery Rate ◽

Motif Discovery ◽

Rapid Development ◽

Biological Significance ◽

Supplementary Information ◽

Sequence Motif ◽

Metagenomic Sequencing ◽

Simulation Studies ◽

Sequencing Technologies ◽

False Discovery

Abstract Motivation The rapid development of sequencing technologies has enabled us to generate a large number of metagenomic reads from genetic materials in microbial communities, making it possible to gain deep insights into understanding the differences between the genetic materials of different groups of microorganisms, such as bacteria, viruses, plasmids, etc. Computational methods based on k-mer frequencies have been shown to be highly effective for classifying metagenomic sequencing reads into different groups. However, such methods usually use all the k-mers as features for prediction without selecting relevant k-mers for the different groups of sequences, i.e. unique nucleotide patterns containing biological significance. Results To select k-mers for distinguishing different groups of sequences with guaranteed false discovery rate (FDR) control, we develop KIMI, a general framework based on model-X Knockoffs regarded as the state-of-the-art statistical method for FDR control, for sequence motif discovery with arbitrary target FDR level, such that reproducibility can be theoretically guaranteed. KIMI is shown through simulation studies to be effective in simultaneously controlling FDR and yielding high power, outperforming the broadly used Benjamini–Hochberg procedure and the q-value method for FDR control. To illustrate the usefulness of KIMI in analyzing real datasets, we take the viral motif discovery problem as an example and implement KIMI on a real dataset consisting of viral and bacterial contigs. We show that the accuracy of predicting viral and bacterial contigs can be increased by training the prediction model only on relevant k-mers selected by KIMI. Availabilityand implementation Our implementation of KIMI is available at https://github.com/xinbaiusc/KIMI. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Beyond target-decoy competition: stable validation of peptide and protein identifications in mass spectrometry-based discovery proteomics

10.1101/765057 ◽

2019 ◽

Cited By ~ 1

Author(s):

Yohann Couté ◽

Christophe Bruley ◽

Thomas Burger

Keyword(s):

Mass Spectrometry ◽

False Discovery Rate ◽

Bottom Up ◽

Mass Spectrometers ◽

Popular Method ◽

Protein Levels ◽

False Discovery ◽

Modern Mass

AbstractIn bottom-up discovery proteomics, target-decoy competition (TDC) is the most popular method for false discovery rate (FDR) control. Despite unquestionable statistical foundations, this method has drawbacks, including its hitherto unknown intrinsic lack of stability vis-à-vis practical conditions of application. Although some consequences of this instability have already been empirically described, they may have been misinter-preted. This article provides evidence that TDC has become less reliable as the accuracy of modern mass spectrometers improved. We therefore propose to replace TDC by a totally different method to control the FDR at spectrum, peptide and protein levels, while benefiting from the theoretical guarantees of the Benjamini-Hochberg framework. As this method is simpler to use, faster to compute and more stable than TDC, we argue that it is better adapted to the standardization and throughput constraints of current proteomic platforms.

Download Full-text

A semi-parametric approach for mixture models: Application to local false discovery rate estimation

Computational Statistics & Data Analysis ◽

10.1016/j.csda.2007.02.028 ◽

2007 ◽

Vol 51 (12) ◽

pp. 5483-5493 ◽

Cited By ~ 24

Author(s):

Stéphane Robin ◽

Avner Bar-Hen ◽

Jean-Jacques Daudin ◽

Laurent Pierre

Keyword(s):

False Discovery Rate ◽

Mixture Models ◽

Local False Discovery Rate ◽

Parametric Approach ◽

Rate Estimation ◽

False Discovery ◽

False Discovery Rate Estimation

Download Full-text