In-Depth Benchmarking of DIA-type Proteomics Data Analysis Strategies Using a Large-Scale Benchmark Dataset Comprising Inter-Patient Heterogeneity

Abstract An overwhelming number of proteomics software tools and algorithms have been published for different steps of Data Independent Acquisition analysis of clinical samples. Nonetheless, there is still a lack of comprehensive benchmark studies evaluating which combinations of those isolated components perform best. Here, we used 92 lymph nodes from distinct patients to create a unique benchmark dataset representing real-world inter-individual heterogeneity. The publicly available dataset comprises 118 LC-MS/MS runs with > 12 million MS2 spectra and allowed us to objectively evaluate how well different combinations of spectral libraries, DIA software, sparsity reduction, normalization and statistical tests can detect differentially abundant proteins, while also taking sample size into account. Evaluation of 2 million data analysis workflows showed that a gas phase fractionation refined spectral library in combination with DIA-NN and Significance Analysis of Microarrays reliably detected differentially abundant proteins. Furthermore, DIA-NN and Spectronaut robustly avoided the false detection of truly absent proteins.

Download Full-text

Removing the hidden data dependency of DIA with predicted spectral libraries

10.1101/681429 ◽

2019 ◽

Author(s):

B. Van Puyvelde ◽

S. Willems ◽

R. Gabriels ◽

S. Daled ◽

L. De Clerck ◽

...

Keyword(s):

Data Analysis ◽

Clinical Proteomics ◽

Spectrometric Data ◽

Stochastic Sampling ◽

Data Independent Acquisition ◽

Use Of Data ◽

Analysis Strategies ◽

Stochastic Data ◽

Hidden Data ◽

Spectral Libraries

Data-Independent Acquisition (DIA) generates comprehensive yet complex mass spectrometric data, which imposes the use of data-dependent acquisition (DDA) libraries for deep peptide-centric detection. We here show that DIA can be redeemed from this dependency by combining predicted fragment intensities and retention times with narrow window DIA. This eliminates variation in library building and omits stochastic sampling, finally making the DIA workflow fully deterministic. Especially for clinical proteomics, this has the potential to facilitate inter-laboratory comparison.Significance of the StudyData-independent acquisition (DIA) is quickly developing into the most comprehensive strategy to analyse a sample on a mass spectrometer. Correspondingly, a wave of data analysis strategies has followed suit, improving the yield from DIA experiments with each iteration. As a result, a worldwide wave of investments in DIA is already taking place in anticipation of clinical applications. Yet, there is considerable confusion about the most useful and efficient way to handle DIA data, given the plethora of possible approaches with little regard for compatibility and complementarity. In our manuscript, we outline the currently available peptide-centric DIA data analysis strategies in a unified graphic called the DIAmond DIAgram. This leads us to an innovative and easily adoptable approach based on predicted spectral information. Most importantly, our contribution removes what is arguably the biggest bottleneck in the field: the current need for Data Dependent Acquisition (DDA) prior to DIA analysis. Fractionation, stochastic data acquisition, processing and identification all introduce bias in the library. By generating libraries through data independent, i.e. deterministic acquisition, stochastic sampling in the DIA workflow is now fully omitted. This is a crucial step towards increased standardization. Additionally, our results demonstrate that a proteome-wide predicted spectral library can surrogate an exhaustive DDA Pan-Human library that was built based on 331 prior DDA runs.

Download Full-text

DIAproteomics: A multi-functional data analysis pipeline for data-independent-acquisition proteomics and peptidomics

10.1101/2020.12.08.415844 ◽

2020 ◽

Author(s):

Leon Bichmann ◽

Shubham Gupta ◽

George Rosenberger ◽

Leon Kuchenbecker ◽

Timo Sachsenberg ◽

...

Keyword(s):

Data Analysis ◽

Large Scale ◽

Dynamic Range ◽

Expert Knowledge ◽

Data Sets ◽

Data Independent Acquisition ◽

Large Scale Data ◽

False Discovery ◽

Spectral Libraries ◽

Scale Data

ABSTRACTData-independent acquisition (DIA) is becoming a leading analysis method in biomedical mass spectrometry. Main advantages include greater reproducibility, sensitivity and dynamic range compared to data-dependent acquisition (DDA). However, data analysis is complex and often requires expert knowledge when dealing with large-scale data sets. Here we present DIAproteomics a multi-functional, automated high-throughput pipeline implemented in Nextflow that allows to easily process proteomics and peptidomics DIA datasets on diverse compute infrastructures. Central components are well-established tools such as the OpenSwathWorkflow for DIA spectral library search and PyProphet for false discovery rate assessment. In addition, it provides options to generate spectral libraries from existing DDA data and carry out retention time and chromatogram alignment. The output includes annotated tables and diagnostic visualizations from statistical post-processing and computation of fold-changes across pairwise conditions, predefined in an experimental design. DIAproteomics is open-source software and available under a permissive license to the scientific community at https://www.openms.de/diaproteomics/.

Download Full-text

Bottom up proteomics data analysis strategies to explore protein modifications and genomic variants

PROTEOMICS ◽

10.1002/pmic.201400186 ◽

2015 ◽

Vol 15 (11) ◽

pp. 1789-1792 ◽

Cited By ~ 3

Author(s):

Ana Sofia Carvalho ◽

Deborah Penque ◽

Rune Matthiesen

Keyword(s):

Data Analysis ◽

Proteomics Data ◽

Bottom Up ◽

Protein Modifications ◽

Genomic Variants ◽

Analysis Strategies ◽

Proteomics Data Analysis

Download Full-text

Comparison of TIMS-PASEF quantitative proteomics data-analysis workflows using FragPipe, DIA-NN, and Spectronaut from a user's perspective

10.1101/2021.11.29.470373 ◽

2021 ◽

Author(s):

Alejandro Fernandez-Vega ◽

Federica Farabegoli ◽

Maria Mercedes Alonso-Martinez ◽

Ignacio Ortea

Keyword(s):

Data Analysis ◽

Quantitative Proteomics ◽

Real Life ◽

Analysis Tool ◽

Proteomics Data ◽

Mass Spectrometers ◽

Data Independent Acquisition ◽

Proteomics Experiment ◽

Under Sampling ◽

The One

Data-independent acquisition (DIA) methods have gained great popularity in bottom-up quantitative proteomics, as they overcome the irreproducibility and under-sampling limitations of data-dependent acquisition (DDA). diaPASEF, recently developed for the timsTOF Pro mass spectrometers, has brought improvements to DIA, providing additional ion separation (in the ion mobility dimension) and increasing sensitivity. Several studies have benchmarked different workflows for DIA quantitative proteomics, but mostly using instruments from Sciex and Thermo, and therefore, the results are not extrapolable to diaPASEF data. In this work, using a real-life sample set like the one that can be found in any proteomics experiment, we compared the results of analyzing PASEF data with different combinations of library-based and library-free analysis, combining the tools of the FragPipe suite, DIA-NN and including MS1-level LFQ with DDA-PASEF data, and also comparing with the workflows possible in Spectronaut. We verified that library-independent workflows, not so efficient not so long ago, have greatly improved in the recent versions of the software tools, and now perform as well or even better than library-based ones. We report here information so that the user who is going to conduct a relative quantitative proteomics study using a timsTOF Pro mass spectrometer can make an informed decision on how to acquire (diaPASEF for DIA analysis, or DDA-PASEF for MS1-level LFQ) the samples, and what can be expected depending on the data analysis tool used, among the different alternatives offered by the recently optimized tools for TIMS-PASEF data analysis.

Download Full-text

GproDIA enables data-independent acquisition glycoproteomics with comprehensive statistical control

Nature Communications ◽

10.1038/s41467-021-26246-3 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Yi Yang ◽

Guoquan Yan ◽

Siyuan Kong ◽

Mengxi Wu ◽

Pengyuan Yang ◽

...

Keyword(s):

Large Scale ◽

Early Stage ◽

Statistical Control ◽

Serum Samples ◽

Accurate Identification ◽

Spectrum Prediction ◽

Data Independent Acquisition ◽

Stage Of Development ◽

Spectral Libraries ◽

Semi Empirical

AbstractLarge-scale profiling of intact glycopeptides is critical but challenging in glycoproteomics. Data independent acquisition (DIA) is an emerging technology with deep proteome coverage and accurate quantitative capability in proteomics studies, but is still in the early stage of development in the field of glycoproteomics. We propose GproDIA, a framework for the proteome-wide characterization of intact glycopeptides from DIA data with comprehensive statistical control by a 2-dimentional false discovery rate approach and a glycoform inference algorithm, enabling accurate identification of intact glycopeptides using wide isolation windows. We further utilize a semi-empirical spectrum prediction strategy to expand the coverage of spectral libraries of glycopeptides. We benchmark our method for N-glycopeptide profiling on DIA data of yeast and human serum samples, demonstrating that DIA with GproDIA outperforms the data-dependent acquisition-based methods for glycoproteomics in terms of capacity and data completeness of identification, as well as accuracy and precision of quantification. We expect that this work can provide a powerful tool for glycoproteomic studies.

Download Full-text

PolySTest: Robust statistical testing of proteomics data with missing values improves detection of biologically relevant features

10.1101/765818 ◽

2019 ◽

Cited By ~ 1

Author(s):

Veit Schwämmle ◽

Christina E Hagensen ◽

Adelina Rogowska-Wrzesinska ◽

Ole N. Jensen

Keyword(s):

Mass Spectrometry ◽

Large Scale ◽

Missing Values ◽

Statistical Tests ◽

Ground Truth ◽

Statistical Testing ◽

Molecular Networks ◽

Proteomics Data ◽

Biologically Relevant ◽

Data Browsing

AbstractStatistical testing remains one of the main challenges for high-confidence detection of differentially regulated proteins or peptides in large-scale quantitative proteomics experiments by mass spectrometry. Statistical tests need to be sufficiently robust to deal with experiment intrinsic data structures and variations and often also reduced feature coverage across different biological samples due to ubiquitous missing values. A robust statistical test provides accurate confidence scores of large-scale proteomics results, regardless of instrument platform, experimental protocol and software tools. However, the multitude of different combinations of experimental strategies, mass spectrometry techniques and informatics methods complicate the decision of choosing appropriate statistical approaches. We address this challenge by introducing PolySTest, a user-friendly web service for statistical testing, data browsing and data visualization. We introduce a new method, Miss Test, that simultaneously tests for missingness and feature abundance, thereby complementing common statistical tests by rescuing otherwise discarded data features. We demonstrate that PolySTest with integrated Miss Test achieves higher confidence and higher sensitivity for artificial and experimental proteomics data sets with known ground truth. Application of PolySTest to mass spectrometry based large-scale proteomics data obtained from differentiating muscle cells resulted in the rescue of 10%-20% additional proteins in the identified molecular networks relevant to muscle differentiation. We conclude that PolySTest is a valuable addition to existing tools and instrument enhancements that improve coverage and depth of large-scale proteomics experiments. A fully functional demo version of PolySTest and Miss Test is available via http://computproteomics.bmb.sdu.dk/Apps/PolySTest.

Download Full-text

Orthogonal proteomic platforms and their implications for the stable classification of high-grade serous ovarian cancer subtypes

10.1101/793026 ◽

2019 ◽

Cited By ~ 2

Author(s):

Stefani N. Thomas ◽

Betty Friedrich ◽

Michael Schnaubelt ◽

Daniel W. Chan ◽

Hui Zhang ◽

...

Keyword(s):

Ovarian Cancer ◽

Large Scale ◽

Clinical Samples ◽

High Grade ◽

Serous Ovarian Cancer ◽

Cancer Subtypes ◽

Data Independent Acquisition ◽

Proteomic Approach ◽

Liquid Chromatography Tandem Mass ◽

Isobaric Tagging

SummaryThe National Cancer Institute (NCI) Clinical Proteomic Tumor Analysis Consortium (CPTAC) has established a two-dimensional liquid chromatography-tandem mass spectrometry (2DLC-MS/MS) workflow using isobaric tagging to compare protein abundance across samples. The workflow has been used for large-scale clinical proteomic studies with deep proteomic coverage within and outside of CPTAC. SWATH-MS, an instance of data-independent acquisition (DIA) proteomic methods, was recently developed as an alternate proteomic approach. In this study, we analyzed remaining aliquots of peptides using SWATH-MS from the original retrospective TCGA samples generated for the CPTAC ovarian cancer proteogenomic study (Zhang et al., 2016). The SWATH-MS results indicated that both methods confidently identified differentially expressed proteins in enriched pathways associated with the robust Mesenchymal subtype of high-grade serous ovarian cancer (HGSOC) and the homologous recombination deficient tumors also present in the original study. The results demonstrated that SWATH/DIA-MS presents a promising complementary or orthogonal alternative to the CPTAC harmonized proteomic method, with the advantages of simpler, faster, and cheaper workflows, as well as lower sample consumption. However, the SWATH/DIA-MS workflow resulted in shallower proteome coverage. Overall, we concluded that both analytical methods are suitable to characterize clinical samples such as in the high-grade serous ovarian cancer study, providing proteomic workflow alternatives for cancer researchers depending on the specific goals and context of the studies.

Download Full-text

FLEXIQuant-LF to quantify protein modification extent in label-free proteomics data

eLife ◽

10.7554/elife.58783 ◽

2020 ◽

Vol 9 ◽

Author(s):

Christoph N Schlaffner ◽

Konstantin Kahnert ◽

Jan Muntel ◽

Ruchi Chauhan ◽

Bernhard Y Renard ◽

...

Keyword(s):

Protein Modification ◽

Large Scale ◽

Software Tool ◽

Label Free ◽

Anaphase Promoting Complex ◽

Proteomics Data ◽

Single Experiment ◽

Post Translational Modifications ◽

Data Independent Acquisition ◽

Modified Peptides

Improvements in LC-MS/MS methods and technology have enabled the identification of thousands of modified peptides in a single experiment. However, protein regulation by post-translational modifications (PTMs) is not binary, making methods to quantify the modification extent crucial to understanding the role of PTMs. Here, we introduce FLEXIQuant-LF, a software tool for large-scale identification of differentially modified peptides and quantification of their modification extent without knowledge of the types of modifications involved. We developed FLEXIQuant-LF using label-free quantification of unmodified peptides and robust linear regression to quantify the modification extent of peptides. As proof of concept, we applied FLEXIQuant-LF to data-independent-acquisition (DIA) data of the anaphase promoting complex/cyclosome (APC/C) during mitosis. The unbiased FLEXIQuant-LF approach to assess the modification extent in quantitative proteomics data provides a better understanding of the function and regulation of PTMs. The software is available at https://github.com/SteenOmicsLab/FLEXIQuantLF.

Download Full-text

Scalable data analysis in proteomics and metabolomics using BioContainers and workflows engines

10.1101/604413 ◽

2019 ◽

Author(s):

Yasset Perez-Riverol ◽

Pablo Moreno

Keyword(s):

Data Analysis ◽

Large Scale ◽

Data Science ◽

Proteomics Data ◽

Computational Proteomics ◽

New Approach ◽

Large Scale Data ◽

Desktop Application ◽

Key Steps ◽

Scale Data

AbstractThe recent improvements in mass spectrometry instruments and new analytical methods are increasing the intersection between proteomics and big data science. In addition, the bioinformatics analysis is becoming an increasingly complex and convoluted process involving multiple algorithms and tools. A wide variety of methods and software tools have been developed for computational proteomics and metabolomics during recent years, and this trend is likely to continue. However, most of the computational proteomics and metabolomics tools are targeted and design for single desktop application limiting the scalability and reproducibility of the data analysis. In this paper we overview the key steps of metabolomic and proteomics data processing including main tools and software use to perform the data analysis. We discuss the combination of software containers with workflows environments for large scale metabolomics and proteomics analysis. Finally, we introduced to the proteomics and metabolomics communities a new approach for reproducible and large-scale data analysis based on BioContainers and two of the most popular workflows environments: Galaxy and Nextflow.

Download Full-text

Data-independent acquisition method for ubiquitinome analysis reveals regulation of circadian biology

10.1101/2020.07.24.219055 ◽

2020 ◽

Cited By ~ 1

Author(s):

Fynn M. Hansen ◽

Maria C. Tanzer ◽

Franziska Brüning ◽

Isabell Bludau ◽

Brenda A. Schulman ◽

...

Keyword(s):

Large Scale ◽

Tnf Alpha ◽

Circadian Cycle ◽

Cellular Processes ◽

Data Independent Acquisition ◽

Protein Ubiquitination ◽

Quantitative Accuracy ◽

Protein Receptors ◽

Spectral Libraries ◽

Acquisition Method

SUMMARYProtein ubiquitination is involved in virtually all cellular processes. Enrichment strategies employing antibodies targeting ubiquitin-derived diGly remnants combined with mass spectrometry (MS) have enabled investigations of ubiquitin signaling at a large scale. However, so far the power of data independent acquisition (DIA) with regards to sensitivity in single run analysis and data completeness have not yet been explored. We developed a sensitive workflow combining diGly antibody-based enrichment and optimized Orbitrap-based DIA with comprehensive spectral libraries together containing more than 90,000 diGly peptides. This approach identified 35,000 diGly peptides in single measurements of proteasome inhibitor-treated cells – double the number and quantitative accuracy of data dependent acquisition. Applied to TNF-alpha signaling, the workflow comprehensively captured known sites while adding many novel ones. A first systems-wide investigation of ubiquitination of the circadian cycle uncovered hundreds of cycling ubiquitination sites and dozens of cycling ubiquitin clusters within individual membrane protein receptors and transporters, highlighting novel connections between metabolism and circadian regulation.

Download Full-text