ProteomicsBrowser: MS/proteomics data visualization and investigation

Gang Peng; Rashaun Wilson; Yishuo Tang; TuKiet T Lam; Angus C Nairn; Kenneth Williams; Hongyu Zhao

doi:10.1093/bioinformatics/bty958

ProteomicsBrowser: MS/proteomics data visualization and investigation

Bioinformatics ◽

10.1093/bioinformatics/bty958 ◽

2018 ◽

Vol 35 (13) ◽

pp. 2313-2314 ◽

Cited By ~ 2

Author(s):

Gang Peng ◽

Rashaun Wilson ◽

Yishuo Tang ◽

TuKiet T Lam ◽

Angus C Nairn ◽

...

Keyword(s):

Mass Spectrometry ◽

Data Visualization ◽

Large Scale ◽

Supplementary Information ◽

Proteomics Data ◽

Sequence Coverage ◽

Protein Database ◽

Post Translational Modifications ◽

Parent Protein ◽

Focus Analysis

Abstract Summary Large-scale, quantitative proteomics data are being generated at ever increasing rates by high-throughput, mass spectrometry technologies. However, due to the complexity of these large datasets as well as the increasing numbers of post-translational modifications (PTMs) that are being identified, developing effective methods for proteomic visualization has been challenging. ProteomicsBrowser was designed to meet this need for comprehensive data visualization. Using peptide information files exported from mass spectrometry search engines or quantitative tools as input, the peptide sequences are aligned to an internal protein database such as UniProtKB. Each identified peptide ion including those with PTMs is then visualized along the parent protein in the Browser. A unique property of ProteomicsBrowser is the ability to combine overlapping peptides in different ways to focus analysis of sequence coverage, charge state or PTMs. ProteomicsBrowser includes other useful functions, such as a data filtering tool and basic statistical analyses to qualify quantitative data. Availability and implementation ProteomicsBrowser is implemented in Java8 and is available at https://medicine.yale.edu/keck/nida/proteomicsbrowser.aspx and https://github.com/peng-gang/ProteomicsBrowser. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MSpectraAI: a powerful platform for deciphering proteome profiling of multi-tumor mass spectrometry data by using deep neural networks

BMC Bioinformatics ◽

10.1186/s12859-020-03783-0 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Shisheng Wang ◽

Hongwen Zhu ◽

Hu Zhou ◽

Jingqiu Cheng ◽

Hao Yang

Keyword(s):

Mass Spectrometry ◽

Neural Networks ◽

Large Scale ◽

Deep Neural Networks ◽

Spectral Feature ◽

Mass Spectrometry Data ◽

Learning Approaches ◽

Proteomics Data ◽

Proteome Profiling ◽

Analytical Technique

Abstract Background Mass spectrometry (MS) has become a promising analytical technique to acquire proteomics information for the characterization of biological samples. Nevertheless, most studies focus on the final proteins identified through a suite of algorithms by using partial MS spectra to compare with the sequence database, while the pattern recognition and classification of raw mass-spectrometric data remain unresolved. Results We developed an open-source and comprehensive platform, named MSpectraAI, for analyzing large-scale MS data through deep neural networks (DNNs); this system involves spectral-feature swath extraction, classification, and visualization. Moreover, this platform allows users to create their own DNN model by using Keras. To evaluate this tool, we collected the publicly available proteomics datasets of six tumor types (a total of 7,997,805 mass spectra) from the ProteomeXchange consortium and classified the samples based on the spectra profiling. The results suggest that MSpectraAI can distinguish different types of samples based on the fingerprint spectrum and achieve better prediction accuracy in MS1 level (average 0.967). Conclusion This study deciphers proteome profiling of raw mass spectrometry data and broadens the promising application of the classification and prediction of proteomics data from multi-tumor samples using deep learning methods. MSpectraAI also shows a better performance compared to the other classical machine learning approaches.

Download Full-text

RobNorm: model-based robust normalization method for labeled quantitative mass spectrometry proteomics data

Bioinformatics ◽

10.1093/bioinformatics/btaa904 ◽

2020 ◽

Author(s):

Meng Wang ◽

Lihua Jiang ◽

Ruiqi Jian ◽

Joanne Y Chan ◽

Qing Liu ◽

...

Keyword(s):

Mass Spectrometry ◽

Protein Expression ◽

Real Data ◽

Tissue Expression ◽

Supplementary Information ◽

Systematic Bias ◽

Proteomics Data ◽

Robust Fitting ◽

Fitting Method ◽

The One

Abstract Motivation Data normalization is an important step in processing proteomics data generated in mass spectrometry experiments, which aims to reduce sample-level variation and facilitate comparisons of samples. Previously published methods for normalization primarily depend on the assumption that the distribution of protein expression is similar across all samples. However, this assumption fails when the protein expression data is generated from heterogenous samples, such as from various tissue types. This led us to develop a novel data-driven method for improved normalization to correct the systematic bias meanwhile maintaining underlying biological heterogeneity. Results To robustly correct the systematic bias, we used the density-power-weight method to down-weigh outliers and extended the one-dimensional robust fitting method described in the previous work to our structured data. We then constructed a robustness criterion and developed a new normalization algorithm, called RobNorm. In simulation studies and analysis of real data from the genotype-tissue expression project, we compared and evaluated the performance of RobNorm against other normalization methods. We found that the RobNorm approach exhibits the greatest reduction in systematic bias while maintaining across-tissue variation, especially for datasets from highly heterogeneous samples. Availabilityand implementation https://github.com/mwgrassgreen/RobNorm. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PolySTest: Robust statistical testing of proteomics data with missing values improves detection of biologically relevant features

10.1101/765818 ◽

2019 ◽

Cited By ~ 1

Author(s):

Veit Schwämmle ◽

Christina E Hagensen ◽

Adelina Rogowska-Wrzesinska ◽

Ole N. Jensen

Keyword(s):

Mass Spectrometry ◽

Large Scale ◽

Missing Values ◽

Statistical Tests ◽

Ground Truth ◽

Statistical Testing ◽

Molecular Networks ◽

Proteomics Data ◽

Biologically Relevant ◽

Data Browsing

AbstractStatistical testing remains one of the main challenges for high-confidence detection of differentially regulated proteins or peptides in large-scale quantitative proteomics experiments by mass spectrometry. Statistical tests need to be sufficiently robust to deal with experiment intrinsic data structures and variations and often also reduced feature coverage across different biological samples due to ubiquitous missing values. A robust statistical test provides accurate confidence scores of large-scale proteomics results, regardless of instrument platform, experimental protocol and software tools. However, the multitude of different combinations of experimental strategies, mass spectrometry techniques and informatics methods complicate the decision of choosing appropriate statistical approaches. We address this challenge by introducing PolySTest, a user-friendly web service for statistical testing, data browsing and data visualization. We introduce a new method, Miss Test, that simultaneously tests for missingness and feature abundance, thereby complementing common statistical tests by rescuing otherwise discarded data features. We demonstrate that PolySTest with integrated Miss Test achieves higher confidence and higher sensitivity for artificial and experimental proteomics data sets with known ground truth. Application of PolySTest to mass spectrometry based large-scale proteomics data obtained from differentiating muscle cells resulted in the rescue of 10%-20% additional proteins in the identified molecular networks relevant to muscle differentiation. We conclude that PolySTest is a valuable addition to existing tools and instrument enhancements that improve coverage and depth of large-scale proteomics experiments. A fully functional demo version of PolySTest and Miss Test is available via http://computproteomics.bmb.sdu.dk/Apps/PolySTest.

Download Full-text

FLEXIQuant-LF to quantify protein modification extent in label-free proteomics data

eLife ◽

10.7554/elife.58783 ◽

2020 ◽

Vol 9 ◽

Author(s):

Christoph N Schlaffner ◽

Konstantin Kahnert ◽

Jan Muntel ◽

Ruchi Chauhan ◽

Bernhard Y Renard ◽

...

Keyword(s):

Protein Modification ◽

Large Scale ◽

Software Tool ◽

Label Free ◽

Anaphase Promoting Complex ◽

Proteomics Data ◽

Single Experiment ◽

Post Translational Modifications ◽

Data Independent Acquisition ◽

Modified Peptides

Improvements in LC-MS/MS methods and technology have enabled the identification of thousands of modified peptides in a single experiment. However, protein regulation by post-translational modifications (PTMs) is not binary, making methods to quantify the modification extent crucial to understanding the role of PTMs. Here, we introduce FLEXIQuant-LF, a software tool for large-scale identification of differentially modified peptides and quantification of their modification extent without knowledge of the types of modifications involved. We developed FLEXIQuant-LF using label-free quantification of unmodified peptides and robust linear regression to quantify the modification extent of peptides. As proof of concept, we applied FLEXIQuant-LF to data-independent-acquisition (DIA) data of the anaphase promoting complex/cyclosome (APC/C) during mitosis. The unbiased FLEXIQuant-LF approach to assess the modification extent in quantitative proteomics data provides a better understanding of the function and regulation of PTMs. The software is available at https://github.com/SteenOmicsLab/FLEXIQuantLF.

Download Full-text

PolySTest: Robust Statistical Testing of Proteomics Data with Missing Values Improves Detection of Biologically Relevant Features

Molecular & Cellular Proteomics ◽

10.1074/mcp.ra119.001777 ◽

2020 ◽

Vol 19 (8) ◽

pp. 1396-1408 ◽

Cited By ~ 2

Author(s):

Veit Schwämmle ◽

Christina E. Hagensen ◽

Adelina Rogowska-Wrzesinska ◽

Ole N. Jensen

Keyword(s):

Mass Spectrometry ◽

Large Scale ◽

Missing Values ◽

Statistical Tests ◽

Ground Truth ◽

Statistical Testing ◽

Molecular Networks ◽

Proteomics Data ◽

Biologically Relevant ◽

Data Browsing

Statistical testing remains one of the main challenges for high-confidence detection of differentially regulated proteins or peptides in large-scale quantitative proteomics experiments by mass spectrometry. Statistical tests need to be sufficiently robust to deal with experiment intrinsic data structures and variations and often also reduced feature coverage across different biological samples due to ubiquitous missing values. A robust statistical test provides accurate confidence scores of large-scale proteomics results, regardless of instrument platform, experimental protocol and software tools. However, the multitude of different combinations of experimental strategies, mass spectrometry techniques and informatics methods complicate the decision of choosing appropriate statistical approaches. We address this challenge by introducing PolySTest, a user-friendly web service for statistical testing, data browsing and data visualization. We introduce a new method, Miss test, that simultaneously tests for missingness and feature abundance, thereby complementing common statistical tests by rescuing otherwise discarded data features. We demonstrate that PolySTest with integrated Miss test achieves higher confidence and higher sensitivity for artificial and experimental proteomics data sets with known ground truth. Application of PolySTest to mass spectrometry based large-scale proteomics data obtained from differentiating muscle cells resulted in the rescue of 10–20% additional proteins in the identified molecular networks relevant to muscle differentiation. We conclude that PolySTest is a valuable addition to existing tools and instrument enhancements that improve coverage and depth of large-scale proteomics experiments. A fully functional demo version of PolySTest and Miss test is available via http://computproteomics.bmb.sdu.dk/Apps/PolySTest.

Download Full-text

CHICKN: extraction of peptide chromatographic elution profiles from large scale mass spectrometry data by means of Wasserstein compressive hierarchical cluster analysis

BMC Bioinformatics ◽

10.1186/s12859-021-03969-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Olga Permiakova ◽

Romain Guibert ◽

Alexandra Kraut ◽

Thomas Fortin ◽

Anne-Marie Hesse ◽

...

Keyword(s):

Mass Spectrometry ◽

Large Scale ◽

Clustering Algorithm ◽

Optimal Transport ◽

State Of The Art ◽

Data Representation ◽

Machine Learning Algorithms ◽

Mass Spectrometry Data ◽

Proteomics Data ◽

Chromatographic Elution

Abstract Background The clustering of data produced by liquid chromatography coupled to mass spectrometry analyses (LC-MS data) has recently gained interest to extract meaningful chemical or biological patterns. However, recent instrumental pipelines deliver data which size, dimensionality and expected number of clusters are too large to be processed by classical machine learning algorithms, so that most of the state-of-the-art relies on single pass linkage-based algorithms. Results We propose a clustering algorithm that solves the powerful but computationally demanding kernel k-means objective function in a scalable way. As a result, it can process LC-MS data in an acceptable time on a multicore machine. To do so, we combine three essential features: a compressive data representation, Nyström approximation and a hierarchical strategy. In addition, we propose new kernels based on optimal transport, which interprets as intuitive similarity measures between chromatographic elution profiles. Conclusions Our method, referred to as CHICKN, is evaluated on proteomics data produced in our lab, as well as on benchmark data coming from the literature. From a computational viewpoint, it is particularly efficient on raw LC-MS data. From a data analysis viewpoint, it provides clusters which differ from those resulting from state-of-the-art methods, while achieving similar performances. This highlights the complementarity of differently principle algorithms to extract the best from complex LC-MS data.

Download Full-text

PDV: an integrative proteomics data viewer

Bioinformatics ◽

10.1093/bioinformatics/bty770 ◽

2018 ◽

Vol 35 (7) ◽

pp. 1249-1251 ◽

Cited By ~ 24

Author(s):

Kai Li ◽

Marc Vaudel ◽

Bing Zhang ◽

Yan Ren ◽

Bo Wen

Keyword(s):

Large Scale ◽

De Novo ◽

Source Code ◽

Peptide Identification ◽

Supplementary Information ◽

Visualization Tool ◽

Command Line ◽

Proteomics Data ◽

Desktop Computers ◽

Wide Range

Abstract Summary Data visualization plays critical roles in proteomics studies, ranging from quality control of MS/MS data to validation of peptide identification results. Herein, we present PDV, an integrative proteomics data viewer that can be used to visualize a wide range of proteomics data, including database search results, de novo sequencing results, proteogenomics files, MS/MS data in mzML/mzXML format and data from public proteomics repositories. PDV is a lightweight visualization tool that enables intuitive and fast exploration of diverse, large-scale proteomics datasets on standard desktop computers in both graphical user interface and command line modes. Availability and implementation PDV software and the user manual are freely available at http://pdv.zhang-lab.org. The source code is available at https://github.com/wenbostar/PDV and is released under the GPL-3 license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MS2AI: Automated repurposing of public peptide LC-MS data for machine learning applications

10.1101/2021.01.27.428375 ◽

2021 ◽

Author(s):

Tobias Greisager Rehfeldt ◽

Konrad Krawczyk ◽

Mathias Bøgebjerg ◽

Veit Schwämmle ◽

Richard Röttger

Keyword(s):

Machine Learning ◽

Mass Spectrometry ◽

Large Scale ◽

Peptide Identification ◽

Training Data ◽

Supplementary Information ◽

Large Sample Size ◽

Raw Data ◽

Machine Learning Applications ◽

Rich Data

AbstractMotivationLiquid-chromatography mass-spectrometry (LC-MS) is the established standard for analyzing the proteome in biological samples by identification and quantification of thousands of proteins. Machine learning (ML) promises to considerably improve the analysis of the resulting data, however, there is yet to be any tool that mediates the path from raw data to modern ML applications. More specifically, ML applications are currently hampered by three major limitations: (1) absence of balanced training data with large sample size; (2) unclear definition of sufficiently information-rich data representations for e.g. peptide identification; (3) lack of benchmarking of ML methods on specific LC-MS problems.ResultsWe created the MS2AI pipeline that automates the process of gathering vast quantities of mass spectrometry (MS) data for large scale ML applications. The software retrieves raw data from either in-house sources or from the proteomics identifications database, PRIDE. Subsequently, the raw data is stored in a standardized format amenable for ML encompassing MS1/MS2 spectra and peptide identifications. This tool bridges the gap between MS and AI, and to this effect we also present an ML application in the form of a convolutional neural network for the identification of oxidized peptides.AvailabilityAn open source implementation of the software can be found freely available for non-commercial use at https://gitlab.com/roettgerlab/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

ProteoCombiner: integrating bottom-up with top-down proteomics data for improved proteoform assessment

Bioinformatics ◽

10.1093/bioinformatics/btaa958 ◽

2020 ◽

Author(s):

Diogo B Lima ◽

Mathieu Dupré ◽

Magalie Duchateau ◽

Quentin Giai Gianetto ◽

Martial Rey ◽

...

Keyword(s):

Search Engines ◽

High Performance ◽

Large Scale ◽

Supplementary Information ◽

Supplementary Data ◽

Post Translational Modification ◽

Proteomics Data ◽

Top Down ◽

Proteomic Data ◽

Demonstration Video

Abstract Motivation We present a high-performance software integrating shotgun with top-down proteomic data. The tool can deal with multiple experiments and search engines. Enable rapid and easy visualization, manual validation and comparison of the identified proteoform sequences including the post-translational modification characterization. Results We demonstrate the effectiveness of our approach on a large-scale Escherichia coli dataset; ProteoCombiner unambiguously shortlisted proteoforms among those identified by the multiple search engines. Availability and implementation ProteoCombiner, a demonstration video and user tutorial are freely available at https://proteocombiner.pasteur.fr, for academic use; all data are thus available from the ProteomeXchange consortium (identifier PXD017618). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

AlphaMap: An open-source Python package for the visual annotation of proteomics data with sequence specific knowledge

10.1101/2021.07.30.454433 ◽

2021 ◽

Author(s):

Eugenia Voytik ◽

Isabell Bludau ◽

Sander Willems ◽

Fynn Hansen ◽

Andreas-David Brunner ◽

...

Keyword(s):

Research Question ◽

Automated Analysis ◽

Experimental Information ◽

Visual Exploration ◽

Supplementary Information ◽

Proteomics Data ◽

Post Translational Modifications ◽

Analysis Platform ◽

Crucial Part ◽

Python Package

Integrating experimental information across proteomic datasets with the wealth of publicly available sequence annotations is a crucial part in many proteomic studies that currently lacks an automated analysis platform. Here we present AlphaMap, a Python package that facilitates the visual exploration of peptide-level proteomics data. Identified peptides and post-translational modifications in proteomic datasets are mapped to their corresponding protein sequence and visualized together with prior knowledge from UniProt and with expected proteolytic cleavage sites. The functionality of AlphaMap can be accessed via an intuitive graphical user interface or - more flexibly - as a Python package that allows its integration into common analysis workflows for data visualization. AlphaMap produces publication-quality illustrations and can easily be customized to address a given research question. Availability and implementation: AlphaMap is implemented in Python and released under an Apache license. The source code and one-click installers are freely available at https://github.com/MannLabs/alphamap. Supplementary information: A detailed user guide for AlphaMap is provided as supplementary data.

Download Full-text