SW-Tandem: a highly efficient tool for large-scale peptide identification with parallel spectrum dot product on Sunway TaihuLight

Chuang Li; Kenli Li; Tao Chen; Yunping Zhu; Qiang He

doi:10.1093/bioinformatics/btz147

SW-Tandem: a highly efficient tool for large-scale peptide identification with parallel spectrum dot product on Sunway TaihuLight

Bioinformatics ◽

10.1093/bioinformatics/btz147 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3861-3863 ◽

Cited By ~ 2

Author(s):

Chuang Li ◽

Kenli Li ◽

Tao Chen ◽

Yunping Zhu ◽

Qiang He

Keyword(s):

Large Scale ◽

Peptide Identification ◽

Software Tool ◽

Critical Issue ◽

Peptide Sequencing ◽

Peptide Sequence ◽

Supplementary Information ◽

Database Searching ◽

Sunway Taihulight ◽

Dot Product

Abstract Summary Tandem mass spectrometry based database searching is a widely acknowledged and adopted method that identifies peptide sequence in shotgun proteomics. However, database searching is extremely computationally expensive, which can take days even weeks to process a large spectra dataset. To address this critical issue, this paper presents SW-Tandem, a new tool for large-scale peptide sequencing. SW-Tandem parallelizes the spectrum dot product scoring algorithm and leverages the advantages of Sunway TaihuLight, the No. 1 supercomputer in the world in 2017. Sunway TaihuLight is powered by the brand new many-core SW26010 processors and provides a peak computation performance greater than 100PFlops. To fully utilize the Sunway TaihuLights capacity, SW-Tandem employs three mechanisms to accelerate large-scale peptide identification, memory-access optimizations, double buffering and vectorization. The results of experiments conducted on multiple datasets demonstrate the performance of SW-Tandem against three state-of-the-art tools for peptide identification, including X!! Tandem, MR-Tandem and MSFragger. In addition, it shows high scalability in the experiments on extremely large datasets sized up to 12 GB. Availability and implementation SW-Tandem is an open source software tool implemented in C++. The source code and the parameter settings are available at https://github.com/Logic09/SW-Tandem. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Targeted realignment of LC-MS profiles by neighbor-wise compound-specific graphical time warping with misalignment detection

Bioinformatics ◽

10.1093/bioinformatics/btaa037 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2862-2871

Author(s):

Chiung-Ting Wu ◽

Yizhi Wang ◽

Yinxue Wang ◽

Timothy Ebbels ◽

Ibrahim Karaman ◽

...

Keyword(s):

Large Scale ◽

Synthetic Data ◽

Software Tool ◽

R Package ◽

Supplementary Information ◽

Internal Quality ◽

Systematic Change ◽

Warping Function ◽

Time Warping ◽

Alignment Algorithms

Abstract Motivation Liquid chromatography–mass spectrometry (LC-MS) is a standard method for proteomics and metabolomics analysis of biological samples. Unfortunately, it suffers from various changes in the retention times (RT) of the same compound in different samples, and these must be subsequently corrected (aligned) during data processing. Classic alignment methods such as in the popular XCMS package often assume a single time-warping function for each sample. Thus, the potentially varying RT drift for compounds with different masses in a sample is neglected in these methods. Moreover, the systematic change in RT drift across run order is often not considered by alignment algorithms. Therefore, these methods cannot effectively correct all misalignments. For a large-scale experiment involving many samples, the existence of misalignment becomes inevitable and concerning. Results Here, we describe an integrated reference-free profile alignment method, neighbor-wise compound-specific Graphical Time Warping (ncGTW), that can detect misaligned features and align profiles by leveraging expected RT drift structures and compound-specific warping functions. Specifically, ncGTW uses individualized warping functions for different compounds and assigns constraint edges on warping functions of neighboring samples. Validated with both realistic synthetic data and internal quality control samples, ncGTW applied to two large-scale metabolomics LC-MS datasets identifies many misaligned features and successfully realigns them. These features would otherwise be discarded or uncorrected using existing methods. The ncGTW software tool is developed currently as a plug-in to detect and realign misaligned features present in standard XCMS output. Availability and implementation An R package of ncGTW is freely available at Bioconductor and https://github.com/ChiungTingWu/ncGTW. A detailed user’s manual and a vignette are provided within the package. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

OCSANA+: optimal control and simulation of signaling networks from network analysis

Bioinformatics ◽

10.1093/bioinformatics/btaa625 ◽

2020 ◽

Vol 36 (19) ◽

pp. 4960-4962

Author(s):

Lauren Marazzi ◽

Andrew Gainer-Dewar ◽

Paola Vera-Licona

Keyword(s):

Network Analysis ◽

Large Scale ◽

Software Tool ◽

Supplementary Information ◽

Control Algorithms ◽

Signaling Networks ◽

Analysis Software ◽

Non Linear Systems ◽

Long Term Behavior

Abstract Summary OCSANA+ is a Cytoscape app for identifying nodes to drive the system toward a desired long-term behavior, prioritizing combinations of interventions in large-scale complex networks, and estimating the effects of node perturbations in signaling networks, all based on the analysis of the network’s structure. OCSANA+ includes an update to optimal combinations of interventions from network analysis software tool with cutting-edge and rigorously tested algorithms, together with recently developed structure-based control algorithms for non-linear systems and an algorithm for estimating signal flow. All these algorithms are based on the network’s topology. OCSANA+ is implemented as a Cytoscape app to enable a user interface for running analyses and visualizing results. Availability and implementation OCSANA+ app and its tutorial can be downloaded from the Cytoscape App Store or https://veraliconaresearchgroup.github.io/OCSANA-Plus/. The source code and computations are available in https://github.com/VeraLiconaResearchGroup/OCSANA-Plus_SourceCode. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PDV: an integrative proteomics data viewer

Bioinformatics ◽

10.1093/bioinformatics/bty770 ◽

2018 ◽

Vol 35 (7) ◽

pp. 1249-1251 ◽

Cited By ~ 24

Author(s):

Kai Li ◽

Marc Vaudel ◽

Bing Zhang ◽

Yan Ren ◽

Bo Wen

Keyword(s):

Large Scale ◽

De Novo ◽

Source Code ◽

Peptide Identification ◽

Supplementary Information ◽

Visualization Tool ◽

Command Line ◽

Proteomics Data ◽

Desktop Computers ◽

Wide Range

Abstract Summary Data visualization plays critical roles in proteomics studies, ranging from quality control of MS/MS data to validation of peptide identification results. Herein, we present PDV, an integrative proteomics data viewer that can be used to visualize a wide range of proteomics data, including database search results, de novo sequencing results, proteogenomics files, MS/MS data in mzML/mzXML format and data from public proteomics repositories. PDV is a lightweight visualization tool that enables intuitive and fast exploration of diverse, large-scale proteomics datasets on standard desktop computers in both graphical user interface and command line modes. Availability and implementation PDV software and the user manual are freely available at http://pdv.zhang-lab.org. The source code is available at https://github.com/wenbostar/PDV and is released under the GPL-3 license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MS2AI: Automated repurposing of public peptide LC-MS data for machine learning applications

10.1101/2021.01.27.428375 ◽

2021 ◽

Author(s):

Tobias Greisager Rehfeldt ◽

Konrad Krawczyk ◽

Mathias Bøgebjerg ◽

Veit Schwämmle ◽

Richard Röttger

Keyword(s):

Machine Learning ◽

Mass Spectrometry ◽

Large Scale ◽

Peptide Identification ◽

Training Data ◽

Supplementary Information ◽

Large Sample Size ◽

Raw Data ◽

Machine Learning Applications ◽

Rich Data

AbstractMotivationLiquid-chromatography mass-spectrometry (LC-MS) is the established standard for analyzing the proteome in biological samples by identification and quantification of thousands of proteins. Machine learning (ML) promises to considerably improve the analysis of the resulting data, however, there is yet to be any tool that mediates the path from raw data to modern ML applications. More specifically, ML applications are currently hampered by three major limitations: (1) absence of balanced training data with large sample size; (2) unclear definition of sufficiently information-rich data representations for e.g. peptide identification; (3) lack of benchmarking of ML methods on specific LC-MS problems.ResultsWe created the MS2AI pipeline that automates the process of gathering vast quantities of mass spectrometry (MS) data for large scale ML applications. The software retrieves raw data from either in-house sources or from the proteomics identifications database, PRIDE. Subsequently, the raw data is stored in a standardized format amenable for ML encompassing MS1/MS2 spectra and peptide identifications. This tool bridges the gap between MS and AI, and to this effect we also present an ML application in the form of a convolutional neural network for the identification of oxidized peptides.AvailabilityAn open source implementation of the software can be found freely available for non-commercial use at https://gitlab.com/roettgerlab/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Prediction of Error Associated with False-Positive Rate Determination for Peptide Identification in Large-Scale Proteomics Experiments Using a Combined Reverse and Forward Peptide Sequence Database Strategy

Journal of Proteome Research ◽

10.1021/pr0603194 ◽

2007 ◽

Vol 6 (1) ◽

pp. 392-398 ◽

Cited By ~ 49

Author(s):

Edward L. Huttlin ◽

Adrian D. Hegeman ◽

Amy C. Harms ◽

Michael R. Sussman

Keyword(s):

False Positive ◽

Large Scale ◽

False Positive Rate ◽

Peptide Identification ◽

Peptide Sequence ◽

Sequence Database ◽

Positive Rate

Download Full-text

Alignment of LC-MS Profiles by Neighbor-wise Compound-specific Graphical Time Warping with Misalignment Detection

10.1101/715334 ◽

2019 ◽

Author(s):

Chiung-Ting Wu ◽

David M. Herrington ◽

Yizhi Wang ◽

Timothy Ebbels ◽

Ibrahim Karaman ◽

...

Keyword(s):

Large Scale ◽

Synthetic Data ◽

Software Tool ◽

R Package ◽

Supplementary Information ◽

Internal Quality ◽

Systematic Change ◽

Warping Function ◽

Time Warping ◽

Alignment Algorithms

AbstractMotivationLiquid chromatography - mass spectrometry (LC-MS) is a standard method for proteomics and metabolomics analysis of biological samples. Unfortunately, it suffers from various changes in the retention times (RT) of the same compound in different samples, and these must be subsequently corrected (aligned) during data processing. Classic alignment methods such as in the popular XCMS package often assume a single time-warping function for each sample. Thus, the potentially varying RT drift for compounds with different masses in a sample is neglected in these methods. Moreover, the systematic change in RT drift across run order is often not considered by alignment algorithms. Therefore, these methods cannot completely correct misalignments. For a large-scale experiment involving many samples, the existence of misalignment becomes inevitable and concerning.ResultsHere we describe an integrated reference-free profile alignment method, neighbor-wise compound-specific Graphical Time Warping (ncGTW), that can detect misaligned features and align profiles by leveraging expected RT drift structures and compound-specific warping functions. Specifically, ncGTW uses individualized warping functions for different compounds and assigns constraint edges on warping functions of neighboring samples. Validated with both realistic synthetic data and internal quality control samples, ncGTW applied to two large-scale metabolomics LC-MS datasets identifies many misaligned features and successfully realigns them. These features would otherwise be discarded or uncorrected using existing methods. The ncGTW software tool is developed currently as a plug-in to the XCMS package.Availability and ImplementationAn R package of ncGTW is freely available at https://github.com/ChiungTingWu/ncGTW. A detailed user’s manual and a vignette are provided within the [email protected], [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

MSPminer: abundance-based reconstitution of microbial pan-genomes from shotgun meta-genomic data

10.1101/173203 ◽

2017 ◽

Cited By ~ 2

Author(s):

Florian Plaza Oñate ◽

Emmanuelle Le Chatelier ◽

Mathieu Almeida ◽

Ales-sandra C. L. Cervino ◽

Franck Gauthier ◽

...

Keyword(s):

Large Scale ◽

De Novo ◽

Software Tool ◽

Supplementary Information ◽

Metagenomic Data ◽

Gene Repertoire ◽

Human Gut ◽

Taxonomic Profiling ◽

Microbial Species ◽

Accessory Genes

AbstractMotivationAnalysis toolkits for shotgun metagenomic data achieve strain-level characterization of complex microbial communities by capturing intra-species gene content variation. Yet, these tools are hampered by the extent of reference genomes that are far from covering all microbial variability, as many species are still not sequenced or have only few strains available. Binning co-abundant genes obtained from de novo assembly is a powerful reference-free technique to discover and reconstitute gene repertoire of microbial species. While current methods accurately identify species core parts, they miss many accessory genes or split them into small gene groups that remain unassociated to core clusters.ResultsWe introduce MSPminer, a computationally efficient software tool that reconstitutes Metagenomic Species Pan-genomes (MSPs) by binning co-abundant genes across metagenomic samples. MSPminer relies on a new robust measure of proportionality coupled with an empirical classifier to group and distinguish not only species core genes but accessory genes also. Applied to a large scale metagenomic dataset, MSPminer successfully delineates in a few hours the gene repertoires of 1 661 microbial species with similar specificity and higher sensitivity than existing tools. The taxonomic annotation of MSPs reveals microorganisms hitherto unknown and brings coherence in the nomenclature of the species of the human gut microbiota. The provided MSPs can be readily used for taxonomic profiling and biomarkers discovery in human gut metagenomic samples. In addition, MSPminer can be applied on gene count tables from other ecosystems to perform similar analyses.AvailabilityThe binary is freely available for non-commercial users at enterome.fr/site/downloads/ Contact: [email protected] informationAvailable in the file named Supplementary Information.pdf

Download Full-text

MODER2: first-order Markov modeling and discovery of monomeric and dimeric binding motifs

Bioinformatics ◽

10.1093/bioinformatics/btaa045 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2690-2696

Author(s):

Jarkko Toivonen ◽

Pratyush K Das ◽

Jussi Taipale ◽

Esko Ukkonen

Keyword(s):

Markov Models ◽

Expectation Maximization Algorithm ◽

Software Tool ◽

Specific Weight ◽

Training Data ◽

Supplementary Information ◽

Markov Modeling ◽

Binding Motifs ◽

The Difference ◽

Probability Matrices

Abstract Motivation Position-specific probability matrices (PPMs, also called position-specific weight matrices) have been the dominating model for transcription factor (TF)-binding motifs in DNA. There is, however, increasing recent evidence of better performance of higher order models such as Markov models of order one, also called adjacent dinucleotide matrices (ADMs). ADMs can model dependencies between adjacent nucleotides, unlike PPMs. A modeling technique and software tool that would estimate such models simultaneously both for monomers and their dimers have been missing. Results We present an ADM-based mixture model for monomeric and dimeric TF-binding motifs and an expectation maximization algorithm MODER2 for learning such models from training data and seeds. The model is a mixture that includes monomers and dimers, built from the monomers, with a description of the dimeric structure (spacing, orientation). The technique is modular, meaning that the co-operative effect of dimerization is made explicit by evaluating the difference between expected and observed models. The model is validated using HT-SELEX and generated datasets, and by comparing to some earlier PPM and ADM techniques. The ADM models explain data slightly better than PPM models for 314 tested TFs (or their DNA-binding domains) from four families (bHLH, bZIP, ETS and Homeodomain), the ADM mixture models by MODER2 being the best on average. Availability and implementation Software implementation is available from https://github.com/jttoivon/moder2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TreeMerge: a new method for improving the scalability of species tree estimation methods

Bioinformatics ◽

10.1093/bioinformatics/btz344 ◽

2019 ◽

Vol 35 (14) ◽

pp. i417-i426 ◽

Cited By ~ 7

Author(s):

Erin K Molloy ◽

Tandy Warnow

Keyword(s):

Large Scale ◽

Species Tree ◽

New Method ◽

Divide And Conquer ◽

Supplementary Information ◽

Estimation Methods ◽

Running Time ◽

Tree Estimation ◽

Computationally Intensive ◽

A Minor

Abstract Motivation At RECOMB-CG 2018, we presented NJMerge and showed that it could be used within a divide-and-conquer framework to scale computationally intensive methods for species tree estimation to larger datasets. However, NJMerge has two significant limitations: it can fail to return a tree and, when used within the proposed divide-and-conquer framework, has O(n5) running time for datasets with n species. Results Here we present a new method called ‘TreeMerge’ that improves on NJMerge in two ways: it is guaranteed to return a tree and it has dramatically faster running time within the same divide-and-conquer framework—only O(n2) time. We use a simulation study to evaluate TreeMerge in the context of multi-locus species tree estimation with two leading methods, ASTRAL-III and RAxML. We find that the divide-and-conquer framework using TreeMerge has a minor impact on species tree accuracy, dramatically reduces running time, and enables both ASTRAL-III and RAxML to complete on datasets (that they would otherwise fail on), when given 64 GB of memory and 48 h maximum running time. Thus, TreeMerge is a step toward a larger vision of enabling researchers with limited computational resources to perform large-scale species tree estimation, which we call Phylogenomics for All. Availability and implementation TreeMerge is publicly available on Github (http://github.com/ekmolloy/treemerge). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

AssociationViewer: a scalable and integrated software tool for visualization of large-scale variation data in genomic context

Bioinformatics ◽

10.1093/bioinformatics/btp017 ◽

2009 ◽

Vol 25 (5) ◽

pp. 662-663 ◽

Cited By ~ 2

Author(s):

Olivier Martin ◽

Armand Valsesia ◽

Amalio Telenti ◽

Ioannis Xenarios ◽

Brian J. Stevenson

Keyword(s):

Large Scale ◽

Software Tool ◽

Genomic Context ◽

Variation Data ◽

Integrated Software ◽

Scale Variation

Download Full-text