ProteoCombiner: integrating bottom-up with top-down proteomics data for improved proteoform assessment

Bioinformatics ◽

10.1093/bioinformatics/btaa958 ◽

2020 ◽

Author(s):

Diogo B Lima ◽

Mathieu Dupré ◽

Magalie Duchateau ◽

Quentin Giai Gianetto ◽

Martial Rey ◽

...

Keyword(s):

Search Engines ◽

High Performance ◽

Large Scale ◽

Supplementary Information ◽

Supplementary Data ◽

Post Translational Modification ◽

Proteomics Data ◽

Top Down ◽

Proteomic Data ◽

Demonstration Video

Abstract Motivation We present a high-performance software integrating shotgun with top-down proteomic data. The tool can deal with multiple experiments and search engines. Enable rapid and easy visualization, manual validation and comparison of the identified proteoform sequences including the post-translational modification characterization. Results We demonstrate the effectiveness of our approach on a large-scale Escherichia coli dataset; ProteoCombiner unambiguously shortlisted proteoforms among those identified by the multiple search engines. Availability and implementation ProteoCombiner, a demonstration video and user tutorial are freely available at https://proteocombiner.pasteur.fr, for academic use; all data are thus available from the ProteomeXchange consortium (identifier PXD017618). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Large-scale analysis of post-translational modifications in E. coli under glucose-limiting conditions

10.1101/051185 ◽

2016 ◽

Author(s):

Colin W Brown ◽

Viswanadham Sridhara ◽

Daniel R Boutz ◽

Maria D Person ◽

Edward M Marcotte ◽

...

Keyword(s):

Large Scale ◽

Biological Function ◽

Growth Conditions ◽

Growth Cycle ◽

Post Translational Modification ◽

Proteomics Data ◽

Proteomic Data ◽

Large Scale Analysis ◽

Wide Range ◽

Domains Of Life

AbstractBackgroundPost-translational modification (PTM) of proteins is central to many cellular processes across all domains of life, but despite decades of study and a wealth of genomic and proteomic data the biological function of many PTMs remains unknown. This is especially true for prokaryotic PTM systems, many of which have only recently been recognized and studied in depth. It is increasingly apparent that a deep sampling of abundance across a wide range of environmental stresses, growth conditions, and PTM types, rather than simply cataloging targets for a handful of modifications, is critical to understanding the complex pathways that govern PTM deposition and downstream effects.ResultsWe utilized a deeply-sampled dataset of MS/MS proteomic analysis covering 9 timepoints spanning theEscherichia coligrowth cycle and an unbiased PTM search strategy to construct a temporal map of abundance for all PTMs within a 400 Da window of mass shifts. Using this map, we are able to identify novel targets and temporal patterns for N-terminal Nα acetylation, C-terminal glutamylation, and asparagine deamidation. Furthermore, we identify a possible relationship between N-terminal Na acetylation and regulation of protein degradation in stationary phase, pointing to a previously unrecognized biological function for this poorly-understood PTM.ConclusionsUnbiased detection of PTM in MS/MS proteomics data facilitates the discovery of novel modification types and previously unobserved dynamic changes in modification across growth timepoints.

Download Full-text

EARRINGS: an efficient and accurate adapter trimmer entails no a priori adapter sequences

Bioinformatics ◽

10.1093/bioinformatics/btab025 ◽

2021 ◽

Author(s):

Ting-Hsuan Wang ◽

Cheng-Ching Huang ◽

Jui-Hung Hung

Keyword(s):

Open Source Software ◽

Large Scale ◽

A Priori ◽

Supplementary Information ◽

Supplementary Data ◽

Comparable Accuracy ◽

Meta Analyses ◽

Next Generation Sequencing Ngs ◽

Adapter Trimming ◽

Generation Sequencing

Abstract Motivation Cross-sample comparisons or large-scale meta-analyses based on the next generation sequencing (NGS) involve replicable and universal data preprocessing, including removing adapter fragments in contaminated reads (i.e. adapter trimming). While modern adapter trimmers require users to provide candidate adapter sequences for each sample, which are sometimes unavailable or falsely documented in the repositories (such as GEO or SRA), large-scale meta-analyses are therefore jeopardized by suboptimal adapter trimming. Results Here we introduce a set of fast and accurate adapter detection and trimming algorithms that entail no a priori adapter sequences. These algorithms were implemented in modern C++ with SIMD and multithreading to accelerate its speed. Our experiments and benchmarks show that the implementation (i.e. EARRINGS), without being given any hint of adapter sequences, can reach comparable accuracy and higher throughput than that of existing adapter trimmers. EARRINGS is particularly useful in meta-analyses of a large batch of datasets and can be incorporated in any sequence analysis pipelines in all scales. Availability and implementation EARRINGS is open-source software and is available at https://github.com/jhhung/EARRINGS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

GWASpro: a high-performance genome-wide association analysis server

Bioinformatics ◽

10.1093/bioinformatics/bty989 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2512-2514 ◽

Cited By ~ 4

Author(s):

Bongsong Kim ◽

Xinbin Dai ◽

Wenchao Zhang ◽

Zhaohong Zhuang ◽

Darlene L Sanchez ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Linear Mixed Model ◽

Association Studies ◽

Learning Curves ◽

Experimental Designs ◽

Genome Wide Association ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Genome Wide

Abstract Summary We present GWASpro, a high-performance web server for the analyses of large-scale genome-wide association studies (GWAS). GWASpro was developed to provide data analyses for large-scale molecular genetic data, coupled with complex replicated experimental designs such as found in plant science investigations and to overcome the steep learning curves of existing GWAS software tools. GWASpro supports building complex design matrices, by which complex experimental designs that may include replications, treatments, locations and times, can be accounted for in the linear mixed model. GWASpro is optimized to handle GWAS data that may consist of up to 10 million markers and 10 000 samples from replicable lines or hybrids. GWASpro provides an interface that significantly reduces the learning curve for new GWAS investigators. Availability and implementation GWASpro is freely available at https://bioinfo.noble.org/GWASPRO. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

OpenBioLink: a benchmarking framework for large-scale biomedical link prediction

Bioinformatics ◽

10.1093/bioinformatics/btaa274 ◽

2020 ◽

Vol 36 (13) ◽

pp. 4097-4098 ◽

Cited By ~ 3

Author(s):

Anna Breit ◽

Simon Ott ◽

Asan Agibetov ◽

Matthias Samwald

Keyword(s):

Link Prediction ◽

Large Scale ◽

Source Code ◽

Machine Learning Algorithms ◽

Knowledge Networks ◽

Supplementary Information ◽

Supplementary Data ◽

Biomedical Knowledge ◽

High Quality ◽

Baseline Evaluation

Abstract Summary Recently, novel machine-learning algorithms have shown potential for predicting undiscovered links in biomedical knowledge networks. However, dedicated benchmarks for measuring algorithmic progress have not yet emerged. With OpenBioLink, we introduce a large-scale, high-quality and highly challenging biomedical link prediction benchmark to transparently and reproducibly evaluate such algorithms. Furthermore, we present preliminary baseline evaluation results. Availability and implementation Source code and data are openly available at https://github.com/OpenBioLink/OpenBioLink. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BERTMeSH: deep contextual representation learning for large-scale high-performance MeSH indexing with full text

Bioinformatics ◽

10.1093/bioinformatics/btaa837 ◽

2020 ◽

Author(s):

Ronghui You ◽

Yuxuan Liu ◽

Hiroshi Mamitsuka ◽

Shanfeng Zhu

Keyword(s):

Full Text ◽

High Performance ◽

Large Scale ◽

Learning Strategy ◽

Learning To Rank ◽

Representation Learning ◽

Supplementary Information ◽

Medical Subject Headings ◽

The Difference ◽

Contextual Representation

Abstract Motivation With the rapid increase of biomedical articles, large-scale automatic Medical Subject Headings (MeSH) indexing has become increasingly important. FullMeSH, the only method for large-scale MeSH indexing with full text, suffers from three major drawbacks: FullMeSH (i) uses Learning To Rank, which is time-consuming, (ii) can capture some pre-defined sections only in full text and (iii) ignores the whole MEDLINE database. Results We propose a computationally lighter, full text and deep-learning-based MeSH indexing method, BERTMeSH, which is flexible for section organization in full text. BERTMeSH has two technologies: (i) the state-of-the-art pre-trained deep contextual representation, Bidirectional Encoder Representations from Transformers (BERT), which makes BERTMeSH capture deep semantics of full text. (ii) A transfer learning strategy for using both full text in PubMed Central (PMC) and title and abstract (only and no full text) in MEDLINE, to take advantages of both. In our experiments, BERTMeSH was pre-trained with 3 million MEDLINE citations and trained on ∼1.5 million full texts in PMC. BERTMeSH outperformed various cutting-edge baselines. For example, for 20 K test articles of PMC, BERTMeSH achieved a Micro F-measure of 69.2%, which was 6.3% higher than FullMeSH with the difference being statistically significant. Also prediction of 20 K test articles needed 5 min by BERTMeSH, while it took more than 10 h by FullMeSH, proving the computational efficiency of BERTMeSH. Supplementary information Supplementary data are available at Bioinformatics online

Download Full-text

Top-Down Garbage Collector: a tool for selecting high-quality top-down proteomics mass spectra

Bioinformatics ◽

10.1093/bioinformatics/btz085 ◽

2019 ◽

Vol 35 (18) ◽

pp. 3489-3490 ◽

Cited By ~ 1

Author(s):

Diogo B Lima ◽

André R F Silva ◽

Mathieu Dupré ◽

Marlon D M Santos ◽

Milan A Clasen ◽

...

Keyword(s):

Quality Control ◽

Mass Spectra ◽

Rate Increase ◽

Supplementary Information ◽

Supplementary Data ◽

Top Down ◽

High Quality ◽

Garbage Collector ◽

E Coli ◽

Spectral Libraries

Abstract Motivation We present the first tool for unbiased quality control of top-down proteomics datasets. Our tool can select high-quality top-down proteomics spectra, serve as a gateway for building top-down spectral libraries and, ultimately, improve identification rates. Results We demonstrate that a twofold rate increase for two E. coli top-down proteomics datasets may be achievable. Availability and implementation http://patternlabforproteomics.org/tdgc, freely available for academic use. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A parallel computational framework for ultra-large-scale sequence clustering analysis

Bioinformatics ◽

10.1093/bioinformatics/bty617 ◽

2018 ◽

Vol 35 (3) ◽

pp. 380-388 ◽

Cited By ~ 2

Author(s):

Wei Zheng ◽

Qi Mao ◽

Robert J Genco ◽

Jean Wactawski-Wende ◽

Michael Buck ◽

...

Keyword(s):

Parallel Computing ◽

High Performance ◽

Large Scale ◽

De Novo ◽

Rapid Development ◽

Operational Taxonomic Unit ◽

Supplementary Information ◽

Computational Framework ◽

Speed Up ◽

Scale Sequence

Abstract Motivation The rapid development of sequencing technology has led to an explosive accumulation of genomic data. Clustering is often the first step to be performed in sequence analysis. However, existing methods scale poorly with respect to the unprecedented growth of input data size. As high-performance computing systems are becoming widely accessible, it is highly desired that a clustering method can easily scale to handle large-scale sequence datasets by leveraging the power of parallel computing. Results In this paper, we introduce SLAD (Separation via Landmark-based Active Divisive clustering), a generic computational framework that can be used to parallelize various de novo operational taxonomic unit (OTU) picking methods and comes with theoretical guarantees on both accuracy and efficiency. The proposed framework was implemented on Apache Spark, which allows for easy and efficient utilization of parallel computing resources. Experiments performed on various datasets demonstrated that SLAD can significantly speed up a number of popular de novo OTU picking methods and meanwhile maintains the same level of accuracy. In particular, the experiment on the Earth Microbiome Project dataset (∼2.2B reads, 437 GB) demonstrated the excellent scalability of the proposed method. Availability and implementation Open-source software for the proposed method is freely available at https://www.acsu.buffalo.edu/~yijunsun/lab/SLAD.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PyMethylProcess—convenient high-throughput preprocessing workflow for DNA methylation data

Bioinformatics ◽

10.1093/bioinformatics/btz594 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5379-5381 ◽

Cited By ~ 2

Author(s):

Joshua J Levy ◽

Alexander J Titus ◽

Lucas A Salas ◽

Brock C Christensen

Keyword(s):

Large Scale ◽

Supplementary Information ◽

Scale Production ◽

Methylation Data ◽

Supplementary Data ◽

Data Preparation ◽

Methylation Array ◽

Project Home Page ◽

Large Scale Production ◽

Set Up

Abstract Summary Performing highly parallelized preprocessing of methylation array data using Python can accelerate data preparation for downstream methylation analyses, including large scale production-ready machine learning pipelines. We present a highly reproducible, scalable pipeline (PyMethylProcess) that can be quickly set-up and deployed through Docker and PIP. Availability and implementation Project Home Page: https://github.com/Christensen-Lab-Dartmouth/PyMethylProcess. Available on PyPI (pymethylprocess), Docker (joshualevy44/pymethylprocess). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SOAPMetaS: profiling large metagenome datasets efficiently on distributed clusters

Bioinformatics ◽

10.1093/bioinformatics/btaa697 ◽

2020 ◽

Author(s):

Shixu He ◽

Zhibo Huang ◽

Xiaohan Wang ◽

Lin Fang ◽

Shengkang Li ◽

...

Keyword(s):

Big Data ◽

Large Volume ◽

Machine Tools ◽

High Performance ◽

Marker Gene ◽

Source Code ◽

Large Datasets ◽

Supplementary Information ◽

Supplementary Data ◽

Multiple Sample

Abstract Summary Rapid increase of the data size in metagenome researches has raised the demand for new tools to process large datasets efficiently. To accelerate the metagenome profiling process in the scenario of big data, we developed SOAPMetaS, a marker gene-based multiple-sample metagenome profiling tool built on Apache Spark. SOAPMetaS demonstrates high performance and scalability to process large datasets. It can process 80 samples of FASTQ data, summing up to 416 GiB, in around half an hour; and the accuracy of species profiling results of SOAPMetaS is similar to that of MetaPhlAn2. SOAPMetaS can deal with a large volume of metagenome data more efficiently than common-used single-machine tools. Availability and implementation Source code is implemented in Java and freely available at https://github.com/BGI-flexlab/SOAPMetaS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Quantification of aneuploidy in targeted sequencing data using ASCETS

Bioinformatics ◽

10.1093/bioinformatics/btaa980 ◽

2020 ◽

Author(s):

Liam F Spurr ◽

Mehdi Touat ◽

Alison M Taylor ◽

Adrian M Dubuc ◽

Juliann Shih ◽

...

Keyword(s):

Copy Number ◽

Large Scale ◽

Genomic Analysis ◽

Targeted Sequencing ◽

Supplementary Information ◽

Supplementary Data ◽

Sequencing Data ◽

Copy Number Changes ◽

Panel Sequencing ◽

Chromosome Level

Abstract Summary The expansion of targeted panel sequencing efforts has created opportunities for large-scale genomic analysis, but tools for copy-number quantification on panel data are lacking. We introduce ASCETS, a method for the efficient quantitation of arm and chromosome-level copy-number changes from targeted sequencing data. Availability and implementation ASCETS is implemented in R and is freely available to non-commercial users on GitHub: https://github.com/beroukhim-lab/ascets, along with detailed documentation. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text