RabbitMash: accelerating hash-based genome analysis on modern multi-core architectures

Bioinformatics ◽

10.1093/bioinformatics/btaa754 ◽

2020 ◽

Author(s):

Zekun Yin ◽

Xiaoming Xu ◽

Jinxiao Zhang ◽

Yanjie Wei ◽

Bertil Schmidt ◽

...

Keyword(s):

Full Advantage ◽

Genome Analysis ◽

Large Scale ◽

Supplementary Information ◽

Supplementary Data ◽

Genome Analysis Toolkit

Abstract Motivation Mash is a popular hash-based genome analysis toolkit with applications to important downstream analyses tasks such as clustering and assembly. However, Mash is currently not able to fully exploit the capabilities of modern multi-core architectures, which in turn leads to high runtimes for large-scale genomic datasets. Results We present RabbitMash, an efficient highly optimized implementation of Mash which can take full advantage of modern hardware including multi-threading, vectorization and fast I/O. We show that our approach achieves speedups of at least 1.3, 9.8, 8.5 and 4.4 compared to Mash for the operations sketch, dist, triangle and screen, respectively. Furthermore, RabbitMash is able to compute the all-versus-all distances of 100 321 genomes in <5 min on a 40-core workstation while Mash requires over 40 min. Availability and implementation RabbitMash is available at https://github.com/ZekunYin/RabbitMash. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

EARRINGS: an efficient and accurate adapter trimmer entails no a priori adapter sequences

Bioinformatics ◽

10.1093/bioinformatics/btab025 ◽

2021 ◽

Author(s):

Ting-Hsuan Wang ◽

Cheng-Ching Huang ◽

Jui-Hung Hung

Keyword(s):

Open Source Software ◽

Large Scale ◽

A Priori ◽

Supplementary Information ◽

Supplementary Data ◽

Comparable Accuracy ◽

Meta Analyses ◽

Next Generation Sequencing Ngs ◽

Adapter Trimming ◽

Generation Sequencing

Abstract Motivation Cross-sample comparisons or large-scale meta-analyses based on the next generation sequencing (NGS) involve replicable and universal data preprocessing, including removing adapter fragments in contaminated reads (i.e. adapter trimming). While modern adapter trimmers require users to provide candidate adapter sequences for each sample, which are sometimes unavailable or falsely documented in the repositories (such as GEO or SRA), large-scale meta-analyses are therefore jeopardized by suboptimal adapter trimming. Results Here we introduce a set of fast and accurate adapter detection and trimming algorithms that entail no a priori adapter sequences. These algorithms were implemented in modern C++ with SIMD and multithreading to accelerate its speed. Our experiments and benchmarks show that the implementation (i.e. EARRINGS), without being given any hint of adapter sequences, can reach comparable accuracy and higher throughput than that of existing adapter trimmers. EARRINGS is particularly useful in meta-analyses of a large batch of datasets and can be incorporated in any sequence analysis pipelines in all scales. Availability and implementation EARRINGS is open-source software and is available at https://github.com/jhhung/EARRINGS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

OpenBioLink: a benchmarking framework for large-scale biomedical link prediction

Bioinformatics ◽

10.1093/bioinformatics/btaa274 ◽

2020 ◽

Vol 36 (13) ◽

pp. 4097-4098 ◽

Cited By ~ 3

Author(s):

Anna Breit ◽

Simon Ott ◽

Asan Agibetov ◽

Matthias Samwald

Keyword(s):

Link Prediction ◽

Large Scale ◽

Source Code ◽

Machine Learning Algorithms ◽

Knowledge Networks ◽

Supplementary Information ◽

Supplementary Data ◽

Biomedical Knowledge ◽

High Quality ◽

Baseline Evaluation

Abstract Summary Recently, novel machine-learning algorithms have shown potential for predicting undiscovered links in biomedical knowledge networks. However, dedicated benchmarks for measuring algorithmic progress have not yet emerged. With OpenBioLink, we introduce a large-scale, high-quality and highly challenging biomedical link prediction benchmark to transparently and reproducibly evaluate such algorithms. Furthermore, we present preliminary baseline evaluation results. Availability and implementation Source code and data are openly available at https://github.com/OpenBioLink/OpenBioLink. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PyMethylProcess—convenient high-throughput preprocessing workflow for DNA methylation data

Bioinformatics ◽

10.1093/bioinformatics/btz594 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5379-5381 ◽

Cited By ~ 2

Author(s):

Joshua J Levy ◽

Alexander J Titus ◽

Lucas A Salas ◽

Brock C Christensen

Keyword(s):

Large Scale ◽

Supplementary Information ◽

Scale Production ◽

Methylation Data ◽

Supplementary Data ◽

Data Preparation ◽

Methylation Array ◽

Project Home Page ◽

Large Scale Production ◽

Set Up

Abstract Summary Performing highly parallelized preprocessing of methylation array data using Python can accelerate data preparation for downstream methylation analyses, including large scale production-ready machine learning pipelines. We present a highly reproducible, scalable pipeline (PyMethylProcess) that can be quickly set-up and deployed through Docker and PIP. Availability and implementation Project Home Page: https://github.com/Christensen-Lab-Dartmouth/PyMethylProcess. Available on PyPI (pymethylprocess), Docker (joshualevy44/pymethylprocess). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Quantification of aneuploidy in targeted sequencing data using ASCETS

Bioinformatics ◽

10.1093/bioinformatics/btaa980 ◽

2020 ◽

Author(s):

Liam F Spurr ◽

Mehdi Touat ◽

Alison M Taylor ◽

Adrian M Dubuc ◽

Juliann Shih ◽

...

Keyword(s):

Copy Number ◽

Large Scale ◽

Genomic Analysis ◽

Targeted Sequencing ◽

Supplementary Information ◽

Supplementary Data ◽

Sequencing Data ◽

Copy Number Changes ◽

Panel Sequencing ◽

Chromosome Level

Abstract Summary The expansion of targeted panel sequencing efforts has created opportunities for large-scale genomic analysis, but tools for copy-number quantification on panel data are lacking. We introduce ASCETS, a method for the efficient quantitation of arm and chromosome-level copy-number changes from targeted sequencing data. Availability and implementation ASCETS is implemented in R and is freely available to non-commercial users on GitHub: https://github.com/beroukhim-lab/ascets, along with detailed documentation. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

RCytoGPS: An R Package for Reading and Visualizing Cytogenetics Data

Bioinformatics ◽

10.1093/bioinformatics/btab683 ◽

2021 ◽

Author(s):

Zachary B Abrams ◽

Dwayne G Tally ◽

Lynne V Abruzzo ◽

Kevin R Coombes

Keyword(s):

Large Scale ◽

International System ◽

Genetic Data ◽

R Package ◽

Supplementary Information ◽

Supplementary Data ◽

Computational Tools ◽

Text Format ◽

Karyotype Analyses ◽

Computational Analyses

Abstract Summary Cytogenetics data, or karyotypes, are among the most common clinically used forms of genetic data. Karyotypes are stored as standardized text strings using the International System for Human Cytogenomic Nomenclature (ISCN). Historically, these data have not been used in large-scale computational analyses due to limitations in the ISCN text format and structure. Recently developed computational tools such as CytoGPS have enabled large-scale computational analyses of karyotypes. To further enable such analyses, we have now developed RCytoGPS, an R package that takes JSON files generated from CytoGPS.org and converts them into objects in R. This conversion facilitates the analysis and visualizations of karyotype data. In effect this tool streamlines the process of performing large-scale karyotype analyses, thus advancing the field of computational cytogenetic pathology. Availability and Implementation Freely available at https://CRAN.R-project.org/package=RCytoGPS. The code for the underlying CytoGPS software can be found at https://github.com/i2-wustl/CytoGPS. Supplementary information There is no supplementary data.

Download Full-text

dbgap2x: An R package to explore and extract data from the database of Genotypes and Phenotypes (dbGaP)

Bioinformatics ◽

10.1093/bioinformatics/btz680 ◽

2019 ◽

Cited By ~ 1

Author(s):

Grégoire Versmée ◽

Laura Versmée ◽

Mikaël Dusenne ◽

Niloofar Jalali ◽

Paul Avillach

Keyword(s):

Data Sharing ◽

Large Scale ◽

Genomic Data ◽

R Package ◽

National Institutes Of Health ◽

Supplementary Information ◽

Supplementary Data ◽

Complex Procedure ◽

Range Of Functions ◽

The Relationship

Abstract Summary Based on the Genomic Data Sharing Policy issued in August 2007, the National Institutes of Health (NIH) has supported several repositories such as the database of Genotypes and Phenotypes (dbGaP). dbGaP is an online repository that provides access to large-scale genetic and phenotypic datasets with more than 1,000 studies. However, navigating the website and understanding the relationship between the studies are not easy tasks. Moreover, the decryption of the files is a complex procedure. In this study we propose the dbgap2x R package that covers a broad range of functions for searching dbGaP studies, exploring the characteristics of a study and easily decrypting the files from dbGaP. Availability and implementation dbgap2x is an R package with the code available at https://github.com/gversmee/dbgap2x. A containerized version including the package, a Jupyter server and with a Notebook example is available at https://hub.docker.com/r/gversmee/dbgap2x. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Automated exploration of gene ontology term and pathway networks with ClueGO-REST

Bioinformatics ◽

10.1093/bioinformatics/btz163 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3864-3866 ◽

Cited By ~ 6

Author(s):

Bernhard Mlecnik ◽

Jérôme Galon ◽

Gabriela Bindea

Keyword(s):

Experimental Data ◽

Large Scale ◽

Bioinformatic Analysis ◽

Supplementary Information ◽

Gene Ontology Term ◽

Supplementary Data ◽

Ontology Term ◽

Biological Interpretation ◽

Programmatic Access ◽

Pathway Networks

Abstract Summary Large scale technologies produce massive amounts of experimental data that need to be investigated. To improve their biological interpretation we have developed ClueGO, a Cytoscape App that selects representative Gene Onology terms and pathways for one or multiple lists of genes/proteins and visualizes them into functionally organized networks. Because of its reliability, userfriendliness and support of many species ClueGO gained a large community of users. To further allow scientists programmatic access to ClueGO with R, Python, JavaScript etc., we implemented the cyREST API into ClueGO. In this article we describe this novel, complementary way of accessing ClueGO via REST, and provide R and Phyton examples to demonstrate how ClueGO workflows can be integrated into bioinformatic analysis pipelines. Availability and implementation ClueGO is available in the Cytoscape App Store (http://apps.cytoscape.org/apps/cluego). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SimkaMin: fast and resource frugal de novo comparative metagenomics

Bioinformatics ◽

10.1093/bioinformatics/btz685 ◽

2019 ◽

Author(s):

Gaëtan Benoit ◽

Mahendra Mariadassou ◽

Stéphane Robin ◽

Sophie Schbath ◽

Pierre Peterlongo ◽

...

Keyword(s):

Large Scale ◽

De Novo ◽

Supplementary Information ◽

Metagenomic Data ◽

Supplementary Data ◽

Comparative Metagenomics ◽

Large Sets ◽

Efficient Data ◽

Genomic Similarity

Abstract Motivation De novo comparative metagenomics is one of the most straightforward ways to analyze large sets of metagenomic data. Latest methods use the fraction of shared k-mers to estimate genomic similarity between read sets. However, those methods, while extremely efficient, are still limited by computational needs for practical usage outside of large computing facilities. Results We present SimkaMin, a quick comparative metagenomics tool with low disk and memory footprints, thanks to an efficient data subsampling scheme used to estimate Bray-Curtis and Jaccard dissimilarities. One billion metagenomic reads can be analyzed in <3 min, with tiny memory (1.09 GB) and disk (≈0.3 GB) requirements and without altering the quality of the downstream comparative analyses, making of SimkaMin a tool perfectly tailored for very large-scale metagenomic projects. Availability and implementation https://github.com/GATB/simka. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

RCytoGPS: An R Package for Reading and Visualizing Cytogenetics Data

10.1101/2021.03.16.389791 ◽

2021 ◽

Author(s):

Zachary B. Abrams ◽

Dwayne G. Tally ◽

Lynne V. Abruzzo ◽

Kevin R. Coombes

Keyword(s):

Large Scale ◽

International System ◽

Genetic Data ◽

R Package ◽

Supplementary Information ◽

Supplementary Data ◽

Computational Tools ◽

Text Format ◽

Karyotype Analyses ◽

Computational Analyses

AbstractSummaryCytogenetics data, or karyotypes, are among the most common clinically used forms of genetic data. Karyotypes are stored as standardized text strings using the International System for Human Cytogenomic Nomenclature (ISCN). Historically, these data have not been used in large-scale computational analyses due to limitations in the ISCN text format and structure. Recently developed computational tools such as CytoGPS have enabled large-scale computational analyses of karyotypes. To further enable such analyses, we have now developed RCytoGPS, an R package that takes JSON files generated from CytoGPS.org and converts them into objects in R. This conversion facilitates the analysis and visualizations of karyotype data. In effect this tool streamlines the process of performing large-scale karyotype analyses, thus advancing the field of computational cytogenetic pathology.Availability and ImplementationFreely available at https://CRAN.R-project.org/package=RCytoGPSSupplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

ProteoCombiner: integrating bottom-up with top-down proteomics data for improved proteoform assessment

Bioinformatics ◽

10.1093/bioinformatics/btaa958 ◽

2020 ◽

Author(s):

Diogo B Lima ◽

Mathieu Dupré ◽

Magalie Duchateau ◽

Quentin Giai Gianetto ◽

Martial Rey ◽

...

Keyword(s):

Search Engines ◽

High Performance ◽

Large Scale ◽

Supplementary Information ◽

Supplementary Data ◽

Post Translational Modification ◽

Proteomics Data ◽

Top Down ◽

Proteomic Data ◽

Demonstration Video

Abstract Motivation We present a high-performance software integrating shotgun with top-down proteomic data. The tool can deal with multiple experiments and search engines. Enable rapid and easy visualization, manual validation and comparison of the identified proteoform sequences including the post-translational modification characterization. Results We demonstrate the effectiveness of our approach on a large-scale Escherichia coli dataset; ProteoCombiner unambiguously shortlisted proteoforms among those identified by the multiple search engines. Availability and implementation ProteoCombiner, a demonstration video and user tutorial are freely available at https://proteocombiner.pasteur.fr, for academic use; all data are thus available from the ProteomeXchange consortium (identifier PXD017618). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text