Large scale microbiome profiling in the cloud

Camilo Valdes; Vitalii Stebliankin; Giri Narasimhan

doi:10.1093/bioinformatics/btz356

Large scale microbiome profiling in the cloud

Bioinformatics ◽

10.1093/bioinformatics/btz356 ◽

2019 ◽

Vol 35 (14) ◽

pp. i13-i22 ◽

Cited By ~ 1

Author(s):

Camilo Valdes ◽

Vitalii Stebliankin ◽

Giri Narasimhan

Keyword(s):

Large Scale ◽

Bacterial Population ◽

Reference Genome ◽

Supplementary Information ◽

Bacterial Genomes ◽

Reference Collection ◽

Order Of Magnitude ◽

Spark Framework ◽

Reference Genomes ◽

Microbiome Profiling

Abstract Motivation Bacterial metagenomics profiling for metagenomic whole sequencing (mWGS) usually starts by aligning sequencing reads to a collection of reference genomes. Current profiling tools are designed to work against a small representative collection of genomes, and do not scale very well to larger reference genome collections. However, large reference genome collections are capable of providing a more complete and accurate profile of the bacterial population in a metagenomics dataset. In this paper, we discuss a scalable, efficient and affordable approach to this problem, bringing big data solutions within the reach of laboratories with modest resources. Results We developed Flint, a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes. Flint takes advantage of Spark’s built-in parallelism and streaming engine architecture to quickly map reads against a large (170 GB) reference collection of 43 552 bacterial genomes from Ensembl. Flint runs on Amazon’s Elastic MapReduce service, and is able to profile 1 million Illumina paired-end reads against over 40 K genomes on 64 machines in 67 s—an order of magnitude faster than the state of the art, while using a much larger reference collection. Streaming the sequencing reads allows this approach to sustain mapping rates of 55 million reads per hour, at an hourly cluster cost of $8.00 USD, while avoiding the necessity of storing large quantities of intermediate alignments. Availability and implementation Flint is open source software, available under the MIT License (MIT). Source code is available at https://github.com/camilo-v/flint. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

AStrap: identification of alternative splicing from transcript sequences without a reference genome

Bioinformatics ◽

10.1093/bioinformatics/bty1008 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2654-2656 ◽

Cited By ~ 5

Author(s):

Guoli Ji ◽

Wenbin Ye ◽

Yaru Su ◽

Moliang Chen ◽

Guangzao Huang ◽

...

Keyword(s):

Machine Learning ◽

Alternative Splicing ◽

Single Molecule ◽

Reference Genome ◽

De Novo ◽

Supplementary Information ◽

Model Organisms ◽

Sequencing Data ◽

Extensive Evaluation ◽

Reference Genomes

Abstract Summary Alternative splicing (AS) is a well-established mechanism for increasing transcriptome and proteome diversity, however, detecting AS events and distinguishing among AS types in organisms without available reference genomes remains challenging. We developed a de novo approach called AStrap for AS analysis without using a reference genome. AStrap identifies AS events by extensive pair-wise alignments of transcript sequences and predicts AS types by a machine-learning model integrating more than 500 assembled features. We evaluated AStrap using collected AS events from reference genomes of rice and human as well as single-molecule real-time sequencing data from Amborella trichopoda. Results show that AStrap can identify much more AS events with comparable or higher accuracy than the competing method. AStrap also possesses a unique feature of predicting AS types, which achieves an overall accuracy of ∼0.87 for different species. Extensive evaluation of AStrap using different parameters, sample sizes and machine-learning models on different species also demonstrates the robustness and flexibility of AStrap. AStrap could be a valuable addition to the community for the study of AS in non-model organisms with limited genetic resources. Availability and implementation AStrap is available for download at https://github.com/BMILAB/AStrap. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DeepMAsED: evaluating the quality of metagenomic assemblies

Bioinformatics ◽

10.1093/bioinformatics/btaa124 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3011-3017 ◽

Cited By ~ 5

Author(s):

Olga Mineeva ◽

Mateo Rojas-Carulla ◽

Ruth E Ley ◽

Bernhard Schölkopf ◽

Nicholas D Youngblut

Keyword(s):

Large Scale ◽

State Of The Art ◽

Ground Truth ◽

Supplementary Information ◽

Learning Approach ◽

Wide Range ◽

Metagenome Assembly ◽

Model Training ◽

Reference Genomes

Abstract Motivation Methodological advances in metagenome assembly are rapidly increasing in the number of published metagenome assemblies. However, identifying misassemblies is challenging due to a lack of closely related reference genomes that can act as pseudo ground truth. Existing reference-free methods are no longer maintained, can make strong assumptions that may not hold across a diversity of research projects, and have not been validated on large-scale metagenome assemblies. Results We present DeepMAsED, a deep learning approach for identifying misassembled contigs without the need for reference genomes. Moreover, we provide an in silico pipeline for generating large-scale, realistic metagenome assemblies for comprehensive model training and testing. DeepMAsED accuracy substantially exceeds the state-of-the-art when applied to large and complex metagenome assemblies. Our model estimates a 1% contig misassembly rate in two recent large-scale metagenome assembly publications. Conclusions DeepMAsED accurately identifies misassemblies in metagenome-assembled contigs from a broad diversity of bacteria and archaea without the need for reference genomes or strong modeling assumptions. Running DeepMAsED is straight-forward, as well as is model re-training with our dataset generation pipeline. Therefore, DeepMAsED is a flexible misassembly classifier that can be applied to a wide range of metagenome assembly projects. Availability and implementation DeepMAsED is available from GitHub at https://github.com/leylabmpi/DeepMAsED. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Crisflash: open-source software to generate CRISPR guide RNAs against genomes annotated with individual variation

Bioinformatics ◽

10.1093/bioinformatics/btz019 ◽

2019 ◽

Vol 35 (17) ◽

pp. 3146-3147 ◽

Cited By ~ 10

Author(s):

Adrien L S Jacquin ◽

Duncan T Odom ◽

Margus Lukk

Keyword(s):

Open Source Software ◽

Software Tool ◽

Supplementary Information ◽

Small Scale ◽

Genome Sequences ◽

Guide Rnas ◽

Genome Modification ◽

Order Of Magnitude ◽

Sgrna Design ◽

Reference Genomes

Abstract Summary CRISPR/Cas9 system requires short guide RNAs (sgRNAs) to direct genome modification. Most currently available tools for sgRNA design operate only with standard reference genomes, and are best suited for small-scale projects. To address these limitations, we developed Crisflash, a software tool for fast sgRNA design and potential off-target discovery, built for performance and flexibility. Crisflash can rapidly design CRISPR guides against any sequenced genome or genome sequences, and can optimize guide accuracy by incorporating user-supplied variant data. Crisflash is over an order of magnitude faster than comparable tools, even using a single CPU core, and efficiently and robustly scores the potential off-targeting of all possible candidate CRISPR guide oligonucleotides. Availability and implementation https://github.com/crisflash Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Kmer2SNP: reference-free SNP calling from raw reads based on matching

10.1101/2020.05.17.100305 ◽

2020 ◽

Author(s):

Yanbo Li ◽

Yu Lin

Keyword(s):

Reference Genome ◽

State Of The Art ◽

Fundamental Problem ◽

Disease Diagnosis ◽

Hybrid Assembly ◽

Snp Calling ◽

Sequencing Technologies ◽

Order Of Magnitude ◽

Maximum Weight Matching ◽

Reference Genomes

AbstractThe development of DNA sequencing technologies provides the opportunity to call heterozygous SNPs for each individual. SNP calling is a fundamental problem of genetic analysis and has many applications, such as gene-disease diagnosis, drug design, and ancestry inference. Reference-based SNP calling approaches generate highly accurate results, but they face serious limitations especially when high-quality reference genomes are not available for many species. Although reference-free approaches have the potential to call SNPs without using the reference genome, they have not been widely applied on large and complex genomes because existing approaches suffer from low recall/precision or high runtime.We develop a reference-free algorithm Kmer2SNP to call SNP directly from raw reads. Kmer2SNP first computes the k-mer frequency distribution from reads and identifies potential heterozygous k-mers which only appear in one haplotype. Kmer2SNP then constructs a graph by choosing these heterozygous k-mers as vertices and connecting edges between pairs of heterozygous k-mers that might correspond to SNPs. Kmer2SNP further assigns a weight to each edge using overlapping information between heterozygous k-mers, computes a maximum weight matching and finally outputs SNPs as edges between k-mer pairs in the matching.We benchmark Kmer2SNP against reference-free methods including hybrid (assembly-based) and assembly-free methods on both simulated and real datasets. Experimental results show that Kmer2SNP achieves better SNP calling quality while being an order of magnitude faster than the state-of-the-art methods. Kmer2SNP shows the potential of calling SNPs only using k-mers from raw reads without assembly. The source code is freely available at https://github.com/yanboANU/Kmer2SNP.

Download Full-text

HaploTypo: a variant-calling pipeline for phased genomes

Bioinformatics ◽

10.1093/bioinformatics/btz933 ◽

2019 ◽

Vol 36 (8) ◽

pp. 2569-2571 ◽

Cited By ~ 3

Author(s):

Cinta Pegueroles ◽

Verónica Mixão ◽

Laia Carreté ◽

Manu Molina ◽

Toni Gabaldón

Keyword(s):

Genetic Variation ◽

Genetic Variant ◽

Reference Genome ◽

Variant Calling ◽

Supplementary Information ◽

Haplotype Structure ◽

Supplementary Data ◽

Heterozygous Variant ◽

Reference Genomes

Abstract Summary An increasing number of phased (i.e. with resolved haplotypes) reference genomes are available. However, the most genetic variant calling tools do not explicitly account for haplotype structure. Here, we present HaploTypo, a pipeline tailored to resolve haplotypes in genetic variation analyses. HaploTypo infers the haplotype correspondence for each heterozygous variant called on a phased reference genome. Availability and implementation HaploTypo is implemented in Python 2.7 and Python 3.5, and is freely available at https://github.com/gabaldonlab/haplotypo, and as a Docker image. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PaSiT: a novel approach based on short-oligonucleotide frequencies for efficient bacterial identification and typing

Bioinformatics ◽

10.1093/bioinformatics/btz964 ◽

2020 ◽

Vol 36 (8) ◽

pp. 2337-2344 ◽

Cited By ~ 1

Author(s):

Gleb Goussarov ◽

Ilse Cleenwerck ◽

Mohamed Mysara ◽

Natalie Leys ◽

Pieter Monsieurs ◽

...

Keyword(s):

Large Scale ◽

Bacterial Species ◽

Supplementary Information ◽

Nucleotide Identity ◽

Average Nucleotide Identity ◽

Bacterial Genomes ◽

Short Oligonucleotide ◽

Novel Approach ◽

Novel Method ◽

Alignment Step

Abstract Motivation One of the most widespread methods used in taxonomy studies to distinguish between strains or taxa is the calculation of average nucleotide identity. It requires a computationally expensive alignment step and is therefore not suitable for large-scale comparisons. Short oligonucleotide-based methods do offer a faster alternative but at the expense of accuracy. Here, we aim to address this shortcoming by providing a software that implements a novel method based on short-oligonucleotide frequencies to compute inter-genomic distances. Results Our tetranucleotide and hexanucleotide implementations, which were optimized based on a taxonomically well-defined set of over 200 newly sequenced bacterial genomes, are as accurate as the short oligonucleotide-based method TETRA and average nucleotide identity, for identifying bacterial species and strains, respectively. Moreover, the lightweight nature of this method makes it applicable for large-scale analyses. Availability and implementation The method introduced here was implemented, together with other existing methods, in a dependency-free software written in C, GenDisCal, available as source code from https://github.com/LM-UGent/GenDisCal. The software supports multithreading and has been tested on Windows and Linux (CentOS). In addition, a Java-based graphical user interface that acts as a wrapper for the software is also available. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

GenomeWarp: an alignment-based variant coordinate transformation

Bioinformatics ◽

10.1093/bioinformatics/btz218 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4389-4391

Author(s):

Cory Y McLean ◽

Yeongwoo Hwang ◽

Ryan Poplin ◽

Mark A DePristo

Keyword(s):

Reference Genome ◽

Source Code ◽

Supplementary Information ◽

Data Generation ◽

Data Types ◽

Error Corrections ◽

Reference Genome Assembly ◽

Genome Wide ◽

Genome Assemblies ◽

Reference Genomes

Abstract Summary Reference genomes are refined to reflect error corrections and other improvements. While this process improves novel data generation and analysis, incorporating data analyzed on an older reference genome assembly requires transforming the coordinates and representations of the data to the new assembly. Multiple tools exist to perform this transformation for coordinate-only data types, but none supports accurate transformation of genome-wide short variation. Here we present GenomeWarp, a tool for efficiently transforming variants between genome assemblies. GenomeWarp transforms regions and short variants in a conservative manner to minimize false positive and negative variants in the target genome, and converts over 99% of regions and short variants from a representative human genome. Availability and implementation GenomeWarp is written in Java. All source code and the user manual are freely available at https://github.com/verilylifesciences/genomewarp. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Nebula: ultra-efficient mapping-free structural variant genotyper

Nucleic Acids Research ◽

10.1093/nar/gkab025 ◽

2021 ◽

Author(s):

Parsoa Khorsand ◽

Fereydoun Hormozdiari

Keyword(s):

Large Scale ◽

Structural Variants ◽

Sequencing Technologies ◽

Generic Framework ◽

Common Genetic Variants ◽

Order Of Magnitude ◽

Complex Events ◽

Comparable Accuracy ◽

Using Data ◽

Computational Resources

Abstract Large scale catalogs of common genetic variants (including indels and structural variants) are being created using data from second and third generation whole-genome sequencing technologies. However, the genotyping of these variants in newly sequenced samples is a nontrivial task that requires extensive computational resources. Furthermore, current approaches are mostly limited to only specific types of variants and are generally prone to various errors and ambiguities when genotyping complex events. We are proposing an ultra-efficient approach for genotyping any type of structural variation that is not limited by the shortcomings and complexities of current mapping-based approaches. Our method Nebula utilizes the changes in the count of k-mers to predict the genotype of structural variants. We have shown that not only Nebula is an order of magnitude faster than mapping based approaches for genotyping structural variants, but also has comparable accuracy to state-of-the-art approaches. Furthermore, Nebula is a generic framework not limited to any specific type of event. Nebula is publicly available at https://github.com/Parsoa/Nebula.

Download Full-text

General Unified Microbiome Profiling Pipeline (GUMPP) for Large Scale, Streamlined and Reproducible Analysis of Bacterial 16S rRNA Data to Predicted Microbial Metagenomes, Enzymatic Reactions and Metabolic Pathways

Metabolites ◽

10.3390/metabo11060336 ◽

2021 ◽

Vol 11 (6) ◽

pp. 336

Author(s):

Boštjan Murovec ◽

Leon Deutsch ◽

Blaž Stres

Keyword(s):

16S Rrna ◽

Metabolic Pathways ◽

Large Scale ◽

Enzymatic Reactions ◽

Operational Taxonomic Units ◽

Biochemical Pathways ◽

Meaningful Information ◽

Novel Biomarkers ◽

Reproducible Analysis ◽

Microbiome Profiling

General Unified Microbiome Profiling Pipeline (GUMPP) was developed for large scale, streamlined and reproducible analysis of bacterial 16S rRNA data and prediction of microbial metagenomes, enzymatic reactions and metabolic pathways from amplicon data. GUMPP workflow introduces reproducible data analyses at each of the three levels of resolution (genus; operational taxonomic units (OTUs); amplicon sequence variants (ASVs)). The ability to support reproducible analyses enables production of datasets that ultimately identify the biochemical pathways characteristic of disease pathology. These datasets coupled to biostatistics and mathematical approaches of machine learning can play a significant role in extraction of truly significant and meaningful information from a wide set of 16S rRNA datasets. The adoption of GUMPP in the gut-microbiota related research enables focusing on the generation of novel biomarkers that can lead to the development of mechanistic hypotheses applicable to the development of novel therapies in personalized medicine.

Download Full-text

Reference flow: reducing reference bias using multiple population genomes

Genome Biology ◽

10.1186/s13059-020-02229-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Nae-Chyun Chen ◽

Brad Solomon ◽

Taher Mun ◽

Sheila Iyer ◽

Ben Langmead

Keyword(s):

Genetic Variation ◽

Reference Genome ◽

Alignment Method ◽

Sequencing Data ◽

Computational Overhead ◽

Reference Flow ◽

Multiple Population ◽

Reference Bias ◽

Flow Alignment ◽

Reference Genomes

AbstractMost sequencing data analyses start by aligning sequencing reads to a linear reference genome, but failure to account for genetic variation leads to reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the reference flow alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance but with 14% of the memory footprint and 5.5 times the speed.

Download Full-text