Haploflow: Strain-resolved de novo assembly of viral genomes

In viral infections often multiple related viral strains are present, due to coinfection or within-host evolution. We describe Haploflow, a de Bruijn graph-based assembler for de novo genome assembly of viral strains from mixed sequence samples using a novel flow algorithm. We assessed Haploflow across multiple benchmark data sets of increasing complexity, showing that Haploflow is faster and more accurate than viral haplotype assemblers and generic metagenome assemblers not aiming to reconstruct strains. Haplotype reconstructed high-quality strain-resolved assemblies from clinical HCMV samples and SARS-CoV-2 genomes from wastewater metagenomes identical to genomes from clinical isolates.

Download Full-text

Scalable Genome Assembly through Parallel de Bruijn Graph Construction for Multiple k-mers

Scientific Reports ◽

10.1038/s41598-019-51284-9 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Kanak Mahadik ◽

Christopher Wright ◽

Milind Kulkarni ◽

Saurabh Bagchi ◽

Somali Chaterji

Keyword(s):

De Novo ◽

De Bruijn Graph ◽

High Quality ◽

De Bruijn Graphs ◽

Sequencing Technologies ◽

De Bruijn ◽

Similar Accuracy ◽

Valued Graph ◽

Assembly Algorithms ◽

Level Parallelism

Abstract Remarkable advancements in high-throughput gene sequencing technologies have led to an exponential growth in the number of sequenced genomes. However, unavailability of highly parallel and scalable de novo assembly algorithms have hindered biologists attempting to swiftly assemble high-quality complex genomes. Popular de Bruijn graph assemblers, such as IDBA-UD, generate high-quality assemblies by iterating over a set of k-values used in the construction of de Bruijn graphs (DBG). However, this process of sequentially iterating from small to large k-values slows down the process of assembly. In this paper, we propose ScalaDBG, which metamorphoses this sequential process, building DBGs for each distinct k-value in parallel. We develop an innovative mechanism to “patch” a higher k-valued graph with contigs generated from a lower k-valued graph. Moreover, ScalaDBG leverages multi-level parallelism, by both scaling up on all cores of a node, and scaling out to multiple nodes simultaneously. We demonstrate that ScalaDBG completes assembling the genome faster than IDBA-UD, but with similar accuracy on a variety of datasets (6.8X faster for one of the most complex genome in our dataset).

Download Full-text

CAMISIM: Simulating metagenomes and microbial communities

10.1101/300970 ◽

2018 ◽

Cited By ~ 4

Author(s):

Adrian Fritz ◽

Peter Hofmann ◽

Stephan Majda ◽

Eik Dahms ◽

Johannes Dröge ◽

...

Keyword(s):

Microbial Communities ◽

De Novo ◽

Real Data ◽

Small Data ◽

Data Sets ◽

Sequencing Data ◽

Taxonomic Profiling ◽

Benchmark Data ◽

Sequencing Technologies ◽

Wide Range

Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required. Here, we describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series and differential abundance studies, includes real and simulated strain-level diversity, and generates second and third generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT and metaSPAdes, on several thousand small data sets generated with CAMISIM. CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with truth standards for method evaluation. All data sets and the software are freely available at: https://github.com/CAMI-challenge/CAMISIM

Download Full-text

Error-estimation-guided rebuilding ofde novomodels increases the success rate ofab initiophasing

Acta Crystallographica Section D Biological Crystallography ◽

10.1107/s0907444912037961 ◽

2012 ◽

Vol 68 (11) ◽

pp. 1522-1534 ◽

Cited By ~ 5

Author(s):

Rojan Shrestha ◽

David Simoncini ◽

Kam Y. J. Zhang

Keyword(s):

Protein Structure ◽

Ab Initio ◽

Diffraction Data ◽

Structure Prediction ◽

De Novo ◽

Coarse Grained ◽

Data Sets ◽

Molecular Replacement ◽

High Quality ◽

Protein Targets

Recent advancements in computational methods for protein-structure prediction have made it possible to generate the high-qualityde novomodels required forab initiophasing of crystallographic diffraction data using molecular replacement. Despite those encouraging achievements inab initiophasing usingde novomodels, its success is limited only to those targets for which high-qualityde novomodels can be generated. In order to increase the scope of targets to whichab initiophasing withde novomodels can be successfully applied, it is necessary to reduce the errors in thede novomodels that are used as templates for molecular replacement. Here, an approach is introduced that can identify and rebuild the residues with larger errors, which subsequently reduces the overall Cαroot-mean-square deviation (CA-RMSD) from the native protein structure. The error in a predicted model is estimated from the average pairwise geometric distance per residue computed among selected lowest energy coarse-grained models. This score is subsequently employed to guide a rebuilding process that focuses on more error-prone residues in the coarse-grained models. This rebuilding methodology has been tested on ten protein targets that were unsuccessful using previous methods. The average CA-RMSD of the coarse-grained models was improved from 4.93 to 4.06 Å. For those models with CA-RMSD less than 3.0 Å, the average CA-RMSD was improved from 3.38 to 2.60 Å. These rebuilt coarse-grained models were then converted into all-atom models and refined to produce improvedde novomodels for molecular replacement. Seven diffraction data sets were successfully phased using rebuiltde novomodels, indicating the improved quality of these rebuiltde novomodels and the effectiveness of the rebuilding process. Software implementing this method, calledMORPHEUS, can be downloaded from http://www.riken.jp/zhangiru/software.html.

Download Full-text

A de novo genome assembler based on MapReduce and bi-directed de Bruijn graph

2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ◽

10.1109/bibm.2016.7822494 ◽

2016 ◽

Author(s):

Yuehua Zhang ◽

Pengfei Xuan ◽

Yunsheng Wang ◽

Pradip K. Srimani ◽

Feng Luo

Keyword(s):

De Novo ◽

De Bruijn Graph ◽

De Bruijn ◽

Genome Assembler

Download Full-text

Faucet: streaming de novo assembly graph construction

10.1101/125658 ◽

2017 ◽

Author(s):

Roye Rozov ◽

Gil Goldshlager ◽

Eran Halperin ◽

Ron Shamir

Keyword(s):

Resource Use ◽

De Novo ◽

State Of The Art ◽

Supplementary Information ◽

De Bruijn Graph ◽

Assembly Quality ◽

Metagenome Assembly ◽

Streaming Algorithm ◽

Supplementary Material ◽

De Bruijn

AbstractMotivationWe present Faucet, a 2-pass streaming algorithm for assembly graph construction. Faucet builds an assembly graph incrementally as each read is processed. Thus, reads need not be stored locally, as they can be processed while downloading data and then discarded. We demonstrate this functionality by performing streaming graph assembly of publicly available data, and observe that the ratio of disk use to raw data size decreases as coverage is increased.ResultsFaucet pairs the de Bruijn graph obtained from the reads with additional meta-data derived from them. We show these metadata - coverage counts collected at junction k-mers and connections bridging between junction pairs - contain most salient information needed for assembly, and demonstrate they enable cleaning of metagenome assembly graphs, greatly improving contiguity while maintaining accuracy. We compared Faucet’s resource use and assembly quality to state of the art metagenome assemblers, as well as leading resource-efficient genome assemblers. Faucet used orders of magnitude less time and disk space than the specialized metagenome assemblers MetaSPAdes and Megahit, while also improving on their memory use; this broadly matched performance of other assemblers optimizing resource efficiency - namely, Minia and LightAssembler. However, on metagenomes tested, Faucet’s outputs had 14-110% higher mean NGA50 lengths compared to Minia, and 2-11-fold higher mean NGA50 lengths compared to LightAssembler, the only other streaming assembler available.AvailabilityFaucet is available at https://github.com/Shamir-Lab/[email protected],[email protected] information:Supplementary data are available at Bioinformatics online.

Download Full-text

Clover: a clustering-oriented de novo assembler for Illumina sequences

BMC Bioinformatics ◽

10.1186/s12859-020-03788-9 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Ming-Feng Hsieh ◽

Chin Lung Lu ◽

Chuan Yi Tang

Keyword(s):

De Novo Assembly ◽

De Novo ◽

Low Cost ◽

De Bruijn Graph ◽

Illumina Platform ◽

Sequencing Errors ◽

Sequencing Technologies ◽

String Graph ◽

Clustering Approach ◽

De Bruijn

Abstract Background Next-generation sequencing technologies revolutionized genomics by producing high-throughput reads at low cost, and this progress has prompted the recent development of de novo assemblers. Multiple assembly methods based on de Bruijn graph have been shown to be efficient for Illumina reads. However, the sequencing errors generated by the sequencer complicate analysis of de novo assembly and influence the quality of downstream genomic researches. Results In this paper, we develop a de Bruijn assembler, called Clover (clustering-oriented de novo assembler), that utilizes a novel k-mer clustering approach from the overlap-layout-consensus concept to deal with the sequencing errors generated by the Illumina platform. We further evaluate Clover’s performance against several de Bruijn graph assemblers (ABySS, SOAPdenovo, SPAdes and Velvet), overlap-layout-consensus assemblers (Bambus2, CABOG and MSR-CA) and string graph assembler (SGA) on three datasets (Staphylococcus aureus, Rhodobacter sphaeroides and human chromosome 14). The results show that Clover achieves a superior assembly quality in terms of corrected N50 and E-size while remaining a significantly competitive in run time except SOAPdenovo. In addition, Clover was involved in the sequencing projects of bacterial genomes Acinetobacter baumannii TYTH-1 and Morganella morganii KT. Conclusions The marvel clustering-based approach of Clover that integrates the flexibility of the overlap-layout-consensus approach and the efficiency of the de Bruijn graph method has high potential on de novo assembly. Now, Clover is freely available as open source software from https://oz.nthu.edu.tw/~d9562563/src.html.

Download Full-text

Inference of viral quasispecies with a paired de Bruijn graph

Bioinformatics ◽

10.1093/bioinformatics/btaa782 ◽

2020 ◽

Author(s):

Borja Freire ◽

Susana Ladra ◽

Jose R Paramá ◽

Leena Salmela

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Supplementary Information ◽

De Bruijn Graph ◽

Viral Quasispecies ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Sequencing Errors ◽

High Throughput Sequencing Data ◽

De Bruijn

Abstract Motivation RNA viruses exhibit a high mutation rate and thus they exist in infected cells as a population of closely related strains called viral quasispecies. The viral quasispecies assembly problem asks to characterize the quasispecies present in a sample from high-throughput sequencing data. We study the de novo version of the problem, where reference sequences of the quasispecies are not available. Current methods for assembling viral quasispecies are either based on overlap graphs or on de Bruijn graphs. Overlap graph-based methods tend to be accurate but slow, whereas de Bruijn graph-based methods are fast but less accurate. Results We present viaDBG, which is a fast and accurate de Bruijn graph-based tool for de novo assembly of viral quasispecies. We first iteratively correct sequencing errors in the reads, which allows us to use large k-mers in the de Bruijn graph. To incorporate the paired-end information in the graph, we also adapt the paired de Bruijn graph for viral quasispecies assembly. These features enable the use of long-range information in contig construction without compromising the speed of de Bruijn graph-based approaches. Our experimental results show that viaDBG is both accurate and fast, whereas previous methods are either fast or accurate but not both. In particular, viaDBG has comparable or better accuracy than SAVAGE, while being at least nine times faster. Furthermore, the speed of viaDBG is comparable to PEHaplo but viaDBG is able to retrieve also low abundance quasispecies, which are often missed by PEHaplo. Availability and implementation viaDBG is implemented in C++ and it is publicly available at https://bitbucket.org/bfreirec1/viadbg. All datasets used in this article are publicly available at https://bitbucket.org/bfreirec1/data-viadbg/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MetaVelvet-DL: a MetaVelvet deep learning extension for de novo metagenome assembly

BMC Bioinformatics ◽

10.1186/s12859-020-03737-6 ◽

2021 ◽

Vol 22 (S6) ◽

Author(s):

Kuo-ching Liang ◽

Yasubumi Sakakibara

Keyword(s):

Deep Learning ◽

Short Term Memory ◽

De Novo ◽

Single Species ◽

De Bruijn Graph ◽

Support Vector ◽

Sequence Information ◽

Metagenomic Sample ◽

De Bruijn ◽

Metagenome Sequencing

Abstract Background The increasing use of whole metagenome sequencing has spurred the need to improve de novo assemblers to facilitate the discovery of unknown species and the analysis of their genomic functions. MetaVelvet-SL is a short-read de novo metagenome assembler that partitions a multi-species de Bruijn graph into single-species sub-graphs. This study aimed to improve the performance of MetaVelvet-SL by using a deep learning-based model to predict the partition nodes in a multi-species de Bruijn graph. Results This study showed that the recent advances in deep learning offer the opportunity to better exploit sequence information and differentiate genomes of different species in a metagenomic sample. We developed an extension to MetaVelvet-SL, which we named MetaVelvet-DL, that builds an end-to-end architecture using Convolutional Neural Network and Long Short-Term Memory units. The deep learning model in MetaVelvet-DL can more accurately predict how to partition a de Bruijn graph than the Support Vector Machine-based model in MetaVelvet-SL can. Assembly of the Critical Assessment of Metagenome Interpretation (CAMI) dataset showed that after removing chimeric assemblies, MetaVelvet-DL produced longer single-species contigs, with less misassembled contigs than MetaVelvet-SL did. Conclusions MetaVelvet-DL provides more accurate de novo assemblies of whole metagenome data. The authors believe that this improvement can help in furthering the understanding of microbiomes by providing a more accurate description of the metagenomic samples under analysis.

Download Full-text

BubbleGun: Enumerating Bubbles and Superbubbles in Genome Graphs

10.1101/2021.03.23.436631 ◽

2021 ◽

Author(s):

Fawaz Dabbaghie ◽

Jana Ebler ◽

Tobias Marschall

Keyword(s):

De Novo ◽

General Purpose ◽

Supplementary Information ◽

De Bruijn Graph ◽

De Bruijn Graphs ◽

Third Generation Sequencing ◽

Human Sample ◽

Fast Development ◽

De Bruijn ◽

Generation Sequencing

AbstractMotivationWith the fast development of third generation sequencing machines, de novo genome assembly is becoming a routine even for larger genomes. Graph-based representations of genomes arise both as part of the assembly process, but also in the context of pangenomes representing a population. In both cases, polymorphic loci lead to bubble structures in such graphs. Detecting bubbles is hence an important task when working with genomic variants in the context of genome graphs.ResultsHere, we present a fast general-purpose tool, called BubbleGun, for detecting bubbles and superbubbles in genome graphs. Furthermore, BubbleGun detects and outputs runs of linearly connected bubbles and superbubbles, which we call bubble chains. We showcase its utility on de Bruijn graphs and compare our results to vg’s snarl detection. We show that BubbleGun is considerably faster than vg especially in bigger graphs, where it reports all bubbles in less than 30 minutes on a human sample de Bruijn graph of around 2 million nodes.AvailabilityBubbleGun is available and documented at https://github.com/fawaz-dabbaghieh/bubble_gun under MIT [email protected] or [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

X-ray and UV radiation-damage-induced phasing using synchrotron serial crystallography

Acta Crystallographica Section D Structural Biology ◽

10.1107/s2059798318001535 ◽

2018 ◽

Vol 74 (4) ◽

pp. 366-378 ◽

Cited By ~ 5

Author(s):

Nicolas Foos ◽

Carolin Seuring ◽

Robin Schubert ◽

Anja Burkhardt ◽

Olof Svensson ◽

...

Keyword(s):

Success Rate ◽

Radiation Damage ◽

Uv Radiation ◽

De Novo ◽

Data Sets ◽

Individual Data ◽

High Quality ◽

X Ray ◽

Serial Crystallography

Specific radiation damage can be used to determine phasesde novofrom macromolecular crystals. This method is known as radiation-damage-induced phasing (RIP). One limitation of the method is that the dose of individual data sets must be minimized, which in turn leads to data sets with low multiplicity. A solution to this problem is to use data from multiple crystals. However, the resulting signal can be degraded by a lack of isomorphism between crystals. Here, it is shown that serial synchrotron crystallography in combination with selective merging of data sets can be used to determine high-quality phases for insulin and thaumatin, and that the increased multiplicity can greatly enhance the success rate of the experiment.

Download Full-text