TraRECo: A Greedy Approach based de novo Transcriptome Assembler with Read Error Correction using Consensus Matrix

AbstractBackgroundChallenges in developing a good de novo transcriptome assembler include how to deal with read errors and sequence repeats. Almost all de novo assemblers utilize de Bruijn graph, which has a complexity linearly growing with data size while suffers from errors and repeat. Although one can correct errors by inspecting topological structure of the graph, it is an uneasy task when there are too many branches. There are two research directions: improving either graph reliability or path search precision. We focused on improving the reliability.ResultsWe present TraRECo, a greedy approach to de novo assembly employing error-aware graph construction. The idea is similar to overlap-layout-consensus approach used for genome assembly, but is different in that consensus is made through the entire graph construction step. Basically, we built contigs by direct read alignment within a distance margin and performed junction search to construct splicing graphs. While doing so, however, a contig of length l was represented by 4×1 matrix (called consensus matrix), of which each element was the base count of aligned reads so far. A representative sequence is obtained, by taking majority in each column of the consensus matrix, to be used for further read alignment. Once splicing graphs were obtained, we used IsoLasso to find paths with noticeable read depth. The experiments using real and simulated reads showed that the method provides considerable improvements in sensitivity and reasonably better performances when comparing both sensitivity and precision. This could be achieved by making more erroneous reads to be participated in graph construction, which, in turn, improved the depth information quality used for the subsequent path search step. The results for simulated reads showed also challenges are still remaining since non-negligible percentage of transcripts with high abundance were not recovered by the assemblers we considered.Conclusionde novo assembly is mainly to explore not-yet-discovered isoforms and must be able to represent as much reads as possible in an efficient way. In this sense, TraRECo provides us a potential alternative to improve graph reliability, even though the computational burden can be much higher than single k-mer de Bruijn graph approach.

Download Full-text

Scalable Genome Assembly through Parallel de Bruijn Graph Construction for Multiple k-mers

Scientific Reports ◽

10.1038/s41598-019-51284-9 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Kanak Mahadik ◽

Christopher Wright ◽

Milind Kulkarni ◽

Saurabh Bagchi ◽

Somali Chaterji

Keyword(s):

De Novo ◽

De Bruijn Graph ◽

High Quality ◽

De Bruijn Graphs ◽

Sequencing Technologies ◽

De Bruijn ◽

Similar Accuracy ◽

Valued Graph ◽

Assembly Algorithms ◽

Level Parallelism

Abstract Remarkable advancements in high-throughput gene sequencing technologies have led to an exponential growth in the number of sequenced genomes. However, unavailability of highly parallel and scalable de novo assembly algorithms have hindered biologists attempting to swiftly assemble high-quality complex genomes. Popular de Bruijn graph assemblers, such as IDBA-UD, generate high-quality assemblies by iterating over a set of k-values used in the construction of de Bruijn graphs (DBG). However, this process of sequentially iterating from small to large k-values slows down the process of assembly. In this paper, we propose ScalaDBG, which metamorphoses this sequential process, building DBGs for each distinct k-value in parallel. We develop an innovative mechanism to “patch” a higher k-valued graph with contigs generated from a lower k-valued graph. Moreover, ScalaDBG leverages multi-level parallelism, by both scaling up on all cores of a node, and scaling out to multiple nodes simultaneously. We demonstrate that ScalaDBG completes assembling the genome faster than IDBA-UD, but with similar accuracy on a variety of datasets (6.8X faster for one of the most complex genome in our dataset).

Download Full-text

A de novo genome assembler based on MapReduce and bi-directed de Bruijn graph

2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ◽

10.1109/bibm.2016.7822494 ◽

2016 ◽

Author(s):

Yuehua Zhang ◽

Pengfei Xuan ◽

Yunsheng Wang ◽

Pradip K. Srimani ◽

Feng Luo

Keyword(s):

De Novo ◽

De Bruijn Graph ◽

De Bruijn ◽

Genome Assembler

Download Full-text

Faucet: streaming de novo assembly graph construction

10.1101/125658 ◽

2017 ◽

Author(s):

Roye Rozov ◽

Gil Goldshlager ◽

Eran Halperin ◽

Ron Shamir

Keyword(s):

Resource Use ◽

De Novo ◽

State Of The Art ◽

Supplementary Information ◽

De Bruijn Graph ◽

Assembly Quality ◽

Metagenome Assembly ◽

Streaming Algorithm ◽

Supplementary Material ◽

De Bruijn

AbstractMotivationWe present Faucet, a 2-pass streaming algorithm for assembly graph construction. Faucet builds an assembly graph incrementally as each read is processed. Thus, reads need not be stored locally, as they can be processed while downloading data and then discarded. We demonstrate this functionality by performing streaming graph assembly of publicly available data, and observe that the ratio of disk use to raw data size decreases as coverage is increased.ResultsFaucet pairs the de Bruijn graph obtained from the reads with additional meta-data derived from them. We show these metadata - coverage counts collected at junction k-mers and connections bridging between junction pairs - contain most salient information needed for assembly, and demonstrate they enable cleaning of metagenome assembly graphs, greatly improving contiguity while maintaining accuracy. We compared Faucet’s resource use and assembly quality to state of the art metagenome assemblers, as well as leading resource-efficient genome assemblers. Faucet used orders of magnitude less time and disk space than the specialized metagenome assemblers MetaSPAdes and Megahit, while also improving on their memory use; this broadly matched performance of other assemblers optimizing resource efficiency - namely, Minia and LightAssembler. However, on metagenomes tested, Faucet’s outputs had 14-110% higher mean NGA50 lengths compared to Minia, and 2-11-fold higher mean NGA50 lengths compared to LightAssembler, the only other streaming assembler available.AvailabilityFaucet is available at https://github.com/Shamir-Lab/[email protected],[email protected] information:Supplementary data are available at Bioinformatics online.

Download Full-text

Clover: a clustering-oriented de novo assembler for Illumina sequences

BMC Bioinformatics ◽

10.1186/s12859-020-03788-9 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Ming-Feng Hsieh ◽

Chin Lung Lu ◽

Chuan Yi Tang

Keyword(s):

De Novo Assembly ◽

De Novo ◽

Low Cost ◽

De Bruijn Graph ◽

Illumina Platform ◽

Sequencing Errors ◽

Sequencing Technologies ◽

String Graph ◽

Clustering Approach ◽

De Bruijn

Abstract Background Next-generation sequencing technologies revolutionized genomics by producing high-throughput reads at low cost, and this progress has prompted the recent development of de novo assemblers. Multiple assembly methods based on de Bruijn graph have been shown to be efficient for Illumina reads. However, the sequencing errors generated by the sequencer complicate analysis of de novo assembly and influence the quality of downstream genomic researches. Results In this paper, we develop a de Bruijn assembler, called Clover (clustering-oriented de novo assembler), that utilizes a novel k-mer clustering approach from the overlap-layout-consensus concept to deal with the sequencing errors generated by the Illumina platform. We further evaluate Clover’s performance against several de Bruijn graph assemblers (ABySS, SOAPdenovo, SPAdes and Velvet), overlap-layout-consensus assemblers (Bambus2, CABOG and MSR-CA) and string graph assembler (SGA) on three datasets (Staphylococcus aureus, Rhodobacter sphaeroides and human chromosome 14). The results show that Clover achieves a superior assembly quality in terms of corrected N50 and E-size while remaining a significantly competitive in run time except SOAPdenovo. In addition, Clover was involved in the sequencing projects of bacterial genomes Acinetobacter baumannii TYTH-1 and Morganella morganii KT. Conclusions The marvel clustering-based approach of Clover that integrates the flexibility of the overlap-layout-consensus approach and the efficiency of the de Bruijn graph method has high potential on de novo assembly. Now, Clover is freely available as open source software from https://oz.nthu.edu.tw/~d9562563/src.html.

Download Full-text

Inference of viral quasispecies with a paired de Bruijn graph

Bioinformatics ◽

10.1093/bioinformatics/btaa782 ◽

2020 ◽

Author(s):

Borja Freire ◽

Susana Ladra ◽

Jose R Paramá ◽

Leena Salmela

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Supplementary Information ◽

De Bruijn Graph ◽

Viral Quasispecies ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Sequencing Errors ◽

High Throughput Sequencing Data ◽

De Bruijn

Abstract Motivation RNA viruses exhibit a high mutation rate and thus they exist in infected cells as a population of closely related strains called viral quasispecies. The viral quasispecies assembly problem asks to characterize the quasispecies present in a sample from high-throughput sequencing data. We study the de novo version of the problem, where reference sequences of the quasispecies are not available. Current methods for assembling viral quasispecies are either based on overlap graphs or on de Bruijn graphs. Overlap graph-based methods tend to be accurate but slow, whereas de Bruijn graph-based methods are fast but less accurate. Results We present viaDBG, which is a fast and accurate de Bruijn graph-based tool for de novo assembly of viral quasispecies. We first iteratively correct sequencing errors in the reads, which allows us to use large k-mers in the de Bruijn graph. To incorporate the paired-end information in the graph, we also adapt the paired de Bruijn graph for viral quasispecies assembly. These features enable the use of long-range information in contig construction without compromising the speed of de Bruijn graph-based approaches. Our experimental results show that viaDBG is both accurate and fast, whereas previous methods are either fast or accurate but not both. In particular, viaDBG has comparable or better accuracy than SAVAGE, while being at least nine times faster. Furthermore, the speed of viaDBG is comparable to PEHaplo but viaDBG is able to retrieve also low abundance quasispecies, which are often missed by PEHaplo. Availability and implementation viaDBG is implemented in C++ and it is publicly available at https://bitbucket.org/bfreirec1/viadbg. All datasets used in this article are publicly available at https://bitbucket.org/bfreirec1/data-viadbg/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MetaVelvet-DL: a MetaVelvet deep learning extension for de novo metagenome assembly

BMC Bioinformatics ◽

10.1186/s12859-020-03737-6 ◽

2021 ◽

Vol 22 (S6) ◽

Author(s):

Kuo-ching Liang ◽

Yasubumi Sakakibara

Keyword(s):

Deep Learning ◽

Short Term Memory ◽

De Novo ◽

Single Species ◽

De Bruijn Graph ◽

Support Vector ◽

Sequence Information ◽

Metagenomic Sample ◽

De Bruijn ◽

Metagenome Sequencing

Abstract Background The increasing use of whole metagenome sequencing has spurred the need to improve de novo assemblers to facilitate the discovery of unknown species and the analysis of their genomic functions. MetaVelvet-SL is a short-read de novo metagenome assembler that partitions a multi-species de Bruijn graph into single-species sub-graphs. This study aimed to improve the performance of MetaVelvet-SL by using a deep learning-based model to predict the partition nodes in a multi-species de Bruijn graph. Results This study showed that the recent advances in deep learning offer the opportunity to better exploit sequence information and differentiate genomes of different species in a metagenomic sample. We developed an extension to MetaVelvet-SL, which we named MetaVelvet-DL, that builds an end-to-end architecture using Convolutional Neural Network and Long Short-Term Memory units. The deep learning model in MetaVelvet-DL can more accurately predict how to partition a de Bruijn graph than the Support Vector Machine-based model in MetaVelvet-SL can. Assembly of the Critical Assessment of Metagenome Interpretation (CAMI) dataset showed that after removing chimeric assemblies, MetaVelvet-DL produced longer single-species contigs, with less misassembled contigs than MetaVelvet-SL did. Conclusions MetaVelvet-DL provides more accurate de novo assemblies of whole metagenome data. The authors believe that this improvement can help in furthering the understanding of microbiomes by providing a more accurate description of the metagenomic samples under analysis.

Download Full-text

Haploflow: Strain-resolved de novo assembly of viral genomes

10.1101/2021.01.25.428049 ◽

2021 ◽

Author(s):

A. Fritz ◽

A. Bremges ◽

Z.-L. Deng ◽

T.-R. Lesker ◽

J. Götting ◽

...

Keyword(s):

Viral Infections ◽

De Novo ◽

De Bruijn Graph ◽

Data Sets ◽

High Quality ◽

Viral Genomes ◽

Benchmark Data ◽

Flow Algorithm ◽

De Bruijn ◽

Host Evolution

In viral infections often multiple related viral strains are present, due to coinfection or within-host evolution. We describe Haploflow, a de Bruijn graph-based assembler for de novo genome assembly of viral strains from mixed sequence samples using a novel flow algorithm. We assessed Haploflow across multiple benchmark data sets of increasing complexity, showing that Haploflow is faster and more accurate than viral haplotype assemblers and generic metagenome assemblers not aiming to reconstruct strains. Haplotype reconstructed high-quality strain-resolved assemblies from clinical HCMV samples and SARS-CoV-2 genomes from wastewater metagenomes identical to genomes from clinical isolates.

Download Full-text

BubbleGun: Enumerating Bubbles and Superbubbles in Genome Graphs

10.1101/2021.03.23.436631 ◽

2021 ◽

Author(s):

Fawaz Dabbaghie ◽

Jana Ebler ◽

Tobias Marschall

Keyword(s):

De Novo ◽

General Purpose ◽

Supplementary Information ◽

De Bruijn Graph ◽

De Bruijn Graphs ◽

Third Generation Sequencing ◽

Human Sample ◽

Fast Development ◽

De Bruijn ◽

Generation Sequencing

AbstractMotivationWith the fast development of third generation sequencing machines, de novo genome assembly is becoming a routine even for larger genomes. Graph-based representations of genomes arise both as part of the assembly process, but also in the context of pangenomes representing a population. In both cases, polymorphic loci lead to bubble structures in such graphs. Detecting bubbles is hence an important task when working with genomic variants in the context of genome graphs.ResultsHere, we present a fast general-purpose tool, called BubbleGun, for detecting bubbles and superbubbles in genome graphs. Furthermore, BubbleGun detects and outputs runs of linearly connected bubbles and superbubbles, which we call bubble chains. We showcase its utility on de Bruijn graphs and compare our results to vg’s snarl detection. We show that BubbleGun is considerably faster than vg especially in bigger graphs, where it reports all bubbles in less than 30 minutes on a human sample de Bruijn graph of around 2 million nodes.AvailabilityBubbleGun is available and documented at https://github.com/fawaz-dabbaghieh/bubble_gun under MIT [email protected] or [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Robust data storage in DNA by de Bruijn graph-based decoding

10.21203/rs.3.rs-382900/v1 ◽

2021 ◽

Author(s):

Lifu Song ◽

Feng Geng ◽

Ziyi Song ◽

Bing-Zhi Li ◽

Ying-Jin Yuan

Keyword(s):

Data Storage ◽

Large Scale ◽

Search Algorithm ◽

De Bruijn Graph ◽

Large Scale Data ◽

Dna Strands ◽

Pcr Products ◽

Path Search ◽

De Bruijn ◽

Linear Decoding

Abstract Data storage in DNA, which store information in polymers, is a potential technology with high density and long-term features. However, the indels, strand rearrangements, and strand breaks that emerged during synthesis, amplification, sequencing, and storage of DNA molecules need to be handled. Here, we report a de Bruijn graph-based, greedy path search algorithm (DBG-GPS), which can efficiently handle all these issues by efficient reconstruction of the DNA strands. DBG-GPS achieves accurate data recovery with low-quality, deep error-prone PCR products, and accelerated aged DNA samples (solution, 70℃ for two weeks). The robustness of DBG-GPS was verified with 100 times of multiple retrievals using PCR products with massive unspecific amplifications. Moreover, DBG-GPS shows linear decoding complexity and more than 100 times faster than the multiple alignment-based methods, indicating a suitable solution for large-scale data storage.

Download Full-text