Bidirectional Variable-Order de Bruijn Graphs

Compressed suffix trees and bidirectional FM-indexes can store a set of strings and support queries that let us explore the set of substrings they contain, adding and deleting characters on both the left and right, but they can use much more space than a de Bruijn graph for the strings. Bowe et al.’s BWT-based de Bruijn graph representation (Proc. Workshop on Algorithms for Bioinformatics, pp. 225–235, 2012) can be made bidirectional as well, at the cost of increasing its space usage by a small constant, but it fixes the length of the substrings. Boucher et al. (Proc. Data Compression Conference, pp. 383–392, 2015) generalized Bowe et al.’s representation to support queries about variable-length substrings, but at the cost of bidirectionality. In this paper we show how to make Boucher et al.’s variable-order implementation of de Bruijn graphs bidirectional.

Download Full-text

BdBG: a bucket-based method for compressing genome sequencing data with dynamic de Bruijn graphs

PeerJ ◽

10.7717/peerj.5611 ◽

2018 ◽

Vol 6 ◽

pp. e5611 ◽

Cited By ~ 1

Author(s):

Rongjie Wang ◽

Junyi Li ◽

Yang Bai ◽

Tianyi Zang ◽

Yadong Wang

Keyword(s):

Data Compression ◽

Genome Sequencing ◽

De Bruijn Graph ◽

Storage Space ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Efficient Data ◽

De Bruijn ◽

Next Generation Sequencing Ngs ◽

Ngs Data

Dramatic increases in data produced by next-generation sequencing (NGS) technologies demand data compression tools for saving storage space. However, effective and efficient data compression for genome sequencing data has remained an unresolved challenge in NGS data studies. In this paper, we propose a novel alignment-free and reference-free compression method, BdBG, which is the first to compress genome sequencing data with dynamic de Bruijn graphs based on the data after bucketing. Compared with existing de Bruijn graph methods, BdBG only stored a list of bucket indexes and bifurcations for the raw read sequences, and this feature can effectively reduce storage space. Experimental results on several genome sequencing datasets show the effectiveness of BdBG over three state-of-the-art methods. BdBG is written in python and it is an open source software distributed under the MIT license, available for download at https://github.com/rongjiewang/BdBG.

Download Full-text

A Pseudo de Bruijn Graph Representation for Discretization Orders for Distance Geometry

Bioinformatics and Biomedical Engineering - Lecture Notes in Computer Science ◽

10.1007/978-3-319-16483-0_50 ◽

2015 ◽

pp. 514-523 ◽

Cited By ~ 6

Author(s):

Antonio Mucherino

Keyword(s):

Distance Geometry ◽

Graph Representation ◽

De Bruijn Graph ◽

De Bruijn ◽

Discretization Orders

Download Full-text

Scalable Genome Assembly through Parallel de Bruijn Graph Construction for Multiple k-mers

Scientific Reports ◽

10.1038/s41598-019-51284-9 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Kanak Mahadik ◽

Christopher Wright ◽

Milind Kulkarni ◽

Saurabh Bagchi ◽

Somali Chaterji

Keyword(s):

De Novo ◽

De Bruijn Graph ◽

High Quality ◽

De Bruijn Graphs ◽

Sequencing Technologies ◽

De Bruijn ◽

Similar Accuracy ◽

Valued Graph ◽

Assembly Algorithms ◽

Level Parallelism

Abstract Remarkable advancements in high-throughput gene sequencing technologies have led to an exponential growth in the number of sequenced genomes. However, unavailability of highly parallel and scalable de novo assembly algorithms have hindered biologists attempting to swiftly assemble high-quality complex genomes. Popular de Bruijn graph assemblers, such as IDBA-UD, generate high-quality assemblies by iterating over a set of k-values used in the construction of de Bruijn graphs (DBG). However, this process of sequentially iterating from small to large k-values slows down the process of assembly. In this paper, we propose ScalaDBG, which metamorphoses this sequential process, building DBGs for each distinct k-value in parallel. We develop an innovative mechanism to “patch” a higher k-valued graph with contigs generated from a lower k-valued graph. Moreover, ScalaDBG leverages multi-level parallelism, by both scaling up on all cores of a node, and scaling out to multiple nodes simultaneously. We demonstrate that ScalaDBG completes assembling the genome faster than IDBA-UD, but with similar accuracy on a variety of datasets (6.8X faster for one of the most complex genome in our dataset).

Download Full-text

Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields

BMC Bioinformatics ◽

10.1186/s12859-020-03740-x ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Aranka Steyaert ◽

Pieter Audenaert ◽

Jan Fostier

Keyword(s):

Genomic Sequence ◽

Conditional Random Field ◽

Accurate Determination ◽

Next Generation Sequencing Data ◽

De Bruijn Graph ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Sequencing Errors ◽

Expectation Maximisation ◽

De Bruijn

Abstract Background De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times each k-mer (resp. k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data. Results To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. Conclusions We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. True k-mers can be distinguished from erroneous k-mers with a higher F1 score than existing methods. A C++11 implementation is available at https://github.com/biointec/detoxunder the GNU AGPL v3.0 license.

Download Full-text

Variable-Order de Bruijn Graphs

2015 Data Compression Conference ◽

10.1109/dcc.2015.70 ◽

2015 ◽

Cited By ~ 22

Author(s):

Christina Boucher ◽

Alex Bowe ◽

Travis Gagie ◽

Simon J. Puglisi ◽

Kunihiko Sadakane

Keyword(s):

Variable Order ◽

De Bruijn Graphs ◽

De Bruijn

Download Full-text

Space-efficient and exact de Bruijn graph representation based on a Bloom filter

Algorithms for Molecular Biology ◽

10.1186/1748-7188-8-22 ◽

2013 ◽

Vol 8 (1) ◽

Cited By ~ 151

Author(s):

Rayan Chikhi ◽

Guillaume Rizk

Keyword(s):

Bloom Filter ◽

Graph Representation ◽

De Bruijn Graph ◽

De Bruijn

Download Full-text

The Maximum Independent Sets of de Bruijn Graphs of Diameter 3

The Electronic Journal of Combinatorics ◽

10.37236/681 ◽

2011 ◽

Vol 18 (1) ◽

Author(s):

Dustin A. Cartwright ◽

María Angélica Cueto ◽

Enrique A. Tobis

Keyword(s):

Recurrence Relation ◽

Generating Function ◽

Independent Sets ◽

De Bruijn Graph ◽

Alphabet Size ◽

De Bruijn Graphs ◽

Exponential Generating Function ◽

De Bruijn

The nodes of the de Bruijn graph $B(d,3)$ consist of all strings of length $3$, taken from an alphabet of size $d$, with edges between words which are distinct substrings of a word of length $4$. We give an inductive characterization of the maximum independent sets of the de Bruijn graphs $B(d,3)$ and for the de Bruijn graph of diameter three with loops removed, for arbitrary alphabet size. We derive a recurrence relation and an exponential generating function for their number. This recurrence allows us to construct exponentially many comma-free codes of length 3 with maximal cardinality.

Download Full-text

Scalable multiple whole-genome alignment and locally collinear block construction with SibeliaZ

10.1101/548123 ◽

2019 ◽

Cited By ~ 5

Author(s):

Ilia Minkin ◽

Paul Medvedev

Keyword(s):

Single Machine ◽

De Bruijn Graph ◽

Genome Alignment ◽

Whole Genome ◽

Reconstruction Algorithms ◽

De Bruijn Graphs ◽

Significant Step ◽

De Bruijn ◽

Whole Genome Alignment ◽

Computational Resources

AbstractMultiple whole-genome alignment is a challenging problem in bioinformatics. Despite many successes, current methods are not able to keep up with the growing number, length, and complexity of assembled genomes, especially when computational resources are limited. Approaches based on compacted de Bruijn graphs to identify and extend anchors into locally collinear blocks have potential for scalability, but current methods do not scale to mammalian genomes. We present an algorithm, SibeliaZ-LCB, for identifying collinear blocks in closely related genomes based on analysis of the de Bruijn graph. We further incorporate this into a multiple whole-genome alignment pipeline called SibeliaZ. SibeliaZ shows run-time improvements over other methods while maintaining accuracy. On sixteen recently-assembled strains of mice, SibeliaZ runs in under 16 hours on a single machine, while other tools did not run to completion for eight mice within a week. SibeliaZ makes a significant step towards improving scalability of multiple whole-genome alignment and collinear block reconstruction algorithms on a single machine.

Download Full-text

Buffering Updates Enables Efficient Dynamic de Bruijn Graphs

10.1101/2021.03.16.435535 ◽

2021 ◽

Cited By ~ 1

Author(s):

Jarno Alanko ◽

Bahar Alipanahi ◽

Jonathen Settle ◽

Christina Boucher ◽

Travis Gagie

Keyword(s):

Graph Model ◽

Biological Data ◽

Theory And Practice ◽

De Bruijn Graph ◽

Efficient Manner ◽

De Bruijn Graphs ◽

Trade Offs ◽

Order Of Magnitude ◽

Efficient Data ◽

De Bruijn

AbstractMotivationThe de Bruijn graph has become a ubiquitous graph model for biological data ever since its initial introduction in the later 1990s. It has been used for a variety of purposes including genome assembly (Zerbino and Birney, 2008; Bankevich et al., 2012b; Peng et al., 2012), variant detection (Alipanahi et al., 2020b; Iqbal et al., 2012), and storage of assembled genomes (Chikhi et al., 2016). For this reason, there have been over a dozen methods for building and representing the de Bruijn graph and its variants in a space- and time-efficient manner.ResultsAlthough there exists a plethora of space efficient data structures for storing the de Bruijn graph, the majority of them make a compression-mutability trade-off. In particular, with the exception of a few methods (Muggli et al., 2019; Holley and Melsted, 2020; Crawford et al., 2018), compressed and compact de Bruijn graphs do not allow for the graph to be efficiently updated, allowing for data to be added or deleted. The most recent compressed dynamic de Bruijn graph, (Alipanahi et al., 2020a), relies on dynamic bit vectors, which are slow in theory and practice. To address this shortcoming, we present BufBOSS which is a compressed dynamic de Bruijn graph that removes the necessity of dynamic bit vectors by buffering data that should be added or removed from the graph. We implement our method, which we refer to as BufBOSS, and compare its performance to Bifrost, DynamicBOSS, and FDBG. Our experiments demonstrate that BufBOSS achieves attractive trade-offs compared to other tools in terms of time, memory and disk, and has the best deletion performance by an order of magnitude out of all the tools that are able to perform deletions. Our implementation is available at https://github.com/jnalanko/[email protected]

Download Full-text

Hybrid correction of highly noisy long reads using a variable-order de Bruijn graph

Bioinformatics ◽

10.1093/bioinformatics/bty521 ◽

2018 ◽

Vol 34 (24) ◽

pp. 4213-4222 ◽

Cited By ~ 14

Author(s):

Pierre Morisse ◽

Thierry Lecroq ◽

Arnaud Lefebvre

Keyword(s):

De Bruijn Graph ◽

Variable Order ◽

Long Reads ◽

De Bruijn

Download Full-text