The Maximum Independent Sets of de Bruijn Graphs of Diameter  3

The nodes of the de Bruijn graph $B(d,3)$ consist of all strings of length $3$, taken from an alphabet of size $d$, with edges between words which are distinct substrings of a word of length $4$. We give an inductive characterization of the maximum independent sets of the de Bruijn graphs $B(d,3)$ and for the de Bruijn graph of diameter three with loops removed, for arbitrary alphabet size. We derive a recurrence relation and an exponential generating function for their number. This recurrence allows us to construct exponentially many comma-free codes of length 3 with maximal cardinality.

Download Full-text

On Hypercubes in de Bruijn Graphs

Parallel Processing Letters ◽

10.1142/s0129626498000274 ◽

1998 ◽

Vol 08 (02) ◽

pp. 259-268

Author(s):

Thomas Andreae ◽

Martin Hintz

Keyword(s):

Complete Solution ◽

De Bruijn Graph ◽

Alphabet Size ◽

De Bruijn Graphs ◽

De Bruijn

We prove that the hypercube of odd dimension 2k + 1 is a subgraph of the de Bruijn graph of alphabet size d and diameter 2 if and only if d ≥ 3 · 2k-1. This complements previous results of Heydemann, Opatrny, and Sotteau (1994) and Andreae et al. (1995), thus yielding a complete solution of the problem of determining, for all integers m, n ≥ 2, the least number d = d(m, n) for which the hypercube of dimension m is a subgraph of the de Bruijn graph of alphabet size d and diameter n.

Download Full-text

Scalable Genome Assembly through Parallel de Bruijn Graph Construction for Multiple k-mers

Scientific Reports ◽

10.1038/s41598-019-51284-9 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Kanak Mahadik ◽

Christopher Wright ◽

Milind Kulkarni ◽

Saurabh Bagchi ◽

Somali Chaterji

Keyword(s):

De Novo ◽

De Bruijn Graph ◽

High Quality ◽

De Bruijn Graphs ◽

Sequencing Technologies ◽

De Bruijn ◽

Similar Accuracy ◽

Valued Graph ◽

Assembly Algorithms ◽

Level Parallelism

Abstract Remarkable advancements in high-throughput gene sequencing technologies have led to an exponential growth in the number of sequenced genomes. However, unavailability of highly parallel and scalable de novo assembly algorithms have hindered biologists attempting to swiftly assemble high-quality complex genomes. Popular de Bruijn graph assemblers, such as IDBA-UD, generate high-quality assemblies by iterating over a set of k-values used in the construction of de Bruijn graphs (DBG). However, this process of sequentially iterating from small to large k-values slows down the process of assembly. In this paper, we propose ScalaDBG, which metamorphoses this sequential process, building DBGs for each distinct k-value in parallel. We develop an innovative mechanism to “patch” a higher k-valued graph with contigs generated from a lower k-valued graph. Moreover, ScalaDBG leverages multi-level parallelism, by both scaling up on all cores of a node, and scaling out to multiple nodes simultaneously. We demonstrate that ScalaDBG completes assembling the genome faster than IDBA-UD, but with similar accuracy on a variety of datasets (6.8X faster for one of the most complex genome in our dataset).

Download Full-text

Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields

BMC Bioinformatics ◽

10.1186/s12859-020-03740-x ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Aranka Steyaert ◽

Pieter Audenaert ◽

Jan Fostier

Keyword(s):

Genomic Sequence ◽

Conditional Random Field ◽

Accurate Determination ◽

Next Generation Sequencing Data ◽

De Bruijn Graph ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Sequencing Errors ◽

Expectation Maximisation ◽

De Bruijn

Abstract Background De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times each k-mer (resp. k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data. Results To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. Conclusions We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. True k-mers can be distinguished from erroneous k-mers with a higher F1 score than existing methods. A C++11 implementation is available at https://github.com/biointec/detoxunder the GNU AGPL v3.0 license.

Download Full-text

Closed Forms for Derangement Numbers in Terms of the Hessenberg Determinants

10.20944/preprints201610.0035.v1 ◽

2016 ◽

Cited By ~ 2

Author(s):

Feng Qi ◽

Jiao-Lian Zhao ◽

Bai-Ni Guo

Keyword(s):

Recurrence Relation ◽

Generating Function ◽

Higher Order ◽

Order Derivative ◽

Exponential Generating Function ◽

Higher Order Derivative

In the paper, the authors find closed forms for derangement numbers in terms of the Hessenberg determinants, discover a recurrence relation of derangement numbers, present a formula for any higher order derivative of the exponential generating function of derangement numbers, and compute some related Hessenberg and tridiagonal determinants.

Download Full-text

Scalable multiple whole-genome alignment and locally collinear block construction with SibeliaZ

10.1101/548123 ◽

2019 ◽

Cited By ~ 5

Author(s):

Ilia Minkin ◽

Paul Medvedev

Keyword(s):

Single Machine ◽

De Bruijn Graph ◽

Genome Alignment ◽

Whole Genome ◽

Reconstruction Algorithms ◽

De Bruijn Graphs ◽

Significant Step ◽

De Bruijn ◽

Whole Genome Alignment ◽

Computational Resources

AbstractMultiple whole-genome alignment is a challenging problem in bioinformatics. Despite many successes, current methods are not able to keep up with the growing number, length, and complexity of assembled genomes, especially when computational resources are limited. Approaches based on compacted de Bruijn graphs to identify and extend anchors into locally collinear blocks have potential for scalability, but current methods do not scale to mammalian genomes. We present an algorithm, SibeliaZ-LCB, for identifying collinear blocks in closely related genomes based on analysis of the de Bruijn graph. We further incorporate this into a multiple whole-genome alignment pipeline called SibeliaZ. SibeliaZ shows run-time improvements over other methods while maintaining accuracy. On sixteen recently-assembled strains of mice, SibeliaZ runs in under 16 hours on a single machine, while other tools did not run to completion for eight mice within a week. SibeliaZ makes a significant step towards improving scalability of multiple whole-genome alignment and collinear block reconstruction algorithms on a single machine.

Download Full-text

Buffering Updates Enables Efficient Dynamic de Bruijn Graphs

10.1101/2021.03.16.435535 ◽

2021 ◽

Cited By ~ 1

Author(s):

Jarno Alanko ◽

Bahar Alipanahi ◽

Jonathen Settle ◽

Christina Boucher ◽

Travis Gagie

Keyword(s):

Graph Model ◽

Biological Data ◽

Theory And Practice ◽

De Bruijn Graph ◽

Efficient Manner ◽

De Bruijn Graphs ◽

Trade Offs ◽

Order Of Magnitude ◽

Efficient Data ◽

De Bruijn

AbstractMotivationThe de Bruijn graph has become a ubiquitous graph model for biological data ever since its initial introduction in the later 1990s. It has been used for a variety of purposes including genome assembly (Zerbino and Birney, 2008; Bankevich et al., 2012b; Peng et al., 2012), variant detection (Alipanahi et al., 2020b; Iqbal et al., 2012), and storage of assembled genomes (Chikhi et al., 2016). For this reason, there have been over a dozen methods for building and representing the de Bruijn graph and its variants in a space- and time-efficient manner.ResultsAlthough there exists a plethora of space efficient data structures for storing the de Bruijn graph, the majority of them make a compression-mutability trade-off. In particular, with the exception of a few methods (Muggli et al., 2019; Holley and Melsted, 2020; Crawford et al., 2018), compressed and compact de Bruijn graphs do not allow for the graph to be efficiently updated, allowing for data to be added or deleted. The most recent compressed dynamic de Bruijn graph, (Alipanahi et al., 2020a), relies on dynamic bit vectors, which are slow in theory and practice. To address this shortcoming, we present BufBOSS which is a compressed dynamic de Bruijn graph that removes the necessity of dynamic bit vectors by buffering data that should be added or removed from the graph. We implement our method, which we refer to as BufBOSS, and compare its performance to Bifrost, DynamicBOSS, and FDBG. Our experiments demonstrate that BufBOSS achieves attractive trade-offs compared to other tools in terms of time, memory and disk, and has the best deletion performance by an order of magnitude out of all the tools that are able to perform deletions. Our implementation is available at https://github.com/jnalanko/[email protected]

Download Full-text

cloudSPAdes: assembly of synthetic long reads using de Bruijn graphs

Bioinformatics ◽

10.1093/bioinformatics/btz349 ◽

2019 ◽

Vol 35 (14) ◽

pp. i61-i70 ◽

Cited By ~ 4

Author(s):

Ivan Tolstoganov ◽

Anton Bankevich ◽

Zhoutao Chen ◽

Pavel A Pevzner

Keyword(s):

Narrow Range ◽

State Of The Art ◽

Supplementary Information ◽

De Bruijn Graph ◽

Hybrid Assembly ◽

De Bruijn Graphs ◽

Long Reads ◽

Long Read ◽

De Bruijn ◽

New Applications

Abstract Motivation The recently developed barcoding-based synthetic long read (SLR) technologies have already found many applications in genome assembly and analysis. However, although some new barcoding protocols are emerging and the range of SLR applications is being expanded, the existing SLR assemblers are optimized for a narrow range of parameters and are not easily extendable to new barcoding technologies and new applications such as metagenomics or hybrid assembly. Results We describe the algorithmic challenge of the SLR assembly and present a cloudSPAdes algorithm for SLR assembly that is based on analyzing the de Bruijn graph of SLRs. We benchmarked cloudSPAdes across various barcoding technologies/applications and demonstrated that it improves on the state-of-the-art SLR assemblers in accuracy and speed. Availability and implementation Source code and installation manual for cloudSPAdes are available at https://github.com/ablab/spades/releases/tag/cloudspades-paper. Supplementary Information Supplementary data are available at Bioinformatics online.

Download Full-text

Inference of viral quasispecies with a paired de Bruijn graph

Bioinformatics ◽

10.1093/bioinformatics/btaa782 ◽

2020 ◽

Author(s):

Borja Freire ◽

Susana Ladra ◽

Jose R Paramá ◽

Leena Salmela

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Supplementary Information ◽

De Bruijn Graph ◽

Viral Quasispecies ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Sequencing Errors ◽

High Throughput Sequencing Data ◽

De Bruijn

Abstract Motivation RNA viruses exhibit a high mutation rate and thus they exist in infected cells as a population of closely related strains called viral quasispecies. The viral quasispecies assembly problem asks to characterize the quasispecies present in a sample from high-throughput sequencing data. We study the de novo version of the problem, where reference sequences of the quasispecies are not available. Current methods for assembling viral quasispecies are either based on overlap graphs or on de Bruijn graphs. Overlap graph-based methods tend to be accurate but slow, whereas de Bruijn graph-based methods are fast but less accurate. Results We present viaDBG, which is a fast and accurate de Bruijn graph-based tool for de novo assembly of viral quasispecies. We first iteratively correct sequencing errors in the reads, which allows us to use large k-mers in the de Bruijn graph. To incorporate the paired-end information in the graph, we also adapt the paired de Bruijn graph for viral quasispecies assembly. These features enable the use of long-range information in contig construction without compromising the speed of de Bruijn graph-based approaches. Our experimental results show that viaDBG is both accurate and fast, whereas previous methods are either fast or accurate but not both. In particular, viaDBG has comparable or better accuracy than SAVAGE, while being at least nine times faster. Furthermore, the speed of viaDBG is comparable to PEHaplo but viaDBG is able to retrieve also low abundance quasispecies, which are often missed by PEHaplo. Availability and implementation viaDBG is implemented in C++ and it is publicly available at https://bitbucket.org/bfreirec1/viadbg. All datasets used in this article are publicly available at https://bitbucket.org/bfreirec1/data-viadbg/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs

Bioinformatics ◽

10.1093/bioinformatics/btz102 ◽

2019 ◽

Vol 36 (5) ◽

pp. 1374-1381 ◽

Cited By ~ 9

Author(s):

Antoine Limasset ◽

Jean-François Flot ◽

Pierre Peterlongo

Keyword(s):

Supplementary Information ◽

De Bruijn Graph ◽

Sequence Information ◽

Short Read ◽

De Bruijn Graphs ◽

Short Reads ◽

Sequencing Errors ◽

Long Read ◽

De Bruijn ◽

Read Accuracy

Abstract Motivation Short-read accuracy is important for downstream analyses such as genome assembly and hybrid long-read correction. Despite much work on short-read correction, present-day correctors either do not scale well on large datasets or consider reads as mere suites of k-mers, without taking into account their full-length sequence information. Results We propose a new method to correct short reads using de Bruijn graphs and implement it as a tool called Bcool. As a first step, Bcool constructs a compacted de Bruijn graph from the reads. This graph is filtered on the basis of k-mer abundance then of unitig abundance, thereby removing most sequencing errors. The cleaned graph is then used as a reference on which the reads are mapped to correct them. We show that this approach yields more accurate reads than k-mer-spectrum correctors while being scalable to human-size genomic datasets and beyond. Availability and implementation The implementation is open source, available at http://github.com/Malfoy/BCOOL under the Affero GPL license and as a Bioconda package. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

IMPROVED BOUNDS ON CUTWIDTHS OF SHUFFLE-EXCHANGE AND DE BRUIJN GRAPHS

Parallel Processing Letters ◽

10.1142/s0129626404001945 ◽

2004 ◽

Vol 14 (03n04) ◽

pp. 361-366 ◽

Cited By ~ 1

Author(s):

BURKHARD MONIEN ◽

IMRICH VRŤO

Keyword(s):

Upper Bound ◽

De Bruijn Graph ◽

Best Constant ◽

De Bruijn Graphs ◽

Exchange Graph ◽

De Bruijn

We prove that the cutwidth of the n-dimensional shuffle-exchange graph is at most ⌈2n+1/n⌉, for n≥10. This essentially improves on the previous best constant factors. As a consequence we obtain an improved upper bound for the cutwidth of the de Bruijn graph.

Download Full-text