Efficient de Bruijn graph construction for genome assembly using a hash table and auxiliary vector data structures

Background. The rapid advancement of sequencing technologies has made it possible to regularly produce millions of high-quality reads from the DNA samples in the sequencing laboratories. To this end, the de Bruijn graph is a popular data structure in the genome assembly literature for efficient representation and processing of data. Due to the number of nodes in a de Bruijn graph, the main barrier here is the memory and runtime. Therefore, this area has received significant attention in contemporary literature. Results. In this paper, we present an approach called HaVec that attempts to achieve a balance between the memory consumption and the running time. HaVec uses a hash table along with an auxiliary vector data structure to store the de Bruijn graph thereby improving the total memory usage and the running time. A critical and noteworthy feature of HaVec is that it exhibits no false positive error. Conclusions. In general, the graph construction procedure takes the major share of the time involved in an assembly process. HaVec can be seen as a significant advancement in this aspect. We anticipate that HaVec will be extremely useful in the de Bruijn graph-based genome assembly.

Download Full-text

Improving the efficiency of de Bruijn graph construction using compact universal hitting sets

10.1101/2020.11.08.373050 ◽

2020 ◽

Author(s):

Yael Ben-Ari ◽

Lianrong Pu ◽

Yaron Orenstein ◽

Ron Shamir

Keyword(s):

Data Structures ◽

High Throughput ◽

Dna Sequences ◽

Genome Assembly ◽

High Throughput Sequencing ◽

De Bruijn Graph ◽

Sequencing Analysis ◽

A Genome ◽

Data Structures And Algorithms ◽

De Bruijn

AbstractHigh-throughput sequencing techniques generate large volumes of DNA sequencing data at ultra-fast speed and extremely low cost. As a consequence, sequencing techniques have become ubiquitous in biomedical research and are used in hundreds of genomic applications. Efficient data structures and algorithms have been developed to handle the large datasets produced by these techniques. The prevailing method to index DNA sequences in those data structures and algorithms is by k-mers (k-long substrings) known as minimizers. Minimizers are the smallest k-mers selected in every consecutive window of a fixed length in a sequence, where the smallest is determined according to a predefined order, e.g., lexicographic. Recently, a new k-mer order based on a universal hitting set (UHS) was suggested. While several studies have shown that orders based on a small UHS have improved properties, the utility of using a small UHS in high-throughput sequencing analysis tasks has not been demonstrated to date.Here, we demonstrate the practical benefit of UHSs for the first time, in the genome assembly task. Reconstructing a genome from billions of short reads is a fundamental task in high-throughput sequencing analyses. de Bruijn graph construction is a key step in genome assembly, which often requires very large amounts of memory and long computation time. A critical bottleneck lies in the partitioning of DNA sequences into bins. The sequences in each bin are assembled separately, and the final de Bruijn graph is constructed by merging the bin-specific subgraphs. We incorporated a UHS-based order in the bin partition step of the Minimum Substring Partitioning algorithm of Li et al. (2013). Using a UHS-based order instead of lexicographic- or random-ordered minimizers produced lower density minimizers with more balanced bin partitioning, which led to a reduction in both runtime and memory usage.

Download Full-text

Improved Parallel Processing of Massive De Bruijn Graph for Genome Assembly

Web Technologies and Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-642-37401-2_12 ◽

2013 ◽

pp. 96-107 ◽

Cited By ~ 1

Author(s):

Li Zeng ◽

Jiefeng Cheng ◽

Jintao Meng ◽

Bingqiang Wang ◽

Shengzhong Feng

Keyword(s):

Parallel Processing ◽

Genome Assembly ◽

De Bruijn Graph ◽

De Bruijn

Download Full-text

RMI-DBG Algorithm: A more agile Iterative de Bruijn Graph Algorithm in Short Read Genome Assembly

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720021500050 ◽

2021 ◽

Author(s):

Zeinab Zare Hosseini ◽

Shekoufeh Kolahdouz Rahimi ◽

Esmaeil Forouzan ◽

Ahmad Baraani

Keyword(s):

Genome Assembly ◽

Graph Algorithm ◽

De Bruijn Graph ◽

Short Read ◽

De Bruijn

Download Full-text

A dynamic hashing approach to build the de bruijn graph for genome assembly

2013 IEEE International Conference of IEEE Region 10 (TENCON 2013) ◽

10.1109/tencon.2013.6719008 ◽

2013 ◽

Cited By ~ 1

Author(s):

Kun Zhao ◽

Weiguo Liu ◽

Gerrit Voss ◽

Wolfgang Muller-Wittig

Keyword(s):

Genome Assembly ◽

De Bruijn Graph ◽

De Bruijn ◽

Dynamic Hashing

Download Full-text

Accelerating De Bruijn Graph-Based Genome Assembly for High-Throughput Short Read Data

2013 International Conference on Parallel and Distributed Systems ◽

10.1109/icpads.2013.68 ◽

2013 ◽

Cited By ~ 1

Author(s):

Kun Zhao ◽

Weiguo Liu ◽

Gerrit Voss ◽

Wolfgang Mueller-Wittig

Keyword(s):

High Throughput ◽

Genome Assembly ◽

De Bruijn Graph ◽

Short Read ◽

De Bruijn

Download Full-text

Aligning optical maps to de Bruijn graphs

Bioinformatics ◽

10.1093/bioinformatics/btz069 ◽

2019 ◽

Vol 35 (18) ◽

pp. 3250-3256 ◽

Cited By ~ 1

Author(s):

Kingshuk Mukherjee ◽

Bahar Alipanahi ◽

Tamer Kahveci ◽

Leena Salmela ◽

Christina Boucher

Keyword(s):

Single Molecule ◽

Genome Assembly ◽

Sequence Data ◽

Supplementary Information ◽

De Bruijn Graph ◽

Structural Variations ◽

Regular Feature ◽

A Genome ◽

De Bruijn ◽

Optical Maps

Abstract Motivation Optical maps are high-resolution restriction maps (Rmaps) that give a unique numeric representation to a genome. Used in concert with sequence reads, they provide a useful tool for genome assembly and for discovering structural variations and rearrangements. Although they have been a regular feature of modern genome assembly projects, optical maps have been mainly used in post-processing step and not in the genome assembly process itself. Several methods have been proposed for pairwise alignment of single molecule optical maps—called Rmaps, or for aligning optical maps to assembled reads. However, the problem of aligning an Rmap to a graph representing the sequence data of the same genome has not been studied before. Such an alignment provides a mapping between two sets of data: optical maps and sequence data which will facilitate the usage of optical maps in the sequence assembly step itself. Results We define the problem of aligning an Rmap to a de Bruijn graph and present the first algorithm for solving this problem which is based on a seed-and-extend approach. We demonstrate that our method is capable of aligning 73% of Rmaps generated from the Escherichia coli genome to the de Bruijn graph constructed from short reads generated from the same genome. We validate the alignments and show that our method achieves an accuracy of 99.6%. We also show that our method scales to larger genomes. In particular, we show that 76% of Rmaps can be aligned to the de Bruijn graph in the case of human data. Availability and implementation The software for aligning optical maps to de Bruijn graph, omGraph is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/omGraph. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Assembly of Long Error-Prone Reads Using de Bruijn Graphs

10.1101/048413 ◽

2016 ◽

Cited By ~ 6

Author(s):

Yu Lin ◽

Jeffrey Yuan ◽

Mikhail Kolmogorov ◽

Max W. Shen ◽

Pavel A. Pevzner

Keyword(s):

Real Time ◽

Single Molecule ◽

Genome Assembly ◽

State Of The Art ◽

De Bruijn Graph ◽

Consensus Approach ◽

De Bruijn Graphs ◽

De Bruijn

AbstractThe recent breakthroughs in assembling long error-prone reads (such as reads generated by Single Molecule Real Time technology) were based on the overlap-layout-consensus approach and did not utilize the strengths of the alternative de Bruijn graph approach to genome assembly. Moreover, these studies often assume that applications of the de Bruijn graph approach are limited to short and accurate reads and that the overlap-layout-consensus approach is the only practical paradigm for assembling long error-prone reads. Below we show how to generalize de Bruijn graphs to assemble long error-prone reads and describe the ABruijn assembler, which results in more accurate genome reconstructions than the existing state-of-the-art algorithms.

Download Full-text

Succinct Dynamic de Bruijn Graphs

10.1101/2020.04.01.018481 ◽

2020 ◽

Cited By ~ 1

Author(s):

Bahar Alipanahi ◽

Alan Kuhnle ◽

Simon J. Puglisi ◽

Leena Salmela ◽

Christina Boucher

Keyword(s):

Data Structures ◽

Large Scale ◽

High Throughput Sequencing ◽

De Bruijn Graph ◽

Sequencing Data ◽

Efficient Manner ◽

De Bruijn Graphs ◽

High Throughput Sequencing Data ◽

Efficient Data ◽

De Bruijn

AbstractMotivationThe de Bruijn graph is one of the fundamental data structures for analysis of high throughput sequencing data. In order to be applicable to population-scale studies, it is essential to build and store the graph in a space- and time-efficient manner. In addition, due to the ever-changing nature of population studies, it has become essential to update the graph after construction e.g. add and remove nodes and edges. Although there has been substantial effort on making the construction and storage of the graph efficient, there is a limited amount of work in building the graph in an efficient and mutable manner. Hence, most space efficient data structures require complete reconstruction of the graph in order to add or remove edges or nodes.ResultsIn this paper we present DynamicBOSS, a succinct representation of the de Bruijn graph that allows for an unlimited number of additions and deletions of nodes and edges. We compare our method with other competing methods and demonstrate that DynamicBOSS is the only method that supports both addition and deletion and is applicable to very large samples (e.g. greater than 15 billion k-mers). Competing dynamic methods e.g., FDBG (Crawford et al., 2018) cannot be constructed on large scale datasets, or cannot support both addition and deletion e.g., BiFrost (Holley and Melsted, 2019).AvailabilityDynamicBOSS is publicly available at https://github.com/baharpan/[email protected]

Download Full-text

Succinct Dynamic de Bruijn Graphs

Bioinformatics ◽

10.1093/bioinformatics/btaa546 ◽

2020 ◽

Author(s):

Bahar Alipanahi ◽

Alan Kuhnle ◽

Simon J Puglisi ◽

Leena Salmela ◽

Christina Boucher

Keyword(s):

Data Structures ◽

Large Scale ◽

High Throughput Sequencing ◽

Supplementary Information ◽

De Bruijn Graph ◽

Sequencing Data ◽

Efficient Manner ◽

De Bruijn Graphs ◽

High Throughput Sequencing Data ◽

De Bruijn

Abstract Motivation The de Bruijn graph is one of the fundamental data structures for analysis of high throughput sequencing data. In order to be applicable to population-scale studies, it is essential to build and store the graph in a space- and time- efficient manner. In addition, due to the ever-changing nature of population studies, it has become essential to update the graph after construction e.g. add and remove nodes and edges. Although there has been substantial effort on making the construction and storage of the graph efficient, there is a limited amount of work in building the graph in an efficient and mutable manner. Hence, most space efficient data structures require complete reconstruction of the graph in order to add or remove edges or nodes. Results In this paper we present DynamicBOSS, a succinct representation of the de Bruijn graph that allows for an unlimited number of additions and deletions of nodes and edges. We compare our method with other competing methods and demonstrate that DynamicBOSS is the only method that supports both addition and deletion and is applicable to very large samples (e.g. greater than 15 billion k-mers). Competing dynamic methods e.g., FDBG (Crawford et al., 2018) cannot be constructed on large scale datasets, or cannot support both addition and deletion e.g., BiFrost (Holley and Melsted, 2019). Availability DynamicBOSS is publicly available at https://github.com/baharpan/dynboss. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text