Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer

How to apply de Bruijn graphs to genome assembly

Nature Biotechnology ◽

10.1038/nbt.2023 ◽

2011 ◽

Vol 29 (11) ◽

pp. 987-991 ◽

Cited By ~ 285

Author(s):

Phillip E C Compeau ◽

Pavel A Pevzner ◽

Glenn Tesler

Keyword(s):

Genome Assembly ◽

De Bruijn Graphs ◽

De Bruijn

Download Full-text

Scalable multiple whole-genome alignment and locally collinear block construction with SibeliaZ

10.1101/548123 ◽

2019 ◽

Cited By ~ 5

Author(s):

Ilia Minkin ◽

Paul Medvedev

Keyword(s):

Single Machine ◽

De Bruijn Graph ◽

Genome Alignment ◽

Whole Genome ◽

Reconstruction Algorithms ◽

De Bruijn Graphs ◽

Significant Step ◽

De Bruijn ◽

Whole Genome Alignment ◽

Computational Resources

AbstractMultiple whole-genome alignment is a challenging problem in bioinformatics. Despite many successes, current methods are not able to keep up with the growing number, length, and complexity of assembled genomes, especially when computational resources are limited. Approaches based on compacted de Bruijn graphs to identify and extend anchors into locally collinear blocks have potential for scalability, but current methods do not scale to mammalian genomes. We present an algorithm, SibeliaZ-LCB, for identifying collinear blocks in closely related genomes based on analysis of the de Bruijn graph. We further incorporate this into a multiple whole-genome alignment pipeline called SibeliaZ. SibeliaZ shows run-time improvements over other methods while maintaining accuracy. On sixteen recently-assembled strains of mice, SibeliaZ runs in under 16 hours on a single machine, while other tools did not run to completion for eight mice within a week. SibeliaZ makes a significant step towards improving scalability of multiple whole-genome alignment and collinear block reconstruction algorithms on a single machine.

Download Full-text

cloudSPAdes: assembly of synthetic long reads using de Bruijn graphs

Bioinformatics ◽

10.1093/bioinformatics/btz349 ◽

2019 ◽

Vol 35 (14) ◽

pp. i61-i70 ◽

Cited By ~ 4

Author(s):

Ivan Tolstoganov ◽

Anton Bankevich ◽

Zhoutao Chen ◽

Pavel A Pevzner

Keyword(s):

Narrow Range ◽

State Of The Art ◽

Supplementary Information ◽

De Bruijn Graph ◽

Hybrid Assembly ◽

De Bruijn Graphs ◽

Long Reads ◽

Long Read ◽

De Bruijn ◽

New Applications

Abstract Motivation The recently developed barcoding-based synthetic long read (SLR) technologies have already found many applications in genome assembly and analysis. However, although some new barcoding protocols are emerging and the range of SLR applications is being expanded, the existing SLR assemblers are optimized for a narrow range of parameters and are not easily extendable to new barcoding technologies and new applications such as metagenomics or hybrid assembly. Results We describe the algorithmic challenge of the SLR assembly and present a cloudSPAdes algorithm for SLR assembly that is based on analyzing the de Bruijn graph of SLRs. We benchmarked cloudSPAdes across various barcoding technologies/applications and demonstrated that it improves on the state-of-the-art SLR assemblers in accuracy and speed. Availability and implementation Source code and installation manual for cloudSPAdes are available at https://github.com/ablab/spades/releases/tag/cloudspades-paper. Supplementary Information Supplementary data are available at Bioinformatics online.

Download Full-text

De novo whole-genome assembly of Chrysanthemum makinoi, a key wild chrysanthemum

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab358 ◽

2021 ◽

Author(s):

Natascha van Lieshout ◽

Martijn van Kaauwen ◽

Linda Kodde ◽

Paul Arens ◽

Marinus J M Smulders ◽

...

Keyword(s):

Ab Initio ◽

Genome Assembly ◽

De Novo Assembly ◽

De Novo ◽

Its Sequence ◽

Whole Genome ◽

Annotation Pipeline ◽

Long Reads ◽

Oxford Nanopore ◽

The World

Abstract Chrysanthemum is among the top ten cut, potted and perennial garden flowers in the world. Despite this, to date, only the genomes of two wild diploid chrysanthemums have been sequenced and assembled. Here we present the most complete and contiguous chrysanthemum de novo assembly published so far, as well as a corresponding ab initio annotation. The cultivated hexaploid varieties are thought to originate from a hybrid of wild chrysanthemums, among which the diploid Chrysanthemum makinoi has been mentioned. Using a combination of Oxford Nanopore long reads, Pacific Biosciences long reads, Illumina short reads, Dovetail sequences and a genetic map, we assembled 3.1 Gb of its sequence into 9 pseudochromosomes, with an N50 of 330 Mb and BUSCO complete score of 92.1%. Our ab initio annotation pipeline predicted 95 074 genes and marked 80.0% of the genome as repetitive. This genome assembly of C. makinoi provides an important step forward in understanding the chrysanthemum genome, evolution and history.

Download Full-text

Cuttlefish: Fast, parallel, and low-memory compaction of de Bruijn graphs from large-scale genome collections

10.1101/2020.10.21.349605 ◽

2020 ◽

Author(s):

Jamshed Khan ◽

Rob Patro

Keyword(s):

Large Scale ◽

De Bruijn Graph ◽

Comparative Genomic ◽

De Bruijn Graphs ◽

Long Reads ◽

Genomic Analyses ◽

Finite State ◽

De Bruijn ◽

Colored De Bruijn Graph ◽

Memory Compaction

AbstractMotivationThe construction of the compacted de Bruijn graph from a large collection of reference genomes is a task of increasing interest in genomic analyses. For example, compacted colored reference de Bruijn graphs are increasingly used as sequence indices for the purposes of alignment of short and long reads. Also, as we sequence and assemble a greater diversity of individual genomes, the compacted colored de Bruijn graph can be used as the basis for methods aiming to perform comparative genomic analyses on these genomes. While algorithms have been developed to construct the compacted colored de Bruijn graph from reference sequences, there is still room for improvement, especially in the memory and the runtime performance as the number and the scale of the genomes over which the de Bruijn graph is built grow.ResultsWe introduce a new algorithm, implemented in the tool Cuttlefish, to construct the colored compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel modeling scheme of the de Bruijn graph vertices as finite-state automata, and constrains the state-space for the automata to enable tracking of their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that the algorithm scales much better than existing approaches, especially as the number and scale of the input references grow. For example, on a typical shared-memory machine, Cuttlefish constructed the compacted graph for 100 human genomes in less than 7 hours, using ~29 GB of memory; no other tested tool successfully completed this task on the testing hardware. We also applied Cuttlefish on 11 diverse conifer plant genomes, and the compacted graph was constructed in under 11 hours, using ~84 GB of memory, while the only other tested tool able to complete this compaction on our hardware took more than 16 hours and ~289 GB of memory.AvailabilityCuttlefish is written in C++14, and is available under an open source license at https://github.com/COMBINE-lab/[email protected]

Download Full-text

Assembly of Long Error-Prone Reads Using de Bruijn Graphs

10.1101/048413 ◽

2016 ◽

Cited By ~ 6

Author(s):

Yu Lin ◽

Jeffrey Yuan ◽

Mikhail Kolmogorov ◽

Max W. Shen ◽

Pavel A. Pevzner

Keyword(s):

Real Time ◽

Single Molecule ◽

Genome Assembly ◽

State Of The Art ◽

De Bruijn Graph ◽

Consensus Approach ◽

De Bruijn Graphs ◽

De Bruijn

AbstractThe recent breakthroughs in assembling long error-prone reads (such as reads generated by Single Molecule Real Time technology) were based on the overlap-layout-consensus approach and did not utilize the strengths of the alternative de Bruijn graph approach to genome assembly. Moreover, these studies often assume that applications of the de Bruijn graph approach are limited to short and accurate reads and that the overlap-layout-consensus approach is the only practical paradigm for assembling long error-prone reads. Below we show how to generalize de Bruijn graphs to assemble long error-prone reads and describe the ABruijn assembler, which results in more accurate genome reconstructions than the existing state-of-the-art algorithms.

Download Full-text

What do Eulerian and Hamiltonian cycles have to do with genome assembly?

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008928 ◽

2021 ◽

Vol 17 (5) ◽

pp. e1008928

Author(s):

Paul Medvedev ◽

Mihai Pop

Keyword(s):

Genome Assembly ◽

Linear Time ◽

Hamiltonian Cycles ◽

De Bruijn Graphs ◽

Genome Reconstruction ◽

Assembly Algorithm ◽

A Genome ◽

De Bruijn ◽

Do So

Many students are taught about genome assembly using the dichotomy between the complexity of finding Eulerian and Hamiltonian cycles (easy versus hard, respectively). This dichotomy is sometimes used to motivate the use of de Bruijn graphs in practice. In this paper, we explain that while de Bruijn graphs have indeed been very useful, the reason has nothing to do with the complexity of the Hamiltonian and Eulerian cycle problems. We give 2 arguments. The first is that a genome reconstruction is never unique and hence an algorithm for finding Eulerian or Hamiltonian cycles is not part of any assembly algorithm used in practice. The second is that even if an arbitrary genome reconstruction was desired, one could do so in linear time in both the Eulerian and Hamiltonian paradigms.

Download Full-text

Whole genome sequencing and assembly of a Caenorhabditis elegans genome with complex genomic rearrangements using the MinION sequencing device

10.1101/099143 ◽

2017 ◽

Cited By ~ 12

Author(s):

JR Tyson ◽

NJ O’Neil ◽

M Jain ◽

HE Olsen ◽

P Hieter ◽

...

Keyword(s):

Caenorhabditis Elegans ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Genome Assembly ◽

De Novo ◽

Sequence Data ◽

Genomic Rearrangements ◽

Whole Genome ◽

C Elegans ◽

Long Reads

ABSTRACTAdvances in 3rd generation sequencing have opened new possibilities for ‘benchtop’ whole genome sequencing. The MinION is a portable device that uses nanopore technology and can sequence long DNA molecules. MinION long reads are well suited for sequencing and de novo assembly of complex genomes with large repetitive elements. Long reads also facilitate the identification of complex genomic rearrangements such as those observed in tumor genomes. To assess the feasibility of the de novo assembly of large complex genomes using both MinION and Illumina platforms, we sequenced the genome of a Caenorhabditis elegans strain that contains a complex acetaldehyde-induced rearrangement and a biolistic bombardment-mediated insertion of a GFP containing plasmid. Using ∼5.8 gigabases of MinION sequence data, we were able to assemble a C. elegans genome containing 145 contigs (N50 contig length = 1.22 Mb) that covered >99% of the 100,286,401 bp reference genome. In contrast, using ∼8.04 gigabases of Illumina sequence data, we were able to assemble a C. elegans genome in 38,645 contigs (N50 contig length = ∼26 kb) containing 117 Mb. From the MinION genome assembly we identified the complex structures of both the acetaldehyde-induced mutation and the biolistic-mediated insertion. To date, this is the largest genome to be assembled exclusively from MinION data and is the first demonstration that the long reads of MinION sequencing can be used for whole genome assembly of large (100 Mb) genomes and the elucidation of complex genomic rearrangements.

Download Full-text

Integration of string and de Bruijn graphs for genome assembly

Bioinformatics ◽

10.1093/bioinformatics/btw011 ◽

2016 ◽

Vol 32 (9) ◽

pp. 1301-1307 ◽

Cited By ~ 5

Author(s):

Yao-Ting Huang ◽

Chen-Fu Liao

Keyword(s):

Genome Assembly ◽

De Bruijn Graphs ◽

De Bruijn

Download Full-text

Assembly of long error-prone reads using de Bruijn graphs

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1604560113 ◽

2016 ◽

Vol 113 (52) ◽

pp. E8396-E8405 ◽

Cited By ~ 85

Author(s):

Yu Lin ◽

Jeffrey Yuan ◽

Mikhail Kolmogorov ◽

Max W. Shen ◽

Mark Chaisson ◽

...

Keyword(s):

Genome Assembly ◽

De Bruijn Graph ◽

De Bruijn Graphs ◽

De Bruijn

The recent breakthroughs in assembling long error-prone reads were based on the overlap-layout-consensus (OLC) approach and did not utilize the strengths of the alternative de Bruijn graph approach to genome assembly. Moreover, these studies often assume that applications of the de Bruijn graph approach are limited to short and accurate reads and that the OLC approach is the only practical paradigm for assembling long error-prone reads. We show how to generalize de Bruijn graphs for assembling long error-prone reads and describe the ABruijn assembler, which combines the de Bruijn graph and the OLC approaches and results in accurate genome reconstructions.

Download Full-text