GPU acceleration of Darwin read overlapper for de novo assembly of long DNA reads

Abstract Background In Overlap-Layout-Consensus (OLC) based de novo assembly, all reads must be compared with every other read to find overlaps. This makes the process rather slow and limits the practicality of using de novo assembly methods at a large scale in the field. Darwin is a fast and accurate read overlapper that can be used for de novo assembly of state-of-the-art third generation long DNA reads. Darwin is designed to be hardware-friendly and can be accelerated on specialized computer system hardware to achieve higher performance. Results This work accelerates Darwin on GPUs. Using real Pacbio data, our GPU implementation on Tesla K40 has shown a speedup of 109x vs 8 CPU threads of an Intel Xeon machine and 24x vs 64 threads of IBM Power8 machine. The GPU implementation supports both linear and affine gap, scoring model. The results show that the GPU implementation can achieve the same high speedup for different scoring schemes. Conclusions The GPU implementation proposed in this work shows significant improvement in performance compared to the CPU version, thereby making it accessible for utilization as a practical read overlapper in a DNA assembly pipeline. Furthermore, our GPU acceleration can also be used for performing fast Smith-Waterman alignment between long DNA reads. GPU hardware has become commonly available in the field today, making the proposed acceleration accessible to a larger public. The implementation is available at https://github.com/Tongdongq/darwin-gpu.

Download Full-text

Accurate long-read de novo assembly evaluation with Inspector

Genome Biology ◽

10.1186/s13059-021-02527-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yu Chen ◽

Yixin Zhang ◽

Amy Y. Wang ◽

Min Gao ◽

Zechen Chong

Keyword(s):

Genome Assembly ◽

De Novo Assembly ◽

In Silico ◽

Large Scale ◽

De Novo ◽

Small Scale ◽

De Novo Genome Assembly ◽

Consensus Sequences ◽

Assembly Evaluation ◽

Long Read

AbstractLong-read de novo genome assembly continues to advance rapidly. However, there is a lack of effective tools to accurately evaluate the assembly results, especially for structural errors. We present Inspector, a reference-free long-read de novo assembly evaluator which faithfully reports types of errors and their precise locations. Notably, Inspector can correct the assembly errors based on consensus sequences derived from raw reads covering erroneous regions. Based on in silico and long-read assembly results from multiple long-read data and assemblers, we demonstrate that in addition to providing generic metrics, Inspector can accurately identify both large-scale and small-scale assembly errors.

Download Full-text

Optimizing de novo genome assembly from PCR-amplified metagenomes

10.7287/peerj.preprints.27453 ◽

2018 ◽

Author(s):

Simon Roux ◽

Gareth Trubl ◽

Danielle Goudeau ◽

Nandita Nath ◽

Estelle Couradeau ◽

...

Keyword(s):

Genome Assembly ◽

De Novo Assembly ◽

De Novo ◽

Pcr Amplification ◽

Error Rates ◽

De Novo Genome Assembly ◽

Low Input ◽

Assembly Algorithm ◽

Coverage Bias ◽

Assembly Pipeline

Background. Metagenomics has transformed our understanding of microbial diversity across ecosystems, with recent advances enabling de novo assembly of genomes from metagenomes. These metagenome-assembled genomes are critical to provide ecological, evolutionary, and metabolic context for all the microbes and viruses yet to be cultivated. Metagenomes can now be generated from nanogram to subnanogram amounts of DNA. However, these libraries require several rounds of PCR amplification before sequencing, and recent data suggest these typically yield smaller and more fragmented assemblies than regular metagenomes. Methods. Here we evaluate de novo assembly methods of 169 PCR-amplified metagenomes, including 25 for which an unamplified counterpart is available, to optimize specific assembly approaches for PCR-amplified libraries. We first evaluated coverage bias by mapping reads from PCR-amplified metagenomes onto reference contigs obtained from unamplified metagenomes of the same samples. Then, we compared different assembly pipelines in terms of assembly size (number of bp in contigs ≥ 10kb) and error rates to evaluate which are the best suited for PCR-amplified metagenomes. Results. Read mapping analyses revealed that the depth of coverage within individual genomes is significantly more uneven in PCR-amplified datasets versus unamplified metagenomes, with regions of high depth of coverage enriched in short inserts. This enrichment scales with the number of PCR cycles performed, and is presumably due to preferential amplification of short inserts. Standard assembly pipelines are confounded by this type of coverage unevenness, so we evaluated other assembly options to mitigate these issues. We found that a pipeline combining read deduplication and an assembly algorithm originally designed to recover genomes from libraries generated after whole genome amplification (single-cell SPAdes) frequently improved assembly of contigs ≥ 10kb by 10 to 100-fold for low input metagenomes. Conclusions. PCR-amplified metagenomes have enabled scientists to explore communities traditionally challenging to describe, including some with extremely low biomass or from which DNA is particularly difficult to extract. Here we show that a modified assembly pipeline can lead to an improved de novo genome assembly from PCR-amplified datasets, and enables a better genome recovery from low input metagenomes.

Download Full-text

De novo assembly of highly polymorphic metagenomic data using in situ generated reference sequences and a novel BLAST-based assembly pipeline

BMC Bioinformatics ◽

10.1186/s12859-017-1630-z ◽

2017 ◽

Vol 18 (1) ◽

Cited By ~ 9

Author(s):

You-Yu Lin ◽

Chia-Hung Hsieh ◽

Jiun-Hong Chen ◽

Xuemei Lu ◽

Jia-Horng Kao ◽

...

Keyword(s):

De Novo Assembly ◽

De Novo ◽

Metagenomic Data ◽

Assembly Pipeline ◽

Reference Sequences

Download Full-text

Reducing the number of artifactual repeats in de novo assembly of RNA-Seq data by optimizing the assembly pipeline

Gene Reports ◽

10.1016/j.genrep.2017.08.003 ◽

2017 ◽

Vol 9 ◽

pp. 7-12

Author(s):

Wei-Kang Lee ◽

Nur Afiza Mohd Zainuddin ◽

Hui-Ying Teh ◽

Yi-Yi Lim ◽

Mohd Uzair Jaafar ◽

...

Keyword(s):

De Novo Assembly ◽

De Novo ◽

Rna Seq ◽

Assembly Pipeline

Download Full-text

State of the art de novo assembly of human genomes from massively parallel sequencing data

Human Genomics ◽

10.1186/1479-7364-4-4-271 ◽

2010 ◽

Vol 4 (4) ◽

pp. 271 ◽

Cited By ~ 49

Author(s):

Yingrui Li ◽

Yujie Hu ◽

Lars Bolund ◽

Jun Wang

Keyword(s):

De Novo Assembly ◽

De Novo ◽

State Of The Art ◽

Massively Parallel Sequencing ◽

Massively Parallel ◽

Sequencing Data ◽

Parallel Sequencing ◽

Human Genomes

Download Full-text

PaKman: Scalable Assembly of Large Genomes on Distributed Memory Machines

10.1101/523068 ◽

2019 ◽

Author(s):

Priyanka Ghosh ◽

Sriram Krishnamoorthy ◽

Ananth Kalyanaraman

Keyword(s):

Genome Assembly ◽

Large Scale ◽

Distributed Memory ◽

High Throughput Sequencing ◽

De Novo ◽

State Of The Art ◽

Fundamental Problem ◽

Parallel Computer ◽

Assembly Process ◽

Data Movement

AbstractDe novo genome assembly is a fundamental problem in the field of bioinformatics, that aims to assemble the DNA sequence of an unknown genome from numerous short DNA fragments (aka reads) obtained from it. With the advent of high-throughput sequencing technologies, billions of reads can be generated in a matter of hours, necessitating efficient parallelization of the assembly process. While multiple parallel solutions have been proposed in the past, conducting a large-scale assembly at scale remains a challenging problem because of the inherent complexities associated with data movement, and irregular access footprints of memory and I/O operations. In this paper, we present a novel algorithm, called PaKman, to address the problem of performing large-scale genome assemblies on a distributed memory parallel computer. Our approach focuses on improving performance through a combination of novel data structures and algorithmic strategies for reducing the communication and I/O footprint during the assembly process. PaKman presents a solution for the two most time-consuming phases in the full genome assembly pipeline, namely, k-mer counting and contig generation.A key aspect of our algorithm is its graph data structure, which comprises fat nodes (or what we call “macro-nodes”) that reduce the communication burden during contig generation. We present an extensive performance and qualitative evaluation of our algorithm, including comparisons to other state-of-the-art parallel assemblers. Our results demonstrate the ability to achieve near-linear speedups on up to 8K cores (tested); outperform state-of-the-art distributed memory and shared memory tools in performance while delivering comparable (if not better) quality; and reduce time to solution significantly. For instance, PaKman is able to generate a high-quality set of assembled contigs for complex genomes such as the human and wheat genomes in a matter of minutes on 8K cores.

Download Full-text

Telomere length de novo assembly of all 7 chromosomes and mitogenome sequencing of the model entomopathogenic fungus, Metarhizium brunneum, by means of a novel assembly pipeline

BMC Genomics ◽

10.1186/s12864-021-07390-y ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Zack Saud ◽

Alexandra M. Kortsinoglou ◽

Vassili N. Kouvelis ◽

Tariq M. Butt

Keyword(s):

De Novo Assembly ◽

De Novo ◽

Gene Prediction ◽

Fungal Species ◽

Orthologous Protein ◽

Metarhizium Brunneum ◽

Sequencing Technologies ◽

Protein Clusters ◽

Assembly Pipeline ◽

Generation Sequencing

Abstract Background More accurate and complete reference genomes have improved understanding of gene function, biology, and evolutionary mechanisms. Hybrid genome assembly approaches leverage benefits of both long, relatively error-prone reads from third-generation sequencing technologies and short, accurate reads from second-generation sequencing technologies, to produce more accurate and contiguous de novo genome assemblies in comparison to using either technology independently. In this study, we present a novel hybrid assembly pipeline that allowed for both mitogenome de novo assembly and telomere length de novo assembly of all 7 chromosomes of the model entomopathogenic fungus, Metarhizium brunneum. Results The improved assembly allowed for better ab initio gene prediction and a more BUSCO complete proteome set has been generated in comparison to the eight current NCBI reference Metarhizium spp. genomes. Remarkably, we note that including the mitogenome in ab initio gene prediction training improved overall gene prediction. The assembly was further validated by comparing contig assembly agreement across various assemblers, assessing the assembly performance of each tool. Genomic synteny and orthologous protein clusters were compared between Metarhizium brunneum and three other Hypocreales species with complete genomes, identifying core proteins, and listing orthologous protein clusters shared uniquely between the two entomopathogenic fungal species, so as to further facilitate the understanding of molecular mechanisms underpinning fungal-insect pathogenesis. Conclusions The novel assembly pipeline may be used for other haploid fungal species, facilitating the need to produce high-quality reference fungal genomes, leading to better understanding of fungal genomic evolution, chromosome structuring and gene regulation.

Download Full-text

De novo assembly and delivery to mouse cells of a 101 kb functional human gene

Genetics ◽

10.1093/genetics/iyab038 ◽

2021 ◽

Author(s):

Leslie A Mitchell ◽

Laura H McCulloch ◽

Sudarshan Pinglay ◽

Henri Berger ◽

Nazario Bosco ◽

...

Keyword(s):

Dna Sequences ◽

Large Scale ◽

De Novo ◽

Human Gene ◽

Embryonic Stem ◽

Building Blocks ◽

Functional Study ◽

Dna Assembly ◽

Functional Evaluation ◽

Mouse Cells

Abstract Design and large-scale synthesis of DNA has been applied to the functional study of viral and microbial genomes. New and expanded technology development is required to unlock the transformative potential of such bottom-up approaches to the study of larger mammalian genomes. Two major challenges include assembling and delivering long DNA sequences. Here we describe a workflow for de novo DNA assembly and delivery that enables functional evaluation of mammalian genes on the length scale of 100 kilobase pairs (kb). The DNA assembly step is supported by an integrated robotic workcell. We demonstrate assembly of the 101 kb human HPRT1 gene in yeast from 3 kb building blocks, precision delivery of the resulting construct to mouse embryonic stem cells, and subsequent expression of the human protein from its full-length human gene in mouse cells. This workflow provides a framework for mammalian genome writing. We envision utility in producing designer variants of human genes linked to disease and their delivery and functional analysis in cell culture or animal models.

Download Full-text

Hardware accelerated novel optical de novo assembly for large-scale genomes

2014 24th International Conference on Field Programmable Logic and Applications (FPL) ◽

10.1109/fpl.2014.6927499 ◽

2014 ◽

Cited By ~ 6

Author(s):

Pingfan Meng ◽

Matthew Jacobsen ◽

Motoki Kimura ◽

Vladimir Dergachev ◽

Thomas Anantharaman ◽

...

Keyword(s):

De Novo Assembly ◽

Large Scale ◽

De Novo

Download Full-text

Telomere length de novo assembly of all 7 chromosomes and mitogenome sequencing of the model entomopathogenic fungus, Metarhizium brunneum, by means of a novel assembly pipeline

10.21203/rs.3.rs-60098/v3 ◽

2020 ◽

Author(s):

Zack Saud ◽

Alexandra M. Kortsinoglou ◽

Vassili N. Kouvelis ◽

Tariq M. Butt

Keyword(s):

De Novo Assembly ◽

De Novo ◽

Gene Prediction ◽

Fungal Species ◽

Orthologous Protein ◽

Metarhizium Brunneum ◽

Sequencing Technologies ◽

Protein Clusters ◽

Assembly Pipeline ◽

Generation Sequencing

Abstract Background More accurate and complete reference genomes have improved understanding of gene function, biology, and evolutionary mechanisms. Hybrid genome assembly approaches leverage benefits of both long, relatively error-prone reads from third-generation sequencing technologies and short, accurate reads from second-generation sequencing technologies, to produce more accurate and contiguous de novo genome assemblies in comparison to using either technology independently. In this study, we present a novel hybrid assembly pipeline that allowed for both mitogenome de novo assembly and telomere length de novo assembly of all 7 chromosomes of the model entomopathogenic fungus, Metarhizium brunneum . Results The improved assembly allowed for better ab initio gene prediction and a more BUSCO complete proteome set has been generated in comparison to the eight current NCBI reference Metarhizium spp. genomes. Remarkably, we note that including the mitogenome in ab initio gene prediction training improved overall gene prediction. The assembly was further validated by comparing contig assembly agreement across various assemblers, assessing the assembly performance of each tool. Genomic synteny and orthologous protein clusters were compared between Metarhizium brunneum and three other Hypocreales species with complete genomes, identifying core proteins, and listing orthologous protein clusters shared uniquely between the two entomopathogenic fungal species, so as to further facilitate the understanding of molecular mechanisms underpinning fungal-insect pathogenesis. Conclusions The novel assembly pipeline may be used for other haploid fungal species, facilitating the need to produce high-quality reference fungal genomes, leading to better understanding of fungal genomic evolution, chromosome structuring and gene regulation.

Download Full-text