LongStitch: high-quality genome assembly correction and scaffolding using long reads

Abstract Background Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. Results LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of Caenorhabditis elegans, Oryza sativa, and three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 1.2-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently improves upon human assemblies in under five hours using less than 23 GB of RAM. Conclusions Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch.

Download Full-text

LongStitch: High-quality genome assembly correction and scaffolding using long reads

10.1101/2021.06.17.448848 ◽

2021 ◽

Author(s):

Lauren Coombe ◽

Janet X Li ◽

Theodora Lo ◽

Johnathan Wong ◽

Vladimir Nikolic ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Draft Genome ◽

Model Organisms ◽

High Quality ◽

De Novo Genome Assembly ◽

Long Reads ◽

Long Read ◽

Genomic Regions ◽

Genome Assemblies

Background Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. Results LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 2.0-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently runs in under five hours using less than 23GB of RAM. Conclusions Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch.

Download Full-text

A high-quality genome assembly from a single, field-collected spotted lanternfly (Lycorma delicatula) using the PacBio Sequel II system

GigaScience ◽

10.1093/gigascience/giz122 ◽

2019 ◽

Vol 8 (10) ◽

Cited By ~ 12

Author(s):

Sarah B Kingan ◽

Julie Urban ◽

Christine C Lambert ◽

Primo Baybayan ◽

Anna K Childers ◽

...

Keyword(s):

Invasive Species ◽

Genome Assembly ◽

De Novo ◽

Fragment Size ◽

High Quality ◽

De Novo Genome Assembly ◽

Lycorma Delicatula ◽

Long Read ◽

Genome Assemblies ◽

High Quality Genome

ABSTRACT Background A high-quality reference genome is an essential tool for applied and basic research on arthropods. Long-read sequencing technologies may be used to generate more complete and contiguous genome assemblies than alternate technologies; however, long-read methods have historically had greater input DNA requirements and higher costs than next-generation sequencing, which are barriers to their use on many samples. Here, we present a 2.3 Gb de novo genome assembly of a field-collected adult female spotted lanternfly (Lycorma delicatula) using a single Pacific Biosciences SMRT Cell. The spotted lanternfly is an invasive species recently discovered in the northeastern United States that threatens to damage economically important crop plants in the region. Results The DNA from 1 individual was used to make 1 standard, size-selected library with an average DNA fragment size of ∼20 kb. The library was run on 1 Sequel II SMRT Cell 8M, generating a total of 132 Gb of long-read sequences, of which 82 Gb were from unique library molecules, representing ∼36× coverage of the genome. The assembly had high contiguity (contig N50 length = 1.5 Mb), completeness, and sequence level accuracy as estimated by conserved gene set analysis (96.8% of conserved genes both complete and without frame shift errors). Furthermore, it was possible to segregate more than half of the diploid genome into the 2 separate haplotypes. The assembly also recovered 2 microbial symbiont genomes known to be associated with L. delicatula, each microbial genome being assembled into a single contig. Conclusions We demonstrate that field-collected arthropods can be used for the rapid generation of high-quality genome assemblies, an attractive approach for projects on emerging invasive species, disease vectors, or conservation efforts of endangered species.

Download Full-text

Extensive genomic and transcriptomic variation defines the chromosome-scale assembly of Haemonchus contortus, a model gastrointestinal worm

10.1101/2020.02.18.945246 ◽

2020 ◽

Cited By ~ 2

Author(s):

Stephen R. Doyle ◽

Alan Tracey ◽

Roz Laing ◽

Nancy Holroyd ◽

David Bartley ◽

...

Keyword(s):

Genome Assembly ◽

Haemonchus Contortus ◽

Vaccine Development ◽

De Novo ◽

Anthelmintic Resistance ◽

Draft Genome ◽

Small Ruminants ◽

High Quality ◽

Long Read ◽

Genome Assemblies

AbstractBackgroundHaemonchus contortus is a globally distributed and economically important gastrointestinal pathogen of small ruminants, and has become the key nematode model for studying anthelmintic resistance and other parasite-specific traits among a wider group of parasites including major human pathogens. Two draft genome assemblies for H. contortus were reported in 2013, however, both were highly fragmented, incomplete, and differed from one another in important respects. While the introduction of long-read sequencing has significantly increased the rate of production and contiguity of de novo genome assemblies broadly, achieving high quality genome assemblies for small, genetically diverse, outcrossing eukaryotic organisms such as H. contortus remains a significant challenge.ResultsHere, we report using PacBio long read and OpGen and 10X Genomics long-molecule methods to generate a highly contiguous 283.4 Mbp chromosome-scale genome assembly including a resolved sex chromosome. We show a remarkable pattern of almost complete conservation of chromosome content (synteny) with Caenorhabditis elegans, but almost no conservation of gene order. Long-read transcriptome sequence data has allowed us to define coordinated transcriptional regulation throughout the life cycle of the parasite, and refine our understanding of cis- and trans-splicing relative to that observed in C. elegans. Finally, we use this assembly to give a comprehensive picture of chromosome-wide genetic diversity both within a single isolate and globally.ConclusionsThe H. contortus MHco3(ISE).N1 genome assembly presented here represents the most contiguous and resolved nematode assembly outside of the Caenorhabditis genus to date, together with one of the highest-quality set of predicted gene features. These data provide a high-quality comparison for understanding the evolution and genomics of Caenorhabditis and other nematodes, and extends the experimental tractability of this model parasitic nematode in understanding pathogen biology, drug discovery and vaccine development, and important adaptive traits such as drug resistance.

Download Full-text

A high-quality, long-read de novo genome assembly to aid conservation of Hawaii’s last remaining crow species

10.1101/349035 ◽

2018 ◽

Author(s):

Jolene T. Sutton ◽

Martin Helmkampf ◽

Cynthia C. Steiner ◽

M. Renee Bellinger ◽

Jonas Korlach ◽

...

Keyword(s):

Genome Assembly ◽

Captive Breeding ◽

De Novo ◽

Bird Species ◽

Population Level ◽

Model Systems ◽

Population Declines ◽

High Quality ◽

De Novo Genome Assembly ◽

Long Read

AbstractGenome-level data can provide researchers with unprecedented precision to examine the causes and genetic consequences of population declines, and to apply these results to conservation management. Here we present a high-quality, long-read, de novo genome assembly for one of the world’s most endangered bird species, the Alala. As the only remaining native crow species in Hawaii, the Alala survived solely in a captive breeding program from 2002 until 2016, at which point a long-term reintroduction program was initiated. The high-quality genome assembly was generated to lay the foundation for both comparative genomics studies, and the development of population-level genomic tools that will aid conservation and recovery efforts. We illustrate how the quality of this assembly places it amongst the very best avian genomes assembled to date, comparable to intensively studied model systems. We describe the genome architecture in terms of repetitive elements and runs of homozygosity, and we show that compared with more outbred species, the Alala genome is substantially more homozygous. We also provide annotations for a subset of immunity genes that are likely to be important for conservation applications, and we discuss how this genome is currently being used as a roadmap for downstream conservation applications.

Download Full-text

Pushing the limits of de novo genome assembly for complex prokaryotic genomes harboring very long, near identical repeats

10.1101/300186 ◽

2018 ◽

Cited By ~ 3

Author(s):

Michael Schmid ◽

Daniel Frei ◽

Andrea Patrignani ◽

Ralph Schlapbach ◽

Jürg E. Frey ◽

...

Keyword(s):

Dark Matter ◽

Genome Assembly ◽

De Novo ◽

Bacterial Genomes ◽

De Novo Genome Assembly ◽

Assembly Algorithm ◽

Long Reads ◽

Oxford Nanopore ◽

Prokaryotic Genomes ◽

Genome Assemblies

AbstractGenerating a complete, de novo genome assembly for prokaryotes is often considered a solved problem. However, we here show that Pseudomonas koreensis P19E3 harbors multiple, near identical repeat pairs up to 70 kilobase pairs in length. Beyond long repeats, the P19E3 assembly was further complicated by a shufflon region. Its complex genome could not be de novo assembled with long reads produced by Pacific Biosciences’ technology, but required very long reads from the Oxford Nanopore Technology. Another important factor for a full genomic resolution was the choice of assembly algorithm.Importantly, a repeat analysis indicated that very complex bacterial genomes represent a general phenomenon beyond Pseudomonas. Roughly 10% of 9331 complete bacterial and a handful of 293 complete archaeal genomes represented this dark matter for de novo genome assembly of prokaryotes. Several of these dark matter genome assemblies contained repeats far beyond the resolution of the sequencing technology employed and likely contain errors, other genomes were closed employing labor-intense steps like cosmid libraries, primer walking or optical mapping. Using very long sequencing reads in combination with assemblers capable of resolving long, near identical repeats will bring most prokaryotic genomes within reach of fast and complete de novo genome assembly.

Download Full-text

Raven: a de novo genome assembler for long reads

10.1101/2020.08.07.242461 ◽

2020 ◽

Cited By ~ 5

Author(s):

Robert Vaser ◽

Mile Šikić

Keyword(s):

Human Genome ◽

Genome Assembly ◽

De Novo ◽

De Novo Genome Assembly ◽

New Methods ◽

Long Reads ◽

Long Read ◽

Comparable Accuracy ◽

Genome Assembler ◽

Genome Dataset

We present new methods for the improvement of long-read de novo genome assembly incorporated into a straightforward tool called Raven (https://github.com/lbcb-sci/raven). Compared with other assemblers, Raven is one of two fastest, it reconstructs the sequenced genome in the least amount of fragments, has better or comparable accuracy, and maintains similar performance for various genomes. Raven takes 500 CPU hours to assemble a 44x human genome dataset in only 259 fragments.

Download Full-text

A high-quality de novo genome assembly based on nanopore sequencing of a wild-caught coconut rhinoceros beetle (Oryctes rhinoceros)

10.1101/2021.09.12.459717 ◽

2021 ◽

Author(s):

Igor Filipović ◽

Gordana Rašić ◽

James Hereward ◽

Maria Gharuka ◽

Gregor J Devine ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Nuclear Genome ◽

Assembly Process ◽

Structural Annotation ◽

High Quality ◽

Oryctes Rhinoceros ◽

Rhinoceros Beetle ◽

Long Read ◽

Genome Assemblies

Background: An optimal starting point for relating genome function to organismal biology is a high-quality nuclear genome assembly, and long-read sequencing is revolutionizing the production of this genomic resource in insects. Despite this, nuclear genome assemblies have been under-represented for agricultural insect pests, particularly from the order Coleoptera. Here we present a de novo genome assembly and structural annotation for the coconut rhinoceros beetle, Oryctes rhinoceros (Coleoptera: Scarabaeidae), based on Oxford Nanopore Technologies (ONT) long-read data generated from a wild-caught female, as well as the assembly process that also led to the recovery of the complete circular genome assemblies of the beetle's mitochondrial genome and that of the biocontrol agent, Oryctes rhinoceros nudivirus (OrNV). As an invasive pest of palm trees, O. rhinoceros is undergoing an expansion in its range across the Pacific Islands, requiring new approaches to management that may include strategies facilitated by genome assembly and annotation. Results: High-quality DNA isolated from an adult female was used to create four ONT libraries that were sequenced using four MinION flow cells, producing a total of 27.2 Gb of high-quality long-read sequences. We employed an iterative assembly process and polishing with one lane of high-accuracy Illumina reads, obtaining a final size of the assembly of 377.36 Mb that had high contiguity (fragment N50 length = 12 Mb) and accuracy, as evidenced by the exceptionally high completeness of the benchmarked set of conserved single-copy orthologous genes (BUSCO completeness = 99.11%). These quality metrics place our assembly as the most complete of the published Coleopteran genomes. The structural annotation of the nuclear genome assembly contained a highly-accurate set of 16,371 protein-coding genes showing BUSCO completeness of 92.09%, as well as the expected number of non-coding RNAs and the number and structure of paralogous genes in a gene family like Sigma GST. Conclusions: The genomic resources produced in this study form a foundation for further functional genetic research and management programs that may inform the control and surveillance of O. rhinoceros populations, and we demonstrate the efficacy of de novo genome assembly using long-read ONT data from a single field-caught insect.

Download Full-text

WENGAN: Efficient and high quality hybrid de novo assembly of human genomes

10.1101/840447 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alex Di Genova ◽

Elena Buena-Atienza ◽

Stephan Ossowski ◽

Marie-France Sagot

Keyword(s):

De Novo ◽

Computational Cost ◽

Sequence Information ◽

Sequencing Data ◽

High Quality ◽

Sequencing Technologies ◽

Human Genomes ◽

Long Reads ◽

Long Read ◽

Genome Assemblies

The continuous improvement of long-read sequencing technologies along with the development of ad-doc algorithms has launched a new de novo assembly era that promises high-quality genomes. However, it has proven difficult to use only long reads to generate accurate genome assemblies of large, repeat-rich human genomes. To date, most of the human genomes assembled from long error-prone reads add accurate short reads to further polish the consensus quality. Here, we report the development of a novel algorithm for hybrid assembly, WENGAN, and the de novo assembly of four human genomes using a combination of sequencing data generated on ONT PromethION, PacBio Sequel, Illumina and MGI technology. WENGAN implements efficient algorithms that exploit the sequence information of short and long reads to tackle assembly contiguity as well as consensus quality. The resulting genome assemblies have high contiguity (contig NG50:16.67-62.06 Mb), few assembly errors (contig NGA50:10.9-45.91 Mb), good consensus quality (QV:27.79-33.61), and high gene completeness (BUSCO complete: 94.6-95.1%), while consuming low computational resources (CPU hours:153-1027). In particular, the WENGAN assembly of the haploid CHM13 sample achieved a contig NG50 of 62.06 Mb (NGA50:45.91 Mb), which surpasses the contiguity of the current human reference genome (GRCh38 contig NG50:57.88 Mb). Providing highest quality at low computational cost, WENGAN is an important step towards the democratization of the de novo assembly of human genomes. The WENGAN assembler is available at https://github.com/adigenova/wengan

Download Full-text

A High-Quality, Long-Read De Novo Genome Assembly to Aid Conservation of Hawaii’s Last Remaining Crow Species

Genes ◽

10.3390/genes9080393 ◽

2018 ◽

Vol 9 (8) ◽

pp. 393 ◽

Cited By ~ 7

Author(s):

Jolene T. Sutton ◽

Martin Helmkampf ◽

Cynthia C. Steiner ◽

M. Renee Bellinger ◽

Jonas Korlach ◽

...

Keyword(s):

Genome Assembly ◽

Captive Breeding ◽

Conservation Management ◽

De Novo ◽

Bird Species ◽

Model Systems ◽

Population Declines ◽

High Quality ◽

De Novo Genome Assembly ◽

Long Read

Abstract: Genome-level data can provide researchers with unprecedented precision to examine the causes and genetic consequences of population declines, which can inform conservation management. Here, we present a high-quality, long-read, de novo genome assembly for one of the world’s most endangered bird species, the ʻAlalā (Corvus hawaiiensis; Hawaiian crow). As the only remaining native crow species in Hawaiʻi, the ʻAlalā survived solely in a captive-breeding program from 2002 until 2016, at which point a long-term reintroduction program was initiated. The high-quality genome assembly was generated to lay the foundation for both comparative genomics studies and the development of population-level genomic tools that will aid conservation and recovery efforts. We illustrate how the quality of this assembly places it amongst the very best avian genomes assembled to date, comparable to intensively studied model systems. We describe the genome architecture in terms of repetitive elements and runs of homozygosity, and we show that compared with more outbred species, the ʻAlalā genome is substantially more homozygous. We also provide annotations for a subset of immunity genes that are likely to be important in conservation management, and we discuss how this genome is currently being used as a roadmap for downstream conservation applications.

Download Full-text

Chromosome-level hybrid de novo genome assemblies as an attainable option for non-model organisms

10.1101/748228 ◽

2019 ◽

Cited By ~ 2

Author(s):

Coline C. Jaworski ◽

Carson W. Allan ◽

Luciano M. Matzkin

Keyword(s):

Genome Assembly ◽

De Novo ◽

Model Organism ◽

Model Organisms ◽

Sequencing Error ◽

Long Reads ◽

Hybrid Genome ◽

Genome Assemblies ◽

Hybrid Assemblies ◽

Chromosome Level

AbstractThe emergence of third generation sequencing (3GS; long-reads) is making closer the goal of chromosome-size fragments in de novo genome assemblies. This allows the exploration of new and broader questions on genome evolution for a number of non-model organisms. However, long-read technologies result in higher sequencing error rates and therefore impose an elevated cost of sufficient coverage to achieve high enough quality. In this context, hybrid assemblies, combining short-reads and long-reads provide an alternative efficient and cost-effective approach to generate de novo, chromosome-level genome assemblies. The array of available software programs for hybrid genome assembly, sequence correction and manipulation is constantly being expanded and improved. This makes it difficult for non-experts to find efficient, fast and tractable computational solutions for genome assembly, especially in the case of non-model organisms lacking a reference genome or one from a closely related species. In this study, we review and test the most recent pipelines for hybrid assemblies, comparing the model organism Drosophila melanogaster to a non-model cactophilic Drosophila, D. mojavensis. We show that it is possible to achieve excellent contiguity on this non-model organism using the DBG2OLC pipeline.

Download Full-text