Raven: a de novo genome assembler for long reads

We present new methods for the improvement of long-read de novo genome assembly incorporated into a straightforward tool called Raven (https://github.com/lbcb-sci/raven). Compared with other assemblers, Raven is one of two fastest, it reconstructs the sequenced genome in the least amount of fragments, has better or comparable accuracy, and maintains similar performance for various genomes. Raven takes 500 CPU hours to assemble a 44x human genome dataset in only 259 fragments.

Download Full-text

LongStitch: High-quality genome assembly correction and scaffolding using long reads

10.1101/2021.06.17.448848 ◽

2021 ◽

Author(s):

Lauren Coombe ◽

Janet X Li ◽

Theodora Lo ◽

Johnathan Wong ◽

Vladimir Nikolic ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Draft Genome ◽

Model Organisms ◽

High Quality ◽

De Novo Genome Assembly ◽

Long Reads ◽

Long Read ◽

Genomic Regions ◽

Genome Assemblies

Background Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. Results LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 2.0-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently runs in under five hours using less than 23GB of RAM. Conclusions Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch.

Download Full-text

LongStitch: high-quality genome assembly correction and scaffolding using long reads

BMC Bioinformatics ◽

10.1186/s12859-021-04451-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Lauren Coombe ◽

Janet X. Li ◽

Theodora Lo ◽

Johnathan Wong ◽

Vladimir Nikolic ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Draft Genome ◽

Model Organisms ◽

High Quality ◽

De Novo Genome Assembly ◽

Long Reads ◽

Long Read ◽

Genomic Regions ◽

Genome Assemblies

Abstract Background Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. Results LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of Caenorhabditis elegans, Oryza sativa, and three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 1.2-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently improves upon human assemblies in under five hours using less than 23 GB of RAM. Conclusions Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch.

Download Full-text

Accurate long-read de novo assembly evaluation with Inspector

Genome Biology ◽

10.1186/s13059-021-02527-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yu Chen ◽

Yixin Zhang ◽

Amy Y. Wang ◽

Min Gao ◽

Zechen Chong

Keyword(s):

Genome Assembly ◽

De Novo Assembly ◽

In Silico ◽

Large Scale ◽

De Novo ◽

Small Scale ◽

De Novo Genome Assembly ◽

Consensus Sequences ◽

Assembly Evaluation ◽

Long Read

AbstractLong-read de novo genome assembly continues to advance rapidly. However, there is a lack of effective tools to accurately evaluate the assembly results, especially for structural errors. We present Inspector, a reference-free long-read de novo assembly evaluator which faithfully reports types of errors and their precise locations. Notably, Inspector can correct the assembly errors based on consensus sequences derived from raw reads covering erroneous regions. Based on in silico and long-read assembly results from multiple long-read data and assemblers, we demonstrate that in addition to providing generic metrics, Inspector can accurately identify both large-scale and small-scale assembly errors.

Download Full-text

A haplotype-resolved, de novo genome assembly for the wood tiger moth (Arctia plantaginis) through trio binning

GigaScience ◽

10.1093/gigascience/giaa088 ◽

2020 ◽

Vol 9 (8) ◽

Cited By ~ 2

Author(s):

Eugenie C Yen ◽

Shane A McCarthy ◽

Juan A Galarza ◽

Tomas N Generalovic ◽

Sarah Pelan ◽

...

Keyword(s):

Population Structure ◽

Genome Assembly ◽

De Novo ◽

Consensus Sequence ◽

Parental Origin ◽

De Novo Genome Assembly ◽

Geographic Population ◽

Long Reads ◽

Tiger Moth ◽

Innovative Solution

ABSTRACT Background Diploid genome assembly is typically impeded by heterozygosity because it introduces errors when haplotypes are collapsed into a consensus sequence. Trio binning offers an innovative solution that exploits heterozygosity for assembly. Short, parental reads are used to assign parental origin to long reads from their F1 offspring before assembly, enabling complete haplotype resolution. Trio binning could therefore provide an effective strategy for assembling highly heterozygous genomes, which are traditionally problematic, such as insect genomes. This includes the wood tiger moth (Arctia plantaginis), which is an evolutionary study system for warning colour polymorphism. Findings We produced a high-quality, haplotype-resolved assembly for Arctia plantaginis through trio binning. We sequenced a same-species family (F1 heterozygosity ∼1.9%) and used parental Illumina reads to bin 99.98% of offspring Pacific Biosciences reads by parental origin, before assembling each haplotype separately and scaffolding with 10X linked reads. Both assemblies are contiguous (mean scaffold N50: 8.2 Mb) and complete (mean BUSCO completeness: 97.3%), with annotations and 31 chromosomes identified through karyotyping. We used the assembly to analyse genome-wide population structure and relationships between 40 wild resequenced individuals from 5 populations across Europe, revealing the Georgian population as the most genetically differentiated with the lowest genetic diversity. Conclusions We present the first invertebrate genome to be assembled via trio binning. This assembly is one of the highest quality genomes available for Lepidoptera, supporting trio binning as a potent strategy for assembling heterozygous genomes. Using our assembly, we provide genomic insights into the geographic population structure of A. plantaginis.

Download Full-text

Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads

Nature Biotechnology ◽

10.1038/s41587-020-0719-5 ◽

2020 ◽

Author(s):

David Porubsky ◽

◽

Peter Ebert ◽

Peter A. Audano ◽

Mitchell R. Vollger ◽

...

Keyword(s):

Single Cell ◽

Genome Assembly ◽

De Novo ◽

Error Rates ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

De Novo Genome Assembly ◽

Parental Data ◽

Human Genome Assembly ◽

Long Read

AbstractHuman genomes are typically assembled as consensus sequences that lack information on parental haplotypes. Here we describe a reference-free workflow for diploid de novo genome assembly that combines the chromosome-wide phasing and scaffolding capabilities of single-cell strand sequencing1,2 with continuous long-read or high-fidelity3 sequencing data. Employing this strategy, we produced a completely phased de novo genome assembly for each haplotype of an individual of Puerto Rican descent (HG00733) in the absence of parental data. The assemblies are accurate (quality value > 40) and highly contiguous (contig N50 > 23 Mbp) with low switch error rates (0.17%), providing fully phased single-nucleotide variants, indels and structural variants. A comparison of Oxford Nanopore Technologies and Pacific Biosciences phased assemblies identified 154 regions that are preferential sites of contig breaks, irrespective of sequencing technology or phasing algorithms.

Download Full-text

Long-read assembly of the Chinese rhesus macaque genome and identification of ape-specific structural variants

Nature Communications ◽

10.1038/s41467-019-12174-w ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 12

Author(s):

Yaoxi He ◽

Xin Luo ◽

Bin Zhou ◽

Ting Hu ◽

Xiaoyu Meng ◽

...

Keyword(s):

Rhesus Macaque ◽

Genome Assembly ◽

De Novo ◽

Gene Annotation ◽

Large Body ◽

Phenotypic Traits ◽

Structural Variants ◽

De Novo Genome Assembly ◽

Chinese Rhesus Macaque ◽

Long Read

Abstract We present a high-quality de novo genome assembly (rheMacS) of the Chinese rhesus macaque (Macaca mulatta) using long-read sequencing and multiplatform scaffolding approaches. Compared to the current Indian rhesus macaque reference genome (rheMac8), rheMacS increases sequence contiguity 75-fold, closing 21,940 of the remaining assembly gaps (60.8 Mbp). We improve gene annotation by generating more than two million full-length transcripts from ten different tissues by long-read RNA sequencing. We sequence resolve 53,916 structural variants (96% novel) and identify 17,000 ape-specific structural variants (ASSVs) based on comparison to ape genomes. Many ASSVs map within ChIP-seq predicted enhancer regions where apes and macaque show diverged enhancer activity and gene expression. We further characterize a subset that may contribute to ape- or great-ape-specific phenotypic traits, including taillessness, brain volume expansion, improved manual dexterity, and large body size. The rheMacS genome assembly serves as an ideal reference for future biomedical and evolutionary studies.

Download Full-text

A high-quality genome assembly from a single, field-collected spotted lanternfly (Lycorma delicatula) using the PacBio Sequel II system

GigaScience ◽

10.1093/gigascience/giz122 ◽

2019 ◽

Vol 8 (10) ◽

Cited By ~ 12

Author(s):

Sarah B Kingan ◽

Julie Urban ◽

Christine C Lambert ◽

Primo Baybayan ◽

Anna K Childers ◽

...

Keyword(s):

Invasive Species ◽

Genome Assembly ◽

De Novo ◽

Fragment Size ◽

High Quality ◽

De Novo Genome Assembly ◽

Lycorma Delicatula ◽

Long Read ◽

Genome Assemblies ◽

High Quality Genome

ABSTRACT Background A high-quality reference genome is an essential tool for applied and basic research on arthropods. Long-read sequencing technologies may be used to generate more complete and contiguous genome assemblies than alternate technologies; however, long-read methods have historically had greater input DNA requirements and higher costs than next-generation sequencing, which are barriers to their use on many samples. Here, we present a 2.3 Gb de novo genome assembly of a field-collected adult female spotted lanternfly (Lycorma delicatula) using a single Pacific Biosciences SMRT Cell. The spotted lanternfly is an invasive species recently discovered in the northeastern United States that threatens to damage economically important crop plants in the region. Results The DNA from 1 individual was used to make 1 standard, size-selected library with an average DNA fragment size of ∼20 kb. The library was run on 1 Sequel II SMRT Cell 8M, generating a total of 132 Gb of long-read sequences, of which 82 Gb were from unique library molecules, representing ∼36× coverage of the genome. The assembly had high contiguity (contig N50 length = 1.5 Mb), completeness, and sequence level accuracy as estimated by conserved gene set analysis (96.8% of conserved genes both complete and without frame shift errors). Furthermore, it was possible to segregate more than half of the diploid genome into the 2 separate haplotypes. The assembly also recovered 2 microbial symbiont genomes known to be associated with L. delicatula, each microbial genome being assembled into a single contig. Conclusions We demonstrate that field-collected arthropods can be used for the rapid generation of high-quality genome assemblies, an attractive approach for projects on emerging invasive species, disease vectors, or conservation efforts of endangered species.

Download Full-text

Whole-genome sequencing of 182 Bursaphelenchus xylophilus strains generates first long read based de novo genome assembly and reveals temperature associated population structure

10.22541/au.159352211.19983305 ◽

2020 ◽

Author(s):

Xiaolei Ding ◽

Yunfei Guo ◽

Jianren Ye ◽

Xiaoqin Wu ◽

Sixi Lin ◽

...

Keyword(s):

Population Structure ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Genome Assembly ◽

De Novo ◽

Bursaphelenchus Xylophilus ◽

Whole Genome ◽

De Novo Genome Assembly ◽

Long Read

Download Full-text

A high-quality, long-read de novo genome assembly to aid conservation of Hawaii’s last remaining crow species

10.1101/349035 ◽

2018 ◽

Author(s):

Jolene T. Sutton ◽

Martin Helmkampf ◽

Cynthia C. Steiner ◽

M. Renee Bellinger ◽

Jonas Korlach ◽

...

Keyword(s):

Genome Assembly ◽

Captive Breeding ◽

De Novo ◽

Bird Species ◽

Population Level ◽

Model Systems ◽

Population Declines ◽

High Quality ◽

De Novo Genome Assembly ◽

Long Read

AbstractGenome-level data can provide researchers with unprecedented precision to examine the causes and genetic consequences of population declines, and to apply these results to conservation management. Here we present a high-quality, long-read, de novo genome assembly for one of the world’s most endangered bird species, the Alala. As the only remaining native crow species in Hawaii, the Alala survived solely in a captive breeding program from 2002 until 2016, at which point a long-term reintroduction program was initiated. The high-quality genome assembly was generated to lay the foundation for both comparative genomics studies, and the development of population-level genomic tools that will aid conservation and recovery efforts. We illustrate how the quality of this assembly places it amongst the very best avian genomes assembled to date, comparable to intensively studied model systems. We describe the genome architecture in terms of repetitive elements and runs of homozygosity, and we show that compared with more outbred species, the Alala genome is substantially more homozygous. We also provide annotations for a subset of immunity genes that are likely to be important for conservation applications, and we discuss how this genome is currently being used as a roadmap for downstream conservation applications.

Download Full-text

A near complete genome for goat genetic and genomic research

Genetics Selection Evolution ◽

10.1186/s12711-021-00668-5 ◽

2021 ◽

Vol 53 (1) ◽

Author(s):

Ran Li ◽

Peng Yang ◽

Xuelei Dai ◽

Hojjat Asadollahpour Nanaei ◽

Wenwen Fang ◽

...

Keyword(s):

Y Chromosome ◽

Genome Assembly ◽

De Novo ◽

Genetic Research ◽

Genomic Research ◽

Dairy Goat ◽

De Novo Genome Assembly ◽

Important Species ◽

Long Read ◽

High Level

Abstract Background Goat, one of the first domesticated livestock, is a worldwide important species both culturally and economically. The current goat reference genome, known as ARS1, is reported as the first nonhuman genome assembly using 69× PacBio sequencing. However, ARS1 suffers from incomplete X chromosome and highly fragmented Y chromosome scaffolds. Results Here, we present a very high-quality de novo genome assembly, Saanen_v1, from a male Saanen dairy goat, with the first goat Y chromosome scaffold based on 117× PacBio long-read sequencing and 118× Hi-C data. Saanen_v1 displays a high level of completeness thanks to the presence of centromeric and telomeric repeats at the proximal and distal ends of two-thirds of the autosomes, and a much reduced number of gaps (169 vs. 773). The completeness and accuracy of the Saanen_v1 genome assembly are also evidenced by more assembled sequences on the chromosomes (2.63 Gb for Saanen_v1 vs. 2.58 Gb for ARS1), a slightly increased mapping ratio for transcriptomic data, and more genes anchored to chromosomes. The eight putative large assembly errors (1 to ~ 7 Mb each) found in ARS1 were amended, and for the first time, the substitution rate of this ruminant Y chromosome was estimated. Furthermore, sequence improvement in Saanen_v1, compared with ARS1, enables us to assign the likely correct positions for 4.4% of the single nucleotide polymorphism (SNP) probes in the widely used GoatSNP50 chip. Conclusions The updated goat genome assembly including both sex chromosomes (X and Y) and the autosomes with high-resolution quality will serve as a valuable resource for goat genetic research and applications.

Download Full-text