Purge Haplotigs: Synteny Reduction for Third-gen Diploid Genome Assemblies

AbstractRecent developments in third-gen long read sequencing and diploid-aware assemblers have resulted in the rapid release of numerous reference-quality assemblies for diploid genomes. However, assembling highly heterozygous genomes is still facing a major problem where the two haplotypes for a region are highly polymorphic and the synteny is not recognised during assembly. This causes issues with downstream analysis, for example variant discovery using the haploid assembly, or haplotype reconstruction using the diploid assembly. A new pipeline—Purge Haplotigs—was developed specifically for third-gen assemblies to identify and reassign the duplicate contigs. The pipeline takes a draft haplotype-fused assembly or a diploid assembly, and read alignments to produce an improved assembly. The pipeline was tested on a simulated dataset and on four recent diploid (phased) de novo assemblies from third-generation long-read sequencing. All assemblies after processing with Purge Haplotigs were less duplicated with minimal impact on genome completeness. The software is available at https://bitbucket.org/mroachawri/purge_haplotigs under a permissive MIT licence.

Download Full-text

Contiguity: Contig adjacency graph construction and visualisation

10.7287/peerj.preprints.1037v1 ◽

2015 ◽

Cited By ~ 8

Author(s):

Mitchell J Sullivan ◽

Nouri L Ben Zakour ◽

Brian M Forde ◽

Mitchell Stanton-Cook ◽

Scott A Beatson

Keyword(s):

De Novo ◽

Reference Sequence ◽

De Bruijn Graph ◽

Interactive Software ◽

Graph Exploration ◽

Adjacency Graph ◽

Highly Sensitive ◽

Long Read ◽

Genome Assemblies ◽

Adjacency Graphs

Contiguity is an interactive software for the visualization and manipulation of de novo genome assemblies. Contiguity creates and displays information on contig adjacency which is contextualized by the simultaneous display of a comparison between assembled contigs and reference sequence. Where scaffolders allow unambiguous connections between contigs to be resolved into a single scaffold, Contiguity allows the user to create all potential scaffolds in ambiguous regions of the genome. This enables the resolution of novel sequence or structural variants from the assembly. In addition, Contiguity provides a sequencing and assembly agnostic approach for the creation of contig adjacency graphs. To maximize the number of contig adjacencies determined, Contiguity combines information from read pair mappings, sequence overlap and De Bruijn graph exploration. We demonstrate how highly sensitive graphs can be achieved using this method. Contig adjacency graphs allow the user to visualize potential arrangements of contigs in unresolvable areas of the genome. By combining adjacency information with comparative genomics, Contiguity provides an intuitive approach for exploring and improving sequence assemblies. It is also useful in guiding manual closure of long read sequence assemblies. Contiguity is an open source application, implemented using Python and the Tkinter GUI package that can run on any Unix, OSX and Windows operating system. It has been designed and optimized for bacterial assemblies. Contiguity is available at http://mjsull.github.io/Contiguity .

Download Full-text

Identifying the causes and consequences of assembly gaps using a multiplatform genome assembly of a bird-of-paradise

10.1101/2019.12.19.882399 ◽

2019 ◽

Cited By ~ 5

Author(s):

Valentina Peona ◽

Mozes P.K. Blom ◽

Luohao Xu ◽

Reto Burri ◽

Shawn Sullivan ◽

...

Keyword(s):

Dark Matter ◽

Genome Assembly ◽

Sex Chromosome ◽

De Novo ◽

Model Organism ◽

Technology Choice ◽

High Quality ◽

Sequencing Technologies ◽

Downstream Analysis ◽

Genome Assemblies

AbstractGenome assemblies are currently being produced at an impressive rate by consortia and individual laboratories. The low costs and increasing efficiency of sequencing technologies have opened up a whole new world of genomic biodiversity. Although these technologies generate high-quality genome assemblies, there are still genomic regions difficult to assemble, like repetitive elements and GC-rich regions (genomic “dark matter”). In this study, we compare the efficiency of currently used sequencing technologies (short/linked/long reads and proximity ligation maps) and combinations thereof in assembling genomic dark matter starting from the same sample. By adopting different de-novo assembly strategies, we were able to compare each individual draft assembly to a curated multiplatform one and identify the nature of the previously missing dark matter with a particular focus on transposable elements, multi-copy MHC genes, and GC-rich regions. Thanks to this multiplatform approach, we demonstrate the feasibility of producing a high-quality chromosome-level assembly for a non-model organism (paradise crow) for which only suboptimal samples are available. Our approach was able to reconstruct complex chromosomes like the repeat-rich W sex chromosome and several GC-rich microchromosomes. Telomere-to-telomere assemblies are not a reality yet for most organisms, but by leveraging technology choice it is possible to minimize genome assembly gaps for downstream analysis. We provide a roadmap to tailor sequencing projects around the completeness of both the coding and non-coding parts of the genomes.

Download Full-text

A high-quality genome assembly from a single, field-collected spotted lanternfly (Lycorma delicatula) using the PacBio Sequel II system

GigaScience ◽

10.1093/gigascience/giz122 ◽

2019 ◽

Vol 8 (10) ◽

Cited By ~ 12

Author(s):

Sarah B Kingan ◽

Julie Urban ◽

Christine C Lambert ◽

Primo Baybayan ◽

Anna K Childers ◽

...

Keyword(s):

Invasive Species ◽

Genome Assembly ◽

De Novo ◽

Fragment Size ◽

High Quality ◽

De Novo Genome Assembly ◽

Lycorma Delicatula ◽

Long Read ◽

Genome Assemblies ◽

High Quality Genome

ABSTRACT Background A high-quality reference genome is an essential tool for applied and basic research on arthropods. Long-read sequencing technologies may be used to generate more complete and contiguous genome assemblies than alternate technologies; however, long-read methods have historically had greater input DNA requirements and higher costs than next-generation sequencing, which are barriers to their use on many samples. Here, we present a 2.3 Gb de novo genome assembly of a field-collected adult female spotted lanternfly (Lycorma delicatula) using a single Pacific Biosciences SMRT Cell. The spotted lanternfly is an invasive species recently discovered in the northeastern United States that threatens to damage economically important crop plants in the region. Results The DNA from 1 individual was used to make 1 standard, size-selected library with an average DNA fragment size of ∼20 kb. The library was run on 1 Sequel II SMRT Cell 8M, generating a total of 132 Gb of long-read sequences, of which 82 Gb were from unique library molecules, representing ∼36× coverage of the genome. The assembly had high contiguity (contig N50 length = 1.5 Mb), completeness, and sequence level accuracy as estimated by conserved gene set analysis (96.8% of conserved genes both complete and without frame shift errors). Furthermore, it was possible to segregate more than half of the diploid genome into the 2 separate haplotypes. The assembly also recovered 2 microbial symbiont genomes known to be associated with L. delicatula, each microbial genome being assembled into a single contig. Conclusions We demonstrate that field-collected arthropods can be used for the rapid generation of high-quality genome assemblies, an attractive approach for projects on emerging invasive species, disease vectors, or conservation efforts of endangered species.

Download Full-text

High contiguity long read assembly of Brassica nigra allows localization of active centromeres and provides insights into the ancestral Brassica genome

10.1101/2020.02.03.932665 ◽

2020 ◽

Cited By ~ 5

Author(s):

Sampath Perumal ◽

Chu Shin Koh ◽

Lingling Jin ◽

Miles Buchwaldt ◽

Erin Higgins ◽

...

Keyword(s):

De Novo ◽

Low Complexity ◽

Error Rates ◽

Brassica Nigra ◽

Genome Integrity ◽

Ancestral Genome ◽

Genomic Distance ◽

Long Read ◽

Genome Assemblies ◽

Technology Comparison

AbstractHigh-quality nanopore genome assemblies were generated for two Brassica nigra genotypes (Ni100 and CN115125); a member of the agronomically important Brassica species. The N50 contig length for the two assemblies were 17.1 Mb (58 contigs) and 0.29 Mb (963 contigs), respectively, reflecting recent improvements in the technology. Comparison with a de novo short read assembly for Ni100 corroborated genome integrity and quantified sequence related error rates (0.002%). The contiguity and coverage allowed unprecedented access to low complexity regions of the genome. Pericentromeric regions and coincidence of hypo-methylation enabled localization of active centromeres and identified a novel centromere-associated ALE class I element which appears to have proliferated through relatively recent nested transposition events (<1 million years ago). Computational abstraction was used to define a post-triplication Brassica specific ancestral genome and to calculate the extensive rearrangements that define the genomic distance separating B. nigra from its diploid relatives.

Download Full-text

Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies

10.1101/2020.03.15.992941 ◽

2020 ◽

Cited By ~ 15

Author(s):

Arang Rhie ◽

Brian P. Walenz ◽

Sergey Koren ◽

Adam M. Phillippy

Keyword(s):

De Novo ◽

High Accuracy ◽

Link Type ◽

Base Level ◽

Project Home Page ◽

Set Operations ◽

Assembly Evaluation ◽

Long Read ◽

Genome Assemblies ◽

Reference Genomes

AbstractRecent long-read assemblies often exceed the quality and completeness of available reference genomes, making validation challenging. Here we present Merqury, a novel tool for reference-free assembly evaluation based on efficient k-mer set operations. By comparing k-mers in a de novo assembly to those found in unassembled high-accuracy reads, Merqury estimates base-level accuracy and completeness. For trios, Merqury can also evaluate haplotype-specific accuracy, completeness, phase block continuity, and switch errors. Multiple visualizations, such as k-mer spectrum plots, can be generated for evaluation. We demonstrate on both human and plant genomes that Merqury is a fast and robust method for assembly validation.Availability of data and materialProject name: MerquryProject home page: https://github.com/marbl/merqury, https://github.com/marbl/merylArchived version: https://github.com/marbl/merqury/releases/tag/v1.0Operating system(s): Platform independentProgramming language: C++, Java, PerlOther requirements: gcc 4.8 or higher, java 1.6 or higherLicense: Public domain (see https://github.com/marbl/merqury/blob/master/README.license) Any restrictions to use by non-academics: No restrictions applied

Download Full-text

Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit

10.1101/715722 ◽

2019 ◽

Cited By ~ 21

Author(s):

Kishwar Shafin ◽

Trevor Pesout ◽

Ryan Lorig-Roach ◽

Marina Haukness ◽

Hugh E. Olsen ◽

...

Keyword(s):

Human Genome ◽

De Novo ◽

Proximity Ligation ◽

Current State ◽

Human Genomes ◽

Sequencing Method ◽

Human Genome Assembly ◽

Long Read ◽

Genome Assemblies ◽

Assembly Performance

AbstractPresent workflows for producing human genome assemblies from long-read technologies have cost and production time bottlenecks that prohibit efficient scaling to large cohorts. We demonstrate an optimized PromethION nanopore sequencing method for eleven human genomes. The sequencing, performed on one machine in nine days, achieved an average 63x coverage, 42 Kb read N50, 90% median read identity and 6.5x coverage in 100 Kb+ reads using just three flow cells per sample. To assemble these data we introduce new computational tools: Shasta - a de novo long read assembler, and MarginPolish & HELEN - a suite of nanopore assembly polishing algorithms. On a single commercial compute node Shasta can produce a complete human genome assembly in under six hours, and MarginPolish & HELEN can polish the result in just over a day, achieving 99.9% identity (QV30) for haploid samples from nanopore reads alone. We evaluate assembly performance for diploid, haploid and trio-binned human samples in terms of accuracy, cost, and time and demonstrate improvements relative to current state-of-the-art methods in all areas. We further show that addition of proximity ligation (Hi-C) sequencing yields near chromosome-level scaffolds for all eleven genomes.

Download Full-text

The evaluation of RNA-Seq de novo assembly by PacBio long read sequencing

10.1101/735621 ◽

2019 ◽

Author(s):

Yifan Yang ◽

Michael Gribskov

Keyword(s):

Real Time ◽

De Novo ◽

Critical Issue ◽

Evaluation Methods ◽

Model Organisms ◽

Rna Seq ◽

Long Reads ◽

Long Read ◽

Set Up ◽

Downstream Analysis

AbstractRNA-Seq de novo assembly is an important method to generate transcriptomes for non-model organisms before any downstream analysis. Given many great de novo assembly methods developed by now, one critical issue is that there is no consensus on the evaluation of de novo assembly methods yet. Therefore, to set up a benchmark for evaluating the quality of de novo assemblies is very critical. Addressing this challenge will help us deepen the insights on the properties of different de novo assemblers and their evaluation methods, and provide hints on choosing the best assembly sets as transcriptomes of non-model organisms for the further functional analysis. In this article, we generate a “real time” transcriptome using PacBio long reads as a benchmark for evaluating five de novo assemblers and two model-based de novo assembly evaluation methods. By comparing the de novo assmblies generated by RNA-Seq short reads with the “real time” transcriptome from the same biological sample, we find that Trinity is best at the completeness by generating more assemblies than the alternative assemblers, but less continuous and having more misassemblies; Oases is best at the continuity and specificity, but less complete; The performance of SOAPdenovo-Trans, Trans-AByss and IDBA-Tran are in between of five assemblers. For evaluation methods, DETONATE leverages multiple aspects of the assembly set and ranks the assembly set with an average performance as the best, meanwhile the contig score can serve as a good metric to select assemblies with high completeness, specificity, continuity but not sensitive to misassemblies; TransRate contig score is useful for removing misassemblies, yet often the assemblies in the optimal set is too few to be used as a transcriptome.

Download Full-text

Draft genome assemblies using sequencing reads from Oxford Nanopore Technology and Illumina platforms for four species of North American Fundulus killifish

GigaScience ◽

10.1093/gigascience/giaa067 ◽

2020 ◽

Vol 9 (6) ◽

Cited By ~ 3

Author(s):

Lisa K Johnson ◽

Ruta Sahasrabudhe ◽

James Anthony Gill ◽

Jennifer L Roach ◽

Lutz Froenicke ◽

...

Keyword(s):

North American ◽

De Novo ◽

Draft Genome ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Sequence Coverage ◽

Short Read ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Background Whole-genome sequencing data from wild-caught individuals of closely related North American killifish species (Fundulus xenicus, Fundulus catenatus, Fundulus nottii, and Fundulus olivaceus) were obtained using long-read Oxford Nanopore Technology (ONT) PromethION and short-read Illumina platforms. Findings Draft de novo reference genome assemblies were generated using a combination of long and short sequencing reads. For each species, the PromethION platform was used to generate 30–45× sequence coverage, and the Illumina platform was used to generate 50–160× sequence coverage. Illumina-only assemblies were fragmented with high numbers of contigs, while ONT-only assemblies were error prone with low BUSCO scores. The highest N50 values, ranging from 0.4 to 2.7 Mb, were from assemblies generated using a combination of short- and long-read data. BUSCO scores were consistently >90% complete using the Eukaryota database. Conclusions High-quality genomes can be obtained from a combination of using short-read Illumina data to polish assemblies generated with long-read ONT data. Draft assemblies and raw sequencing data are available for public use. We encourage use and reuse of these data for assembly benchmarking and other analyses.

Download Full-text

Reconstruction of proto-vertebrate, proto-cyclostome and proto-gnathostome genomes provides new insights into early vertebrate evolution

Nature Communications ◽

10.1038/s41467-021-24573-z ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Yoichiro Nakatani ◽

Prashant Shingate ◽

Vydianathan Ravi ◽

Nisha E. Pillai ◽

Aravind Prasad ◽

...

Keyword(s):

Evolutionary History ◽

Gene Loss ◽

De Novo ◽

Genome Structure ◽

Origin And Evolution ◽

Long Read ◽

History Of ◽

And Function ◽

Genome Assemblies ◽

Key Questions

AbstractAncient polyploidization events have had a lasting impact on vertebrate genome structure, organization and function. Some key questions regarding the number of ancient polyploidization events and their timing in relation to the cyclostome-gnathostome divergence have remained contentious. Here we generate de novo long-read-based chromosome-scale genome assemblies for the Japanese lamprey and elephant shark. Using these and other representative genomes and developing algorithms for the probabilistic macrosynteny model, we reconstruct high-resolution proto-vertebrate, proto-cyclostome and proto-gnathostome genomes. Our reconstructions resolve key questions regarding the early evolutionary history of vertebrates. First, cyclostomes diverged from the lineage leading to gnathostomes after a shared tetraploidization (1R) but before a gnathostome-specific tetraploidization (2R). Second, the cyclostome lineage experienced an additional hexaploidization. Third, 2R in the gnathostome lineage was an allotetraploidization event, and biased gene loss from one of the subgenomes shaped the gnathostome genome by giving rise to remarkably conserved microchromosomes. Thus, our reconstructions reveal the major evolutionary events and offer new insights into the origin and evolution of vertebrate genomes.

Download Full-text

LongStitch: High-quality genome assembly correction and scaffolding using long reads

10.1101/2021.06.17.448848 ◽

2021 ◽

Author(s):

Lauren Coombe ◽

Janet X Li ◽

Theodora Lo ◽

Johnathan Wong ◽

Vladimir Nikolic ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Draft Genome ◽

Model Organisms ◽

High Quality ◽

De Novo Genome Assembly ◽

Long Reads ◽

Long Read ◽

Genomic Regions ◽

Genome Assemblies

Background Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. Results LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 2.0-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently runs in under five hours using less than 23GB of RAM. Conclusions Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch.

Download Full-text