Major improvements to the Heliconius melpomene genome assembly used to confirm 10 chromosome fusion events in 6 million years of butterfly evolution

The Heliconius butterflies are a widely studied adaptive radiation of 46 species spread across Central and South America, several of which are known to hybridise in the wild. Here, we present a substantially improved assembly of the Heliconius melpomene genome, developed using novel methods that should be applicable to improving other genome assemblies produced using short read sequencing. Firstly, we whole genome sequenced a pedigree to produce a linkage map incorporating 99% of the genome. Secondly, we incorporated haplotype scaffolds extensively to produce a more complete haploid version of the draft genome. Thirdly, we incorporated ~20x coverage of Pacific Biosciences sequencing and scaffolded the haploid genome using an assembly of this long read sequence. These improvements result in a genome of 795 scaffolds, 275 Mb in length, with an L50 of 2.1 Mb, an N50 of 34 and with 99% of the genome placed and 84% anchored on chromosomes. We use the new genome assembly to confirm that the Heliconius genome underwent 10 chromosome fusions since the split with its sister genus Eueides, over a period of about 6 million years.

Download Full-text

LongStitch: High-quality genome assembly correction and scaffolding using long reads

10.1101/2021.06.17.448848 ◽

2021 ◽

Author(s):

Lauren Coombe ◽

Janet X Li ◽

Theodora Lo ◽

Johnathan Wong ◽

Vladimir Nikolic ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Draft Genome ◽

Model Organisms ◽

High Quality ◽

De Novo Genome Assembly ◽

Long Reads ◽

Long Read ◽

Genomic Regions ◽

Genome Assemblies

Background Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. Results LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 2.0-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently runs in under five hours using less than 23GB of RAM. Conclusions Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch.

Download Full-text

Extensive genomic and transcriptomic variation defines the chromosome-scale assembly of Haemonchus contortus, a model gastrointestinal worm

10.1101/2020.02.18.945246 ◽

2020 ◽

Cited By ~ 2

Author(s):

Stephen R. Doyle ◽

Alan Tracey ◽

Roz Laing ◽

Nancy Holroyd ◽

David Bartley ◽

...

Keyword(s):

Genome Assembly ◽

Haemonchus Contortus ◽

Vaccine Development ◽

De Novo ◽

Anthelmintic Resistance ◽

Draft Genome ◽

Small Ruminants ◽

High Quality ◽

Long Read ◽

Genome Assemblies

AbstractBackgroundHaemonchus contortus is a globally distributed and economically important gastrointestinal pathogen of small ruminants, and has become the key nematode model for studying anthelmintic resistance and other parasite-specific traits among a wider group of parasites including major human pathogens. Two draft genome assemblies for H. contortus were reported in 2013, however, both were highly fragmented, incomplete, and differed from one another in important respects. While the introduction of long-read sequencing has significantly increased the rate of production and contiguity of de novo genome assemblies broadly, achieving high quality genome assemblies for small, genetically diverse, outcrossing eukaryotic organisms such as H. contortus remains a significant challenge.ResultsHere, we report using PacBio long read and OpGen and 10X Genomics long-molecule methods to generate a highly contiguous 283.4 Mbp chromosome-scale genome assembly including a resolved sex chromosome. We show a remarkable pattern of almost complete conservation of chromosome content (synteny) with Caenorhabditis elegans, but almost no conservation of gene order. Long-read transcriptome sequence data has allowed us to define coordinated transcriptional regulation throughout the life cycle of the parasite, and refine our understanding of cis- and trans-splicing relative to that observed in C. elegans. Finally, we use this assembly to give a comprehensive picture of chromosome-wide genetic diversity both within a single isolate and globally.ConclusionsThe H. contortus MHco3(ISE).N1 genome assembly presented here represents the most contiguous and resolved nematode assembly outside of the Caenorhabditis genus to date, together with one of the highest-quality set of predicted gene features. These data provide a high-quality comparison for understanding the evolution and genomics of Caenorhabditis and other nematodes, and extends the experimental tractability of this model parasitic nematode in understanding pathogen biology, drug discovery and vaccine development, and important adaptive traits such as drug resistance.

Download Full-text

LongStitch: high-quality genome assembly correction and scaffolding using long reads

BMC Bioinformatics ◽

10.1186/s12859-021-04451-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Lauren Coombe ◽

Janet X. Li ◽

Theodora Lo ◽

Johnathan Wong ◽

Vladimir Nikolic ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Draft Genome ◽

Model Organisms ◽

High Quality ◽

De Novo Genome Assembly ◽

Long Reads ◽

Long Read ◽

Genomic Regions ◽

Genome Assemblies

Abstract Background Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. Results LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of Caenorhabditis elegans, Oryza sativa, and three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 1.2-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently improves upon human assemblies in under five hours using less than 23 GB of RAM. Conclusions Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch.

Download Full-text

MaGuS: a tool for map-guided scaffolding and quality assessment of genome assemblies

10.1101/032045 ◽

2015 ◽

Author(s):

Mohammed-Amin Madoui ◽

Carole Dossat ◽

Leo d'Agata ◽

Edwin van der Vossen ◽

Jan van Oeveren ◽

...

Keyword(s):

High Throughput ◽

Genome Assembly ◽

High Throughput Sequencing ◽

Draft Genome ◽

Genetic Maps ◽

Sequencing Data ◽

A Genome ◽

Genome Map ◽

Genome Assemblies ◽

Complex Genome

Background Scaffolding is a crucial step in the genome assembly process. Current methods based on large fragment paired-end reads or long reads allow an increase in continuity but often lack consistency in repetitive regions, resulting in fragmented assemblies. Here, we describe a novel tool to link assemblies to a genome map to aid complex genome reconstruction by detecting assembly errors and allowing scaffold ordering and anchoring. Results We present MaGuS (map-guided scaffolding), a modular tool that uses a draft genome assembly, a genome map, and high-throughput paired-end sequencing data to estimate the quality and to enhance the continuity of an assembly. We generated several assemblies of the Arabidopsis genome using different scaffolding programs and applied MaGuS to select the best assembly using quality metrics. Then, we used MaGuS to perform map-guided scaffolding to increase continuity by creating new scaffold links in low-covered and highly repetitive regions where other commonly used scaffolding methods lack consistency. Conclusions MaGuS is a powerful reference-free evaluator of assembly quality and a map-guided scaffolder that is freely available at https://github.com/institut-de-genomique/MaGuS. Its use can be extended to other high-throughput sequencing data (e.g., long-read data) and also to other map data (e.g., genetic maps) to improve the quality and the continuity of large and complex genome assemblies.

Download Full-text

ARBitR: An overlap-aware genome assembly scaffolder for linked reads

10.1101/2020.04.29.065847 ◽

2020 ◽

Author(s):

Markus Hiltunen ◽

Martin Ryberg ◽

Hanna Johannesson

Keyword(s):

Genome Assembly ◽

General Public ◽

Source Code ◽

Draft Genome ◽

Supplementary Information ◽

Ltr Retrotransposons ◽

Sequencing Data ◽

Long Read ◽

Genome Assemblies ◽

General Public License

Abstract10X Genomics Chromium linked reads contain information that can be used to link sequences together into scaffolds in draft genome assemblies. Existing software for this purpose perform the scaffolding by joining sequences together with a gap between them, not considering potential contig overlaps. Such overlaps can be particularly prominent in genome drafts assembled from long-read sequencing data where an overlap-layout-consensus (OLC) algorithm has been used. Ignoring overlapping contig ends may result in genes and other features being incomplete or fragmented in the resulting scaffolds. We developed the application ARBitR to generate scaffolds from genome drafts using 10X Chromium data, with a focus on minimizing the number of gaps in resulting scaffolds by incorporating an OLC step to resolve junctions between linked contigs. We tested the performance of ARBitR on three published and simulated datasets and compared to the previously published tools ARCS and ARKS. The results revealed that ARBitR performed similarly considering contiguity statistics, and the advantage of the overlapping step was revealed by fewer long and short variants in ARBitR produced scaffolds, in addition to a higher proportion of completely assembled LTR retrotransposons. We expect ARBitR to have broad applicability in genome assembly projects that utilize 10X Chromium linked reads.Availability and implementationARBitR is written and implemented in Python3 for Unix-like operative systems. All source code is available at https://github.com/markhilt/ARBitR under the GNU General Public License [email protected] informationavailable online

Download Full-text

Synteny-based genome assembly for 16 species of Heliconius butterflies, and an assessment of structural variation across the genus

10.1101/2020.10.29.359505 ◽

2020 ◽

Author(s):

Fernando A. Seixas ◽

Nathaniel B. Edelman ◽

James Mallet

Keyword(s):

Genome Assembly ◽

Copy Number ◽

Structural Variation ◽

Draft Genome ◽

Genomic Data ◽

Neotropical Species ◽

Heliconius Melpomene ◽

Evolutionary Studies ◽

Genome Assemblies ◽

Reference Genomes

AbstractHeliconius butterflies (Lepidoptera: Nymphalidae) are a group of 48 neotropical species widely studied in evolutionary research. Despite the wealth of genomic data generated in past years, chromosomal level genome assemblies currently exist for only two species, Heliconius melpomene and H. erato, each a representative of one of the two major clades of the genus. Here, we use these reference genomes to improve the contiguity of previously published draft genome assemblies of 16 Heliconius species. Using a reference-assisted scaffolding approach, we place and order the scaffolds of these genomes onto chromosomes, resulting in 95.7-99.9% of their genomes anchored to chromosomes. Genome sizes are somewhat variable among species (270-422 Mb) and in one small group of species (H. hecale, H. elevatus and H. pardalinus) differences in genome size are mainly driven by a few restricted repetitive regions. Genes within these repeat regions show an increase in exon copy number, an absence of internal stop codons, evidence of constraint on non-synonymous changes, and increased expression, all of which suggest that the extra copies are functional. Finally, we conducted a systematic search for inversions and identified five moderately large inversions fixed between the two major Heliconius clades. We infer that one of these inversions was transferred by introgression between the lineages leading to the erato/sara and burneyi/doris clades. These reference-guided assemblies represent a major improvement in Heliconius genomic resources that should aid further genetic and evolutionary studies in this genus.

Download Full-text

Quinoa genome assembly employing genomic variation for guided scaffolding

Theoretical and Applied Genetics ◽

10.1007/s00122-021-03915-x ◽

2021 ◽

Author(s):

Alexandrina Bodrug-Schepers ◽

Nancy Stralis-Pavese ◽

Hermann Buerstmayr ◽

Juliane C. Dohm ◽

Heinz Himmelbauer

Keyword(s):

Genome Assembly ◽

Chenopodium Quinoa ◽

Genomic Variation ◽

Valuable Resource ◽

Sequencing Data ◽

Genome Wide ◽

A Genome ◽

Long Read ◽

Genome Assemblies ◽

Haplotype Information

Abstract Key message We propose to use the natural variation between individuals of a population for genome assembly scaffolding. In today’s genome projects, multiple accessions get sequenced, leading to variant catalogs. Using such information to improve genome assemblies is attractive both cost-wise as well as scientifically, because the value of an assembly increases with its contiguity. We conclude that haplotype information is a valuable resource to group and order contigs toward the generation of pseudomolecules. Abstract Quinoa (Chenopodium quinoa) has been under cultivation in Latin America for more than 7500 years. Recently, quinoa has gained increasing attention due to its stress resistance and its nutritional value. We generated a novel quinoa genome assembly for the Bolivian accession CHEN125 using PacBio long-read sequencing data (assembly size 1.32 Gbp, initial N50 size 608 kbp). Next, we re-sequenced 50 quinoa accessions from Peru and Bolivia. This set of accessions differed at 4.4 million single-nucleotide variant (SNV) positions compared to CHEN125 (1.4 million SNV positions on average per accession). We show how to exploit variation in accessions that are distantly related to establish a genome-wide ordered set of contigs for guided scaffolding of a reference assembly. The method is based on detecting shared haplotypes and their expected continuity throughout the genome (i.e., the effect of linkage disequilibrium), as an extension of what is expected in mapping populations where only a few haplotypes are present. We test the approach using Arabidopsis thaliana data from different populations. After applying the method on our CHEN125 quinoa assembly we validated the results with mate-pairs, genetic markers, and another quinoa assembly originating from a Chilean cultivar. We show consistency between these information sources and the haplotype-based relations as determined by us and obtain an improved assembly with an N50 size of 1079 kbp and ordered contig groups of up to 39.7 Mbp. We conclude that haplotype information in distantly related individuals of the same species is a valuable resource to group and order contigs according to their adjacency in the genome toward the generation of pseudomolecules.

Download Full-text

Genome sequence resource of Phomopsis longicolla strain YC2-1, a fungal pathogen causing Phomopsis stem blight in soybean

Molecular Plant-Microbe Interactions ◽

10.1094/mpmi-12-20-0340-a ◽

2021 ◽

Author(s):

Xiaolin Zhao ◽

Zhichao Zhang ◽

Sujiao Zheng ◽

Wenwu Ye ◽

Xiaobo Zheng ◽

...

Keyword(s):

Genome Assembly ◽

Stem Canker ◽

Quality Data ◽

Phomopsis Longicolla ◽

Protein Coding ◽

Stem Blight ◽

A Genome ◽

Long Read ◽

Genomic Resource ◽

Blight Disease

Diaporthe-Phomopsis disease complex causes considerable yield losses in soybean production worldwide. As one of the major pathogens, Phomopsis longicolla T. W. Hobbs (syn. Diaporthe longicolla) is not only the primary agent of Phomopsis seed decay, but also one of the agents of Phomopsis pod and stem blight, and Phomopsis stem canker. We performed both PacBio long read sequencing and Illumina short read sequencing, and obtained a genome assembly for the P. longicolla strain YC2-1, which was isolated from soybean stem with Phomopsis stem blight disease. The 63.1 Mb genome assembly contains 87 scaffolds, with a minimum, maximum, and N50 scaffold length of 20 kb, 4.6 Mb, and 1.5 Mb respectively, and a total of 17,407 protein-coding genes. The high-quality data expand the genomic resource of P. longicolla species and will provide a solid foundation for a better understanding of their genetic diversity and pathogenic mechanisms.

Download Full-text

A high-quality genome assembly from a single, field-collected spotted lanternfly (Lycorma delicatula) using the PacBio Sequel II system

GigaScience ◽

10.1093/gigascience/giz122 ◽

2019 ◽

Vol 8 (10) ◽

Cited By ~ 12

Author(s):

Sarah B Kingan ◽

Julie Urban ◽

Christine C Lambert ◽

Primo Baybayan ◽

Anna K Childers ◽

...

Keyword(s):

Invasive Species ◽

Genome Assembly ◽

De Novo ◽

Fragment Size ◽

High Quality ◽

De Novo Genome Assembly ◽

Lycorma Delicatula ◽

Long Read ◽

Genome Assemblies ◽

High Quality Genome

ABSTRACT Background A high-quality reference genome is an essential tool for applied and basic research on arthropods. Long-read sequencing technologies may be used to generate more complete and contiguous genome assemblies than alternate technologies; however, long-read methods have historically had greater input DNA requirements and higher costs than next-generation sequencing, which are barriers to their use on many samples. Here, we present a 2.3 Gb de novo genome assembly of a field-collected adult female spotted lanternfly (Lycorma delicatula) using a single Pacific Biosciences SMRT Cell. The spotted lanternfly is an invasive species recently discovered in the northeastern United States that threatens to damage economically important crop plants in the region. Results The DNA from 1 individual was used to make 1 standard, size-selected library with an average DNA fragment size of ∼20 kb. The library was run on 1 Sequel II SMRT Cell 8M, generating a total of 132 Gb of long-read sequences, of which 82 Gb were from unique library molecules, representing ∼36× coverage of the genome. The assembly had high contiguity (contig N50 length = 1.5 Mb), completeness, and sequence level accuracy as estimated by conserved gene set analysis (96.8% of conserved genes both complete and without frame shift errors). Furthermore, it was possible to segregate more than half of the diploid genome into the 2 separate haplotypes. The assembly also recovered 2 microbial symbiont genomes known to be associated with L. delicatula, each microbial genome being assembled into a single contig. Conclusions We demonstrate that field-collected arthropods can be used for the rapid generation of high-quality genome assemblies, an attractive approach for projects on emerging invasive species, disease vectors, or conservation efforts of endangered species.

Download Full-text

Easy identification of insertion sequence mobilization events in related bacterial strains with ISCompare

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab181 ◽

2021 ◽

Author(s):

Ezequiel G Mogro ◽

Nicolás M Ambrosis ◽

Mauricio J Lozano

Keyword(s):

Insertion Sequence ◽

Bacterial Genome ◽

Draft Genome ◽

Genome Plasticity ◽

Bacterial Strains ◽

Bacterial Genomes ◽

A Genome ◽

Phenotypic Variations ◽

Genome Assemblies ◽

Straightforward Approach

Abstract Bacterial genomes are composed of core and accessory genomes. The first is composed of housekeeping and essential genes, while the second is highly enriched in mobile genetic elements, including transposable elements (TEs). Insertion sequences (ISs), the smallest TEs, have an important role in genome evolution, and contribute to bacterial genome plasticity and adaptability. ISs can spread in a genome, presenting different locations in nearly related strains, and producing phenotypic variations. Few tools are available which can identify differentially located ISs (DLISs) on assembled genomes. Here, we introduce ISCompare, a new program to profile IS mobilization events in related bacterial strains using complete or draft genome assemblies. ISCompare was validated using artificial genomes with simulated random IS insertions and real sequences, achieving the same or better results than other available tools, with the advantage that ISCompare can analyze multiple ISs at the same time and outputs a list of candidate DLISs. ISCompare provides an easy and straightforward approach to look for differentially located ISs on bacterial genomes.

Download Full-text