A haplotype-resolved, de novo genome assembly for the wood tiger moth (Arctia plantaginis) through trio binning

Eugenie C Yen; Shane A McCarthy; Juan A Galarza; Tomas N Generalovic; Sarah Pelan; Petr Nguyen; Joana I Meier; Ian A Warren; Johanna Mappes; Richard Durbin; Chris D Jiggins

doi:10.1093/gigascience/giaa088

A haplotype-resolved, de novo genome assembly for the wood tiger moth (Arctia plantaginis) through trio binning

GigaScience ◽

10.1093/gigascience/giaa088 ◽

2020 ◽

Vol 9 (8) ◽

Cited By ~ 2

Author(s):

Eugenie C Yen ◽

Shane A McCarthy ◽

Juan A Galarza ◽

Tomas N Generalovic ◽

Sarah Pelan ◽

...

Keyword(s):

Population Structure ◽

Genome Assembly ◽

De Novo ◽

Consensus Sequence ◽

Parental Origin ◽

De Novo Genome Assembly ◽

Geographic Population ◽

Long Reads ◽

Tiger Moth ◽

Innovative Solution

ABSTRACT Background Diploid genome assembly is typically impeded by heterozygosity because it introduces errors when haplotypes are collapsed into a consensus sequence. Trio binning offers an innovative solution that exploits heterozygosity for assembly. Short, parental reads are used to assign parental origin to long reads from their F1 offspring before assembly, enabling complete haplotype resolution. Trio binning could therefore provide an effective strategy for assembling highly heterozygous genomes, which are traditionally problematic, such as insect genomes. This includes the wood tiger moth (Arctia plantaginis), which is an evolutionary study system for warning colour polymorphism. Findings We produced a high-quality, haplotype-resolved assembly for Arctia plantaginis through trio binning. We sequenced a same-species family (F1 heterozygosity ∼1.9%) and used parental Illumina reads to bin 99.98% of offspring Pacific Biosciences reads by parental origin, before assembling each haplotype separately and scaffolding with 10X linked reads. Both assemblies are contiguous (mean scaffold N50: 8.2 Mb) and complete (mean BUSCO completeness: 97.3%), with annotations and 31 chromosomes identified through karyotyping. We used the assembly to analyse genome-wide population structure and relationships between 40 wild resequenced individuals from 5 populations across Europe, revealing the Georgian population as the most genetically differentiated with the lowest genetic diversity. Conclusions We present the first invertebrate genome to be assembled via trio binning. This assembly is one of the highest quality genomes available for Lepidoptera, supporting trio binning as a potent strategy for assembling heterozygous genomes. Using our assembly, we provide genomic insights into the geographic population structure of A. plantaginis.

Download Full-text

A haplotype-resolved, de novo genome assembly for the wood tiger moth (Arctia plantaginis) through trio binning

10.1101/2020.02.28.970020 ◽

2020 ◽

Author(s):

Eugenie C. Yen ◽

Shane A. McCarthy ◽

Juan A. Galarza ◽

Tomas N. Generalovic ◽

Sarah Pelan ◽

...

Keyword(s):

Population Structure ◽

Genome Assembly ◽

De Novo ◽

Consensus Sequence ◽

Parental Origin ◽

De Novo Genome Assembly ◽

Geographic Population ◽

Long Reads ◽

Tiger Moth ◽

Innovative Solution

ABSTRACTBackgroundDiploid genome assembly is typically impeded by heterozygosity, as it introduces errors when haplotypes are collapsed into a consensus sequence. Trio binning offers an innovative solution which exploits heterozygosity for assembly. Short, parental reads are used to assign parental origin to long reads from their F1 offspring before assembly, enabling complete haplotype resolution. Trio binning could therefore provide an effective strategy for assembling highly heterozygous genomes which are traditionally problematic, such as insect genomes. This includes the wood tiger moth (Arctia plantaginis), which is an evolutionary study system for warning colour polymorphism.FindingsWe produced a high-quality, haplotype-resolved assembly for Arctia plantaginis through trio binning. We sequenced a same-species family (F1 heterozygosity ∼1.9%) and used parental Illumina reads to bin 99.98% of offspring Pacific Biosciences reads by parental origin, before assembling each haplotype separately and scaffolding with 10X linked-reads. Both assemblies are highly contiguous (mean scaffold N50: 8.2Mb) and complete (mean BUSCO completeness: 97.3%), with complete annotations and 31 chromosomes identified through karyotyping. We employed the assembly to analyse genome-wide population structure and relationships between 40 wild resequenced individuals from five populations across Europe, revealing the Georgian population as the most genetically differentiated with the lowest genetic diversity.ConclusionsWe present the first invertebrate genome to be assembled via trio binning. This assembly is one of the highest quality genomes available for Lepidoptera, supporting trio binning as a potent strategy for assembling highly heterozygous genomes. Using this assembly, we provide genomic insights into geographic population structure of Arctia plantaginis.

Download Full-text

Whole-genome sequencing of 182 Bursaphelenchus xylophilus strains generates first long read based de novo genome assembly and reveals temperature associated population structure

10.22541/au.159352211.19983305 ◽

2020 ◽

Author(s):

Xiaolei Ding ◽

Yunfei Guo ◽

Jianren Ye ◽

Xiaoqin Wu ◽

Sixi Lin ◽

...

Keyword(s):

Population Structure ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Genome Assembly ◽

De Novo ◽

Bursaphelenchus Xylophilus ◽

Whole Genome ◽

De Novo Genome Assembly ◽

Long Read

Download Full-text

LongStitch: High-quality genome assembly correction and scaffolding using long reads

10.1101/2021.06.17.448848 ◽

2021 ◽

Author(s):

Lauren Coombe ◽

Janet X Li ◽

Theodora Lo ◽

Johnathan Wong ◽

Vladimir Nikolic ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Draft Genome ◽

Model Organisms ◽

High Quality ◽

De Novo Genome Assembly ◽

Long Reads ◽

Long Read ◽

Genomic Regions ◽

Genome Assemblies

Background Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. Results LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 2.0-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently runs in under five hours using less than 23GB of RAM. Conclusions Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch.

Download Full-text

Pushing the limits of de novo genome assembly for complex prokaryotic genomes harboring very long, near identical repeats

10.1101/300186 ◽

2018 ◽

Cited By ~ 3

Author(s):

Michael Schmid ◽

Daniel Frei ◽

Andrea Patrignani ◽

Ralph Schlapbach ◽

Jürg E. Frey ◽

...

Keyword(s):

Dark Matter ◽

Genome Assembly ◽

De Novo ◽

Bacterial Genomes ◽

De Novo Genome Assembly ◽

Assembly Algorithm ◽

Long Reads ◽

Oxford Nanopore ◽

Prokaryotic Genomes ◽

Genome Assemblies

AbstractGenerating a complete, de novo genome assembly for prokaryotes is often considered a solved problem. However, we here show that Pseudomonas koreensis P19E3 harbors multiple, near identical repeat pairs up to 70 kilobase pairs in length. Beyond long repeats, the P19E3 assembly was further complicated by a shufflon region. Its complex genome could not be de novo assembled with long reads produced by Pacific Biosciences’ technology, but required very long reads from the Oxford Nanopore Technology. Another important factor for a full genomic resolution was the choice of assembly algorithm.Importantly, a repeat analysis indicated that very complex bacterial genomes represent a general phenomenon beyond Pseudomonas. Roughly 10% of 9331 complete bacterial and a handful of 293 complete archaeal genomes represented this dark matter for de novo genome assembly of prokaryotes. Several of these dark matter genome assemblies contained repeats far beyond the resolution of the sequencing technology employed and likely contain errors, other genomes were closed employing labor-intense steps like cosmid libraries, primer walking or optical mapping. Using very long sequencing reads in combination with assemblers capable of resolving long, near identical repeats will bring most prokaryotic genomes within reach of fast and complete de novo genome assembly.

Download Full-text

Raven: a de novo genome assembler for long reads

10.1101/2020.08.07.242461 ◽

2020 ◽

Cited By ~ 5

Author(s):

Robert Vaser ◽

Mile Šikić

Keyword(s):

Human Genome ◽

Genome Assembly ◽

De Novo ◽

De Novo Genome Assembly ◽

New Methods ◽

Long Reads ◽

Long Read ◽

Comparable Accuracy ◽

Genome Assembler ◽

Genome Dataset

We present new methods for the improvement of long-read de novo genome assembly incorporated into a straightforward tool called Raven (https://github.com/lbcb-sci/raven). Compared with other assemblers, Raven is one of two fastest, it reconstructs the sequenced genome in the least amount of fragments, has better or comparable accuracy, and maintains similar performance for various genomes. Raven takes 500 CPU hours to assemble a 44x human genome dataset in only 259 fragments.

Download Full-text

LongStitch: high-quality genome assembly correction and scaffolding using long reads

BMC Bioinformatics ◽

10.1186/s12859-021-04451-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Lauren Coombe ◽

Janet X. Li ◽

Theodora Lo ◽

Johnathan Wong ◽

Vladimir Nikolic ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Draft Genome ◽

Model Organisms ◽

High Quality ◽

De Novo Genome Assembly ◽

Long Reads ◽

Long Read ◽

Genomic Regions ◽

Genome Assemblies

Abstract Background Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. Results LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of Caenorhabditis elegans, Oryza sativa, and three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 1.2-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently improves upon human assemblies in under five hours using less than 23 GB of RAM. Conclusions Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch.

Download Full-text

Chromosome‐level de novo genome assembly of Telopea speciosissima (New South Wales waratah) using long‐reads, linked‐reads and Hi‐C

Molecular Ecology Resources ◽

10.1111/1755-0998.13574 ◽

2022 ◽

Author(s):

Stephanie H. Chen ◽

Maurizio Rossetto ◽

Marlien Merwe ◽

Patricia Lu‐Irving ◽

Jia‐Yee S. Yap ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

New South ◽

New South Wales ◽

De Novo Genome Assembly ◽

South Wales ◽

Long Reads ◽

Chromosome Level

Download Full-text

High-quality de novo genome assembly of Kappaphycus alvarezii based on both PacBio and HiSeq sequencing

10.1101/2020.02.15.950402 ◽

2020 ◽

Author(s):

Shangang Jia ◽

Guoliang Wang ◽

Guiming Liu ◽

Jiangyong Qu ◽

Beilun Zhao ◽

...

Keyword(s):

Single Molecule ◽

Genome Assembly ◽

De Novo ◽

Kappaphycus Alvarezii ◽

Draft Genome ◽

Production Traits ◽

Illumina Hiseq ◽

De Novo Genome Assembly ◽

Protein Coding ◽

Long Reads

ABSTRACTThe red algae Kappaphycus alvarezii is the most important aquaculture species in Kappaphycus, widely distributed in tropical waters, and it has become the main crop of carrageenan production at present. The mechanisms of adaptation for high temperature, high salinity environments and carbohydrate metabolism may provide an important inspiration for marine algae study. Scientific background knowledge such as genomic data will be also essential to improve disease resistance and production traits of K. alvarezii. 43.28 Gb short paired-end reads and 18.52 Gb single-molecule long reads of K. alvarezii were generated by Illumina HiSeq platform and Pacbio RSII platform respectively. The de novo genome assembly was performed using Falcon_unzip and Canu software, and then improved with Pilon. The final assembled genome (336 Mb) consists of 888 scaffolds with a contig N50 of 849 Kb. Further annotation analyses predicted 21,422 protein-coding genes, with 61.28% functionally annotated. Here we report the draft genome and annotations of K. alvarezii, which are valuable resources for future genomic and genetic studies in Kappaphycus and other algae.

Download Full-text

A long reads-based de-novo assembly of the genome of the Arlee homozygous line reveals chromosomal rearrangements in rainbow trout

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab052 ◽

2021 ◽

Author(s):

Guangtu Gao ◽

Susana Magadan ◽

Geoffrey C Waldbieser ◽

Ramey C Youngblood ◽

Paul A Wheeler ◽

...

Keyword(s):

Rainbow Trout ◽

Chromosome Number ◽

Genome Assembly ◽

De Novo Assembly ◽

De Novo ◽

Sequence Data ◽

Structural Variations ◽

High Coverage ◽

Haploid Chromosome Number ◽

Long Reads

Abstract Currently, there is still a need to improve the contiguity of the rainbow trout reference genome and to use multiple genetic backgrounds that will represent the genetic diversity of this species. The Arlee doubled haploid line was originated from a domesticated hatchery strain that was originally collected from the northern California coast. The Canu pipeline was used to generate the Arlee line genome de-novo assembly from high coverage PacBio long-reads sequence data. The assembly was further improved with Bionano optical maps and Hi-C proximity ligation sequence data to generate 32 major scaffolds corresponding to the karyotype of the Arlee line (2 N = 64). It is composed of 938 scaffolds with N50 of 39.16 Mb and a total length of 2.33 Gb, of which ∼95% was in 32 chromosome sequences with only 438 gaps between contigs and scaffolds. In rainbow trout the haploid chromosome number can vary from 29 to 32. In the Arlee karyotype the haploid chromosome number is 32 because chromosomes Omy04, 14 and 25 are divided into six acrocentric chromosomes. Additional structural variations that were identified in the Arlee genome included the major inversions on chromosomes Omy05 and Omy20 and additional 15 smaller inversions that will require further validation. This is also the first rainbow trout genome assembly that includes a scaffold with the sex-determination gene (sdY) in the chromosome Y sequence. The utility of this genome assembly is demonstrated through the improved annotation of the duplicated genome loci that harbor the IGH genes on chromosomes Omy12 and Omy13.

Download Full-text

Improved hybrid de novo genome assembly of domesticated apple (Malus x domestica)

GigaScience ◽

10.1186/s13742-016-0139-0 ◽

2016 ◽

Vol 5 (1) ◽

Cited By ~ 28

Author(s):

Xuewei Li ◽

Ling Kui ◽

Jing Zhang ◽

Yinpeng Xie ◽

Liping Wang ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Malus X Domestica ◽

De Novo Genome Assembly

Download Full-text