Assembly of complete diploid phased chromosomes from draft genome sequences

De novo genome assembly is essential for genomic research. High-quality genomes assembled into phased pseudomolecules are challenging to produce and often contain assembly errors caused by repeats, heterozygosity, or the chosen assembly strategy. Although algorithms exist that produce partially phased assemblies, haploid draft assemblies that may lack biological information remain favored because they are easier to generate and use. We developed HaploSync, a suite of tools that produces fully phased, chromosome-scale diploid genome assemblies and performs extensive quality control to limit assembly artifacts. HaploSync uses a genetic map and/or the genome of a closely related species to guide the scaffolding of a diploid assembly into phased pseudomolecules for each chromosome. It compares alternative haplotypes to identify and correct misassemblies independent of a reference, fills assembly gaps with unplaced sequences, and resolves collapsed homozygous regions. In a series of plant, fungal, and animal kingdom case studies, we demonstrate that HaploSync increases the assembly contiguity of phased chromosomes, improves completeness by filling gaps, corrects scaffolding, and correctly phases highly heterozygous, complex regions.

Download Full-text

LongStitch: High-quality genome assembly correction and scaffolding using long reads

10.1101/2021.06.17.448848 ◽

2021 ◽

Author(s):

Lauren Coombe ◽

Janet X Li ◽

Theodora Lo ◽

Johnathan Wong ◽

Vladimir Nikolic ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Draft Genome ◽

Model Organisms ◽

High Quality ◽

De Novo Genome Assembly ◽

Long Reads ◽

Long Read ◽

Genomic Regions ◽

Genome Assemblies

Background Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. Results LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 2.0-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently runs in under five hours using less than 23GB of RAM. Conclusions Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch.

Download Full-text

LongStitch: high-quality genome assembly correction and scaffolding using long reads

BMC Bioinformatics ◽

10.1186/s12859-021-04451-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Lauren Coombe ◽

Janet X. Li ◽

Theodora Lo ◽

Johnathan Wong ◽

Vladimir Nikolic ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Draft Genome ◽

Model Organisms ◽

High Quality ◽

De Novo Genome Assembly ◽

Long Reads ◽

Long Read ◽

Genomic Regions ◽

Genome Assemblies

Abstract Background Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. Results LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of Caenorhabditis elegans, Oryza sativa, and three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 1.2-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently improves upon human assemblies in under five hours using less than 23 GB of RAM. Conclusions Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch.

Download Full-text

A de novo genome assembly of the dwarfing pear rootstock Zhongai 1

Scientific Data ◽

10.1038/s41597-019-0291-3 ◽

2019 ◽

Vol 6 (1) ◽

Cited By ~ 1

Author(s):

Chunqing Ou ◽

Fei Wang ◽

Jiahong Wang ◽

Song Li ◽

Yanjie Zhang ◽

...

Keyword(s):

De Novo ◽

Repetitive Sequences ◽

Draft Genome ◽

Genome Sequences ◽

Fruit Characteristics ◽

De Novo Genome Assembly ◽

Protein Coding ◽

Protein Coding Genes ◽

Long Reads ◽

Cultivated Species

Abstract‘Zhongai 1’ [(Pyrus ussuriensis × communis) × spp.] is an excellent pear dwarfing rootstock common in China. It is dwarf itself and has high dwarfing efficiency on most of main Pyrus cultivated species when used as inter-stock. Here we describe the draft genome sequences of ‘Zhongai 1’ which was assembled using PacBio long reads, Illumina short reads and Hi-C technology. We estimated the genome size is approximately 511.33 Mb by K-mer analysis and obtained a final genome of 510.59 Mb with a contig N50 size of 1.28 Mb. Next, 506.31 Mb (99.16%) of contigs were clustered into 17 chromosomes with a scaffold N50 size of 23.45 Mb. We further predicted 309.86 Mb (60.68%) of repetitive sequences and 43,120 protein-coding genes. The assembled genome will be a valuable resource and reference for future pear breeding, genetic improvement, and comparative genomics among related species. Moreover, it will help identify genes involved in dwarfism, early flowering, stress tolerance, and commercially desirable fruit characteristics.

Download Full-text

De novo genome assemblies of butterflies

GigaScience ◽

10.1093/gigascience/giab041 ◽

2021 ◽

Vol 10 (6) ◽

Author(s):

Emily A Ellis ◽

Caroline G Storer ◽

Akito Y Kawahara

Keyword(s):

De Novo ◽

Draft Genome ◽

Quality Metrics ◽

Data Reuse ◽

Post Processing ◽

Genome Sequences ◽

High Utility ◽

Computational Resources ◽

Genome Assemblies ◽

Aphantopus Hyperantus

Abstract Background The availability of thousands of genomes has enabled new advancements in biology. However, many genomes have not been investigated for their quality. Here we examine quality trends in a taxonomically diverse and well-known group, butterflies (Papilionoidea), and provide draft, de novo assemblies for all available butterfly genomes. Owing to massive genome sequencing investment and taxonomic curation, this is an excellent group to explore genome quality. Findings We provide de novo assemblies for all 822 available butterfly genomes and interpret their quality in terms of completeness and continuity. We identify the 50 highest quality genomes across butterflies and conclude that the ringlet, Aphantopus hyperantus, has the highest quality genome. Our post-processing of draft genome assemblies identified 118 butterfly genomes that should not be reused owing to contamination or extremely low quality. However, many draft genomes are of high utility, especially because permissibility of low-quality genomes is dependent on the objective of the study. Our assemblies will serve as a key resource for papilionid genomics, especially for researchers without computational resources. Conclusions Quality metrics and assemblies are typically presented with annotated genome accessions but rarely with de novo genomes. We recommend that studies presenting genome sequences provide the assembly and some metrics of quality because quality will significantly affect downstream results. Transparency in quality metrics is needed to improve the field of genome science and encourage data reuse.

Download Full-text

De Novo Genome Assembly of the Meadow Brown Butterfly, Maniola jurtina

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401071 ◽

2020 ◽

Vol 10 (5) ◽

pp. 1477-1484

Author(s):

Kumar Saurabh Singh ◽

David J. Hosken ◽

Nina Wedell ◽

Richard ffrench-Constant ◽

Chris Bass ◽

...

Keyword(s):

De Novo ◽

Draft Genome ◽

Single Copy ◽

De Novo Genome Assembly ◽

Modern Biology ◽

Final Assembly ◽

Gene Sets ◽

Gene Count ◽

In The Wild ◽

Maniola Jurtina

Meadow brown butterflies (Maniola jurtina) on the Isles of Scilly represent an ideal model in which to dissect the links between genotype, phenotype and long-term patterns of selection in the wild - a largely unfulfilled but fundamental aim of modern biology. To meet this aim, a clear description of genotype is required. Here we present the draft genome sequence of M. jurtina to serve as a founding genetic resource for this species. Seven libraries were constructed using pooled DNA from five wild caught spotted females and sequenced using Illumina, PacBio RSII and MinION technology. A novel hybrid assembly approach was employed to generate a final assembly with an N50 of 214 kb (longest scaffold 2.9 Mb). The sequence assembly described here predicts a gene count of 36,294 and includes variants and gene duplicates from five genotypes. Core BUSCO (Benchmarking Universal Single-Copy Orthologs) gene sets of Arthropoda and Insecta recovered 90.5% and 88.7% complete and single-copy genes respectively. Comparisons with 17 other Lepidopteran species placed 86.5% of the assembled genes in orthogroups. Our results provide the first high-quality draft genome and annotation of the butterfly M. jurtina.

Download Full-text

Draft Genome Sequences of Two Benthic Cyanobacteria, Oscillatoriales USR 001 and Nostoc sp. MBR 210, Isolated from Tropical Freshwater Lakes

Genome Announcements ◽

10.1128/genomea.01115-16 ◽

2016 ◽

Vol 4 (5) ◽

Cited By ~ 1

Author(s):

Shu Harn Te ◽

Boon Fei Tan ◽

Janelle R. Thompson ◽

Karina Yew-Hoong Gin

Keyword(s):

Solid Medium ◽

De Novo ◽

Draft Genome ◽

Shotgun Sequencing ◽

Freshwater Lakes ◽

Genome Sequences ◽

Benthic Cyanobacteria ◽

Complete Genomes ◽

Tropical Freshwater

Genomes of two filamentous benthic cyanobacteria were obtained from cocultures obtained from two freshwater lakes. The cultures were obtained by first growing cyanobacterial trichome on solid medium, followed by subculturing in freshwater media. Subsequent shotgun sequencing, de novo assembly, and genomic binning yielded almost complete genomes of Oscillatoriales USR 001 and Nostoc sp. MBR 210.

Download Full-text

Draft Genome Assemblies of Xylose-Utilizing Candida tropicalis and Candida boidinii with Potential Application in Biochemical and Biofuel Production

Genome Announcements ◽

10.1128/genomea.01594-17 ◽

2018 ◽

Vol 6 (7) ◽

Cited By ~ 1

Author(s):

Abhishek Somani ◽

Daniel Smith ◽

Matthew Hegarty ◽

Narcis Fernandez-Fuentes ◽

Sreenivas R. Ravella ◽

...

Keyword(s):

Candida Tropicalis ◽

Potential Application ◽

Draft Genome ◽

Biofuel Production ◽

Candida Boidinii ◽

Industrial Biotechnology ◽

Genome Sequences ◽

Genome Assemblies

ABSTRACT Non- albicans Candida species are growing in prominence in industrial biotechnology due to their ability to utilize hemicellulose. Here, we present the draft genome sequences of an inhibitor-tolerant Candida tropicalis strain (Y6604) and Candida boidinii NCAIM Y01308 T .

Download Full-text

Draft Genome of the Macadamia Husk Spot Pathogen, Pseudocercospora macadamiae

Phytopathology ◽

10.1094/phyto-12-19-0460-a ◽

2020 ◽

Vol 110 (9) ◽

pp. 1503-1506

Author(s):

Olufemi A. Akinsanmi ◽

Lilia C. Carvalhais

Keyword(s):

Plant Disease Resistance ◽

Plant Disease ◽

De Novo ◽

Draft Genome ◽

Gc Content ◽

Disease Development ◽

Closely Related Species ◽

Protein Coding ◽

Protein Coding Genes ◽

The Family

Pseudocercospora macadamiae causes husk spot in macadamia in Australia. Lack of genomic resources for this pathogen has restricted acquiring knowledge on the mechanism of disease development, spread, and its role in fruit abscission. To address this gap, we sequenced the genome of P. macadamiae. The sequence was de novo assembled into a draft genome of 40 Mb, which is comparable to closely related species in the family Mycosphaerellaceae. The draft genome comprises 212 scaffolds, of which 99 scaffolds are over 50 kb. The genome has a 49% GC content and is predicted to contain 15,430 protein-coding genes. This draft genome sequence is the first for P. macadamiae and represents a valuable resource for understanding genome evolution and plant disease resistance.

Download Full-text

Draft genome of Bugula neritina, a colonial animal packing powerful symbionts and potential medicines

Scientific Data ◽

10.1038/s41597-020-00684-y ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Mikhail Rayko ◽

Aleksey Komissarov ◽

Jason C. Kwan ◽

Grace Lim-Fong ◽

Adelaide C. Rhodes ◽

...

Keyword(s):

De Novo ◽

Draft Genome ◽

Transcriptome Data ◽

Biomedical Sciences ◽

Bugula Neritina ◽

Genome Sequences ◽

Illumina Hiseq ◽

Draft Assembly ◽

Metazoan Genome ◽

Animal Phyla

Abstract Many animal phyla have no representatives within the catalog of whole metazoan genome sequences. This dataset fills in one gap in the genome knowledge of animal phyla with a draft genome of Bugula neritina (phylum Bryozoa). Interest in this species spans ecology and biomedical sciences because B. neritina is the natural source of bioactive compounds called bryostatins. Here we present a draft assembly of the B. neritina genome obtained from PacBio and Illumina HiSeq data, as well as genes and proteins predicted de novo and verified using transcriptome data, along with the functional annotation. These sequences will permit a better understanding of host-symbiont interactions at the genomic level, and also contribute additional phylogenomic markers to evaluate Lophophorate or Lophotrochozoa phylogenetic relationships. The effort also fits well with plans to ultimately sequence all orders of the Metazoa.

Download Full-text

dnAQET: a framework to compute a consolidated metric for benchmarking quality of de novo assemblies

BMC Genomics ◽

10.1186/s12864-019-6070-x ◽

2019 ◽

Vol 20 (1) ◽

Author(s):

Gokhan Yavas ◽

Huixiao Hong ◽

Wenming Xiao

Keyword(s):

Quality Assessment ◽

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Quality Score ◽

De Novo Genome Assembly ◽

Genome Assemblies ◽

Reference Genomes ◽

Better Than

Abstract Background Accurate de novo genome assembly has become reality with the advancements in sequencing technology. With the ever-increasing number of de novo genome assembly tools, assessing the quality of assemblies has become of great importance in genome research. Although many quality metrics have been proposed and software tools for calculating those metrics have been developed, the existing tools do not produce a unified measure to reflect the overall quality of an assembly. Results To address this issue, we developed the de novo Assembly Quality Evaluation Tool (dnAQET) that generates a unified metric for benchmarking the quality assessment of assemblies. Our framework first calculates individual quality scores for the scaffolds/contigs of an assembly by aligning them to a reference genome. Next, it computes a quality score for the assembly using its overall reference genome coverage, the quality score distribution of its scaffolds and the redundancy identified in it. Using synthetic assemblies randomly generated from the latest human genome build, various builds of the reference genomes for five organisms and six de novo assemblies for sample NA24385, we tested dnAQET to assess its capability for benchmarking quality evaluation of genome assemblies. For synthetic data, our quality score increased with decreasing number of misassemblies and redundancy and increasing average contig length and coverage, as expected. For genome builds, dnAQET quality score calculated for a more recent reference genome was better than the score for an older version. To compare with some of the most frequently used measures, 13 other quality measures were calculated. The quality score from dnAQET was found to be better than all other measures in terms of consistency with the known quality of the reference genomes, indicating that dnAQET is reliable for benchmarking quality assessment of de novo genome assemblies. Conclusions The dnAQET is a scalable framework designed to evaluate a de novo genome assembly based on the aggregated quality of its scaffolds (or contigs). Our results demonstrated that dnAQET quality score is reliable for benchmarking quality assessment of genome assemblies. The dnQAET can help researchers to identify the most suitable assembly tools and to select high quality assemblies generated.

Download Full-text