Transcript- and annotation-guided genome assembly of the European starling

AbstractThe European starling, Sturnus vulgaris, is an ecologically significant, globally invasive avian species that is also suffering from a major decline in its native range. Here, we present the genome assembly and long-read transcriptome of an Australian-sourced European starling (S. vulgaris vAU), and a second North American genome (S. vulgaris vNA), as complementary reference genomes for population genetic and evolutionary characterisation. S. vulgaris vAU combined 10x Genomics linked-reads, low-coverage Nanopore sequencing, and PacBio Iso-Seq full-length transcript scaffolding to generate a 1050 Mb assembly on 1,628 scaffolds (72.5 Mb scaffold N50). Species-specific transcript mapping and gene annotation revealed high structural and functional completeness (94.6% BUSCO completeness). Further scaffolding against the high-quality zebra finch (Taeniopygia guttata) genome assigned 98.6% of the assembly to 32 putative nuclear chromosome scaffolds. Rapid, recent advances in sequencing technologies and bioinformatics software have highlighted the need for evidence-based assessment of assembly decisions on a case-by-case basis. Using S. vulgaris vAU, we demonstrate how the multifunctional use of PacBio Iso-Seq transcript data and complementary homology-based annotation of sequential assembly steps (assessed using a new tool, SAAGA) can be used to assess, inform, and validate assembly workflow decisions. We also highlight some counter-intuitive behaviour in traditional BUSCO metrics, and present BUSCOMP, a complementary tool for assembly comparison designed to be robust to differences in assembly size and base-calling quality. Finally, we present a second starling assembly, S. vulgaris vNA, to facilitate comparative analysis and global genomic research on this ecologically important species.

Download Full-text

A near complete genome for goat genetic and genomic research

Genetics Selection Evolution ◽

10.1186/s12711-021-00668-5 ◽

2021 ◽

Vol 53 (1) ◽

Author(s):

Ran Li ◽

Peng Yang ◽

Xuelei Dai ◽

Hojjat Asadollahpour Nanaei ◽

Wenwen Fang ◽

...

Keyword(s):

Y Chromosome ◽

Genome Assembly ◽

De Novo ◽

Genetic Research ◽

Genomic Research ◽

Dairy Goat ◽

De Novo Genome Assembly ◽

Important Species ◽

Long Read ◽

High Level

Abstract Background Goat, one of the first domesticated livestock, is a worldwide important species both culturally and economically. The current goat reference genome, known as ARS1, is reported as the first nonhuman genome assembly using 69× PacBio sequencing. However, ARS1 suffers from incomplete X chromosome and highly fragmented Y chromosome scaffolds. Results Here, we present a very high-quality de novo genome assembly, Saanen_v1, from a male Saanen dairy goat, with the first goat Y chromosome scaffold based on 117× PacBio long-read sequencing and 118× Hi-C data. Saanen_v1 displays a high level of completeness thanks to the presence of centromeric and telomeric repeats at the proximal and distal ends of two-thirds of the autosomes, and a much reduced number of gaps (169 vs. 773). The completeness and accuracy of the Saanen_v1 genome assembly are also evidenced by more assembled sequences on the chromosomes (2.63 Gb for Saanen_v1 vs. 2.58 Gb for ARS1), a slightly increased mapping ratio for transcriptomic data, and more genes anchored to chromosomes. The eight putative large assembly errors (1 to ~ 7 Mb each) found in ARS1 were amended, and for the first time, the substitution rate of this ruminant Y chromosome was estimated. Furthermore, sequence improvement in Saanen_v1, compared with ARS1, enables us to assign the likely correct positions for 4.4% of the single nucleotide polymorphism (SNP) probes in the widely used GoatSNP50 chip. Conclusions The updated goat genome assembly including both sex chromosomes (X and Y) and the autosomes with high-resolution quality will serve as a valuable resource for goat genetic research and applications.

Download Full-text

A chromosome-anchored genome assembly for Lake Trout (Salvelinus namaycush)

10.22541/au.161792605.50398568/v1 ◽

2021 ◽

Author(s):

Seth Smith ◽

Eric Normandeau ◽

Haig Djambazian ◽

Pubudu Nawarathna ◽

Pierre Berube ◽

...

Keyword(s):

Genome Assembly ◽

Lake Trout ◽

Single Copy ◽

Salvelinus Namaycush ◽

Genomic Research ◽

Total Size ◽

Salmonid Species ◽

Long Read ◽

Potential Applications ◽

Conservation Concern

Here we present an annotated, chromosome-anchored, genome assembly for Lake Trout (Salvelinus namaycush) – a highly diverse salmonid species of notable conservation concern and an excellent model for research on adaptation and speciation. We leveraged Pacific Biosciences long-read sequencing, paired-end Illumina sequencing, proximity ligation (Hi-C), and a previously published linkage map to produce a highly contiguous assembly composed of 7,378 contigs (contig N50 = 1.8 mb) assigned to 4,120 scaffolds (scaffold N50 = 44.975 mb). 84.7% of the genome was assigned to 42 chromosome-sized scaffolds and 93.2% of Benchmarking Universal Single Copy Orthologs were recovered, putting this assembly on par with the best currently available salmonid genomes. Estimates of genome size based on k-mer frequency analysis were highly similar to the total size of the finished genome, suggesting that the entirety of the genome was recovered. A mitome assembly was also produced. Self-vs-self synteny analysis allowed us to identify homeologs resulting from the Salmonid specific autotetraploid event (Ss4R) and alignment with three other salmonid species allowed us to identify homologous chromosomes in other species. We also generated multiple resources useful for future genomic research on Lake Trout including a repeat library and a sex averaged recombination map. A novel RNA sequencing dataset was also used to produce a publicly available set of gene annotations using the National Center for Biotechnology Information Eukaryotic Genome Annotation Pipeline. Potential applications of these resources to population genetics and the conservation of native populations are discussed.

Download Full-text

Improving the Chromosome-Level Genome Assembly of the Siamese Fighting Fish (Betta splendens) in a University Master’s Course

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401205 ◽

2020 ◽

Vol 10 (7) ◽

pp. 2179-2183 ◽

Cited By ~ 1

Author(s):

Stefan Prost ◽

Malte Petersen ◽

Martin Grethlein ◽

Sarah Joy Hahn ◽

Nina Kuschik-Maczollek ◽

...

Keyword(s):

Genome Assembly ◽

High Throughput Sequencing ◽

Siamese Fighting Fish ◽

Betta Splendens ◽

High Quality ◽

Sequencing Platform ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

Chromosome Level

Ever decreasing costs along with advances in sequencing and library preparation technologies enable even small research groups to generate chromosome-level assemblies today. Here we report the generation of an improved chromosome-level assembly for the Siamese fighting fish (Betta splendens) that was carried out during a practical university master’s course. The Siamese fighting fish is a popular aquarium fish and an emerging model species for research on aggressive behavior. We updated the current genome assembly by generating a new long-read nanopore-based assembly with subsequent scaffolding to chromosome-level using previously published Hi-C data. The use of ∼35x nanopore-based long-read data sequenced on a MinION platform (Oxford Nanopore Technologies) allowed us to generate a baseline assembly of only 1,276 contigs with a contig N50 of 2.1 Mbp, and a total length of 441 Mbp. Scaffolding using the Hi-C data resulted in 109 scaffolds with a scaffold N50 of 20.7 Mbp. More than 99% of the assembly is comprised in 21 scaffolds. The assembly showed the presence of 96.1% complete BUSCO genes from the Actinopterygii dataset indicating a high quality of the assembly. We present an improved full chromosome-level assembly of the Siamese fighting fish generated during a university master’s course. The use of ∼35× long-read nanopore data drastically improved the baseline assembly in terms of continuity. We show that relatively in-expensive high-throughput sequencing technologies such as the long-read MinION sequencing platform can be used in educational settings allowing the students to gain practical skills in modern genomics and generate high quality results that benefit downstream research projects.

Download Full-text

Long-read assembly of the Chinese rhesus macaque genome and identification of ape-specific structural variants

Nature Communications ◽

10.1038/s41467-019-12174-w ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 12

Author(s):

Yaoxi He ◽

Xin Luo ◽

Bin Zhou ◽

Ting Hu ◽

Xiaoyu Meng ◽

...

Keyword(s):

Rhesus Macaque ◽

Genome Assembly ◽

De Novo ◽

Gene Annotation ◽

Large Body ◽

Phenotypic Traits ◽

Structural Variants ◽

De Novo Genome Assembly ◽

Chinese Rhesus Macaque ◽

Long Read

Abstract We present a high-quality de novo genome assembly (rheMacS) of the Chinese rhesus macaque (Macaca mulatta) using long-read sequencing and multiplatform scaffolding approaches. Compared to the current Indian rhesus macaque reference genome (rheMac8), rheMacS increases sequence contiguity 75-fold, closing 21,940 of the remaining assembly gaps (60.8 Mbp). We improve gene annotation by generating more than two million full-length transcripts from ten different tissues by long-read RNA sequencing. We sequence resolve 53,916 structural variants (96% novel) and identify 17,000 ape-specific structural variants (ASSVs) based on comparison to ape genomes. Many ASSVs map within ChIP-seq predicted enhancer regions where apes and macaque show diverged enhancer activity and gene expression. We further characterize a subset that may contribute to ape- or great-ape-specific phenotypic traits, including taillessness, brain volume expansion, improved manual dexterity, and large body size. The rheMacS genome assembly serves as an ideal reference for future biomedical and evolutionary studies.

Download Full-text

LRSDAY: Long-read Sequencing Data Analysis for Yeasts

10.1101/184572 ◽

2017 ◽

Author(s):

Jia-Xing Yue ◽

Gianni Liti

Keyword(s):

Genome Assembly ◽

Model Organism ◽

Sequencing Data ◽

Protein Coding ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Downstream Analysis ◽

Eukaryotic Organisms ◽

Genomic Regions

AbstractLong-read sequencing technologies have become increasingly popular in genome projects due to their strengths in resolving complex genomic regions. As a leading model organism with small genome size and great biotechnological importance, the budding yeast, Saccharomyces cerevisiae, has many isolates currently being sequenced with long reads. However, analyzing long-read sequencing data to produce high-quality genome assembly and annotation remains challenging. Here we present LRSDAY, the first one-stop solution to streamline this process. LRSDAY can produce chromosome-level end-to-end genome assembly and comprehensive annotations for various genomic features (including centromeres, protein-coding genes, tRNAs, transposable elements and telomere-associated elements) that are ready for downstream analysis. Although tailored for S. cerevisiae, we designed LRSDAY to be highly modular and customizable, making it adaptable for virtually any eukaryotic organisms. Applying LRSDAY to a S. cerevisiae strain takes ∼43 hrs to generate a complete and well-annotated genome from ∼100X Pacific Biosciences (PacBio) reads using four threads.

Download Full-text

Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies

10.21203/rs.3.rs-712747/v1 ◽

2021 ◽

Author(s):

Arang Rhie ◽

Ann Mc Cartney ◽

Kishwar Shafin ◽

Michael Alonge ◽

Andrey Bzikadze ◽

...

Keyword(s):

Genome Assembly ◽

Tandem Repeats ◽

Hydatidiform Mole ◽

Segmental Duplications ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Human Genome Assembly ◽

Long Read ◽

Genome Assemblies ◽

Oxford Nanopore Technologies

Abstract Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first Telomere-to-Telomere (T2T) human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Though derived from highly accurate sequencing, evaluation revealed that the initial T2T draft assembly had evidence of small errors and structural misassemblies. To correct these errors, we designed a novel repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly QV to 73.9. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both PacBio HiFi and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies

Download Full-text

Improved Genome Sequence and Gene Annotation Resource for the Potato Late Blight Pathogen Phytophthora infestans

Molecular Plant-Microbe Interactions ◽

10.1094/mpmi-02-20-0023-a ◽

2020 ◽

Vol 33 (8) ◽

pp. 1025-1028

Author(s):

Yoonyoung Lee ◽

Kwang-Soo Cho ◽

Jin-Hee Seo ◽

Kee Hoon Sohn ◽

Maxim Prokchorchik

Keyword(s):

Late Blight ◽

Phytophthora Infestans ◽

Molecular Mechanisms ◽

Gene Annotation ◽

Potato Late Blight ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

Blight Disease ◽

Late Blight Pathogen

Phytophthora infestans is a devastating pathogen causing potato late blight (Solanum tuberosum). Here we report the sequencing, assembly and genome annotation for two Phytophthora infestans isolates sampled in Republic of Korea. Genome sequencing was carried out using long read (Oxford Nanopore) and short read (Illumina Nextseq) sequencing technologies that significantly improved the contiguity and quality of P. infestans genome assembly. Our resources would help researchers better understand the molecular mechanisms by which P. infestans causes late blight disease in the future.

Download Full-text

Long-Read Genome Sequence Resource of Ascochyta versabilis Causing Leaf Spot Disease in Pseudostellaria heterophylla

Molecular Plant-Microbe Interactions ◽

10.1094/mpmi-05-20-0137-a ◽

2020 ◽

Vol 33 (12) ◽

pp. 1438-1440

Author(s):

Yunbo Kuang ◽

Zhe Wang ◽

Felix Abah ◽

Hongli Hu ◽

Baohua Wang ◽

...

Keyword(s):

Single Molecule ◽

Genome Assembly ◽

Leaf Spot ◽

Gene Annotation ◽

Control Strategies ◽

Effective Control ◽

Leaf Spot Disease ◽

Spot Disease ◽

Pseudostellaria Heterophylla ◽

Long Read

Ascochyta versabilis is the fungal pathogen that causes the severe leaf spot disease of Pseudostellaria heterophylla (Miq.) Pax, a vital Chinese herbal plant. Here, we deployed PacBio single-molecule real-time long-read sequencing technology to generate a near-complete genome assembly for the A. versabilis KC1 strain and obtained a total of 9.80 Gb raw reads. These reads were processed into a 41.05 Mb genome assembly containing 95 contigs with N50 of 1.70 Mb and a maximum length of 3.93 Mb. A total of 10,457 gene models, of which 1,004 encode putatively secreted proteins, were identified in the genome. This high-quality genome assembly and gene annotation resource will facilitate the institution of functional genetic studies aimed at providing a better insight into the infection mechanisms of A. versabilis to support the development of effective control strategies for leaf spot disease of P. heterophylla.

Download Full-text

ISOdb: A Comprehensive Database of Full-Length Isoforms Generated by Iso-Seq

International Journal of Genomics ◽

10.1155/2018/9207637 ◽

2018 ◽

Vol 2018 ◽

pp. 1-6 ◽

Cited By ~ 1

Author(s):

Shang-Qian Xie ◽

Yue Han ◽

Xiao-Zhou Chen ◽

Tai-Yu Cao ◽

Kai-Kai Ji ◽

...

Keyword(s):

Single Molecule ◽

Full Length ◽

Public Access ◽

Transcript Isoforms ◽

Sequencing Technologies ◽

Long Reads ◽

Depth Analysis ◽

Gene Level ◽

Long Read ◽

Full Length Transcript

The accurate landscape of transcript isoforms plays an important role in the understanding of gene function and gene regulation. However, building complete transcripts is very challenging for short reads generated using next-generation sequencing. Fortunately, isoform sequencing (Iso-Seq) using single-molecule sequencing technologies, such as PacBio SMRT, provides long reads spanning entire transcript isoforms which do not require assembly. Therefore, we have developed ISOdb, a comprehensive resource database for hosting and carrying out an in-depth analysis of Iso-Seq datasets and visualising the full-length transcript isoforms. The current version of ISOdb has collected 93 publicly available Iso-Seq samples from eight species and presents the samples in two levels: (1) sample level, including metainformation, long read distribution, isoform numbers, and alternative splicing (AS) events of each sample; (2) gene level, including the total isoforms, novel isoform number, novel AS number, and isoform visualisation of each gene. In addition, ISOdb provides a user interface in the website for uploading sample information to facilitate the collection and analysis of researchers’ datasets. Currently, ISOdb is the first repository that offers comprehensive resources and convenient public access for hosting, analysing, and visualising Iso-Seq data, which is freely available.

Download Full-text

Long-read based assembly and annotation of a Drosophila simulans genome

10.1101/425710 ◽

2018 ◽

Author(s):

Pierre Nouhaud

Keyword(s):

Drosophila Simulans ◽

Read Length ◽

Sequencing Technologies ◽

Final Assembly ◽

Rnaseq Data ◽

Long Read ◽

Gene Structures ◽

Full Length Transcript ◽

Genome Assemblies ◽

Nested Gene

AbstractLong-read sequencing technologies enable high-quality, contiguous genome assemblies. Here we used SMRT sequencing to assemble the genome of a Drosophila simulans strain originating from Madagascar, the ancestral range of the species. We generated 8 Gb of raw data (~50× coverage) with a mean read length of 6,410 bp, a NR50 of 9,125 bp and the longest subread at 49 kb. We benchmarked six different assemblers and merged the best two assemblies from Canu and Falcon. Our final assembly was 127.41 Mb with a N50 of 5.38 Mb and 305 contigs. We anchored more than 4 Mb of novel sequence to the major chromosome arms, and significantly improved the assembly of peri-centromeric and telomeric regions. Finally, we performed full-length transcript sequencing and used this data in conjunction with short-read RNAseq data to annotate 13,422 genes in the genome, improving the annotation in regions with complex, nested gene structures.

Download Full-text