An improved de novo assembly and annotation of the tomato reference genome using single-molecule sequencing, Hi-C proximity ligation and optical maps

AbstractThe original Heinz 1706 reference genome was produced by a large team of scientists from across the globe from a variety of input sources that included 454 sequences in addition to full-length BACs, BAC and fosmid ends sequenced with Sanger technology. We present here the latest tomato reference genome (SL4.0) assembled de novo from PacBio long reads and scaffolded using Hi-C contact maps. The assembly was validated using Bionano optical maps and 10X linked-read sequences. This assembly is highly contiguous with fewer gaps compared to previous genome builds and almost all scaffolds have been anchored and oriented to the 12 tomato chromosomes. We have found more repeats compared to the previous versions and one of the largest repeat classes identified are the LTR retrotransposons. We also describe updates to the reference genome and annotation since the last publication. The corresponding ITAG4.0 annotation has 4,794 novel genes along with 29,281 genes preserved from ITAG2.4. Most of the updated genes have extensions in the 5’ and 3’ UTRs resulting in doubling of annotated UTRs per gene. The genome and annotation can be accessed using SGN through BLAST database, Pathway database (SolCyc), Apollo, JBrowse genome browser and FTP available at https://solgenomics.net.

Download Full-text

De novo Assembly of the Brugia malayi Genome Using Long Reads from a Single MinION Flowcell

Scientific Reports ◽

10.1038/s41598-019-55908-y ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 3

Author(s):

Joseph R. Fauver ◽

John Martin ◽

Gary J. Weil ◽

Makedonka Mitreva ◽

Peter U. Fischer

Keyword(s):

Single Molecule ◽

New Technologies ◽

Reference Genome ◽

De Novo ◽

Complete Mitochondrial Genome ◽

Nuclear Genome ◽

Brugia Malayi ◽

Field Isolates ◽

Sequencing Technologies ◽

Long Reads

AbstractFilarial nematode infections cause a substantial global disease burden. Genomic studies of filarial worms can improve our understanding of their biology and epidemiology. However, genomic information from field isolates is limited and available reference genomes are often discontinuous. Single molecule sequencing technologies can reduce the cost of genome sequencing and long reads produced from these devices can improve the contiguity and completeness of genome assemblies. In addition, these new technologies can make generation and analysis of large numbers of field isolates feasible. In this study, we assessed the performance of the Oxford Nanopore Technologies MinION for sequencing and assembling the genome of Brugia malayi, a human parasite widely used in filariasis research. Using data from a single MinION flowcell, a 90.3 Mb nuclear genome was assembled into 202 contigs with an N50 of 2.4 Mb. This assembly covered 96.9% of the well-defined B. malayi reference genome with 99.2% identity. The complete mitochondrial genome was obtained with individual reads and the nearly complete genome of the endosymbiotic bacteria Wolbachia was assembled alongside the nuclear genome. Long-read data from the MinION produced an assembly that approached the quality of a well-established reference genome using comparably fewer resources.

Download Full-text

AStrap: identification of alternative splicing from transcript sequences without a reference genome

Bioinformatics ◽

10.1093/bioinformatics/bty1008 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2654-2656 ◽

Cited By ~ 5

Author(s):

Guoli Ji ◽

Wenbin Ye ◽

Yaru Su ◽

Moliang Chen ◽

Guangzao Huang ◽

...

Keyword(s):

Machine Learning ◽

Alternative Splicing ◽

Single Molecule ◽

Reference Genome ◽

De Novo ◽

Supplementary Information ◽

Model Organisms ◽

Sequencing Data ◽

Extensive Evaluation ◽

Reference Genomes

Abstract Summary Alternative splicing (AS) is a well-established mechanism for increasing transcriptome and proteome diversity, however, detecting AS events and distinguishing among AS types in organisms without available reference genomes remains challenging. We developed a de novo approach called AStrap for AS analysis without using a reference genome. AStrap identifies AS events by extensive pair-wise alignments of transcript sequences and predicts AS types by a machine-learning model integrating more than 500 assembled features. We evaluated AStrap using collected AS events from reference genomes of rice and human as well as single-molecule real-time sequencing data from Amborella trichopoda. Results show that AStrap can identify much more AS events with comparable or higher accuracy than the competing method. AStrap also possesses a unique feature of predicting AS types, which achieves an overall accuracy of ∼0.87 for different species. Extensive evaluation of AStrap using different parameters, sample sizes and machine-learning models on different species also demonstrates the robustness and flexibility of AStrap. AStrap could be a valuable addition to the community for the study of AS in non-model organisms with limited genetic resources. Availability and implementation AStrap is available for download at https://github.com/BMILAB/AStrap. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

The sequencing and de novo assembly of the Larimichthys crocea genome using PacBio and Hi-C technologies

Scientific Data ◽

10.1038/s41597-019-0194-3 ◽

2019 ◽

Vol 6 (1) ◽

Cited By ~ 8

Author(s):

Baohua Chen ◽

Zhixiong Zhou ◽

Qiaozhen Ke ◽

Yidi Wu ◽

Huaqiang Bai ◽

...

Keyword(s):

Marine Fish ◽

Single Molecule ◽

Large Scale ◽

Reference Genome ◽

De Novo ◽

Larimichthys Crocea ◽

Chromosome Conformation ◽

Protein Coding ◽

Total Length ◽

Chromosome Level

Abstract Larimichthys crocea is an endemic marine fish in East Asia that belongs to Sciaenidae in Perciformes. L. crocea has now been recognized as an “iconic” marine fish species in China because not only is it a popular food fish in China, it is a representative victim of overfishing and still provides high value fish products supported by the modern large-scale mariculture industry. Here, we report a chromosome-level reference genome of L. crocea generated by employing the PacBio single molecule sequencing technique (SMRT) and high-throughput chromosome conformation capture (Hi-C) technologies. The genome sequences were assembled into 1,591 contigs with a total length of 723.86 Mb and a contig N50 length of 2.83 Mb. After chromosome-level scaffolding, 24 scaffolds were constructed with a total length of 668.67 Mb (92.48% of the total length). Genome annotation identified 23,657 protein-coding genes and 7262 ncRNAs. This highly accurate, chromosome-level reference genome of L. crocea provides an essential genome resource to support the development of genome-scale selective breeding and restocking strategies of L. crocea.

Download Full-text

De novo assembly of the cattle reference genome with single-molecule sequencing

GigaScience ◽

10.1093/gigascience/giaa021 ◽

2020 ◽

Vol 9 (3) ◽

Cited By ~ 35

Author(s):

Benjamin D Rosen ◽

Derek M Bickhart ◽

Robert D Schnabel ◽

Sergey Koren ◽

Christine G Elsik ◽

...

Keyword(s):

Single Molecule ◽

De Novo Assembly ◽

Reference Genome ◽

De Novo ◽

Bos Taurus ◽

Future Research ◽

Protein Coding ◽

Single Molecule Sequencing ◽

Assembly Accuracy ◽

Genomic Tools

Abstract Background Major advances in selection progress for cattle have been made following the introduction of genomic tools over the past 10–12 years. These tools depend upon the Bos taurus reference genome (UMD3.1.1), which was created using now-outdated technologies and is hindered by a variety of deficiencies and inaccuracies. Results We present the new reference genome for cattle, ARS-UCD1.2, based on the same animal as the original to facilitate transfer and interpretation of results obtained from the earlier version, but applying a combination of modern technologies in a de novo assembly to increase continuity, accuracy, and completeness. The assembly includes 2.7 Gb and is >250× more continuous than the original assembly, with contig N50 >25 Mb and L50 of 32. We also greatly expanded supporting RNA-based data for annotation that identifies 30,396 total genes (21,039 protein coding). The new reference assembly is accessible in annotated form for public use. Conclusions We demonstrate that improved continuity of assembled sequence warrants the adoption of ARS-UCD1.2 as the new cattle reference genome and that increased assembly accuracy will benefit future research on this species.

Download Full-text

Illumina TruSeq synthetic long-reads empowerde novoassembly and resolve complex, highly repetitive transposable elements

10.1101/001834 ◽

2014 ◽

Cited By ~ 1

Author(s):

Rajiv C McCoy ◽

Ryan W Taylor ◽

Timothy A Blauwkamp ◽

Joanna L Kelley ◽

Michael Kertesz ◽

...

Keyword(s):

Transposable Elements ◽

Reference Genome ◽

De Novo ◽

Model Organism ◽

Genomic Analysis ◽

High Sequence Identity ◽

Current Reference ◽

Sequencing Technologies ◽

Long Reads ◽

Whole Genomes

High-throughput DNA sequencing technologies have revolutionized genomic analysis, including thede novoassembly of whole genomes. Nevertheless, assembly of complex genomes remains challenging, in part due to the presence of dispersed repeats which introduce ambiguity during genome reconstruction. Transposable elements (TEs) can be particularly problematic, especially for TE families exhibiting high sequence identity, high copy number, or present in complex genomic arrangements. While TEs strongly affect genome function and evolution, most currentde novoassembly approaches cannot resolve long, identical, and abundant families of TEs. Here, we applied a novel Illumina technology called TruSeq synthetic long-reads, which are generated through highly parallel library preparation and local assembly of short read data and achieve lengths of 1.5-18.5 Kbp with an extremely low error rate (∼0.03% per base). To test the utility of this technology, we sequenced and assembled the genome of the model organismDrosophila melanogaster(reference genome strainy;cn,bw,sp) achieving an N50 contig size of 69.7 Kbp and covering 96.9% of the euchromatic chromosome arms of the current reference genome. TruSeq synthetic long-read technology enables placement of individual TE copies in their proper genomic locations as well as accurate reconstruction of TE sequences. We entirely recovered and accurately placed 4,229 (77.8%) of the 5,434 of annotated transposable elements with perfect identity to the current reference genome. As TEs are ubiquitous features of genomes of many species, TruSeq synthetic long- reads, and likely other methods that generate long reads, offer a powerful approach to improvede novoassemblies of whole genomes.

Download Full-text

A whole genome atlas of 81 Psilocybe genomes as a resource for psilocybin production.

F1000Research ◽

10.12688/f1000research.55301.2 ◽

2021 ◽

Vol 10 ◽

pp. 961

Author(s):

Kevin McKernan ◽

Liam Kane ◽

Yvonne Helbert ◽

Lei Zhang ◽

Nathan Houde ◽

...

Keyword(s):

Gene Cluster ◽

Single Molecule ◽

Reference Genome ◽

De Novo ◽

Genomic Diversity ◽

Sequence Coverage ◽

Single Molecule Sequencing ◽

Contiguous Gene ◽

Long Read ◽

Interesting Variation

The Psilocybe genus is well known for the synthesis of valuable psychoactive compounds such as Psilocybin, Psilocin, Baeocystin and Aeruginascin. The ubiquity of Psilocybin synthesis in Psilocybe has been attributed to a horizontal gene transfer mechanism of a ~20Kb gene cluster. A recently published highly contiguous reference genome derived from long read single molecule sequencing has underscored interesting variation in this Psilocybin synthesis gene cluster. This reference genome has also enabled the shotgun sequencing of spores from many Psilocybe strains to better catalog the genomic diversity in the Psilocybin synthesis pathway. Here we present the de novo assembly of 81 Psilocybe genomes compared to the P.envy reference genome. Surprisingly, the genomes of Psilocybe galindoi, Psilocybe tampanensis and Psilocybe azurescens lack sequence coverage over the previously described Psilocybin synthesis pathway but do demonstrate amino acid sequence homology to a less contiguous gene cluster and may illuminate the previously proposed evolution of psilocybin synthesis.

Download Full-text

Assembly of chromosome-scale contigs by efficiently resolving repetitive sequences with long reads

10.1101/345983 ◽

2018 ◽

Cited By ~ 2

Author(s):

Huilong Du ◽

Chengzhi Liang

Keyword(s):

Single Molecule ◽

High Efficiency ◽

Reference Genome ◽

Repetitive Sequences ◽

Sequencing Data ◽

High Quality ◽

Single Molecule Sequencing ◽

Genome Maps ◽

Long Reads ◽

Novel Method

AbstractDue to the large number of repetitive sequences in complex eukaryotic genomes, fragmented and incompletely assembled genomes lose value as reference sequences, often due to short contigs that cannot be anchored or mispositioned onto chromosomes. Here we report a novel method Highly Efficient Repeat Assembly (HERA), which includes a new concept called a connection graph as well as algorithms for constructing the graph. HERA resolves repeats at high efficiency with single-molecule sequencing data, and enables the assembly of chromosome-scale contigs by further integrating genome maps and Hi-C data. We tested HERA with the genomes of rice R498, maize B73, human HX1 and Tartary buckwheat Pinku1. HERA can correctly assemble most of the tandemly repetitive sequences in rice using single-molecule sequencing data only. Using the same maize and human sequencing data published by Jiao et al. (2017) and Shi et al. (2016), respectively, we dramatically improved on the sequence contiguity compared with the published assemblies, increasing the contig N50 from 1.3 Mb to 61.2 Mb in maize B73 assembly and from 8.3 Mb to 54.4 Mb in human HX1 assembly with HERA. We provided a high-quality maize reference genome with 96.9% of the gaps filled (only 76 gaps left) and several incorrectly positioned sequences fixed compared with the B73 RefGen_v4 assembly. Comparisons between the HERA assembly of HX1 and the human GRCh38 reference genome showed that many gaps in GRCh38 could be filled, and that GRCh38 contained some potential errors that could be fixed. We assembled the Pinku1 genome into 12 scaffolds with a contig N50 size of 27.85 Mb. HERA serves as a new genome assembly/phasing method to generate high quality sequences for complex genomes and as a curation tool to improve the contiguity and completeness of existing reference genomes, including the correction of assembly errors in repetitive regions.

Download Full-text

Single-molecule sequencing and conformational capture enable de novo mammalian reference genomes

10.1101/064352 ◽

2016 ◽

Cited By ~ 10

Author(s):

Derek M. Bickhart ◽

Benjamin D. Rosen ◽

Sergey Koren ◽

Brian L. Sayre ◽

Alex R. Hastie ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Chromosome Length ◽

Chromatin Interaction ◽

Capra Hircus ◽

Immune Gene ◽

Ruminant Species ◽

Long Reads ◽

Assembly Algorithms ◽

Genome Assemblies

AbstractThe decrease in sequencing cost and increased sophistication of assembly algorithms for short-read platforms has resulted in a sharp increase in the number of species with genome assemblies. However, these assemblies are highly fragmented, with many gaps, ambiguities, and errors, impeding downstream applications. We demonstrate current state of the art for de novo assembly using the domestic goat (Capra hircus), based on long reads for contig formation, short reads for consensus validation, and scaffolding by optical and chromatin interaction mapping. These combined technologies produced the most contiguous de novo mammalian assembly to date, with chromosome-length scaffolds and only 663 gaps. Our assembly represents a >250-fold improvement in contiguity compared to the previously published C. hircus assembly, and better resolves repetitive structures longer than 1 kb, supporting the most complete repeat family and immune gene complex representation ever produced for a ruminant species.

Download Full-text

A high-quality reference genome for a parasitic bivalve with doubly uniparental inheritance (Bivalvia: Unionida)

Genome Biology and Evolution ◽

10.1093/gbe/evab029 ◽

2021 ◽

Author(s):

Chase H Smith

Keyword(s):

Single Molecule ◽

Reference Genome ◽

Economic Value ◽

Freshwater Mussel ◽

Uniparental Inheritance ◽

Doubly Uniparental Inheritance ◽

High Quality ◽

Long Reads ◽

A Genome ◽

Genomic Studies

Abstract From a genomics perspective, bivalves (Mollusca: Bivalvia) have been poorly explored with the exception for those of high economic value. The bivalve order Unionida, or freshwater mussels, has been of interest in recent genomic studies due to their unique mitochondrial biology and peculiar life cycle. However, genomic studies have been hindered by the lack of a high-quality reference genome. Here, I present a genome assembly of Potamilus streckersoni using Pacific Bioscience single-molecule real-time long reads and 10X Genomics linked read sequencing. Further, I use RNA sequencing from multiple tissue types and life stages to annotate the reference genome. The final assembly was far superior to any previously published freshwater mussel genome and was represented by 2,368 scaffolds (2,472 contigs) and 1,776,755,624 bp, with a scaffold N50 of 2,051,244 bp. A high proportion of the assembly was comprised of repetitive elements (51.03%), aligning with genomic characteristics of other bivalves. The functional annotation returned 52,407 gene models (41,065 protein, 11,342 tRNAs), which was concordant with the estimated number of genes in other freshwater mussel species. This genetic resource, along with future studies developing high-quality genome assemblies and annotations, will be integral toward unraveling the genomic bases of ecologically and evolutionarily important traits in this hyper-diverse group.

Download Full-text

De novo design of transmembrane β-barrels

10.1101/2020.10.22.346965 ◽

2020 ◽

Author(s):

Anastassia A. Vorobieva ◽

Paul White ◽

Binyong Liang ◽

Jim E Horne ◽

Asim K. Bera ◽

...

Keyword(s):

Lipid Bilayers ◽

Single Molecule ◽

First Principles ◽

Lipid Membranes ◽

De Novo ◽

Computational Design ◽

Protein Sequencing ◽

Naturally Occurring ◽

Transmembrane Pores ◽

Almost All

AbstractThe ability of naturally occurring transmembrane β-barrel proteins (TMBs) to spontaneously insert into lipid bilayers and form stable transmembrane pores is a remarkable feat of protein evolution and has been exploited in biotechnology for applications ranging from single molecule DNA and protein sequencing to biomimetic filtration membranes. Because it has not been possible to design TMBs from first principles, these efforts have relied on re-engineering of naturally occurring TMBs that generally have a biological function very different from that desired. Here we leverage the power of de novo computational design coupled with a “hypothesis, design and test” approach to determine principles underlying TMB structure and folding, and find that, unlike almost all other classes of protein, locally destabilizing sequences in both the β-turns and β-strands facilitate TMB expression and global folding by modulating the kinetics of folding and the competition between soluble misfolding and proper folding into the lipid bilayer. We use these principles to design new eight stranded TMBs with sequences unrelated to any known TMB and show that they insert and fold into detergent micelles and synthetic lipid membranes. The designed proteins fold more rapidly and reversibly in lipid membranes than the TMB domain of the model native protein OmpA, and high resolution NMR and X-ray crystal structures of one of the designs are very close to the computational model. The ability to design TMBs from first principles opens the door to custom design of TMBs for biotechnology and demonstrates the value of de novo design to investigate basic protein folding problems that are otherwise hidden by evolutionary history.One sentence summarySuccess in de novo design of transmembrane β-barrels reveals geometric and sequence constraints on the fold and paves the way to design of custom pores for sequencing and other single-molecule analytical applications.

Download Full-text