scholarly journals A reference genome for the critically endangered woylie, Bettongia penicillata ogilbyi

2021 ◽  
Author(s):  
Emma Peel ◽  
Luke Silver ◽  
Parice Brandies ◽  
Carolyn J Hogg ◽  
Katherine Belov

Biodiversity is declining globally, and Australia has one of the worst extinction records for mammals. The development of sequencing technologies means that genomic approaches are now available as important tools for wildlife conservation and management. Despite this, genome sequences are available for only 5% of threatened Australian species. Here we report the first reference genome for the woylie (Bettongia penicillata ogilbyi), a critically endangered marsupial from Western Australia, and the first genome within the Potoroidae family. The woylie reference genome was generated using Pacific Biosciences HiFi long-reads, resulting in a 3.39 Gbp assembly with a scaffold N50 of 6.49 Mbp and 86.5% complete mammalian BUSCOs. Assembly of a global transcriptome from pouch skin, tongue, heart and blood RNA-seq reads was used to guide annotation with Fgenesh++, resulting in the annotation of 24,655 genes. The woylie reference genome is a valuable resource for conservation, management and investigations into disease-induced decline of this critically endangered marsupial.

2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Joseph R. Fauver ◽  
John Martin ◽  
Gary J. Weil ◽  
Makedonka Mitreva ◽  
Peter U. Fischer

AbstractFilarial nematode infections cause a substantial global disease burden. Genomic studies of filarial worms can improve our understanding of their biology and epidemiology. However, genomic information from field isolates is limited and available reference genomes are often discontinuous. Single molecule sequencing technologies can reduce the cost of genome sequencing and long reads produced from these devices can improve the contiguity and completeness of genome assemblies. In addition, these new technologies can make generation and analysis of large numbers of field isolates feasible. In this study, we assessed the performance of the Oxford Nanopore Technologies MinION for sequencing and assembling the genome of Brugia malayi, a human parasite widely used in filariasis research. Using data from a single MinION flowcell, a 90.3 Mb nuclear genome was assembled into 202 contigs with an N50 of 2.4 Mb. This assembly covered 96.9% of the well-defined B. malayi reference genome with 99.2% identity. The complete mitochondrial genome was obtained with individual reads and the nearly complete genome of the endosymbiotic bacteria Wolbachia was assembled alongside the nuclear genome. Long-read data from the MinION produced an assembly that approached the quality of a well-established reference genome using comparably fewer resources.


2014 ◽  
Author(s):  
Rajiv C McCoy ◽  
Ryan W Taylor ◽  
Timothy A Blauwkamp ◽  
Joanna L Kelley ◽  
Michael Kertesz ◽  
...  

High-throughput DNA sequencing technologies have revolutionized genomic analysis, including thede novoassembly of whole genomes. Nevertheless, assembly of complex genomes remains challenging, in part due to the presence of dispersed repeats which introduce ambiguity during genome reconstruction. Transposable elements (TEs) can be particularly problematic, especially for TE families exhibiting high sequence identity, high copy number, or present in complex genomic arrangements. While TEs strongly affect genome function and evolution, most currentde novoassembly approaches cannot resolve long, identical, and abundant families of TEs. Here, we applied a novel Illumina technology called TruSeq synthetic long-reads, which are generated through highly parallel library preparation and local assembly of short read data and achieve lengths of 1.5-18.5 Kbp with an extremely low error rate (∼0.03% per base). To test the utility of this technology, we sequenced and assembled the genome of the model organismDrosophila melanogaster(reference genome strainy;cn,bw,sp) achieving an N50 contig size of 69.7 Kbp and covering 96.9% of the euchromatic chromosome arms of the current reference genome. TruSeq synthetic long-read technology enables placement of individual TE copies in their proper genomic locations as well as accurate reconstruction of TE sequences. We entirely recovered and accurately placed 4,229 (77.8%) of the 5,434 of annotated transposable elements with perfect identity to the current reference genome. As TEs are ubiquitous features of genomes of many species, TruSeq synthetic long- reads, and likely other methods that generate long reads, offer a powerful approach to improvede novoassemblies of whole genomes.


2020 ◽  
Vol 2 (3) ◽  
Author(s):  
Cheng He ◽  
Guifang Lin ◽  
Hairong Wei ◽  
Haibao Tang ◽  
Frank F White ◽  
...  

Abstract Genome sequences provide genomic maps with a single-base resolution for exploring genetic contents. Sequencing technologies, particularly long reads, have revolutionized genome assemblies for producing highly continuous genome sequences. However, current long-read sequencing technologies generate inaccurate reads that contain many errors. Some errors are retained in assembled sequences, which are typically not completely corrected by using either long reads or more accurate short reads. The issue commonly exists, but few tools are dedicated for computing error rates or determining error locations. In this study, we developed a novel approach, referred to as k-mer abundance difference (KAD), to compare the inferred copy number of each k-mer indicated by short reads and the observed copy number in the assembly. Simple KAD metrics enable to classify k-mers into categories that reflect the quality of the assembly. Specifically, the KAD method can be used to identify base errors and estimate the overall error rate. In addition, sequence insertion and deletion as well as sequence redundancy can also be detected. Collectively, KAD is valuable for quality evaluation of genome assemblies and, potentially, provides a diagnostic tool to aid in precise error correction. KAD software has been developed to facilitate public uses.


Author(s):  
Lucile Broseus ◽  
Aubin Thomas ◽  
Andrew J. Oldfield ◽  
Dany Severac ◽  
Emeric Dubois ◽  
...  

ABSTRACTMotivationLong-read sequencing technologies are invaluable for determining complex RNA transcript architectures but are error-prone. Numerous “hybrid correction” algorithms have been developed for genomic data that correct long reads by exploiting the accuracy and depth of short reads sequenced from the same sample. These algorithms are not suited for correcting more complex transcriptome sequencing data.ResultsWe have created a novel reference-free algorithm called TALC (Transcription Aware Long Read Correction) which models changes in RNA expression and isoform representation in a weighted De-Bruijn graph to correct long reads from transcriptome studies. We show that transcription aware correction by TALC improves the accuracy of the whole spectrum of downstream RNA-seq applications and is thus necessary for transcriptome analyses that use long read technology.Availability and ImplementationTALC is implemented in C++ and available at https://gitlab.igh.cnrs.fr/lbroseus/[email protected]


2021 ◽  
Author(s):  
Dongxue Zhao ◽  
Yan Zhang ◽  
Yizeng Lu ◽  
Mao Chai ◽  
Liqiang Fan ◽  
...  

Sorbus pohuashanensis is a potential horticulture and medicinal plant, but its genomic and genetic background remains unknown. Here, we de novo sequenced and assembled the S. pohuashanensis (Hance) Hedl. reference genome using PacBio long reads. Based on the new reference genome, we resequenced a core collection of 22 Sorbus spp. samples, which were divided into two groups (G1 and G2) based on phylogenetic and PCA analysis. These phylogenetic clusters were highly in accordance with the classification based on leaf shape. Natural hybridization between the G1 and G2 groups was evidenced by a sample (R21) with a highly heterozygous genotype. Nucleotide diversity (π) analysis showed that G1 has a higher diversity than G2, and that G2 originated from G1. During the evolution process, the gene families involved in photosynthesis pathways expanded and gene families involved in energy consumption contracted. Comparative genome analysis showed that S. pohuashanensis has a high level of chromosomal synteny with Malus domestica and Pyrus communis. RNA-seq data suggested that flavonol biosynthesis and heat-shock protein (HSP)-heat-shock factor (HSF) pathways play important roles in protection against sunburn. This research provides new insight into the evolution of Sorbus spp. genomes. In addition, the genomic resources and the identified genetic variations, especially those genes related to stress resistance, will help future efforts to introduce and breed Sorbus spp.


GigaScience ◽  
2021 ◽  
Vol 10 (4) ◽  
Author(s):  
Tiantian Zhao ◽  
Wenxu Ma ◽  
Zhen Yang ◽  
Lisong Liang ◽  
Xin Chen ◽  
...  

Abstract Background Corylus heterophylla Fisch. is a species of the Betulaceae family native to China. As an economically and ecologically important nut tree, C. heterophylla can survive in extremely low temperatures (–30 to –40 °C). To deepen our knowledge of the Betulaceae species and facilitate the use of C. heterophylla for breeding and its genetic improvement, we have sequenced the whole genome of C. heterophylla. Findings Based on >64.99 Gb (∼175.30×) of Nanopore long reads, we assembled a 370.75-Mb C. heterophylla genome with contig N50 and scaffold N50 sizes of 2.07 and 31.33  Mb, respectively, accounting for 99.23% of the estimated genome size (373.61 Mb). Furthermore, 361.90 Mb contigs were anchored to 11 chromosomes using Hi-C link data, representing 97.61% of the assembled genome sequences. Transcriptomes representing 4 different tissues were sequenced to assist protein-coding gene prediction. A total of 27,591 protein-coding genes were identified, of which 92.02% (25,389) were functionally annotated. The phylogenetic analysis showed that C. heterophylla is close to Ostrya japonica, and they diverged from their common ancestor ∼52.79 million years ago. Conclusions We generated a high-quality chromosome-level genome of C. heterophylla. This genome resource will promote research on the molecular mechanisms of how the hazelnut responds to environmental stresses and serves as an important resource for genome-assisted improvement in cold and drought resistance of the Corylus genus.


Author(s):  
Cheng He ◽  
Guifang Lin ◽  
Hairong Wei ◽  
Haibao Tang ◽  
Frank F White ◽  
...  

ABSTRACTGenome sequences provide genomic maps with a single-base resolution for exploring genetic contents. Sequencing technologies, particularly long reads, have revolutionized genome assemblies for producing highly continuous genome sequences. However, current long-read sequencing technologies generate inaccurate reads that contain many errors. Some errors are retained in assembled sequences, which are typically not completely corrected by using either long reads or more accurate short reads. The issue commonly exists but few tools are dedicated for computing error rates or determining error locations. In this study, we developed a novel approach, referred to as K-mer Abundance Difference (KAD), to compare the inferred copy number of each k-mer indicated by short reads and the observed copy number in the assembly. Simple KAD metrics enable to classify k-mers into categories that reflect the quality of the assembly. Specifically, the KAD method can be used to identify base errors and estimate the overall error rate. In addition, sequence insertion and deletion as well as sequence redundancy can also be detected. Therefore, KAD is valuable for quality evaluation of genome assemblies and, potentially, provides a diagnostic tool to aid in precise error correction. KAD software has been developed to facilitate public uses.


2018 ◽  
Author(s):  
Grzegorz M Boratyn ◽  
Jean Thierry-Mieg ◽  
Danielle Thierry-Mieg ◽  
Ben Busby ◽  
Thomas L Madden

ABSTRACTNext-generation sequencing technologies can produce tens of millions of reads, often paired-end, from transcripts or genomes. But few programs can align RNA on the genome and accurately discover introns, especially with long reads. We introduce Magic-BLAST, a new aligner based on ideas from the Magic pipeline. It uses innovative techniques that include the optimization of a spliced alignment score and selective masking during seed selection. We evaluate the performance of Magic-BLAST to accurately map short or long sequences and its ability to discover introns on real RNA-seq data sets from PacBio, Roche and Illumina runs, and on six benchmarks, and compare it to other popular aligners. Additionally, we look at alignments of human idealized RefSeq mRNA sequences perfectly matching the genome. We show that Magic-BLAST is the best at intron discovery over a wide range of conditions and the best at mapping reads longer than 250 bases, from any platform. It is versatile and robust to high levels of mismatches or extreme base composition, and reasonably fast. It can align reads to a BLAST database or a FASTA file. It can accept a FASTQ file as input or automatically retrieve an accession from the SRA repository at the NCBI.


2021 ◽  
Author(s):  
Guodong Zhang ◽  
Xin Jin ◽  
Xiubao Li ◽  
Ning Zhang ◽  
Shaoqian Li ◽  
...  

Abstract The phosphatidy ethanolamine-binding protein (PEBP) genes are involved in regulating plant flowering and tuberization. We analyzed both the recently updated, long-reads-based reference genome (DM v6.1) and the previous short-reads-based annotation (PGSC DM v3.4) of the potato reference genome and characterized heat-induced gene expression using RT-PCR and RNA-Seq. Fifteen PEBP genes were identified from DM v6.1 and named StPEBP1 to StPEBP15 based on their chromosomal locations. Six of these genes were not found in the previous annnotation (DM v3.4). The 15 genes could be classified into FT, TFL, MFT, and PEBP-like subfamilies and were located on 6 chromosomes. Most of the StPEBP genes were found to have conserved motifs 1 to 5 similar to Arabidopsis and other plants. We found that heat stress induced opposite expression patterns of certain FT and TFL members in a tissue-specific way: StPEBP14 and StPEBP15 versus StPEBP3 and StPEBP10 in leaves, StPEBP4 versus StPEBP10 in roots, and StPEBP9 versus StPEBP3 in tubers (FT versus TFL respectively). This maintenance of the FT/TFL opposite expression pattern but involving tissue-specific PEBP members may partly explain why different potato organs have different sensitivities to heat stress. Our study provided important multiuse genomic resource, and relevant information and candidate genes for genetic improvement of heat tolerance in potato. It clearly support that the long-reads-based genome assembly and annotation provides a better genomic resource for identification of PEBP and perhaps other genes.


2020 ◽  
Vol 36 (20) ◽  
pp. 5000-5006 ◽  
Author(s):  
Lucile Broseus ◽  
Aubin Thomas ◽  
Andrew J Oldfield ◽  
Dany Severac ◽  
Emeric Dubois ◽  
...  

Abstract Motivation Long-read sequencing technologies are invaluable for determining complex RNA transcript architectures but are error-prone. Numerous ‘hybrid correction’ algorithms have been developed for genomic data that correct long reads by exploiting the accuracy and depth of short reads sequenced from the same sample. These algorithms are not suited for correcting more complex transcriptome sequencing data. Results We have created a novel reference-free algorithm called Transcript-level Aware Long-Read Correction (TALC) which models changes in RNA expression and isoform representation in a weighted De Bruijn graph to correct long reads from transcriptome studies. We show that transcript-level aware correction by TALC improves the accuracy of the whole spectrum of downstream RNA-seq applications and is thus necessary for transcriptome analyses that use long read technology. Availability and implementation TALC is implemented in C++ and available at https://github.com/lbroseus/TALC. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document