Novel splicing and open reading frames revealed by long-read direct RNA sequencing of adenovirus transcripts

AbstractAdenovirus is a common human pathogen that relies on host cell processes for production and processing of viral RNA. Although adenoviral promoters, splice junctions, and cleavage and polyadenylation sites have been characterized using low-throughput biochemical techniques or short read cDNA-based sequencing, these technologies do not fully capture the complexity of the adenoviral transcriptome. By combining Illumina short-read and nanopore long-read direct RNA sequencing approaches, we mapped transcription start sites and cleavage and polyadenylation sites across the adenovirus genome. The canonical viral early and late RNA cassettes were confirmed, but analysis of splice junctions within long RNA reads revealed an additional 20 novel viral transcripts. These RNAs include seven new splice junctions which lead to expression of canonical open reading frames (ORF), as well as 13 transcripts encoding for messages that potentially alter protein functions through truncations or the fusion of canonical ORFs. In addition, we also detect RNAs that bypass canonical cleavage sites and generate potential chimeric proteins by linking separate gene transcription units. Our work highlights how long-read sequencing technologies can reveal further complexity within viral transcriptomes.

Download Full-text

Comparative assessment of long-read error-correction software applied to RNA-sequencing data

10.1101/476622 ◽

2018 ◽

Cited By ~ 2

Author(s):

Leandro Lima ◽

Camille Marchet ◽

Ségolène Caboche ◽

Corinne Da Silva ◽

Benjamin Istace ◽

...

Keyword(s):

Error Correction ◽

Rna Sequencing ◽

Gene Families ◽

Error Rates ◽

Open Reading Frames ◽

Sequencing Data ◽

Sequencing Technologies ◽

Isoform Diversity ◽

Long Read ◽

Read Error Correction

AbstractMotivationLong-read sequencing technologies offer promising alternatives to high-throughput short read sequencing, especially in the context of RNA-sequencing. However these technologies are currently hindered by high error rates in the output data that affect analyses such as the identification of isoforms, exon boundaries, open reading frames, and the creation of gene catalogues. Due to the novelty of such data, computational methods are still actively being developed and options for the error-correction of RNA-sequencing long reads remain limited.ResultsIn this article, we evaluate the extent to which existing long-read DNA error correction methods are capable of correcting cDNA Nanopore reads. We provide an automatic and extensive benchmark tool that not only reports classical error-correction metrics but also the effect of correction on gene families, isoform diversity, bias towards the major isoform, and splice site detection. We find that long read error-correction tools that were originally developed for DNA are also suitable for the correction of RNA-sequencing data, especially in terms of increasing base-pair accuracy. Yet investigators should be warned that the correction process perturbs gene family sizes and isoform diversity. This work provides guidelines on which (or whether) error-correction tools should be used, depending on the application type.Benchmarking softwarehttps://gitlab.com/leoisl/LR_EC_analyser

Download Full-text

Decoding the architecture of the varicella-zoster virus transcriptome

10.1101/2020.05.25.110965 ◽

2020 ◽

Author(s):

Shirley E. Braspenning ◽

Tomohiko Sadaoka ◽

Judith Breuer ◽

Georges M.G.M Verjans ◽

Werner J.D. Ouwendijk ◽

...

Keyword(s):

Varicella Zoster Virus ◽

Rna Sequencing ◽

Open Reading Frames ◽

Varicella Zoster ◽

Transcription Start Sites ◽

Functional Studies ◽

Double Stranded Dna ◽

Long Read ◽

Infected Cells ◽

Splice Junctions

SummaryVaricella-zoster virus (VZV), a double-stranded DNA virus, causes varicella, establishes lifelong latency in ganglionic neurons, and reactivates later in life to cause herpes zoster, commonly associated with chronic pain. The VZV genome is densely packed and produces multitudes of overlapping transcripts deriving from both strands. While 71 distinct open reading frames (ORFs) have thus far been experimentally defined, the full coding potential of VZV remains unknown. Here, we integrated multiple short-read RNA sequencing approaches with long-read direct RNA sequencing on RNA isolated from VZV-infected cells to provide a comprehensive reannotation of the lytic VZV transcriptome architecture. Through precise mapping of transcription start sites, splice junctions, and polyadenylation sites, we identified 136 distinct polyadenylated VZV RNAs that encode canonical ORFs, non-canonical ORFs, and ORF fusions, as well as putative non-coding RNAs (ncRNAs). Furthermore, we determined the kinetic class of all VZV transcripts and observed, unexpectedly, that transcripts encoding the ORF62 protein, previously designated as immediate-early, were expressed with late kinetics. Our work showcases the complexity of the VZV transcriptome and provides a comprehensive resource that will facilitate future functional studies of coding RNAs, ncRNAs, and the biological mechanisms underlying the regulation of viral transcription and translation during lytic VZV infection.

Download Full-text

Long-read cDNA sequencing identifies functional pseudogenes in the human transcriptome

Genome Biology ◽

10.1186/s13059-021-02369-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Robin-Lee Troskie ◽

Yohaann Jafrani ◽

Tim R. Mercer ◽

Adam D. Ewing ◽

Geoffrey J. Faulkner ◽

...

Keyword(s):

Cultured Cells ◽

Open Reading Frames ◽

Cdna Sequencing ◽

Protein Coding ◽

Dynamic Component ◽

Gene Copies ◽

Long Read ◽

Normal Human ◽

Reading Frames ◽

Transcriptional Landscape

AbstractPseudogenes are gene copies presumed to mainly be functionless relics of evolution due to acquired deleterious mutations or transcriptional silencing. Using deep full-length PacBio cDNA sequencing of normal human tissues and cancer cell lines, we identify here hundreds of novel transcribed pseudogenes expressed in tissue-specific patterns. Some pseudogene transcripts have intact open reading frames and are translated in cultured cells, representing unannotated protein-coding genes. To assess the biological impact of noncoding pseudogenes, we CRISPR-Cas9 delete the nucleus-enriched pseudogene PDCL3P4 and observe hundreds of perturbed genes. This study highlights pseudogenes as a complex and dynamic component of the human transcriptional landscape.

Download Full-text

Comprehensive identification of transposable element insertions using multiple sequencing technologies

Nature Communications ◽

10.1038/s41467-021-24041-8 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Chong Chu ◽

Rebeca Borges-Monroy ◽

Vinayak V. Viswanadham ◽

Soohyun Lee ◽

Heng Li ◽

...

Keyword(s):

Transposable Element ◽

Structure And Function ◽

Endogenous Retroviruses ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Short Read ◽

Sequencing Technologies ◽

Long Read ◽

And Function

AbstractTransposable elements (TEs) help shape the structure and function of the human genome. When inserted into some locations, TEs may disrupt gene regulation and cause diseases. Here, we present xTea (x-Transposable element analyzer), a tool for identifying TE insertions in whole-genome sequencing data. Whereas existing methods are mostly designed for short-read data, xTea can be applied to both short-read and long-read data. Our analysis shows that xTea outperforms other short read-based methods for both germline and somatic TE insertion discovery. With long-read data, we created a catalogue of polymorphic insertions with full assembly and annotation of insertional sequences for various types of retroelements, including pseudogenes and endogenous retroviruses. Notably, we find that individual genomes have an average of nine groups of full-length L1s in centromeres, suggesting that centromeres and other highly repetitive regions such as telomeres are a significant yet unexplored source of active L1s. xTea is available at https://github.com/parklab/xTea.

Download Full-text

Mapping and phasing of structural variation in patient genomes using nanopore sequencing

10.1101/129379 ◽

2017 ◽

Cited By ~ 4

Author(s):

Mircea Cretu Stancu ◽

Markus J. van Roosmalen ◽

Ivo Renkens ◽

Marleen Nieboer ◽

Sjors Middelkamp ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Structural Variants ◽

Human Genetic Disease ◽

Structural Genomic ◽

Short Read ◽

Sequencing Technologies ◽

Genome Wide ◽

Long Read ◽

Complex Structural

AbstractStructural genomic variants form a common type of genetic alteration underlying human genetic disease and phenotypic variation. Despite major improvements in genome sequencing technology and data analysis, the detection of structural variants still poses challenges, particularly when variants are of high complexity. Emerging long-read single-molecule sequencing technologies provide new opportunities for detection of structural variants. Here, we demonstrate sequencing of the genomes of two patients with congenital abnormalities using the ONT MinION at 11x and 16x mean coverage, respectively. We developed a bioinformatic pipeline - NanoSV - to efficiently map genomic structural variants (SVs) from the long-read data. We demonstrate that the nanopore data are superior to corresponding short-read data with regard to detection of de novo rearrangements originating from complex chromothripsis events in the patients. Additionally, genome-wide surveillance of SVs, revealed 3,253 (33%) novel variants that were missed in short-read data of the same sample, the majority of which are duplications < 200bp in size. Long sequencing reads enabled efficient phasing of genetic variations, allowing the construction of genome-wide maps of phased SVs and SNVs. We employed read-based phasing to show that all de novo chromothripsis breakpoints occurred on paternal chromosomes and we resolved the long-range structure of the chromothripsis. This work demonstrates the value of long-read sequencing for screening whole genomes of patients for complex structural variants.

Download Full-text

Complete Genome Resequencing of Thermus thermophilus Strain TMY by Hybrid Assembly of Long- and Short-Read Sequencing Technologies

Microbiology Resource Announcements ◽

10.1128/mra.00979-21 ◽

2021 ◽

Vol 10 (46) ◽

Author(s):

Kentaro Miyazaki ◽

Natsuko Tokito

Keyword(s):

Complete Genome ◽

Thermus Thermophilus ◽

Genomic Analysis ◽

Comparative Genomic ◽

Hybrid Assembly ◽

Genome Resequencing ◽

Short Read ◽

Content Type ◽

Sequencing Technologies ◽

Long Read

Complete genome resequencing was conducted for Thermus thermophilus strain TMY by hybrid assembly of Oxford Nanopore Technologies long-read and MGI short-read data. Errors in the previously reported genome sequence determined by PacBio technology alone were corrected, allowing for high-quality comparative genomic analysis of closely related T. thermophilus genomes.

Download Full-text

Abstract 2724: A combination of short-read and long-read RNA sequencing reveals NOVA1’s role in telomere biology

10.1158/1538-7445.sabcs18-2724 ◽

2019 ◽

Author(s):

Andrew T. Ludlow ◽

Mohammed E. Sayed ◽

Aaron L. Slusher ◽

Mark Ribick ◽

Anisha Pancholi ◽

...

Keyword(s):

Rna Sequencing ◽

Short Read ◽

Long Read ◽

Telomere Biology

Download Full-text

Swan: a library for the analysis and visualization of long-read transcriptomes

Bioinformatics ◽

10.1093/bioinformatics/btaa836 ◽

2020 ◽

Author(s):

Fairlie Reese ◽

Ali Mortazavi

Keyword(s):

Rna Sequencing ◽

Cell Lines ◽

Intron Retention ◽

Exon Skipping ◽

Differentially Expressed ◽

Transcript Isoforms ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read

Abstract Motivation Long-read RNA-sequencing technologies such as PacBio and Oxford Nanopore have discovered an explosion of new transcript isoforms that are difficult to visually analyze using currently available tools. We introduce the Swan Python library, which is designed to analyze and visualize transcript models. Results Swan finds 4909 differentially expressed transcripts between cell lines HepG2 and HFFc6, including 279 that are differentially expressed even though the parent gene is not. Additionally, Swan discovers 285 reproducible exon skipping and 47 intron retention events not recorded in the GENCODE v29 annotation. Availability and implementation The Swan library for Python 3 is available on PyPi at https://pypi.org/project/swan-vis/ and on GitHub at https://github.com/mortazavilab/swan_vis.

Download Full-text

Analysis of the genome of Spodoptera frugiperda nucleopolyhedrovirus (SfMNPV-19) and of the high genomic heterogeneity in group II nucleopolyhedroviruses

Journal of General Virology ◽

10.1099/vir.0.83581-0 ◽

2008 ◽

Vol 89 (5) ◽

pp. 1202-1211 ◽

Cited By ~ 32

Author(s):

José Luiz Caldas Wolff ◽

Fernando Hercos Valicente ◽

Renata Martins ◽

Juliana Velasco de Castro Oliveira ◽

Paolo Marinho de Andrade Zanotto

Keyword(s):

Spodoptera Frugiperda ◽

Point Mutations ◽

Open Reading Frames ◽

Temporal Organization ◽

Agrotis Segetum ◽

Gene Gain ◽

Physical Maps ◽

Polyadenylation Sites ◽

Group Ii ◽

Reading Frames

The genome of the most virulent among 22 Brazilian geographical isolates of Spodoptera frugiperda nucleopolyhedrovirus, isolate 19 (SfMNPV-19), was completely sequenced and shown to comprise 132 565 bp and 141 open reading frames (ORFs). A total of 11 ORFs with no homology to genes in the GenBank database were found. Of those, four had typical baculovirus promoter motifs and polyadenylation sites. Computer-simulated restriction enzyme cleavage patterns of SfMNPV-19 were compared with published physical maps of other SfMNPV isolates. Differences were observed in terms of the restriction profiles and genome size. Comparison of SfMNPV-19 with the sequence of the SfMNPV isolate 3AP2 indicated that they differed due to a 1427 bp deletion, as well as by a series of smaller deletions and point mutations. The majority of genes of SfMNPV-19 were conserved in the closely related Spodoptera exigua NPV (SeMNPV) and Agrotis segetum NPV (AgseMNPV-A), but a few regions experienced major changes and rearrangements. Synthenic maps for the genomes of group II NPVs revealed that gene collinearity was observed only within certain clusters. Analysis of the dynamics of gene gain and loss along the phylogenetic tree of the NPVs showed that group II had only five defining genes and supported the hypothesis that these viruses form ten highly divergent ancient lineages. Crucially, more than 60 % of the gene gain events followed a power-law relation to genetic distance among baculoviruses, indicative of temporal organization in the gene accretion process.

Download Full-text

ntJoin: Fast and lightweight assembly-guided scaffolding using minimizer graphs

10.1101/2020.01.13.905240 ◽

2020 ◽

Author(s):

Lauren Coombe ◽

Vladimir Nikolić ◽

Justin Chu ◽

Inanc Birol ◽

René L. Warren

Keyword(s):

Reference Sequence ◽

Biological Research ◽

Closely Related Species ◽

Draft Assembly ◽

Short Read ◽

Sequencing Technologies ◽

Long Read ◽

Genome Assemblies ◽

High Quality Genome ◽

Reference Human Genome

AbstractSummaryThe ability to generate high-quality genome sequences is cornerstone to modern biological research. Even with recent advancements in sequencing technologies, many genome assemblies are still not achieving reference-grade. Here, we introduce ntJoin, a tool that leverages structural synteny between a draft assembly and reference sequence(s) to contiguate and correct the former with respect to the latter. Instead of alignments, ntJoin uses a lightweight mapping approach based on a graph data structure generated from ordered minimizer sketches. The tool can be used in a variety of different applications, including improving a draft assembly with a reference-grade genome, a short read assembly with a draft long read assembly, and a draft assembly with an assembly from a closely-related species. When scaffolding a human short read assembly using the reference human genome or a long read assembly, ntJoin improves the NGA50 length 23- and 13-fold, respectively, in under 13 m, using less than 11 GB of RAM. Compared to existing reference-guided assemblers, ntJoin generates highly contiguous assemblies faster and using less memory.Availability and implementationntJoin is written in C++ and Python, and is freely available at https://github.com/bcgsc/[email protected]

Download Full-text