scholarly journals Precise Transcript Reconstruction with End-Guided Assembly

2022 ◽  
Author(s):  
Michael A Schon ◽  
Stefan Lutzmayer ◽  
Falko Hofmann ◽  
Michael D Nodine

Accurate annotation of transcript isoforms is crucial for functional genomics research, but automated methods for reconstructing full-length transcripts from RNA sequencing (RNA-seq) data are imprecise. We developed a generalized transcript assembly framework called Bookend that incorporates data from multiple modes of RNA-seq, with a focus on identifying, labeling, and deconvoluting RNA 5′ and 3′ ends. Through end-guided assembly with Bookend we demonstrate that correctly modeling transcript start and end sites is essential for precise transcript assembly. Furthermore, we discover that reads from full-length single-cell RNA-seq (scRNA-seq) methods are sparsely end-labeled, and that these ends are sufficient to dramatically improve precision of assembly in single cells. Finally, we show that hybrid assembly across short-read, long-read, and end-capture RNA-seq in the model plant Arabidopsis and meta-assembly of single mouse embryonic stem cells (mESCs) are both capable of producing tissue-specific end-to-end transcript annotations of comparable or superior quality to existing reference isoforms.

2019 ◽  
Vol 20 (24) ◽  
pp. 6350 ◽  
Author(s):  
Nan Deng ◽  
Chen Hou ◽  
Fengfeng Ma ◽  
Caixia Liu ◽  
Yuxin Tian

The limitations of RNA sequencing make it difficult to accurately predict alternative splicing (AS) and alternative polyadenylation (APA) events and long non-coding RNAs (lncRNAs), all of which reveal transcriptomic diversity and the complexity of gene regulation. Gnetum, a genus with ambiguous phylogenetic placement in seed plants, has a distinct stomatal structure and photosynthetic characteristics. In this study, a full-length transcriptome of Gnetum luofuense leaves at different developmental stages was sequenced with the latest PacBio Sequel platform. After correction by short reads generated by Illumina RNA-Seq, 80,496 full-length transcripts were obtained, of which 5269 reads were identified as isoforms of novel genes. Additionally, 1660 lncRNAs and 12,998 AS events were detected. In total, 5647 genes in the G. luofuense leaves had APA featured by at least one poly(A) site. Moreover, 67 and 30 genes from the bHLH gene family, which play an important role in stomatal development and photosynthesis, were identified from the G. luofuense genome and leaf transcripts, respectively. This leaf transcriptome supplements the reference genome of G. luofuense, and the AS events and lncRNAs detected provide valuable resources for future studies of investigating low photosynthetic capacity of Gnetum.


2020 ◽  
Author(s):  
Alina Isakova ◽  
Norma Neff ◽  
Stephen R. Quake

ABSTRACTThe ability to interrogate total RNA content of single cells would enable better mapping of the transcriptional logic behind emerging cell types and states. However, current RNA-seq methods are unable to simultaneously monitor both short and long, poly(A)+ and poly(A)-transcripts at the single-cell level, and thus deliver only a partial snapshot of the cellular RNAome. Here, we describe Smart-seq-total, a method capable of assaying a broad spectrum of coding and non-coding RNA from a single cell. Built upon the template-switch mechanism, Smart-seq-total bears the key feature of its predecessor, Smart-seq2, namely, the ability to capture full-length transcripts with high yield and quality. It also outperforms current poly(A)–independent total RNA-seq protocols by capturing transcripts of a broad size range, thus, allowing us to simultaneously analyze protein-coding, long non-coding, microRNA and other non-coding RNA transcripts from single cells. We used Smart-seq-total to analyze the total RNAome of human primary fibroblasts, HEK293T and MCF7 cells as well as that of induced murine embryonic stem cells differentiated into embryoid bodies. We show that simultaneous measurement of non-coding RNA and mRNA from the same cell enables elucidation of new roles of non-coding RNA throughout essential processes such as cell cycle or lineage commitment. Moreover, we show that cell types can be distinguished based on the abundance of non-coding transcripts alone.


2021 ◽  
Vol 118 (51) ◽  
pp. e2113568118
Author(s):  
Alina Isakova ◽  
Norma Neff ◽  
Stephen R. Quake

The ability to interrogate total RNA content of single cells would enable better mapping of the transcriptional logic behind emerging cell types and states. However, current single-cell RNA-sequencing (RNA-seq) methods are unable to simultaneously monitor all forms of RNA transcripts at the single-cell level, and thus deliver only a partial snapshot of the cellular RNAome. Here we describe Smart-seq-total, a method capable of assaying a broad spectrum of coding and noncoding RNA from a single cell. Smart-seq-total does not require splitting the RNA content of a cell and allows the incorporation of unique molecular identifiers into short and long RNA molecules for absolute quantification. It outperforms current poly(A)-independent total RNA-seq protocols by capturing transcripts of a broad size range, thus enabling simultaneous analysis of protein-coding, long-noncoding, microRNA, and other noncoding RNA transcripts from single cells. We used Smart-seq-total to analyze the total RNAome of human primary fibroblasts, HEK293T, and MCF7 cells, as well as that of induced murine embryonic stem cells differentiated into embryoid bodies. By analyzing the coexpression patterns of both noncoding RNA and mRNA from the same cell, we were able to discover new roles of noncoding RNA throughout essential processes, such as cell cycle and lineage commitment during embryonic development. Moreover, we show that independent classes of short-noncoding RNA can be used to determine cell-type identity.


2020 ◽  
Author(s):  
Luyi Tian ◽  
Jafar S. Jabbari ◽  
Rachel Thijssen ◽  
Quentin Gouil ◽  
Shanika L. Amarasinghe ◽  
...  

AbstractAlternative splicing shapes the phenotype of cells in development and disease. Long-read RNA-sequencing recovers full-length transcripts but has limited throughput at the single-cell level. Here we developed single-cell full-length transcript sequencing by sampling (FLT-seq), together with the computational pipeline FLAMES to overcome these issues and perform isoform discovery and quantification, splicing analysis and mutation detection in single cells. With FLT-seq and FLAMES, we performed the first comprehensive characterization of the full-length isoform landscape in single cells of different types and species and identified thousands of unannotated isoforms. We found conserved functional modules that were enriched for alternative transcript usage in different cell populations, including ribosome biogenesis and mRNA splicing. Analysis at the transcript-level allowed data integration with scATAC-seq on individual promoters, improved correlation with protein expression data and linked mutations known to confer drug resistance to transcriptome heterogeneity. Our methods reveal previously unseen isoform complexity and provide a better framework for multi-omics data integration.


2016 ◽  
Author(s):  
Ruolin Liu ◽  
Julie Dickerson

We propose a novel method and computational tool, Strawberry, for transcript reconstruction and quantification from paired-end RNA-seq data under the guidance of genome alignment and independent of gene annotation. Strawberry achieves this through disentangling assembly and quantification in a sequential manner. The application of a fast flow network algorithm for assembly speeds up the construction of a parsimonious set of transcripts. The resulting reduced data representation improves the efficiency of expression-level quantification. Strawberry leverages the speed and accuracy of transcript assembly and quantification in such a way that processing 10 million simulated reads (after alignment) requires only 90 seconds using a single thread while achieving over 92% correlation with the ground truth, making it the state-of-the-art method. Strawberry outperforms Cufflinks and StringTie, the two other leading methods, in many aspects, including the number of corrected assembled transcripts and the correlation with the ground truth of simulated RNA-seq data. Availability: Strawberry is written in C++11, and is available as open source software at https://github.com/ruolin/Strawberry under the GPLv3 license.


Author(s):  
Akihito Otsuki ◽  
Yasunobu Okamura ◽  
Yuichi Aoki ◽  
Noriko Ishida ◽  
Kazuki Kumada ◽  
...  

Our body responds to environmental stress by changing the expression levels of a series of cytoprotective enzymes/proteins through multilayered regulatory mechanisms, including the KEAP1-NRF2 system. While NRF2 upregulates the expression of many cytoprotective genes, there are fundamental limitations in short-read RNA sequencing (RNA-Seq), resulting in confusion regarding interpreting the effectiveness of cytoprotective gene induction at transcript level. To precisely delineate isoform usage in the stress response, we conducted independent full-length transcriptome profiling (isoform sequencing; Iso-Seq) analyses of lymphoblastoid cells from three volunteers under normal and electrophilic stress-induced conditions. We first determined the first exon usage in KEAP1 and NFE2L2 (encoding NRF2) and found the presence of transcript diversity. We then examined changes in isoform usage of NRF2 target genes under stress conditions and identified a few isoforms dominantly expressed in the majority of NRF2 target genes. The expression levels of isoforms determined by Iso-Seq analyses showed striking differences from those determined by short-read RNA-Seq; the latter could be misleading in regards to the abundance of transcripts. These results support that transcript usage is tightly regulated to produce functional proteins under electrophilic stress. Our present study strongly argues that there are important benefits that can be achieved by long-read transcriptome sequencing.


2021 ◽  
Vol 12 ◽  
Author(s):  
Michelle M. Halstead ◽  
Alma Islas-Trejo ◽  
Daniel E. Goszczynski ◽  
Juan F. Medrano ◽  
Huaijun Zhou ◽  
...  

A comprehensive annotation of transcript isoforms in domesticated species is lacking. Especially considering that transcriptome complexity and splicing patterns are not well-conserved between species, this presents a substantial obstacle to genomic selection programs that seek to improve production, disease resistance, and reproduction. Recent advances in long-read sequencing technology have made it possible to directly extrapolate the structure of full-length transcripts without the need for transcript reconstruction. In this study, we demonstrate the power of long-read sequencing for transcriptome annotation by coupling Oxford Nanopore Technology (ONT) with large-scale multiplexing of 93 samples, comprising 32 tissues collected from adult male and female Hereford cattle. More than 30 million uniquely mapping full-length reads were obtained from a single ONT flow cell, and used to identify and characterize the expression dynamics of 99,044 transcript isoforms at 31,824 loci. Of these predicted transcripts, 21% exactly matched a reference transcript, and 61% were novel isoforms of reference genes, substantially increasing the ratio of transcript variants per gene, and suggesting that the complexity of the bovine transcriptome is comparable to that in humans. Over 7,000 transcript isoforms were extremely tissue-specific, and 61% of these were attributed to testis, which exhibited the most complex transcriptome of all interrogated tissues. Despite profiling over 30 tissues, transcription was only detected at about 60% of reference loci. Consequently, additional studies will be necessary to continue characterizing the bovine transcriptome in additional cell types, developmental stages, and physiological conditions. However, by here demonstrating the power of ONT sequencing coupled with large-scale multiplexing, the task of exhaustively annotating the bovine transcriptome – or any mammalian transcriptome – appears significantly more feasible.


2018 ◽  
Author(s):  
Ishaan Gupta ◽  
Paul G Collier ◽  
Bettina Haase ◽  
Ahmed Mahfouz ◽  
Anoushka Joglekar ◽  
...  

AbstractFull-length isoform sequencing has advanced our knowledge of isoform biology1–11. However, apart from applying full-length isoform sequencing to very few single cells12,13, isoform sequencing has been limited to bulk tissue, cell lines, or sorted cells. Single splicing events have been described for <=200 single cells with great statistical success14,15, but these methods do not describe full-length mRNAs. Single cell short-read 3’ sequencing has allowed identification of many cell sub-types16–23, but full-length isoforms for these cell types have not been profiled. Using our new method of single-cell-isoform-RNA-sequencing (ScISOr-Seq) we determine isoform-expression in thousands of individual cells from a heterogeneous bulk tissue (cerebellum), without specific antibody-fluorescence activated cell sorting. We elucidate isoform usage in high-level cell types such as neurons, astrocytes and microglia and finer sub-types, such as Purkinje cells and Granule cells, including the combination patterns of distant splice sites6–9,24,25, which for individual molecules requires long reads. We produce an enhanced genome annotation revealing cell-type specific expression of known and 16,872 novel (with respect to mouse Gencode version 10) isoforms (see isoformatlas.com).ScISOr-Seq describes isoforms from >1,000 single cells from bulk tissue without cell sorting by leveraging two technologies in three steps: In step one, we employ microfluidics to produce amplified full-length cDNAs barcoded for their cell of origin. This cDNA is split into two pools: one pool for 3’ sequencing to measure gene expression (step 2) and another pool for long-read sequencing and isoform expression (step 3). In step two, short-read 3’-sequencing provides molecular counts for each gene and cell, which allows clustering cells and assigning a cell type using cell-type specific markers. In step three, an aliquot of the same cDNAs (each barcoded for the individual cell of origin) is sequenced using Pacific Biosciences (“PacBio”)1,2,4,5,26 or Oxford Nanopore3. Since these long reads carry the single-cell barcodes identified in step two, one can determine the individual cell from which each long read originates. Since most single cells are assigned to a named cluster, we can also assign the cell’s cluster name (e.g. “Purkinje cell” or “astrocyte”) to the long read in question (Fig 1A) – without losing the cell of origin of each long read.


2019 ◽  
Author(s):  
Youjin Hu ◽  
Jiawei Zhong ◽  
Yuhua Xiao ◽  
Zheng Xing ◽  
Katherine Sheu ◽  
...  

AbstractThe differences in transcription start sites (TSS) and transcription end sites (TES) among gene isoforms can affect the stability, localization, and translation efficiency of mRNA. Isoforms also allow a single gene different functions across various tissues and cells However, methods for efficient genome-wide identification and quantification of RNA isoforms in single cells are still lacking. Here, we introduce single cell Cap And Tail sequencing (scCAT-seq). In conjunction with a novel machine learning algorithm developed for TSS/TES characterization, scCAT-seq can demarcate transcript boundaries of RNA transcripts, providing an unprecedented way to identify and quantify single-cell full-length RNA isoforms based on short-read sequencing. Compared with existing long-read sequencing methods, scCAT-seq has higher efficiency with lower cost. Using scCAT-seq, we identified hundreds of previously uncharacterized full-length transcripts and thousands of alternative transcripts for known genes, quantitatively revealed cell-type specific isoforms with alternative TSSs/TESs in dorsal root ganglion (DRG) neurons, mature oocytes and ageing oocytes, and generated the first atlas of the non-human primate cornea. The approach described here can be widely adapted to other short-read or long-read methods to improve accuracy and efficiency in assessing RNA isoform dynamics among single cells.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Ruijiao Xin ◽  
Yan Gao ◽  
Yuan Gao ◽  
Robert Wang ◽  
Kathryn E. Kadash-Edmondson ◽  
...  

AbstractCircular RNAs (circRNAs) have emerged as an important class of functional RNA molecules. Short-read RNA sequencing (RNA-seq) is a widely used strategy to identify circRNAs. However, an inherent limitation of short-read RNA-seq is that it does not experimentally determine the full-length sequences and exact exonic compositions of circRNAs. Here, we report isoCirc, a strategy for sequencing full-length circRNA isoforms, using rolling circle amplification followed by nanopore long-read sequencing. We describe an integrated computational pipeline to reliably characterize full-length circRNA isoforms using isoCirc data. Using isoCirc, we generate a comprehensive catalog of 107,147 full-length circRNA isoforms across 12 human tissues and one human cell line (HEK293), including 40,628 isoforms ≥500 nt in length. We identify widespread alternative splicing events within the internal part of circRNAs, including 720 retained intron events corresponding to a class of exon-intron circRNAs (EIciRNAs). Collectively, isoCirc and the companion dataset provide a useful strategy and resource for studying circRNAs in human transcriptomes.


Sign in / Sign up

Export Citation Format

Share Document