scholarly journals Benchmarking sequencing methods and tools that facilitate the study of alternative polyadenylation

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Ankeeta Shah ◽  
Briana E. Mittleman ◽  
Yoav Gilad ◽  
Yang I. Li

Abstract Background Alternative cleavage and polyadenylation (APA), an RNA processing event, occurs in over 70% of human protein-coding genes. APA results in mRNA transcripts with distinct 3′ ends. Most APA occurs within 3′ UTRs, which harbor regulatory elements that can impact mRNA stability, translation, and localization. Results APA can be profiled using a number of established computational tools that infer polyadenylation sites from standard, short-read RNA-seq datasets. Here, we benchmarked a number of such tools—TAPAS, QAPA, DaPars2, GETUTR, and APATrap— against 3′-Seq, a specialized RNA-seq protocol that enriches for reads at the 3′ ends of genes, and Iso-Seq, a Pacific Biosciences (PacBio) single-molecule full-length RNA-seq method in their ability to identify polyadenylation sites and quantify polyadenylation site usage. We demonstrate that 3′-Seq and Iso-Seq are able to identify and quantify the usage of polyadenylation sites more reliably than computational tools that take short-read RNA-seq as input. However, we find that running one such tool, QAPA, with a set of polyadenylation site annotations derived from small quantities of 3′-Seq or Iso-Seq can reliably quantify variation in APA across conditions, such asacross genotypes, as demonstrated by the successful mapping of alternative polyadenylation quantitative trait loci (apaQTL). Conclusions We envisage that our analyses will shed light on the advantages of studying APA with more specialized sequencing protocols, such as 3′-Seq or Iso-Seq, and the limitations of studying APA with short-read RNA-seq. We provide a computational pipeline to aid in the identification of polyadenylation sites and quantification of polyadenylation site usages using Iso-Seq data as input.

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Ryan Lusk ◽  
Evan Stene ◽  
Farnoush Banaei-Kashani ◽  
Boris Tabakoff ◽  
Katerina Kechris ◽  
...  

AbstractAnnotation of polyadenylation sites from short-read RNA sequencing alone is a challenging computational task. Other algorithms rooted in DNA sequence predict potential polyadenylation sites; however, in vivo expression of a particular site varies based on a myriad of conditions. Here, we introduce aptardi (alternative polyadenylation transcriptome analysis from RNA-Seq data and DNA sequence information), which leverages both DNA sequence and RNA sequencing in a machine learning paradigm to predict expressed polyadenylation sites. Specifically, as input aptardi takes DNA nucleotide sequence, genome-aligned RNA-Seq data, and an initial transcriptome. The program evaluates these initial transcripts to identify expressed polyadenylation sites in the biological sample and refines transcript 3′-ends accordingly. The average precision of the aptardi model is twice that of a standard transcriptome assembler. In particular, the recall of the aptardi model (the proportion of true polyadenylation sites detected by the algorithm) is improved by over three-fold. Also, the model—trained using the Human Brain Reference RNA commercial standard—performs well when applied to RNA-sequencing samples from different tissues and different mammalian species. Finally, aptardi’s input is simple to compile and its output is easily amenable to downstream analyses such as quantitation and differential expression.


2019 ◽  
Vol 20 (24) ◽  
pp. 6350 ◽  
Author(s):  
Nan Deng ◽  
Chen Hou ◽  
Fengfeng Ma ◽  
Caixia Liu ◽  
Yuxin Tian

The limitations of RNA sequencing make it difficult to accurately predict alternative splicing (AS) and alternative polyadenylation (APA) events and long non-coding RNAs (lncRNAs), all of which reveal transcriptomic diversity and the complexity of gene regulation. Gnetum, a genus with ambiguous phylogenetic placement in seed plants, has a distinct stomatal structure and photosynthetic characteristics. In this study, a full-length transcriptome of Gnetum luofuense leaves at different developmental stages was sequenced with the latest PacBio Sequel platform. After correction by short reads generated by Illumina RNA-Seq, 80,496 full-length transcripts were obtained, of which 5269 reads were identified as isoforms of novel genes. Additionally, 1660 lncRNAs and 12,998 AS events were detected. In total, 5647 genes in the G. luofuense leaves had APA featured by at least one poly(A) site. Moreover, 67 and 30 genes from the bHLH gene family, which play an important role in stomatal development and photosynthesis, were identified from the G. luofuense genome and leaf transcripts, respectively. This leaf transcriptome supplements the reference genome of G. luofuense, and the AS events and lncRNAs detected provide valuable resources for future studies of investigating low photosynthetic capacity of Gnetum.


2020 ◽  
Author(s):  
Naima Ahmed Fahmi ◽  
Jae-Woong Chang ◽  
Heba Nassereddeen ◽  
Khandakar Tanvir Ahmed ◽  
Deliang Fan ◽  
...  

AbstractThe eukaryotic genome is capable of producing multiple isoforms from a gene by alternative polyadenylation (APA) during pre-mRNA processing. APA in the 3’-untranslated region (3’-UTR) of mRNA produces transcripts with shorter 3’-UTR. Often, 3’-UTR serves as a binding platform for microRNAs and RNA-binding proteins, which affect the fate of the mRNA transcript. Thus, 3’-UTR APA provides a means to regulate gene expression at the post-transcriptional level and is known to promote translation. Current bioinformatics pipelines have limited capability in profiling 3’-UTR APA events due to incomplete annotations and a low-resolution analyzing power: widely available bioinformatics pipelines do not reference actionable polyadenylation (cleavage) sites but simulate 3’-UTR APA only using RNA-seq read coverage, causing false positive identifications. To overcome these limitations, we developed APA-Scan, a robust program that identifies 3’-UTR APA events and visualizes the RNA-seq short-read coverage with gene annotations. APA-Scan utilizes either predicted or experimentally validated actionable polyadenylation signals as a reference for polyadenylation sites and calculates the quantity of long and short 3’-UTR transcripts in the RNA-seq data. The performance of APA-Scan was validated by qPCR.ImplementationAPA-Scan is implemented in Python. Source code and a comprehensive user’s manual are freely available at https://github.com/compbiolabucf/APA-Scan


2020 ◽  
Vol 36 (12) ◽  
pp. 3907-3909 ◽  
Author(s):  
Ruijia Wang ◽  
Bin Tian

Abstract Summary Most eukaryotic genes produce alternative polyadenylation (APA) isoforms. APA is dynamically regulated under different growth and differentiation conditions. Here, we present a bioinformatics package, named APAlyzer, for examining 3′UTR APA, intronic APA and gene expression changes using RNA-seq data and annotated polyadenylation sites in the PolyA_DB database. Using APAlyzer and data from the GTEx database, we present APA profiles across human tissues. Availability and implementation APAlyzer is freely available at https://bioconductor.org/packages/release/bioc/html/APAlyzer.html as an R/Bioconductor package. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Sam Kovaka ◽  
Aleksey V. Zimin ◽  
Geo M. Pertea ◽  
Roham Razaghi ◽  
Steven L. Salzberg ◽  
...  

AbstractRNA sequencing using the latest single-molecule sequencing instruments produces reads that are thousands of nucleotides long. The ability to assemble these long reads can greatly improve the sensitivity of long-read analyses. Here we present StringTie2, a reference-guided transcriptome assembler that works with both short and long reads. StringTie2 includes new methods to handle the high error rate of long reads and offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of short-read assemblies. StringTie2 is more accurate and faster and uses less memory than all comparable short-read and long-read analysis tools.


2020 ◽  
Vol 48 (7) ◽  
pp. 3806-3815 ◽  
Author(s):  
Daniel del Valle Morales ◽  
Jackson B Trotman ◽  
Ralf Bundschuh ◽  
Daniel R Schoenberg

Abstract Cap homeostasis is the cyclical process of decapping and recapping that maintains the translation and stability of a subset of the transcriptome. Previous work showed levels of some recapping targets decline following transient expression of an inactive form of RNMT (ΔN-RNMT), likely due to degradation of mRNAs with improperly methylated caps. The current study examined transcriptome-wide changes following inhibition of cytoplasmic cap methylation. This identified mRNAs with 5′-terminal oligopyrimidine (TOP) sequences as the largest single class of recapping targets. Cap end mapping of several TOP mRNAs identified recapping events at native 5′ ends and downstream of the TOP sequence of EIF3K and EIF3D. This provides the first direct evidence for downstream recapping. Inhibition of cytoplasmic cap methylation was also associated with mRNA abundance increases for a number of transcription, splicing, and 3′ processing factors. Previous work suggested a role for alternative polyadenylation in target selection, but this proved not to be the case. However, inhibition of cytoplasmic cap methylation resulted in a shift of upstream polyadenylation sites to annotated 3′ ends. Together, these results solidify cap homeostasis as a fundamental process of gene expression control and show cytoplasmic recapping can impact regulatory elements present at the ends of mRNA molecules.


2017 ◽  
Author(s):  
Michael K. K. Leung ◽  
Andrew Delong ◽  
Brendan J. Frey

AbstractProcessing of transcripts at the 3’-end involves cleavage at a polyadenylation site followed by the addition of a poly(A)-tail. By selecting which polyadenylation site is cleaved, alternative polyadenylation enables genes to produce transcript isoforms with different 3’-ends. To facilitate the identification and treatment of disease-causing mutations that affect polyadenylation and to understand the underlying regulatory processes, a computational model that can accurately predict polyadenylation patterns based on genomic features is desirable. Previous works have focused on identifying candidate polyadenylation sites and classifying sites which may be tissue-specific. What is lacking is a predictive model of the underlying mechanism of site selection, competition, and processing efficiency in a tissue-specific manner. We develop a deep learning model that trains on 3’-end sequencing data and predicts tissue-specific site selection among competing polyadenylation sites in the 3’ untranslated region of the human genome.Two neural network architectures are evaluated: one built on hand-engineered features, and another that directly learns from the genomic sequence. The hand-engineered features include polyadenylation signals, cis-regulatory elements, n-mer counts, nucleosome occupancy, and RNA-binding protein motifs. The direct-from-sequence model is inferred without prior knowledge on polyadenylation, based on a convolutional neural network trained with genomic sequences surrounding each polyadenylation site as input. Both models are trained using the TensorFlow library.The proposed polyadenylation code can predict site selection among competing polyadenylation sites in different tissues. Importantly, it does so without relying on evolutionary conservation. The model can distinguish pathogenic from benign variants that appear near annotated polyadenylation sites in ClinVar and inspect the genome to find candidate polyadenylation sites. We also provide an analysis on how different features affect the model’s performance.


2018 ◽  
Vol 34 (11) ◽  
pp. 1841-1849 ◽  
Author(s):  
Congting Ye ◽  
Yuqi Long ◽  
Guoli Ji ◽  
Qingshun Quinn Li ◽  
Xiaohui Wu

Author(s):  
Chaitanya Erady ◽  
Shraddha Puntambekar ◽  
Sudhakaran Prabakaran

AbstractIdentification of as of yet unannotated or undefined novel open reading frames (nORFs) and exploration of their functions in multiple organisms has revealed that vast regions of the genome have remained unexplored or ‘hidden’. Present within both protein-coding and noncoding regions, these nORFs signify the presence of a much more diverse proteome than previously expected. Given the need to study nORFs further, proper identification strategies must be in place, especially because they cannot be identified using conventional gene signatures. Although Ribo-Seq and proteogenomics are frequently used to identify and investigate nORFs, in this study, we propose a workflow for identifying nORF containing transcripts using our precompiled database of nORFs with translational evidence, using sample transcript information. Further, we discuss the potential uses of this identification, the caveats involved in such a transcript identification and finally present a few representative results from our analysis of naive mouse B and T cells, human post-mortem brain and cichlid fish transcriptome. Our proposed workflow can identify noncoding transcripts that can potentially translate intronic, intergenic and several other classes of nORFs.One-line summaryA systematic workflow to identify nORF containing transcripts using sample transcript information.


Viruses ◽  
2019 ◽  
Vol 11 (1) ◽  
pp. 33 ◽  
Author(s):  
Wei Zou ◽  
Min Xiong ◽  
Xuefeng Deng ◽  
John Engelhardt ◽  
Ziying Yan ◽  
...  

Human bocavirus 1 (HBoV1) infects well-differentiated (polarized) human airway epithelium (HAE) cultured at an air-liquid interface (ALI). In the present study, we applied next-generation RNA sequencing to investigate the genome-wide transcription profile of HBoV1, including viral mRNA and small RNA transcripts, in HBoV1-infected HAE cells. We identified novel transcription start and termination sites and confirmed the previously identified splicing events. Importantly, an additional proximal polyadenylation site (pA)p2 and a new distal polyadenylation site (pA)dREH lying on the right-hand hairpin (REH) of the HBoV1 genome were identified in processing viral pre-mRNA. Of note, all viral nonstructural proteins-encoding mRNA transcripts use both the proximal polyadenylation sites [(pA)p1 and (pA)p2] and distal polyadenylation sites [(pA)d1 and (pA)dREH] for termination. However, capsid proteins-encoding transcripts only use the distal polyadenylation sites. While the (pA)p1 and (pA)p2 sites were utilized at roughly equal efficiency for proximal polyadenylation of HBoV1 mRNA transcripts, the (pA)d1 site was more preferred for distal polyadenylation. Additionally, small RNA-seq analysis confirmed there is only one viral noncoding RNA (BocaSR) transcribed from nt 5199–5340 of the HBoV1 genome. Thus, our study provides a systematic and unbiased transcription profile, including both mRNA and small RNA transcripts, of HBoV1 in HBoV1-infected HAE-ALI cultures.


Sign in / Sign up

Export Citation Format

Share Document