scAPAtrap: identification and quantification of alternative polyadenylation sites from single-cell RNA-seq data

Author(s):  
Xiaohui Wu ◽  
Tao Liu ◽  
Congting Ye ◽  
Wenbin Ye ◽  
Guoli Ji

Abstract Alternative polyadenylation (APA) generates diverse mRNA isoforms, which contributes to transcriptome diversity and gene expression regulation by affecting mRNA stability, translation and localization in cells. The rapid development of 3′ tag-based single-cell RNA-sequencing (scRNA-seq) technologies, such as CEL-seq and 10x Genomics, has led to the emergence of computational methods for identifying APA sites and profiling APA dynamics at single-cell resolution. However, existing methods fail to detect the precise location of poly(A) sites or sites with low read coverage. Moreover, they rely on priori genome annotation and can only detect poly(A) sites located within or near annotated genes. Here we proposed a tool called scAPAtrap for detecting poly(A) sites at the whole genome level in individual cells from 3′ tag-based scRNA-seq data. scAPAtrap incorporates peak identification and poly(A) read anchoring, enabling the identification of the precise location of poly(A) sites, even for sites with low read coverage. Moreover, scAPAtrap can identify poly(A) sites without using priori genome annotation, which helps locate novel poly(A) sites in previously overlooked regions and improve genome annotation. We compared scAPAtrap with two latest methods, scAPA and Sierra, using scRNA-seq data from different experimental technologies and species. Results show that scAPAtrap identified poly(A) sites with higher accuracy and sensitivity than competing methods and could be used to explore APA dynamics among cell types or the heterogeneous APA isoform expression in individual cells. scAPAtrap is available at https://github.com/BMILAB/scAPAtrap.

2021 ◽  
Author(s):  
Mervin M Fansler ◽  
Gang Zhen ◽  
Christine Mayr

Although half of human genes use alternative polyadenylation (APA) to generate mRNA isoforms that encode the same protein but differ in their 3′UTRs, most single cell RNA-sequencing (scRNA-seq) pipelines only measure gene expression. Here, we describe an open-access pipeline, called scUTRquant (https://github.com/Mayrlab/scUTRquant), that measures gene and 3′UTR isoform expression from scRNA-seq data obtained from known cell types in any species. scUTRquant-derived gene and 3′UTR transcript counts were validated against standard methods which demonstrated their accuracy. 3′UTR isoform quantification was substantially more reproducible than previous methods. scUTRquant provides an atlas of high-confidence 3′ end cleavage sites at single-nucleotide resolution to allow APA comparison across mouse datasets. Analysis of 120 mouse cell types revealed that during differentiation genes either change their expression or they change their 3′UTR isoform usage. Therefore, we identified thousands of genes with 3′UTR isoform changes that have previously not been implicated in specific biological processes.


2020 ◽  
Vol 29 (R1) ◽  
pp. R89-R99
Author(s):  
Deivid Carvalho Rodrigues ◽  
Marat Mufteev ◽  
James Ellis

Abstract The methyl-CpG-binding protein 2 (MECP2) is a critical global regulator of gene expression. Mutations in MECP2 cause neurodevelopmental disorders including Rett syndrome (RTT). MECP2 exon 2 is spliced into two alternative messenger ribonucleic acid (mRNA) isoforms encoding MECP2-E1 or MECP2-E2 protein isoforms that differ in their N-termini. MECP2-E2, isolated first, was used to define the general roles of MECP2 in methyl-deoxyribonucleic acid (DNA) binding, targeting of transcriptional regulatory complexes, and its disease-causing impact in RTT. It was later found that MECP2-E1 is the most abundant isoform in the brain and its exon 1 is also mutated in RTT. MECP2 transcripts undergo alternative polyadenylation generating mRNAs with four possible 3′untranslated region (UTR) lengths ranging from 130 to 8600 nt. Together, the exon and 3′UTR isoforms display remarkable abundance disparity across cell types and tissues during development. These findings indicate discrete means of regulation and suggest that protein isoforms perform non-overlapping roles. Multiple regulatory programs have been explored to explain these disparities. DNA methylation patterns of the MECP2 promoter and first intron impact MECP2-E1 and E2 isoform levels. Networks of microRNAs and RNA-binding proteins also post-transcriptionally regulate the stability and translation efficiency of MECP2 3′UTR isoforms. Finally, distinctions in biophysical properties in the N-termini between MECP2-E1 and E2 lead to variable protein stabilities and DNA binding dynamics. This review describes the steps taken from the discovery of MECP2, the description of its key functions, and its association with RTT, to the emergence of evidence revealing how MECP2 isoforms are differentially regulated at the transcriptional, post-transcriptional and post-translational levels.


2019 ◽  
Vol 47 (19) ◽  
pp. 10027-10039 ◽  
Author(s):  
Eldad David Shulman ◽  
Ran Elkon

AbstractAlternative polyadenylation (APA) is emerging as an important layer of gene regulation because the majority of mammalian protein-coding genes contain multiple polyadenylation (pA) sites in their 3′ UTR. By alteration of 3′ UTR length, APA can considerably affect post-transcriptional gene regulation. Yet, our understanding of APA remains rudimentary. Novel single-cell RNA sequencing (scRNA-seq) techniques allow molecular characterization of different cell types to an unprecedented degree. Notably, the most popular scRNA-seq protocols specifically sequence the 3′ end of transcripts. Building on this property, we implemented a method for analysing patterns of APA regulation from such data. Analyzing multiple datasets from diverse tissues, we identified widespread modulation of APA in different cell types resulting in global 3′ UTR shortening/lengthening and enhanced cleavage at intronic pA sites. Our results provide a proof-of-concept demonstration that the huge volume of scRNA-seq data that accumulates in the public domain offers a unique resource for the exploration of APA based on a very broad collection of cell types and biological conditions.


2021 ◽  
Author(s):  
Sheng Zhu ◽  
Qiwei Lian ◽  
Wenbin Ye ◽  
Wei Qin ◽  
Zhe Wu ◽  
...  

Abstract Alternative polyadenylation (APA) is a widespread regulatory mechanism of transcript diversification in eukaryotes, which is increasingly recognized as an important layer for eukaryotic gene expression. Recent studies based on single-cell RNA-seq (scRNA-seq) have revealed cell-to-cell heterogeneity in APA usage and APA dynamics across different cell types in various tissues, biological processes and diseases. However, currently available APA databases were all collected from bulk 3′-seq and/or RNA-seq data, and no existing database has provided APA information at single-cell resolution. Here, we present a user-friendly database called scAPAdb (http://www.bmibig.cn/scAPAdb), which provides a comprehensive and manually curated atlas of poly(A) sites, APA events and poly(A) signals at the single-cell level. Currently, scAPAdb collects APA information from > 360 scRNA-seq experiments, covering six species including human, mouse and several other plant species. scAPAdb also provides batch download of data, and users can query the database through a variety of keywords such as gene identifier, gene function and accession number. scAPAdb would be a valuable and extendable resource for the study of cell-to-cell heterogeneity in APA isoform usages and APA-mediated gene regulation at the single-cell level under diverse cell types, tissues and species.


2019 ◽  
Author(s):  
Xiaoyang Chen ◽  
Shengquan Chen ◽  
Rui Jiang

AbstractBackgroundIn recent years, the rapid development of single-cell RNA-sequencing (scRNA-seq) techniques enables the quantitative characterization of cell types at a single-cell resolution. With the explosive growth of the number of cells profiled in individual scRNA-seq experiments, there is a demand for novel computational methods for classifying newly-generated scRNA-seq data onto annotated labels. Although several methods have recently been proposed for the cell-type classification of single-cell transcriptomic data, such limitations as inadequate accuracy, inferior robustness, and low stability greatly limit their wide applications.ResultsWe propose a novel ensemble approach, named EnClaSC, for accurate and robust cell-type classification of single-cell transcriptomic data. Through comprehensive validation experiments, we demonstrate that EnClaSC can not only be applied to the self-projection within a specific dataset and the cell-type classification across different datasets, but also scale up well to various data dimensionality and different data sparsity. We further illustrate the ability of EnClaSC to effectively make cross-species classification, which may shed light on the studies in correlation of different species. EnClaSC is freely available at https://github.com/xy-chen16/EnClaSC.ConclusionsEnClaSC enables highly accurate and robust cell-type classification of single-cell transcriptomic data via an ensemble learning method. We expect to see wide applications of our method to not only transcriptome studies, but also the classification of more general data.


2019 ◽  
Vol 23 (5) ◽  
pp. 508-518
Author(s):  
E. A. Vodiasova ◽  
E. S. Chelebieva ◽  
O. N. Kuleshova

A wealth of genome and transcriptome data obtained using new generation sequencing (NGS) technologies for whole organisms could not answer many questions in oncology, immunology, physiology, neurobiology, zoology and other fields of science and medicine. Since the cell is the basis for the living of all unicellular and multicellular organisms, it is necessary to study the biological processes at its level. This understanding gave impetus to the development of a new direction – the creation of technologies that allow working with individual cells (single-cell technology). The rapid development of not only instruments, but also various advanced protocols for working with single cells is due to the relevance of these studies in many fields of science and medicine. Studying the features of various stages of ontogenesis, identifying patterns of cell differentiation and subsequent tissue development, conducting genomic and transcriptome analyses in various areas of medicine (especially in demand in immunology and oncology), identifying cell types and states, patterns of biochemical and physiological processes using single cell technologies, allows the comprehensive research to be conducted at a new level. The first RNA-sequencing technologies of individual cell transcriptomes (scRNA-seq) captured no more than one hundred cells at a time, which was insufficient due to the detection of high cell heterogeneity, existence of the minor cell types (which were not detected by morphology) and complex regulatory pathways. The unique techniques for isolating, capturing and sequencing transcripts of tens of thousands of cells at a time are evolving now. However, new technologies have certain differences both at the sample preparation stage and during the bioinformatics analysis. In the paper we consider the most effective methods of multiple parallel scRNA-seq using the example of 10XGenomics, as well as the specifics of such an experiment, further bioinformatics analysis of the data, future outlook and applications of new high-performance technologies.


Author(s):  
Bin Yu ◽  
Chen Chen ◽  
Ren Qi ◽  
Ruiqing Zheng ◽  
Patrick J Skillman-Lawrence ◽  
...  

Abstract The rapid development of single-cell RNA sequencing (scRNA-Seq) technology provides strong technical support for accurate and efficient analyzing single-cell gene expression data. However, the analysis of scRNA-Seq is accompanied by many obstacles, including dropout events and the curse of dimensionality. Here, we propose the scGMAI, which is a new single-cell Gaussian mixture clustering method based on autoencoder networks and the fast independent component analysis (FastICA). Specifically, scGMAI utilizes autoencoder networks to reconstruct gene expression values from scRNA-Seq data and FastICA is used to reduce the dimensions of reconstructed data. The integration of these computational techniques in scGMAI leads to outperforming results compared to existing tools, including Seurat, in clustering cells from 17 public scRNA-Seq datasets. In summary, scGMAI is an effective tool for accurately clustering and identifying cell types from scRNA-Seq data and shows the great potential of its applicative power in scRNA-Seq data analysis. The source code is available at https://github.com/QUST-AIBBDRC/scGMAI/.


2020 ◽  
Vol 21 (S13) ◽  
Author(s):  
Xiaoyang Chen ◽  
Shengquan Chen ◽  
Rui Jiang

Abstract Background In recent years, the rapid development of single-cell RNA-sequencing (scRNA-seq) techniques enables the quantitative characterization of cell types at a single-cell resolution. With the explosive growth of the number of cells profiled in individual scRNA-seq experiments, there is a demand for novel computational methods for classifying newly-generated scRNA-seq data onto annotated labels. Although several methods have recently been proposed for the cell-type classification of single-cell transcriptomic data, such limitations as inadequate accuracy, inferior robustness, and low stability greatly limit their wide applications. Results We propose a novel ensemble approach, named EnClaSC, for accurate and robust cell-type classification of single-cell transcriptomic data. Through comprehensive validation experiments, we demonstrate that EnClaSC can not only be applied to the self-projection within a specific dataset and the cell-type classification across different datasets, but also scale up well to various data dimensionality and different data sparsity. We further illustrate the ability of EnClaSC to effectively make cross-species classification, which may shed light on the studies in correlation of different species. EnClaSC is freely available at https://github.com/xy-chen16/EnClaSC. Conclusions EnClaSC enables highly accurate and robust cell-type classification of single-cell transcriptomic data via an ensemble learning method. We expect to see wide applications of our method to not only transcriptome studies, but also the classification of more general data.


2021 ◽  
Author(s):  
Kangning Dong ◽  
Shihua Zhang

ABSTRACTWith the rapid development of single-cell ATAC-seq technology, it has become possible to profile the chromatin accessibility of massive individual cells. However, it remains challenging to characterize their regulatory heterogeneity due to the high-dimensional, sparse and near-binary nature of data. Most existing data representation methods were designed based on correlation, which may be ill-defined for sparse data. Moreover, these methods do not well address the issue of excessive zeros. Thus, a simple, fast and scalable approach is needed to analyze single-cell ATAC-seq data with massive cells, address the “missingness” and accurately categorize cell types. To this end, we developed a network diffusion method for scalable embedding of massive single-cell ATAC-seq data (named as scAND). Specifically, we considered the near-binary single-cell ATAC-seq data as a bipartite network that reflects the accessible relationship between cells and accessible regions, and further adopted a simple and scalable network diffusion method to embed it. scAND can take information from similar cells to alleviate the sparsity and improve cell type identification. Extensive tests and comparison with existing methods using synthetic and real data as benchmarks demonstrated its distinct superiorities in terms of clustering accuracy, robustness, scalability and data integration.AvailabilityThe Python-based scAND tool is freely available at http://page.amss.ac.cn/shihua.zhang/software.html.


2020 ◽  
Author(s):  
Dylan Farnsworth ◽  
Mason Posner ◽  
Adam Miller

AbstractThe vertebrate lens is a valuable model system for investigating the gene expression changes that coordinate tissue differentiation due to its inclusion of two spatially separated cell types, the outer epithelial cells and the deeper denucleated fiber cells that they support. Zebrafish are a useful model system for studying lens development given the organ’s rapid development in the first several days of life in an accessible, transparent embryo. While we have strong foundational knowledge of the diverse lens crystallin proteins and the basic gene regulatory networks controlling lens development, no study has detailed gene expression in a vertebrate lens at single cell resolution. Here we report an atlas of lens gene expression in zebrafish embryos at single cell resolution through five days of development, identifying a number of novel regulators of lens development as potential targets for future functional studies. Our temporospatial expression data address open questions about the function of α-crystallins during lens development and provides the first detailed view of β- and γ-crystallin expression in and outside the lens. We describe subfunctionalization in transcription factor genes that occur as paralog pairs in the zebrafish. Finally, we examine the expression dynamics of cytoskeletal, RNA-binding, and transcription factors genes, identifying a number of novel patterns. Overall these data provide a foundation for identifying and characterizing lens developmental regulatory mechanisms and revealing targets for future functional studies with potential therapeutic impact.


Sign in / Sign up

Export Citation Format

Share Document