scholarly journals Probabilistic estimation of short sequence expression using RNA-Seq data and the “positional bootstrap”

2016 ◽  
Author(s):  
Hui Y. Xiong ◽  
Leo J. Lee ◽  
Hannes Bretschneider ◽  
Jiexin Gao ◽  
Nebojsa Jojic ◽  
...  

AbstractWhen estimating expression of a transcript or part of a transcript using RNA-seq data, it is commonly assumed that reads are generated uniformly from positions within the transcript. While this assumption is acceptable for long transcript sequences where reads from many positions are averaged, it frequently leads to large errors for short sequences, e.g., less than 100 bp. Analysis of short sequences, such as when studying splice junctions and microRNAs, is increasingly important and necessitates addressing errors in short-sequence expression estimation. Indeed, when we examined RNA-seq data from diverse studies, we found that large errors are introduced by variations in RNA-seq coverage due to sequence content, experimental conditions and sample preparation.We developed a technique that we call the positional bootstrap, which quantifies the level of uncertainty in expression induced by non-uniform coverage. Unlike methods that attempt to correct for biases in coverage, but do so by making strong assumptions about the form of those biases, the positional bootstrap can quantify the noise induced by all types of bias, including unknown ones. Results obtained using independently generated RNA-seq datasets show that the positional bootstrap increases the accuracy of estimates of alternative splicing levels, tissue-differential alternative splicing and tissue differential expression, by a factor of up to 10.A Python implementation of the algorithm to quantify splicing levels is freely available from github.com/PSI-Lab/BENTO-Seq.

2020 ◽  
Author(s):  
Benjamin Kellman ◽  
Hratch Baghdassarian ◽  
Tiziano Pramparo ◽  
Isaac Shamie ◽  
Vahid Gazestani ◽  
...  

Abstract Background: Both RNA-Seq and sample freeze-thaw are ubiquitous. However, knowledge about the impact of freeze-thaw on downstream analyses is limited. The lack of common quality metrics that are sufficiently sensitive to freeze-thaw and RNA degradation, e.g. the RNA Integrity Score, makes such assessments challenging.Results: Here we quantify the impact of repeated freeze-thaw cycles on the reliability of RNA-Seq by examining poly(A)-enriched and ribosomal RNA depleted RNA-seq from frozen leukocytes drawn from a toddler Autism cohort. To do so, we estimate the relative noise, or percentage of random counts, separating technical replicates. Using this approach we measured noise associated with RIN and freeze-thaw cycles. As expected, RIN does not fully capture sample degradation due to freeze-thaw. We further examined differential expression results and found that three freeze-thaws should extinguish the differential expression reproducibility of similar experiments. Freeze-thaw also resulted in a 3’ shift in the read coverage distribution along the gene body of poly(A)-enriched samples compared to ribosomal RNA depleted samples, suggesting that library preparation may exacerbate freeze-thaw-induced sample degradation.Conclusion: The use of poly(A)-enrichment for RNA sequencing is pervasive in library preparation of frozen tissue, and thus, it is important during experimental design and data analysis to consider the impact of repeated freeze-thaw cycles on reproducibility.


RNA Biology ◽  
2020 ◽  
pp. 1-14
Author(s):  
Wenbin Guo ◽  
Nikoleta A Tzioutziou ◽  
Gordon Stephen ◽  
Iain Milne ◽  
Cristiane PG Calixto ◽  
...  

2021 ◽  
Author(s):  
Benjamin Kellman ◽  
Hratch Baghdassarian ◽  
Tiziano Pramparo ◽  
Isaac Shamie ◽  
Vahid Gazestani ◽  
...  

Abstract Background: Both RNA-Seq and sample freeze-thaw are ubiquitous. However, knowledge about the impact of freeze-thaw on downstream analyses is limited. The lack of common quality metrics that are sufficiently sensitive to freeze-thaw and RNA degradation, e.g. the RNA Integrity Score, makes such assessments challenging.Results: Here we quantify the impact of repeated freeze-thaw cycles on the reliability of RNA-Seq by examining poly(A)-enriched and ribosomal RNA depleted RNA-seq from frozen leukocytes drawn from a toddler Autism cohort. To do so, we estimate the relative noise, or percentage of random counts, separating technical replicates. Using this approach we measured noise associated with RIN and freeze-thaw cycles. As expected, RIN does not fully capture sample degradation due to freeze-thaw. We further examined differential expression results and found that three freeze-thaws should extinguish the differential expression reproducibility of similar experiments. Freeze-thaw also resulted in a 3’ shift in the read coverage distribution along the gene body of poly(A)-enriched samples compared to ribosomal RNA depleted samples, suggesting that library preparation may exacerbate freeze-thaw-induced sample degradation.Conclusion: The use of poly(A)-enrichment for RNA sequencing is pervasive in library preparation of frozen tissue, and thus, it is important during experimental design and data analysis to consider the impact of repeated freeze-thaw cycles on reproducibility.


2016 ◽  
Author(s):  
Huijuan Feng ◽  
Tingting Li ◽  
Xuegong Zhang

AbstractBackgroundAlternative splicing is a ubiquitous post-transcriptional process in most eukaryotic genes. Aberrant splicing isoforms and abnormal isoform ratios can contribute to cancer development. Kinase genes are key regulators of various cellular processes. Many kinases are found to be oncogenic and have been intensively investigated in the study of cancer and drugs. RNA-Seq provides a powerful technology for genome-wide study of alternative splicing in cancer besides the conventional gene expression profiling. But this potential has not been fully demonstrated yet.MethodsHere we characterized the transcriptome profile of prostate cancer using RNA-Seq data from viewpoints of both differential expression and differential splicing, with an emphasis on kinase genes and their splicing variations. We built up a pipeline to conduct differential expression and differential splicing analysis. Further functional enrichment analysis was performed to explore functional interpretation of the genes. With focus on kinase genes, we performed kinase domain analysis to identify the functionally important candidate kinase gene in prostate cancer. We further calculated the expression level of isoforms to explore the function of isoform switching of kinase genes in prostate cancer.ResultsWe identified distinct gene groups from differential expression and splicing analysis, which suggested that alternative splicing adds another level to gene expression regulation. Enriched GO terms of differentially expressed and spliced kinase genes were found to play different roles in regulation of cellular metabolism. Function analysis on differentially spliced kinase genes showed that differentially spliced exons of these genes are significantly enriched in protein kinase domains. Among them, we found that gene CDK5 has isoform switching between prostate cancer and benign tissues, which may affect cancer development by changing androgen receptor (AR) phosphorylation. The observation was validated in another RNA-Seq dataset of prostate cancer cell lines.ConclusionsOur work characterized the expression and splicing profile of kinase genes in prostate cancer and proposed a hypothetical model on isoform switching of CDK5 and AR phosphorylation in prostate cancer. These findings bring new understanding to the role of alternatively spliced kinases in prostate cancer and demonstrate the use of RNA-Seq data in studying alternative splicing in cancer.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Benjamin P. Kellman ◽  
Hratch M. Baghdassarian ◽  
Tiziano Pramparo ◽  
Isaac Shamie ◽  
Vahid Gazestani ◽  
...  

Abstract Background Both RNA-Seq and sample freeze-thaw are ubiquitous. However, knowledge about the impact of freeze-thaw on downstream analyses is limited. The lack of common quality metrics that are sufficiently sensitive to freeze-thaw and RNA degradation, e.g. the RNA Integrity Score, makes such assessments challenging. Results Here we quantify the impact of repeated freeze-thaw cycles on the reliability of RNA-Seq by examining poly(A)-enriched and ribosomal RNA depleted RNA-seq from frozen leukocytes drawn from a toddler Autism cohort. To do so, we estimate the relative noise, or percentage of random counts, separating technical replicates. Using this approach we measured noise associated with RIN and freeze-thaw cycles. As expected, RIN does not fully capture sample degradation due to freeze-thaw. We further examined differential expression results and found that three freeze-thaws should extinguish the differential expression reproducibility of similar experiments. Freeze-thaw also resulted in a 3′ shift in the read coverage distribution along the gene body of poly(A)-enriched samples compared to ribosomal RNA depleted samples, suggesting that library preparation may exacerbate freeze-thaw-induced sample degradation. Conclusion The use of poly(A)-enrichment for RNA sequencing is pervasive in library preparation of frozen tissue, and thus, it is important during experimental design and data analysis to consider the impact of repeated freeze-thaw cycles on reproducibility. Graphical abstract


2019 ◽  
Author(s):  
Wenbin Guo ◽  
Nikoleta Tzioutziou ◽  
Gordon Stephen ◽  
Iain Milne ◽  
Cristiane Calixto ◽  
...  

AbstractRNA-seq analysis of gene expression and alternative splicing should be routine and robust but is often a bottleneck for biologists because of reliance on specialized bioinformatics skills. Thus, we have developed “3D RNA-seq”, an R shiny App and web based service which provides an easy-to-use, flexible and powerful tool for three-component analysis of RNA-seq data: Differential Expression, Differential Alternative Splicing and Differential Transcript Usage. 3D RNA-seq integrates state-of-the-art, highly rated differential expression analysis tools and adopts best practice for RNA-seq analysis. It operates through a user-friendly graphical interface, can handle complex experimental designs, allows setting of statistical parameters, tracks results through graphics and tables, and generates figures and a comprehensive report that will guarantee reproducibility. 3D RNA-seq can be applied to any species and is designed to be run by biologists with no programming skills (or by bioinformaticians) allowing lab scientists to perform rapid and accurate analysis of RNA-seq data.


2021 ◽  
Author(s):  
Runxuan Zhang ◽  
Richard Kuo ◽  
Max Coulter ◽  
Cristiane P.G. Calixto ◽  
Juan Carlos Entizne ◽  
...  

Background Accurate and comprehensive annotation of transcript sequences is essential for transcript quantification and differential gene and transcript expression analysis. Single molecule long read sequencing technologies provide improved integrity of transcript structures including alternative splicing, and transcription start and polyadenylation sites. However, accuracy is significantly affected by sequencing errors, mRNA degradation or incomplete cDNA synthesis. Results We present a new and comprehensive Arabidopsis thaliana Reference Transcript Dataset 3 (AtRTD3). AtRTD3 contains over 160k transcripts - twice that of the best current Arabidopsis transcriptome and including over 1,500 novel genes. 79% of transcripts are from Iso-seq with accurately defined splice junctions and transcription start and end sites. We developed novel methods to determine splice junctions and transcription start and end sites accurately. Mis-match profiles around splice junctions provided a powerful feature to distinguish correct splice junctions and remove false splice junctions. Stratified approaches identified high confidence transcription start/end sites and removed fragmentary transcripts due to degradation. AtRTD3 is a major improvement over existing transcriptomes as demonstrated by analysis of an Arabidopsis cold response RNA-seq time-series. AtRTD3 provided higher resolution of transcript expression profiling and identified cold- and light-induced differential transcription start and polyadenylation site usage. Conclusions AtRTD3 is the most comprehensive Arabidopsis transcriptome currently available. It improves the precision of differential gene and transcript expression, differential alternative splicing, and transcription start/end site usage from RNA-seq data. The novel methods for identifying accurate splice junctions and transcription start/end sites are widely applicable and will improve single molecule sequencing analysis from any species.


2019 ◽  
Vol 20 (S16) ◽  
Author(s):  
Kefei Liu ◽  
Li Shen ◽  
Hui Jiang

Abstract Background A fundamental problem in RNA-seq data analysis is to identify genes or exons that are differentially expressed with varying experimental conditions based on the read counts. The relativeness of RNA-seq measurements makes the between-sample normalization of read counts an essential step in differential expression (DE) analysis. In most existing methods, the normalization step is performed prior to the DE analysis. Recently, Jiang and Zhan proposed a statistical method which introduces sample-specific normalization parameters into a joint model, which allows for simultaneous normalization and differential expression analysis from log-transformed RNA-seq data. Furthermore, an ℓ0 penalty is used to yield a sparse solution which selects a subset of DE genes. The experimental conditions are restricted to be categorical in their work. Results In this paper, we generalize Jiang and Zhan’s method to handle experimental conditions that are measured in continuous variables. As a result, genes with expression levels associated with a single or multiple covariates can be detected. As the problem being high-dimensional, non-differentiable and non-convex, we develop an efficient algorithm for model fitting. Conclusions Experiments on synthetic data demonstrate that the proposed method outperforms existing methods in terms of detection accuracy when a large fraction of genes are differentially expressed in an asymmetric manner, and the performance gain becomes more substantial for larger sample sizes. We also apply our method to a real prostate cancer RNA-seq dataset to identify genes associated with pre-operative prostate-specific antigen (PSA) levels in patients.


Sign in / Sign up

Export Citation Format

Share Document