Estimation of isoform expression in RNA-seq data using a hierarchical Bayesian model

2015 ◽  
Vol 13 (06) ◽  
pp. 1542001 ◽  
Author(s):  
Zengmiao Wang ◽  
Jun Wang ◽  
Changjing Wu ◽  
Minghua Deng

Estimation of gene or isoform expression is a fundamental step in many transcriptome analysis tasks, such as differential expression analysis, eQTL (or sQTL) studies, and biological network construction. RNA-seq technology enables us to monitor the expression on genome-wide scale at single base pair resolution and offers the possibility of accurately measuring expression at the level of isoform. However, challenges remain because of non-uniform read sampling and the presence of various biases in RNA-seq data. In this paper, we present a novel hierarchical Bayesian method to estimate isoform expression. While most of the existing methods treat gene expression as a by-product, we incorporate it into our model and explicitly describe its relationship with corresponding isoform expression using a Multinomial distribution. In this way, gene and isoform expression are included in a unified framework and it helps us achieve a better performance over other state-of-the-art algorithms for isoform expression estimation. The effectiveness of the proposed method is demonstrated using both simulated data with known ground truth and two real RNA-seq datasets from MAQC project. The codes are available at http://www.math.pku.edu.cn/teachers/dengmh/GIExp/ .

2019 ◽  
Author(s):  
Avi Srivastava ◽  
Laraib Malik ◽  
Hirak Sarkar ◽  
Mohsen Zakeri ◽  
Fatemeh Almodaresi ◽  
...  

AbstractBackgroundThe accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy.ResultsWe investigate the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis. We observe that, even when the quantification model itself is held fixed, the effect of choosing a different alignment methodology, or aligning reads using different parameters, on quantification estimates can sometimes be large, and can affect downstream differential expression analyses as well. These effects can go unnoticed when assessment is focused too heavily on simulated data, where the alignment task is often simpler than in experimentally-acquired samples. We also introduce a new alignment methodology, called selective alignment, to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment.ConclusionWe observe that, on experimental datasets, the performance of lightweight mapping and alignment-based approaches varies significantly and highlight some of the underlying factors. We show this variation both in terms of quantification and downstream differential expression analysis. In all comparisons, we also show the improved performance of our proposed selective alignment method and suggest best practices for performing RNA-seq quantification.


2019 ◽  
Author(s):  
David Gerard

AbstractWith the explosion in the number of methods designed to analyze bulk and single-cell RNA-seq data, there is a growing need for approaches that assess and compare these methods. The usual technique is to compare methods on data simulated according to some theoretical model. However, as real data often exhibit violations from theoretical models, this can result in un-substantiated claims of a method’s performance. Rather than generate data from a theoretical model, in this paper we develop methods to add signal to real RNA-seq datasets. Since the resulting simulated data are not generated from an unrealistic theoretical model, they exhibit realistic (annoying) attributes of real data. This lets RNA-seq methods developers assess their procedures in non-ideal (model-violating) scenarios. Our procedures may be applied to both single-cell and bulk RNA-seq. We show that our simulation method results in more realistic datasets and can alter the conclusions of a differential expression analysis study. We also demonstrate our approach by comparing various factor analysis techniques on RNA-seq datasets. Our tools are available in the seqgendiff R package on the Comprehensive R Archive Net-work: https://cran.r-project.org/package=seqgendiff.


2014 ◽  
Author(s):  
Nuno A Fonseca ◽  
John A Marioni ◽  
Alvis Brazma

Accurately quantifying gene expression levels is a key goal of experiments using RNA-sequencing to assay the transcriptome. This typically requires aligning the short reads generated to the genome or transcriptome before quantifying expression of pre-defined sets of genes. Differences in the alignment/quantification tools can have a major effect upon the expression levels found with important consequences for biological interpretation. Here we address two main issues: do different analysis pipelines affect the gene expression levels inferred from RNA-seq data? And, how close are the expression levels inferred to the ``true'' expression levels? We evaluate fifty gene profiling pipelines in experimental and simulated data sets with different characteristics (e.g, read length and sequencing depth). In the absence of knowledge of the 'ground truth' in real RNAseq data sets, we used simulated data to assess the differences between the true expression and those reconstructed by the analysis pipelines. Even though this approach does not take into account all known biases present in RNAseq data, it still allows to assess the accuracy of the gene expression values inferred by different analysis pipelines. The results show that i) overall there is a high correlation between the expression levels inferred by the best pipelines and the true quantification values; ii) the error in the estimated gene expression values can vary considerably across genes; and iii) a small set of genes have expression estimates with consistently high error (across data sets and methods). Finally, although the mapping software is important, the quantification method makes a greater difference to the results.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Avi Srivastava ◽  
Laraib Malik ◽  
Hirak Sarkar ◽  
Mohsen Zakeri ◽  
Fatemeh Almodaresi ◽  
...  

Abstract Background The accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy. Results We investigate the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis. We observe that, even when the quantification model itself is held fixed, the effect of choosing a different alignment methodology, or aligning reads using different parameters, on quantification estimates can sometimes be large and can affect downstream differential expression analyses as well. These effects can go unnoticed when assessment is focused too heavily on simulated data, where the alignment task is often simpler than in experimentally acquired samples. We also introduce a new alignment methodology, called selective alignment, to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment. Conclusion We observe that, on experimental datasets, the performance of lightweight mapping and alignment-based approaches varies significantly, and highlight some of the underlying factors. We show this variation both in terms of quantification and downstream differential expression analysis. In all comparisons, we also show the improved performance of our proposed selective alignment method and suggest best practices for performing RNA-seq quantification.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Matthew Chung ◽  
Vincent M. Bruno ◽  
David A. Rasko ◽  
Christina A. Cuomo ◽  
José F. Muñoz ◽  
...  

AbstractAdvances in transcriptome sequencing allow for simultaneous interrogation of differentially expressed genes from multiple species originating from a single RNA sample, termed dual or multi-species transcriptomics. Compared to single-species differential expression analysis, the design of multi-species differential expression experiments must account for the relative abundances of each organism of interest within the sample, often requiring enrichment methods and yielding differences in total read counts across samples. The analysis of multi-species transcriptomics datasets requires modifications to the alignment, quantification, and downstream analysis steps compared to the single-species analysis pipelines. We describe best practices for multi-species transcriptomics and differential gene expression.


Animals ◽  
2021 ◽  
Vol 11 (6) ◽  
pp. 1745
Author(s):  
Ben-Ben Miao ◽  
Su-Fang Niu ◽  
Ren-Xie Wu ◽  
Zhen-Bang Liang ◽  
Bao-Gui Tang ◽  
...  

Pearl gentian grouper (Epinephelus fuscoguttatus ♀ × Epinephelus lanceolatus ♂) is a fish of high commercial value in the aquaculture industry in Asia. However, this hybrid fish is not cold-tolerant, and its molecular regulation mechanism underlying cold stress remains largely elusive. This study thus investigated the liver transcriptomic responses of pearl gentian grouper by comparing the gene expression of cold stress groups (20, 15, 12, and 12 °C for 6 h) with that of control group (25 °C) using PacBio SMRT-Seq and Illumina RNA-Seq technologies. In SMRT-Seq analysis, a total of 11,033 full-length transcripts were generated and used as reference sequences for further RNA-Seq analysis. In RNA-Seq analysis, 3271 differentially expressed genes (DEGs), two low-temperature specific modules (tan and blue modules), and two significantly expressed gene sets (profiles 0 and 19) were screened by differential expression analysis, weighted gene co-expression networks analysis (WGCNA), and short time-series expression miner (STEM), respectively. The intersection of the above analyses further revealed some key genes, such as PCK, ALDOB, FBP, G6pC, CPT1A, PPARα, SOCS3, PPP1CC, CYP2J, HMGCR, CDKN1B, and GADD45Bc. These genes were significantly enriched in carbohydrate metabolism, lipid metabolism, signal transduction, and endocrine system pathways. All these pathways were linked to biological functions relevant to cold adaptation, such as energy metabolism, stress-induced cell membrane changes, and transduction of stress signals. Taken together, our study explores an overall and complex regulation network of the functional genes in the liver of pearl gentian grouper, which could benefit the species in preventing damage caused by cold stress.


2021 ◽  
Vol 11 (8) ◽  
pp. 3562
Author(s):  
Yong Jin Lee ◽  
Sang Yong Park ◽  
Dae Yeon Kim ◽  
Jae Yoon Kim

Preharvest sprouting (PHS) is a key global issue in production and end-use quality of cereals, particularly in regions where the rainfall season overlaps the harvest. To investigate transcriptomic changes in genes affected by PHS-induction and ABA-treatment, RNA-seq analysis was performed in two wheat cultivars that differ in PHS tolerance. A total of 123 unigenes related to hormone metabolism and signaling for abscisic acid (ABA), gibberellic acid (GA), indole-3-acetic acid (IAA), and cytokinin were identified and 1862 of differentially expressed genes were identified and divided into 8 groups by transcriptomic analysis. DEG analysis showed the majority of genes were categorized in sugar related processes, which interact with ABA signaling in PHS tolerant cultivar under PHS-induction. Thus, genes related to ABA are key regulators of dormancy and germination. Our results give insight into global changes in expression of plant hormone related genes in response to PHS.


2021 ◽  
Vol 3 (2) ◽  
Author(s):  
Xueyi Dong ◽  
Luyi Tian ◽  
Quentin Gouil ◽  
Hasaru Kariyawasam ◽  
Shian Su ◽  
...  

Abstract Application of Oxford Nanopore Technologies’ long-read sequencing platform to transcriptomic analysis is increasing in popularity. However, such analysis can be challenging due to the high sequence error and small library sizes, which decreases quantification accuracy and reduces power for statistical testing. Here, we report the analysis of two nanopore RNA-seq datasets with the goal of obtaining gene- and isoform-level differential expression information. A dataset of synthetic, spliced, spike-in RNAs (‘sequins’) as well as a mouse neural stem cell dataset from samples with a null mutation of the epigenetic regulator Smchd1 was analysed using a mix of long-read specific tools for preprocessing together with established short-read RNA-seq methods for downstream analysis. We used limma-voom to perform differential gene expression analysis, and the novel FLAMES pipeline to perform isoform identification and quantification, followed by DRIMSeq and limma-diffSplice (with stageR) to perform differential transcript usage analysis. We compared results from the sequins dataset to the ground truth, and results of the mouse dataset to a previous short-read study on equivalent samples. Overall, our work shows that transcriptomic analysis of long-read nanopore data using long-read specific preprocessing methods together with short-read differential expression methods and software that are already in wide use can yield meaningful results.


Sign in / Sign up

Export Citation Format

Share Document