Swimming downstream: statistical analysis of differential transcript usage following Salmon quantification

Detection of differential transcript usage (DTU) from RNA-seq data is an important bioinformatic analysis that complements differential gene expression analysis. Here we present a simple workflow using a set of existing R/Bioconductor packages for analysis of DTU. We show how these packages can be used downstream of RNA-seq quantification using the Salmon software package. The entire pipeline is fast, benefiting from inference steps by Salmon to quantify expression at the transcript level. The workflow includes live, runnable code chunks for analysis using DRIMSeq and DEXSeq, as well as for performing two-stage testing of DTU using the stageR package, a statistical framework to screen at the gene level and then confirm which transcripts within the significant genes show evidence of DTU. We evaluate these packages and other related packages on a simulated dataset with parameters estimated from real data.

Download Full-text

Swimming downstream: statistical analysis of differential transcript usage following Salmon quantification

F1000Research ◽

10.12688/f1000research.15398.3 ◽

2018 ◽

Vol 7 ◽

pp. 952 ◽

Cited By ~ 8

Author(s):

Michael I. Love ◽

Charlotte Soneson ◽

Rob Patro

Keyword(s):

Software Package ◽

Gene Expression Analysis ◽

Real Data ◽

Transcript Level ◽

Bioinformatic Analysis ◽

Rna Seq ◽

Statistical Framework ◽

Gene Level ◽

Show Evidence ◽

Differential Gene

Detection of differential transcript usage (DTU) from RNA-seq data is an important bioinformatic analysis that complements differential gene expression analysis. Here we present a simple workflow using a set of existing R/Bioconductor packages for analysis of DTU. We show how these packages can be used downstream of RNA-seq quantification using the Salmon software package. The entire pipeline is fast, benefiting from inference steps by Salmon to quantify expression at the transcript level. The workflow includes live, runnable code chunks for analysis using DRIMSeq and DEXSeq, as well as for performing two-stage testing of DTU using the stageR package, a statistical framework to screen at the gene level and then confirm which transcripts within the significant genes show evidence of DTU. We evaluate these packages and other related packages on a simulated dataset with parameters estimated from real data.

Download Full-text

Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences

F1000Research ◽

10.12688/f1000research.7563.2 ◽

2016 ◽

Vol 4 ◽

pp. 1521 ◽

Cited By ~ 268

Author(s):

Charlotte Soneson ◽

Michael I. Love ◽

Mark D. Robinson

Keyword(s):

Statistical Inference ◽

High Throughput Sequencing ◽

Real Data ◽

Transcript Level ◽

R Package ◽

Data Sets ◽

Rna Seq ◽

Abundance Estimates ◽

Gene Level ◽

Genomic Regions

High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Various quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that the presence of differential isoform usage can lead to inflated false discovery rates in differential gene expression analyses on simple count matrices but that this can be addressed by incorporating offsets derived from transcript-level abundance estimates. We also show that the problem is relatively minor in several real data sets. Finally, we provide an R package (tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.

Download Full-text

RNA-Seq workflow: gene-level exploratory analysis and differential expression

F1000Research ◽

10.12688/f1000research.7035.2 ◽

2016 ◽

Vol 4 ◽

pp. 1070 ◽

Cited By ~ 24

Author(s):

Michael I. Love ◽

Simon Anders ◽

Vladislav Kim ◽

Wolfgang Huber

Keyword(s):

Differential Expression ◽

Gene Expression Analysis ◽

Exploratory Data Analysis ◽

Reference Genome ◽

Rna Seq ◽

Differential Gene Expression Analysis ◽

Gene Level ◽

Exploratory Data ◽

Differential Gene ◽

The Relationship

Here we walk through an end-to-end gene-level RNA-Seq differential expression workflow using Bioconductor packages. We will start from the FASTQ files, show how these were aligned to the reference genome, and prepare a count matrix which tallies the number of RNA-seq reads/fragments within each gene for each sample.We will perform exploratory data analysis (EDA) for quality assessment and to explore the relationship between samples, perform differential gene expression analysis, and visually explore the results.

Download Full-text

RNA-Seq workflow: gene-level exploratory analysis and differential expression

F1000Research ◽

10.12688/f1000research.7035.1 ◽

2015 ◽

Vol 4 ◽

pp. 1070 ◽

Cited By ~ 121

Author(s):

Michael I. Love ◽

Simon Anders ◽

Vladislav Kim ◽

Wolfgang Huber

Keyword(s):

Differential Expression ◽

Gene Expression Analysis ◽

Exploratory Data Analysis ◽

Reference Genome ◽

Rna Seq ◽

Differential Gene Expression Analysis ◽

Gene Level ◽

Exploratory Data ◽

Differential Gene ◽

The Relationship

Here we walk through an end-to-end gene-level RNA-Seq differential expression workflow using Bioconductor packages. We will start from the FASTQ files, show how these were aligned to the reference genome, and prepare a count matrix which tallies the number of RNA-seq reads/fragments within each gene for each sample. We will perform exploratory data analysis (EDA) for quality assessment and to explore the relationship between samples, perform differential gene expression analysis, and visually explore the results.

Download Full-text

Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences

F1000Research ◽

10.12688/f1000research.7563.1 ◽

2015 ◽

Vol 4 ◽

pp. 1521 ◽

Cited By ~ 704

Author(s):

Charlotte Soneson ◽

Michael I. Love ◽

Mark D. Robinson

Keyword(s):

Statistical Inference ◽

High Throughput Sequencing ◽

Simulated Data ◽

Real Data ◽

Transcript Level ◽

R Package ◽

Rna Seq ◽

Abundance Estimates ◽

Gene Level ◽

The Difference

High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Several different quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that while the presence of differential isoform usage can lead to inflated false discovery rates in differential expression analyses on simple count matrices and transcript-level abundance estimates improve the performance in simulated data, the difference is relatively minor in several real data sets. Finally, we provide an R package (tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.

Download Full-text

Faculty of 1000 evaluation for Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.

F1000 - Post-publication peer review of the biomedical literature ◽

10.3410/f.726079641.793513319 ◽

2016 ◽

Author(s):

Wolfgang Huber

Keyword(s):

Transcript Level ◽

Rna Seq ◽

Gene Level

Download Full-text

No counts, no variance: allowing for loss of degrees of freedom when assessing biological variability from RNA-seq data

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2017-0010 ◽

2017 ◽

Vol 16 (2) ◽

Cited By ~ 1

Author(s):

Aaron T. L. Lun ◽

Gordon K. Smyth

Keyword(s):

Software Package ◽

Error Control ◽

Degrees Of Freedom ◽

Linear Models ◽

Type I Error ◽

Real Data ◽

Type I ◽

Rna Seq ◽

Study Gene Expression ◽

Complex Models

AbstractRNA sequencing (RNA-seq) is widely used to study gene expression changes associated with treatments or biological conditions. Many popular methods for detecting differential expression (DE) from RNA-seq data use generalized linear models (GLMs) fitted to the read counts across independent replicate samples for each gene. This article shows that the standard formula for the residual degrees of freedom (d.f.) in a linear model is overstated when the model contains fitted values that are exactly zero. Such fitted values occur whenever all the counts in a treatment group are zero as well as in more complex models such as those involving paired comparisons. This misspecification results in underestimation of the genewise variances and loss of type I error control. This article proposes a formula for the reduced residual d.f. that restores error control in simulated RNA-seq data and improves detection of DE genes in a real data analysis. The new approach is implemented in the quasi-likelihood framework of the edgeR software package. The results of this article also apply to RNA-seq analyses that apply linear models to log-transformed counts, such as those in the limma software package, and more generally to any count-based GLM where exactly zero fitted values are possible.

Download Full-text

Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data

Genome Biology ◽

10.1186/gb-2013-14-9-r95 ◽

2013 ◽

Vol 14 (9) ◽

pp. R95 ◽

Cited By ~ 408

Author(s):

Franck Rapaport ◽

Raya Khanin ◽

Yupu Liang ◽

Mono Pirun ◽

Azra Krek ◽

...

Keyword(s):

Gene Expression ◽

Differential Gene Expression ◽

Expression Analysis ◽

Gene Expression Analysis ◽

Comprehensive Evaluation ◽

Rna Seq ◽

Differential Gene Expression Analysis ◽

Analysis Methods ◽

Differential Gene

Download Full-text

A general and powerful stage-wise testing procedure for differential expression and differential transcript usage

10.1101/109082 ◽

2017 ◽

Cited By ~ 1

Author(s):

Koen Van den Berge ◽

Charlotte Soneson ◽

Mark D. Robinson ◽

Lieven Clement

Keyword(s):

Multiple Testing ◽

Statistical Power ◽

Error Control ◽

Transcript Level ◽

Testing Procedure ◽

Cancer Case ◽

Rna Seq ◽

Post Hoc Analysis ◽

Gene Level ◽

Post Hoc

AbstractBackgroundReductions in sequencing cost and innovations in expression quantification have prompted an emergence of RNA-seq studies with complex designs and data analysis at transcript resolution. These applications involve multiple hypotheses per gene, leading to challenging multiple testing problems. Conventional approaches provide separate top-lists for every contrast and false discovery rate (FDR) control at individual hypothesis level. Hence, they fail to establish proper gene-level error control, which compromises downstream validation experiments. Tests that aggregate individual hypotheses are more powerful and provide gene-level FDR control, but in the RNA-seq literature no methods are available for post-hoc analysis of individual hypotheses.ResultsWe introduce a two-stage procedure that leverages the increased power of aggregated hypothesis tests while maintaining high biological resolution by post-hoc analysis of genes passing the screening hypothesis. Our method is evaluated on simulated and real RNA-seq experiments. It provides gene-level FDR control in studies with complex designs while boosting power for interaction effects without compromising the discovery of main effects. In a differential transcript usage/expression context, stage-wise testing gains power by aggregating hypotheses at the gene level, while providing transcript-level assessment of genes passing the screening stage. Finally, a prostate cancer case study highlights the relevance of combining gene with transcript level results.ConclusionStage-wise testing is a general paradigm that can be adopted whenever individual hypotheses can be aggregated. In our context, it achieves an optimal middle ground between biological resolution and statistical power while providing gene-level FDR control, which is beneficial for downstream biological interpretation and validation.

Download Full-text