scholarly journals A tool for the comparison of transcript differential expression analysis pipelines

Author(s):  
Stefano Beretta ◽  
Yuri Pirola ◽  
Valeria Ranzani ◽  
Grazisa Rossetti ◽  
Raoul Bonnal ◽  
...  

MOTIVATION Long non-coding RNAs (lncRNAs) have recently gained interest, especially for their involvement in controlling several cell processes, but a full understanding of their role is lacking. Differential Expression (DE) analysis is one of the most important tasks in the analysis of RNA-seq data, since it potentially points out genes involved in the regulation of the condition under study. However, a classical analysis at gene level may disregard the role of Alternative Splicing (AS) in regulating cell conditions. This is the case, for example, when a given gene is expressed in all the different conditions, but the expressed isoform is significantly diverse in the different conditions (that is an isoform switch). A transcript level analysis may better shed light on this case, especially in studies having as goal, for example, a better understanding of the behavior of lncRNAs in lymphocytes T cells, which are fundamental in studies of specific diseases, such as cancer. After Cufflinks/Cuffdiff, several approaches for DE analysis at isoform/transcript level have been proposed. However, their results are often sensitive to the upstream analysis such as read mapping, transcript reconstruction and quantification, and it is often hard to choose "a priori" the most appropriate combination of tools. This work presents a tool for assisting the user in this choice, and poses the bases for a study devoted to the characterization of lncRNAs and the identification of of isoform switch events. Our tool includes a framework for the description and the execution of a set of DE pipelines over the same input dataset, as well a set of tools for reconciling and comparing the results. METHOD We designed an automated and easily customizable tool which is able to execute a set of existing pipelines for DE analysis at transcript level starting from RNA-seq data. Our method is built upon Snakemake, a workflow management system, with the specific goal of reducing the complexity of creating workflows. This approach guarantees that the experimentation is fully replicable and easy to customize. Each considered pipeline is structured in three steps: (i) transcript assembly, (ii) quantification, and (iii) DE analysis. By default, our tool builds and compares 9 different pipelines, each taking as input the same set of RNA-seq reads, obtained by combining different state-of-the-art methods to perform the transcript assembly (TA step) with different state-of-the-art methods to perform quantification and differential expression analysis (Q+DE step). More precisely, the 9 pipelines are obtained by combining two tools (Cufflinks and StringTie) and a Reference Annotation (Ensembl annotated transcripts) for the TA step, with three tools (Cuffquant+Cuffdiff, StringTie-B+Ballgown and Kallisto+Sleuth) for the Q+DE step. Abstract truncated at 3,000 characters - the full version is available in the pdf file

2016 ◽  
Author(s):  
Stefano Beretta ◽  
Yuri Pirola ◽  
Valeria Ranzani ◽  
Grazisa Rossetti ◽  
Raoul Bonnal ◽  
...  

MOTIVATION Long non-coding RNAs (lncRNAs) have recently gained interest, especially for their involvement in controlling several cell processes, but a full understanding of their role is lacking. Differential Expression (DE) analysis is one of the most important tasks in the analysis of RNA-seq data, since it potentially points out genes involved in the regulation of the condition under study. However, a classical analysis at gene level may disregard the role of Alternative Splicing (AS) in regulating cell conditions. This is the case, for example, when a given gene is expressed in all the different conditions, but the expressed isoform is significantly diverse in the different conditions (that is an isoform switch). A transcript level analysis may better shed light on this case, especially in studies having as goal, for example, a better understanding of the behavior of lncRNAs in lymphocytes T cells, which are fundamental in studies of specific diseases, such as cancer. After Cufflinks/Cuffdiff, several approaches for DE analysis at isoform/transcript level have been proposed. However, their results are often sensitive to the upstream analysis such as read mapping, transcript reconstruction and quantification, and it is often hard to choose "a priori" the most appropriate combination of tools. This work presents a tool for assisting the user in this choice, and poses the bases for a study devoted to the characterization of lncRNAs and the identification of of isoform switch events. Our tool includes a framework for the description and the execution of a set of DE pipelines over the same input dataset, as well a set of tools for reconciling and comparing the results. METHOD We designed an automated and easily customizable tool which is able to execute a set of existing pipelines for DE analysis at transcript level starting from RNA-seq data. Our method is built upon Snakemake, a workflow management system, with the specific goal of reducing the complexity of creating workflows. This approach guarantees that the experimentation is fully replicable and easy to customize. Each considered pipeline is structured in three steps: (i) transcript assembly, (ii) quantification, and (iii) DE analysis. By default, our tool builds and compares 9 different pipelines, each taking as input the same set of RNA-seq reads, obtained by combining different state-of-the-art methods to perform the transcript assembly (TA step) with different state-of-the-art methods to perform quantification and differential expression analysis (Q+DE step). More precisely, the 9 pipelines are obtained by combining two tools (Cufflinks and StringTie) and a Reference Annotation (Ensembl annotated transcripts) for the TA step, with three tools (Cuffquant+Cuffdiff, StringTie-B+Ballgown and Kallisto+Sleuth) for the Q+DE step. Abstract truncated at 3,000 characters - the full version is available in the pdf file


2021 ◽  
Author(s):  
Marine Gauthier ◽  
Denis Agniel ◽  
Rodolphe Thiébaut ◽  
Boris P. Hejblum

State-of-the-art methods for single-cell RNA-seq (scRNA-seq) Differential Expression Analysis (DEA) often rely on strong distributional assumptions that are difficult to verify in practice. Furthermore, while the increasing complexity of clinical and biological single-cell studies calls for greater tool versatility, the majority of existing methods only tackle the comparison between two conditions. We propose a novel, distribution-free, and flexible approach to DEA for single-cell RNA-seq data. This new method, called ccdf, tests the association of each gene expression with one or many variables of interest (that can be either continuous or discrete), while potentially adjusting for additional covariates. To test such complex hypotheses, ccdf uses a conditional independence test relying on the conditional cumulative distribution function, estimated through multiple regressions. We provide the asymptotic distribution of the ccdf test statistic as well as a permutation test (when the number of observed cells is not sufficiently large). ccdf substantially expands the possibilities for scRNA-seq DEA studies: it obtains good statistical performance in various simulation scenarios considering complex experimental designs i.e. beyond the two condition comparison), while retaining competitive performance with state-of-the-art methods in a two-condition benchmark.


2019 ◽  
Author(s):  
Anqi Zhu ◽  
Avi Srivastava ◽  
Joseph G. Ibrahim ◽  
Rob Patro ◽  
Michael I. Love

AbstractA primary challenge in the analysis of RNA-seq data is to identify differentially expressed genes or transcripts while controlling for technical biases present in the observations. Ideally, a statistical testing procedure should incorporate information about the inherent uncertainty of the abundance estimates, whether at the gene or transcript level, that arise from quantification of abundance. Most popular methods for RNA-seq differential expression analysis fit a parametric model to the counts or scaled counts for each gene or transcript, and a subset of methods can incorporate information about the uncertainty of the counts. Previous work has shown that nonparametric models for RNA-seq differential expression may in some cases have better control of the false discovery rate, and adapt well to new data types without requiring reformulation of a parametric model. Existing nonparametric models do not take into account the inferential uncertainty of the observations, leading to an inflated false discovery rate, in particular at the transcript level. Here we propose a nonparametric model for differential expression analysis using inferential replicate counts, extending the existing SAMseq method to account for inferential uncertainty, batch effects, and sample pairing. We compare our method, “SAMseq With Inferential Samples Helps”, or Swish, with popular differential expression analysis methods. Swish has improved control of the false discovery rate, in particular for transcripts with high inferential uncertainty. We apply Swish to a singlecell RNA-seq dataset, assessing sensitivity to recover DE genes between sub-populations of cells, and compare its performance to the Wilcoxon rank sum test.


2018 ◽  
Author(s):  
Jesse M. Zhang ◽  
Govinda M. Kamath ◽  
David N. Tse

SummarySingle-cell computational pipelines involve two critical steps: organizing cells (clustering) and identifying the markers driving this organization (differential expression analysis). State-of-the-art pipelines perform differential analysis after clustering on the same dataset. We observe that because clustering forces separation, reusing the same dataset generates artificially low p-values and hence false discoveries. We introduce a valid post-clustering differential analysis framework which corrects for this problem. We provide software at https://github.com/jessemzhang/tn_test.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Matthew Chung ◽  
Vincent M. Bruno ◽  
David A. Rasko ◽  
Christina A. Cuomo ◽  
José F. Muñoz ◽  
...  

AbstractAdvances in transcriptome sequencing allow for simultaneous interrogation of differentially expressed genes from multiple species originating from a single RNA sample, termed dual or multi-species transcriptomics. Compared to single-species differential expression analysis, the design of multi-species differential expression experiments must account for the relative abundances of each organism of interest within the sample, often requiring enrichment methods and yielding differences in total read counts across samples. The analysis of multi-species transcriptomics datasets requires modifications to the alignment, quantification, and downstream analysis steps compared to the single-species analysis pipelines. We describe best practices for multi-species transcriptomics and differential gene expression.


2021 ◽  
Vol 3 (2) ◽  
Author(s):  
Xueyi Dong ◽  
Luyi Tian ◽  
Quentin Gouil ◽  
Hasaru Kariyawasam ◽  
Shian Su ◽  
...  

Abstract Application of Oxford Nanopore Technologies’ long-read sequencing platform to transcriptomic analysis is increasing in popularity. However, such analysis can be challenging due to the high sequence error and small library sizes, which decreases quantification accuracy and reduces power for statistical testing. Here, we report the analysis of two nanopore RNA-seq datasets with the goal of obtaining gene- and isoform-level differential expression information. A dataset of synthetic, spliced, spike-in RNAs (‘sequins’) as well as a mouse neural stem cell dataset from samples with a null mutation of the epigenetic regulator Smchd1 was analysed using a mix of long-read specific tools for preprocessing together with established short-read RNA-seq methods for downstream analysis. We used limma-voom to perform differential gene expression analysis, and the novel FLAMES pipeline to perform isoform identification and quantification, followed by DRIMSeq and limma-diffSplice (with stageR) to perform differential transcript usage analysis. We compared results from the sequins dataset to the ground truth, and results of the mouse dataset to a previous short-read study on equivalent samples. Overall, our work shows that transcriptomic analysis of long-read nanopore data using long-read specific preprocessing methods together with short-read differential expression methods and software that are already in wide use can yield meaningful results.


2019 ◽  
Author(s):  
Avi Srivastava ◽  
Laraib Malik ◽  
Hirak Sarkar ◽  
Mohsen Zakeri ◽  
Fatemeh Almodaresi ◽  
...  

AbstractBackgroundThe accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy.ResultsWe investigate the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis. We observe that, even when the quantification model itself is held fixed, the effect of choosing a different alignment methodology, or aligning reads using different parameters, on quantification estimates can sometimes be large, and can affect downstream differential expression analyses as well. These effects can go unnoticed when assessment is focused too heavily on simulated data, where the alignment task is often simpler than in experimentally-acquired samples. We also introduce a new alignment methodology, called selective alignment, to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment.ConclusionWe observe that, on experimental datasets, the performance of lightweight mapping and alignment-based approaches varies significantly and highlight some of the underlying factors. We show this variation both in terms of quantification and downstream differential expression analysis. In all comparisons, we also show the improved performance of our proposed selective alignment method and suggest best practices for performing RNA-seq quantification.


Sign in / Sign up

Export Citation Format

Share Document