Detection of aberrant splicing events in RNA-seq data using FRASER

AbstractAberrant splicing is a major cause of rare diseases. However, its prediction from genome sequence alone remains in most cases inconclusive. Recently, RNA sequencing has proven to be an effective complementary avenue to detect aberrant splicing. Here, we develop FRASER, an algorithm to detect aberrant splicing from RNA sequencing data. Unlike existing methods, FRASER captures not only alternative splicing but also intron retention events. This typically doubles the number of detected aberrant events and identified a pathogenic intron retention in MCOLN1 causing mucolipidosis. FRASER automatically controls for latent confounders, which are widespread and affect sensitivity substantially. Moreover, FRASER is based on a count distribution and multiple testing correction, thus reducing the number of calls by two orders of magnitude over commonly applied z score cutoffs, with a minor loss of sensitivity. Applying FRASER to rare disease diagnostics is demonstrated by reprioritizing a pathogenic aberrant exon truncation in TAZ from a published dataset. FRASER is easy to use and freely available.

Download Full-text

Detection of aberrant splicing events in RNA-seq data with FRASER

10.1101/2019.12.18.866830 ◽

2019 ◽

Cited By ~ 1

Author(s):

Christian Mertes ◽

Ines Scheller ◽

Vicente A. Yépez ◽

Muhammed H. Çelik ◽

Yingjiqiong Liang ◽

...

Keyword(s):

Rna Sequencing ◽

Multiple Testing ◽

Intron Retention ◽

Sequencing Data ◽

Multiple Testing Correction ◽

Aberrant Splicing ◽

Sensitivity Loss ◽

Count Distribution ◽

Disease Diagnostics ◽

A Minor

AbstractAberrant splicing is a major cause of rare diseases, yet its prediction from genome sequence remains in most cases inconclusive. Recently, RNA sequencing has proven to be an effective complementary avenue to detect aberrant splicing. Here, we developed FRASER, an algorithm to detect aberrant splicing from RNA sequencing data. Unlike existing methods, FRASER captures not only alternative splicing but also intron retention events. This typically doubles the number of detected aberrant events and identified a pathogenic intron retention in MCOLN1. FRASER automatically controls for latent confounders, which are widespread and substantially affect sensitivity. Moreover, FRASER is based on a count distribution and multiple testing correction, reducing the number of calls by two orders of magnitude over commonly applied z score cutoffs, with a minor sensitivity loss. The application to rare disease diagnostics is demonstrated by reprioritizing a pathogenic aberrant exon truncation in TAZ from a published dataset. FRASER is easy to use and freely available.

Download Full-text

Detection of aberrant events in RNA sequencing data

10.21203/rs.2.19080/v1 ◽

2020 ◽

Author(s):

Vicente A. Yépez ◽

Christian Mertes ◽

Michaela F. Mueller ◽

Daniela S. Andrade ◽

Leonhard Wachutka ◽

...

Keyword(s):

Rna Sequencing ◽

Rare Allele ◽

Rna Seq ◽

Sequencing Data ◽

Aberrant Splicing ◽

Aberrant Expression ◽

Powerful Approach ◽

Disease Entities ◽

Computational Workflow ◽

Assess Quality

Abstract RNA sequencing (RNA-seq) has emerged as a powerful approach to discover disease-causing gene regulatory defects for individuals affected with a genetically undiagnosed rare disorder. Pioneer studies have shown that RNA-seq could increase diagnostic rates over DNA sequencing alone by 8% to 36 % depending on disease entities and probed tissues. To accelerate the adoption of RNA-seq among human genetic centers, detailed analysis protocols are now needed. Here, we present a step-by-step protocol that instructs how to robustly detect aberrant expression, aberrant splicing, and mono-allelic expression of a rare allele in RNA-seq data using dedicated statistical methods. We describe how to generate and assess quality control plots and interpret the analysis results. The protocol is based on DROP (Detection of RNA Outliers Pipeline), a modular computational workflow that integrates all analysis steps, can leverage parallel computing infrastructures, and generates browsable web page reports.

Download Full-text

A Two-Stage Poisson Model for Testing RNA-Seq Data

Statistical Applications in Genetics and Molecular Biology ◽

10.2202/1544-6115.1627 ◽

2011 ◽

Vol 10 (1) ◽

Cited By ~ 39

Author(s):

Paul L. Auer ◽

Rebecca W Doerge

Keyword(s):

Rna Sequencing ◽

Statistical Approach ◽

Poisson Model ◽

Real Data ◽

Rna Seq ◽

Sequencing Data ◽

Sequencing Technology ◽

Two Stage ◽

Individual Gene ◽

Unique Nature

RNA sequencing technology is providing data of unprecedented throughput, resolution, and accuracy. Although there are many different computational tools for processing these data, there are a limited number of statistical methods for analyzing them, and even fewer that acknowledge the unique nature of individual gene transcription. We introduce a simple and powerful statistical approach, based on a two-stage Poisson model, for modeling RNA sequencing data and testing for biologically important changes in gene expression. The advantages of this approach are demonstrated through simulations and real data applications.

Download Full-text

Next-generation RNA sequencing reveals transcriptomic changes after brief exposure to preoperative nab-paclitaxel, bevacizumab, and trastuzumab.

Journal of Clinical Oncology ◽

10.1200/jco.2012.30.15_suppl.10508 ◽

2012 ◽

Vol 30 (15_suppl) ◽

pp. 10508-10508

Author(s):

Vinay Varadan ◽

Sitharthan Kamalakaran ◽

Angel Janevski ◽

Nila Banerjee ◽

Kimberly Lezon-Geyda ◽

...

Keyword(s):

Multiple Testing ◽

Day Treatment ◽

Transcriptional Response ◽

Enrichment Analysis ◽

Transcript Abundance ◽

Differentially Expressed ◽

Read Length ◽

Breast Cancers ◽

Rna Seq ◽

Multiple Testing Correction

10508 Background: Identification of differentially expressed transcripts after brief exposure to preoperative therapy can help determine likely response markers. We quantify and compare differential gene and isoform expression using RNA-seq on patient samples with 10 day exposure to one dose of trastuzumab, bevacizumab or nab-paclitaxel. Methods: We sequenced transcriptomes of 23 pairs of core biopsy RNA from breast cancers pre/post 10 day exposure to therapy. Paired-end sequencing was done on the Illumina GAII platform using amplified total RNA with 74bp read length, yielding data on transcript abundance for a total of 22,160 genes and 34,449 transcripts. Differential expression of transcripts between pre/post samples was estimated assuming Poisson-distributed read-counts, followed by multiple testing correction and enrichment analysis of 185 KEGG pathways. Results: PAM50-based clustering showed individual samples cluster together, demonstrating that tumor subtypes do not change over the 10-day treatment (SABCS 2011). We identified genes that were significantly differentially expressed (p<0.05; FDR<0.1) in at least 60% of samples within each therapy arm: 780 genes in trastuzumab, 302 in bevacizumab, and 176 in nab-paclitaxel. Surprisingly, only THAP11 and TINF2 were common amongst them. THAP11 is involved in stem cell maintenance and TINF2 is important for regulation of telomere length. Immune system and metabolism-related pathways were commonly affected (p<0.05) across all arms. The bevacizumab arm showed significant down-regulation of angiogenesis-associated genes: ESM1 and VEGFR2 in > 80% of samples. The nab-paclitaxel arm exhibited changes in TGF-beta signaling, Nod-like receptor and Wnt signaling. The trastuzumab arm exhibited consistent alteration of ErbB2 and mTOR pathways, with SOX11 and TOP2B downregulated in every sample. Conclusions: This is the first study to compare gene expression with brief exposure across therapies using RNA-seq technology. The unique aspects of transcriptional response to each treatment underscore the need for specific markers of therapeutic response to nab-paclitaxel, bevacizumab and trastuzumab.

Download Full-text

SimFuse: A Novel Fusion Simulator for RNA Sequencing (RNA-Seq) Data

BioMed Research International ◽

10.1155/2015/780519 ◽

2015 ◽

Vol 2015 ◽

pp. 1-5 ◽

Cited By ~ 2

Author(s):

Yuxiang Tan ◽

Yann Tambouret ◽

Stefano Monti

Keyword(s):

Sample Size ◽

Rna Sequencing ◽

High Throughput Sequencing ◽

Performance Metrics ◽

Simulated Data ◽

Real Data ◽

Rna Seq ◽

Sequencing Data ◽

Detection Algorithms ◽

Fusion Detection

The performance evaluation of fusion detection algorithms from high-throughput sequencing data crucially relies on the availability of data with known positive and negative cases of gene rearrangements. The use of simulated data circumvents some shortcomings of real data by generation of an unlimited number of true and false positive events, and the consequent robust estimation of accuracy measures, such as precision and recall. Although a few simulated fusion datasets from RNA Sequencing (RNA-Seq) are available, they are of limited sample size. This makes it difficult to systematically evaluate the performance of RNA-Seq based fusion-detection algorithms. Here, we present SimFuse to address this problem. SimFuse utilizes real sequencing data as the fusions’ background to closely approximate the distribution of reads from a real sequencing library and uses a reference genome as the template from which to simulate fusions’ supporting reads. To assess the supporting read-specific performance, SimFuse generates multiple datasets with various numbers of fusion supporting reads. Compared to an extant simulated dataset, SimFuse gives users control over the supporting read features and the sample size of the simulated library, based on which the performance metrics needed for the validation and comparison of alternative fusion-detection algorithms can be rigorously estimated.

Download Full-text

SSCC: a novel computational framework for rapid and accurate clustering large single cell RNA-seq data

10.1101/344242 ◽

2018 ◽

Cited By ~ 2

Author(s):

Xianwen Ren ◽

Liangtao Zheng ◽

Zemin Zhang

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Large Scale ◽

Random Projection ◽

Rna Seq ◽

Sequencing Data ◽

Computational Framework ◽

Human Blood Cells ◽

Single Cell Rna Sequencing ◽

Data Volume

ABSTRACTClustering is a prevalent analytical means to analyze single cell RNA sequencing data but the rapidly expanding data volume can make this process computational challenging. New methods for both accurate and efficient clustering are of pressing needs. Here we proposed a new clustering framework based on random projection and feature construction for large scale single-cell RNA sequencing data, which greatly improves clustering accuracy, robustness and computational efficacy for various state-of-the-art algorithms benchmarked on multiple real datasets. On a dataset with 68,578 human blood cells, our method reached 20% improvements for clustering accuracy and 50-fold acceleration but only consumed 66% memory usage compared to the widely-used software package SC3. Compared to k-means, the accuracy improvement can reach 3-fold depending on the concrete dataset. An R implementation of the framework is available from https://github.com/Japrin/sscClust.

Download Full-text

Splatter: simulation of single-cell RNA sequencing data

10.1101/133173 ◽

2017 ◽

Cited By ~ 8

Author(s):

Luke Zappia ◽

Belinda Phipson ◽

Alicia Oshlack

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Real Data ◽

Cell Types ◽

Rna Seq ◽

Sequencing Data ◽

Sequencing Technologies ◽

Simulation Based ◽

Single Cell Rna Sequencing ◽

Multiple Cell

AbstractAs single-cell RNA sequencing technologies have rapidly developed, so have analysis methods. Many methods have been tested, developed and validated using simulated datasets. Unfortunately, current simulations are often poorly documented, their similarity to real data is not demonstrated, or reproducible code is not available.Here we present the Splatter Bioconductor package for simple, reproducible and well-documented simulation of single-cell RNA-seq data. Splatter provides an interface to multiple simulation methods including Splat, our own simulation, based on a gamma-Poisson distribution. Splat can simulate single populations of cells, populations with multiple cell types or differentiation paths.

Download Full-text

RNA sequencing data: hitchhiker's guide to expression analysis

10.7287/peerj.preprints.27283 ◽

2018 ◽

Author(s):

Koen Van Den Berge ◽

Katharina Hembach ◽

Charlotte Soneson ◽

Simone Tiberi ◽

Lieven Clement ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Large Scale ◽

Science Studies ◽

Rna Seq ◽

Sequencing Data ◽

Data Types ◽

The Past ◽

Long Read ◽

Statistical Approaches

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

Download Full-text

Transcriptome diversity is a systematic source of bias in RNA-sequencing data

10.1101/2021.04.27.441712 ◽

2021 ◽

Author(s):

Pablo E. García-Nieto ◽

Ban Wang ◽

Hunter B. Fraser

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Systematic Bias ◽

Simple Explanation ◽

Rna Seq ◽

Sequencing Data ◽

Biological Variables ◽

Systematic Effects ◽

Standard Practices ◽

Transcriptome Diversity

ABSTRACTBackgroundRNA sequencing has been widely used as an essential tool to probe gene expression. While standard practices have been established to analyze RNA-seq data, it is still challenging to detect and remove artifactual signals. Several factors such as sex, age, and sequencing technology have been found to bias these estimates. Probabilistic estimation of expression residuals (PEER) has been used to account for some systematic effects, but it has remained challenging to interpret these PEER factors.ResultsHere we show that transcriptome diversity – a simple metric based on Shannon entropy – explains a large portion of variability in gene expression, and is a major factor detected by PEER. We then show that transcriptome diversity has significant associations with multiple technical and biological variables across diverse organisms and datasets. This prevalent confounding factor provides a simple explanation for a major source of systematic biases in gene expression estimates.ConclusionsOur results show that transcriptome diversity is a metric that captures a systematic bias in RNA-seq and is the strongest known factor encoded in PEER covariates.

Download Full-text

Circall: fast and accurate methodology for discovery of circular RNAs from paired-end RNA-sequencing data

BMC Bioinformatics ◽

10.1186/s12859-021-04418-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Dat Thanh Nguyen ◽

Quang Thinh Trac ◽

Thi-Hau Nguyen ◽

Ha-Nam Nguyen ◽

Nir Ohad ◽

...

Keyword(s):

Rna Sequencing ◽

Simulated Data ◽

High Sensitivity ◽

Circular Rna ◽

Computational Time ◽

Circular Rnas ◽

Rna Seq ◽

Sequencing Data ◽

Mapping Algorithm ◽

False Discovery Rate Method

Abstract Background Circular RNA (circRNA) is an emerging class of RNA molecules attracting researchers due to its potential for serving as markers for diagnosis, prognosis, or therapeutic targets of cancer, cardiovascular, and autoimmune diseases. Current methods for detection of circRNA from RNA sequencing (RNA-seq) focus mostly on improving mapping quality of reads supporting the back-splicing junction (BSJ) of a circRNA to eliminate false positives (FPs). We show that mapping information alone often cannot predict if a BSJ-supporting read is derived from a true circRNA or not, thus increasing the rate of FP circRNAs. Results We have developed Circall, a novel circRNA detection method from RNA-seq. Circall controls the FPs using a robust multidimensional local false discovery rate method based on the length and expression of circRNAs. It is computationally highly efficient by using a quasi-mapping algorithm for fast and accurate RNA read alignments. We applied Circall on two simulated datasets and three experimental datasets of human cell-lines. The results show that Circall achieves high sensitivity and precision in the simulated data. In the experimental datasets it performs well against current leading methods. Circall is also substantially faster than the other methods, particularly for large datasets. Conclusions With those better performances in the detection of circRNAs and in computational time, Circall facilitates the analyses of circRNAs in large numbers of samples. Circall is implemented in C++ and R, and available for use at https://www.meb.ki.se/sites/biostatwiki/circall and https://github.com/datngu/Circall.

Download Full-text