scholarly journals Probabilistic outlier identification for RNA sequencing generalized linear models

2021 ◽  
Vol 3 (1) ◽  
Author(s):  
Stefano Mangiola ◽  
Evan A Thomas ◽  
Martin Modrák ◽  
Aki Vehtari ◽  
Anthony T Papenfuss

Abstract Relative transcript abundance has proven to be a valuable tool for understanding the function of genes in biological systems. For the differential analysis of transcript abundance using RNA sequencing data, the negative binomial model is by far the most frequently adopted. However, common methods that are based on a negative binomial model are not robust to extreme outliers, which we found to be abundant in public datasets. So far, no rigorous and probabilistic methods for detection of outliers have been developed for RNA sequencing data, leaving the identification mostly to visual inspection. Recent advances in Bayesian computation allow large-scale comparison of observed data against its theoretical distribution given in a statistical model. Here we propose ppcseq, a key quality-control tool for identifying transcripts that include outlier data points in differential expression analysis, which do not follow a negative binomial distribution. Applying ppcseq to analyse several publicly available datasets using popular tools, we show that from 3 to 10 percent of differentially abundant transcripts across algorithms and datasets had statistics inflated by the presence of outliers.

2020 ◽  
Author(s):  
Hunyong Cho ◽  
Chuwen Liu ◽  
John S. Preisser ◽  
Di Wu

SummaryMeasuring gene-gene dependence in single cell RNA sequencing (scRNA-seq) count data is often of interest and remains challenging, because an unidentified portion of the zero counts represent non-detected RNA due to technical reasons. Conventional statistical methods that fail to account for technical zeros incorrectly measure the dependence among genes. To address this problem, we propose a bivariate zero-inflated negative binomial (BZINB) model constructed using a bivariate Poisson-gamma mixture with dropout indicators for the technical (excess) zeros. Parameters are estimated based on the EM algorithm and are used to measure the underlying dependence by decomposing the two sources of zeros. Compared to existing models, the proposed BZINB model is specifically designed for estimating dependence and is more flexible, while preserving the marginal zero-inflated negative binomial distributions. Additionally, it has a simple latent variable framework, allowing parameters to have clear and intuitive interpretations, and its computation is feasible with large scale data. Using a recent scRNA-seq dataset, we illustrate model fitting and how the model-based measures can be different from naive measures. The inferential ability of the proposed model is evaluated in a simulation study. An R package ‘bzinb’ is available on CRAN.


2017 ◽  
Author(s):  
Koen Van den Berge ◽  
Charlotte Soneson ◽  
Michael I. Love ◽  
Mark D. Robinson ◽  
Lieven Clement

AbstractDropout in single cell RNA-seq (scRNA-seq) applications causes many transcripts to go undetected. It induces excess zero counts, which leads to power issues in differential expression (DE) analysis and has triggered the development of bespoke scRNA-seq DE tools that cope with zero-inflation. Recent evaluations, however, have shown that dedicated scRNA-seq tools provide no advantage compared to traditional bulk RNA-seq tools. We introduce zingeR, a zero-inflated negative binomial model that identifies excess zero counts and generates observation weights to unlock bulk RNA-seq pipelines for zero-inflation, boosting performance in scRNA-seq differential expression analysis.


2021 ◽  
Vol 2 (1) ◽  
pp. 43-61
Author(s):  
Aanchal Malhotra ◽  
Samarendra Das ◽  
Shesh N. Rai

Single-cell RNA-sequencing (scRNA-seq) technology provides an excellent platform for measuring the expression profiles of genes in heterogeneous cell populations. Multiple tools for the analysis of scRNA-seq data have been developed over the years. The tools require complicated commands and steps to analyze the underlying data, which are not easy to follow by genome researchers and experimental biologists. Therefore, we describe a step-by-step workflow for processing and analyzing the scRNA-seq unique molecular identifier (UMI) data from Human Lung Adenocarcinoma cell lines. We demonstrate the basic analyses including quality check, mapping and quantification of transcript abundance through suitable real data example to obtain UMI count data. Further, we performed basic statistical analyses, such as zero-inflation, differential expression and clustering analyses on the obtained count data. We studied the effects of excess zero-inflation present in scRNA-seq data on the downstream analyses. Our findings indicate that the zero-inflation associated with UMI data had no or minimal role in clustering, while it had significant effect on identifying differentially expressed genes. We also provide an insight into the comparative analysis for differential expression analysis tools based on zero-inflated negative binomial and negative binomial models on scRNA-seq data. The sensitivity analysis enhanced our findings in that the negative binomial model-based tool did not provide an accurate and efficient way to analyze the scRNA-seq data. This study provides a set of guidelines for the users to handle and analyze real scRNA-seq data more easily.


2015 ◽  
Author(s):  
Hung-I Harry Chen ◽  
Yuanhang Liu ◽  
Yi Zou ◽  
Zhao Lai ◽  
Devanand Sarkar ◽  
...  

Background RNA sequencing (RNA-seq) is a powerful tool for genome-wide expression profiling of biological samples with the advantage of high-throughput and high resolution. There are many existing algorithms nowadays for quantifying expression levels and detecting differential gene expression, but none of them takes the misaligned reads that are mapped to non-exonic regions into account. We developed a novel algorithm, XBSeq, where a statistical model was established based on the assumption that observed signals are the convolution of true expression signals and sequencing noises. The mapped reads in non-exonic regions are considered as sequencing noises, which follows a Poisson distribution. Given measureable observed and noise signals from RNA-seq data, true expression signals, assuming governed by the negative binomial distribution, can be delineated and thus the accurate detection of differential expressed genes. Results We implemented our novel XBSeq algorithm and evaluated it by using a set of simulated expression datasets under different conditions, using a combination of negative binomial and Poisson distributions with parameters derived from real RNA-seq data. We compared the performance of our method with other commonly used differential expression analysis algorithms. We also evaluated the changes in true and false positive rates with variations in biological replicates, differential fold changes, and expression levels in non-exonic regions. We also tested the algorithm on a set of real RNA-seq data where the common and different detection results from different algorithms were reported. Conclusions In this paper, we proposed a novel XBSeq, a differential expression analysis algorithm for RNA-seq data that takes non-exonic mapped reads into consideration. When background noise is at baseline level, the performance of XBSeq and DESeq are mostly equivalent. However, our method surpasses DESeq and other algorithms with the increase of non-exonic mapped reads. Only in very low read count condition XBSeq had a slightly higher false discovery rate, which may be improved by adjusting the background noise effect in this situation. Taken together, by considering non-exonic mapped reads, XBSeq can provide accurate expression measurement and thus detect differential expressed genes even in noisy conditions.


2019 ◽  
Vol 47 (2) ◽  
pp. 287-305 ◽  
Author(s):  
Eghbal Zandkarimi ◽  
Abbas Moghimbeigi ◽  
Hossein Mahjub ◽  
Reza Majdzadeh

2021 ◽  
Vol 300 ◽  
pp. 113919
Author(s):  
Narimasa Kumagai ◽  
Aran Tajika ◽  
Akio Hasegawa ◽  
Nao Kawanishi ◽  
Hirokazu Fujita ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document