ZIAQ: a quantile regression method for differential expression analysis of single-cell RNA-seq data

2020 ◽  
Vol 36 (10) ◽  
pp. 3124-3130
Author(s):  
Wenfei Zhang ◽  
Ying Wei ◽  
Donghui Zhang ◽  
Ethan Y Xu

Abstract Motivation Single-cell RNA sequencing (scRNA-seq) has enabled the simultaneous transcriptomic profiling of individual cells under different biological conditions. scRNA-seq data have two unique challenges that can affect the sensitivity and specificity of single-cell differential expression analysis: a large proportion of expressed genes with zero or low read counts ('dropout' events) and multimodal data distributions. Results We have developed a zero-inflation-adjusted quantile (ZIAQ) algorithm, which is the first method to account for both dropout rates and complex scRNA-seq data distributions in the same model. ZIAQ demonstrates superior performance over several existing methods on simulated scRNA-seq datasets by finding more differentially expressed genes. When ZIAQ was applied to the comparison of neoplastic and non-neoplastic cells from a human glioblastoma dataset, the ranking of biologically relevant genes and pathways showed clear improvement over existing methods. Availability and implementation ZIAQ is implemented in the R language and available at https://github.com/gefeizhang/ZIAQ. Supplementary information Supplementary data are available at Bioinformatics online.

2020 ◽  
Vol 36 (10) ◽  
pp. 3156-3161 ◽  
Author(s):  
Chong Chen ◽  
Changjing Wu ◽  
Linjie Wu ◽  
Xiaochen Wang ◽  
Minghua Deng ◽  
...  

Abstract Motivation Single cell RNA-sequencing (scRNA-seq) technology enables whole transcriptome profiling at single cell resolution and holds great promises in many biological and medical applications. Nevertheless, scRNA-seq often fails to capture expressed genes, leading to the prominent dropout problem. These dropouts cause many problems in down-stream analysis, such as significant increase of noises, power loss in differential expression analysis and obscuring of gene-to-gene or cell-to-cell relationship. Imputation of these dropout values can be beneficial in scRNA-seq data analysis. Results In this article, we model the dropout imputation problem as robust matrix decomposition. This model has minimal assumptions and allows us to develop a computational efficient imputation method called scRMD. Extensive data analysis shows that scRMD can accurately recover the dropout values and help to improve downstream analysis such as differential expression analysis and clustering analysis. Availability and implementation The R package scRMD is available at https://github.com/XiDsLab/scRMD. Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Vol 45 (19) ◽  
pp. 10978-10988 ◽  
Author(s):  
Cheng Jia ◽  
Yu Hu ◽  
Derek Kelly ◽  
Junhyong Kim ◽  
Mingyao Li ◽  
...  

2018 ◽  
Vol 34 (19) ◽  
pp. 3340-3348 ◽  
Author(s):  
Zhijin Wu ◽  
Yi Zhang ◽  
Michael L Stitzel ◽  
Hao Wu

2018 ◽  
Vol 19 (1) ◽  
Author(s):  
Wenan Chen ◽  
Yan Li ◽  
John Easton ◽  
David Finkelstein ◽  
Gang Wu ◽  
...  

2017 ◽  
Author(s):  
Charlotte Soneson ◽  
Mark D. Robinson

AbstractBackgroundAs single-cell RNA-seq (scRNA-seq) is becoming increasingly common, the amount of publicly available data grows rapidly, generating a useful resource for computational method development and extension of published results. Although processed data matrices are typically made available in public repositories, the procedure to obtain these varies widely between data sets, which may complicate reuse and cross-data set comparison. Moreover, while many statistical methods for performing differential expression analysis of scRNA-seq data are becoming available, their relative merits and the performance compared to methods developed for bulk RNA-seq data are not sufficiently well understood.ResultsWe present conquer, a collection of consistently processed, analysis-ready public single-cell RNA-seq data sets. Each data set has count and transcripts per million (TPM) estimates for genes and transcripts, as well as quality control and exploratory analysis reports. We use a subset of the data sets available in conquer to perform an extensive evaluation of the performance and characteristics of statistical methods for differential gene expression analysis, evaluating a total of 30 statistical approaches on both experimental and simulated scRNA-seq data.ConclusionsConsiderable differences are found between the methods in terms of the number and characteristics of the genes that are called differentially expressed. Pre-filtering of lowly expressed genes can have important effects on the results, particularly for some of the methods originally developed for analysis of bulk RNA-seq data. Generally, however, methods developed for bulk RNA-seq analysis do not perform notably worse than those developed specifically for scRNA-seq.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Melvyn Yap ◽  
Rebecca L. Johnston ◽  
Helena Foley ◽  
Samual MacDonald ◽  
Olga Kondrashova ◽  
...  

AbstractFor complex machine learning (ML) algorithms to gain widespread acceptance in decision making, we must be able to identify the features driving the predictions. Explainability models allow transparency of ML algorithms, however their reliability within high-dimensional data is unclear. To test the reliability of the explainability model SHapley Additive exPlanations (SHAP), we developed a convolutional neural network to predict tissue classification from Genotype-Tissue Expression (GTEx) RNA-seq data representing 16,651 samples from 47 tissues. Our classifier achieved an average F1 score of 96.1% on held-out GTEx samples. Using SHAP values, we identified the 2423 most discriminatory genes, of which 98.6% were also identified by differential expression analysis across all tissues. The SHAP genes reflected expected biological processes involved in tissue differentiation and function. Moreover, SHAP genes clustered tissue types with superior performance when compared to all genes, genes detected by differential expression analysis, or random genes. We demonstrate the utility and reliability of SHAP to explain a deep learning model and highlight the strengths of applying ML to transcriptome data.


2019 ◽  
Author(s):  
Mahmoud M Ibrahim ◽  
Rafael Kramann

ABSTRACTMarker genes identified in single cell experiments are expected to be highly specific to a certain cell type and highly expressed in that cell type. Detecting a gene by differential expression analysis does not necessarily satisfy those two conditions and is typically computationally expensive for large cell numbers.Here we present genesorteR, an R package that ranks features in single cell data in a manner consistent with the expected definition of marker genes in experimental biology research. We benchmark genesorteR using various data sets and show that it is distinctly more accurate in large single cell data sets compared to other methods. genesorteR is orders of magnitude faster than current implementations of differential expression analysis methods, can operate on data containing millions of cells and is applicable to both single cell RNA-Seq and single cell ATAC-Seq data.genesorteR is available at https://github.com/mahmoudibrahim/genesorteR.


2019 ◽  
Vol 35 (22) ◽  
pp. 4671-4678
Author(s):  
Justin D Finkle ◽  
Neda Bagheri

Abstract Motivation To understand the regulatory pathways underlying diseases, studies often investigate the differential gene expression between genetically or chemically differing cell populations. Differential expression analysis identifies global changes in transcription and enables the inference of functional roles of applied perturbations. This approach has transformed the discovery of genetic drivers of disease and possible therapies. However, differential expression analysis does not provide quantitative predictions of gene expression in untested conditions. We present a hybrid approach, termed Differential Expression in Python (DiffExPy), that uniquely combines discrete, differential expression analysis with in silico differential equation simulations to yield accurate, quantitative predictions of gene expression from time-series data. Results To demonstrate the distinct insight provided by DiffExpy, we applied it to published, in vitro, time-series RNA-seq data from several genetic PI3K/PTEN variants of MCF10a cells stimulated with epidermal growth factor. DiffExPy proposed ensembles of several minimal differential equation systems for each differentially expressed gene. These systems provide quantitative models of expression for several previously uncharacterized genes and uncover new regulation by the PI3K/PTEN pathways. We validated model predictions on expression data from conditions that were not used for model training. Our discrete, differential expression analysis also identified SUZ12 and FOXA1 as possible regulators of specific groups of genes that exhibit late changes in expression. Our work reveals how DiffExPy generates quantitatively predictive models with testable, biological hypotheses from time-series expression data. Availability and implementation DiffExPy is available on GitHub (https://github.com/bagherilab/diffexpy). Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 16 (2) ◽  
pp. 163-166 ◽  
Author(s):  
Vasilis Ntranos ◽  
Lynn Yi ◽  
Páll Melsted ◽  
Lior Pachter

Sign in / Sign up

Export Citation Format

Share Document