Overcoming confounding plate effects in differential expression analyses of single-cell RNA-seq data

AbstractAn increasing number of studies are using single-cell RNA-sequencing (scRNA-seq) to characterize the gene expression profiles of individual cells. One common analysis applied to scRNA-seq data involves detecting differentially expressed (DE) genes between cells in different biological groups. However, many experiments are designed such that the cells to be compared are processed in separate plates or chips, meaning that the groupings are confounded with systematic plate effects. This confounding aspect is frequently ignored in DE analyses of scRNA-seq data. In this article, we demonstrate that failing to consider plate effects in the statistical model results in loss of type I error control. A solution is proposed whereby counts are summed from all cells in each plate and the count sums for all plates are used in the DE analysis. This restores type I error control in the presence of plate effects without compromising detection power in simulated data. Summation is also robust to varying numbers and library sizes of cells on each plate. Similar results are observed in DE analyses of real data where the use of count sums instead of single-cell counts improves specificity and the ranking of relevant genes. This suggests that summation can assist in maintaining statistical rigour in DE analyses of scRNA-seq data with plate effects.

Download Full-text

No counts, no variance: allowing for loss of degrees of freedom when assessing biological variability from RNA-seq data

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2017-0010 ◽

2017 ◽

Vol 16 (2) ◽

Cited By ~ 1

Author(s):

Aaron T. L. Lun ◽

Gordon K. Smyth

Keyword(s):

Software Package ◽

Error Control ◽

Degrees Of Freedom ◽

Linear Models ◽

Type I Error ◽

Real Data ◽

Type I ◽

Rna Seq ◽

Study Gene Expression ◽

Complex Models

AbstractRNA sequencing (RNA-seq) is widely used to study gene expression changes associated with treatments or biological conditions. Many popular methods for detecting differential expression (DE) from RNA-seq data use generalized linear models (GLMs) fitted to the read counts across independent replicate samples for each gene. This article shows that the standard formula for the residual degrees of freedom (d.f.) in a linear model is overstated when the model contains fitted values that are exactly zero. Such fitted values occur whenever all the counts in a treatment group are zero as well as in more complex models such as those involving paired comparisons. This misspecification results in underestimation of the genewise variances and loss of type I error control. This article proposes a formula for the reduced residual d.f. that restores error control in simulated RNA-seq data and improves detection of DE genes in a real data analysis. The new approach is implemented in the quasi-likelihood framework of the edgeR software package. The results of this article also apply to RNA-seq analyses that apply linear models to log-transformed counts, such as those in the limma software package, and more generally to any count-based GLM where exactly zero fitted values are possible.

Download Full-text

Bulk and single-cell RNA-seq reveal dmrtb1 gene expression profiles during sex change in zig-zag eel (Mastacembelus armatus)

Aquaculture ◽

10.1016/j.aquaculture.2021.737194 ◽

2021 ◽

pp. 737194

Author(s):

Lingzhan Xue ◽

Dan Jia ◽

Luohao Xu ◽

Zhen Huang ◽

Haiping Fan ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Sex Change ◽

Rna Seq ◽

Mastacembelus Armatus

Download Full-text

DECENT: differential expression with capture efficiency adjustmeNT for single-cell RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz453 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5155-5162 ◽

Cited By ~ 10

Author(s):

Chengzhong Ye ◽

Terence P Speed ◽

Agus Salim

Keyword(s):

Single Cell ◽

Differential Expression ◽

Type I Error ◽

R Package ◽

Supplementary Information ◽

Type I ◽

Common Phenomenon ◽

Rna Seq ◽

Capture Process ◽

Technological Platforms

Abstract Motivation Dropout is a common phenomenon in single-cell RNA-seq (scRNA-seq) data, and when left unaddressed it affects the validity of the statistical analyses. Despite this, few current methods for differential expression (DE) analysis of scRNA-seq data explicitly model the process that gives rise to the dropout events. We develop DECENT, a method for DE analysis of scRNA-seq data that explicitly and accurately models the molecule capture process in scRNA-seq experiments. Results We show that DECENT demonstrates improved DE performance over existing DE methods that do not explicitly model dropout. This improvement is consistently observed across several public scRNA-seq datasets generated using different technological platforms. The gain in improvement is especially large when the capture process is overdispersed. DECENT maintains type I error well while achieving better sensitivity. Its performance without spike-ins is almost as good as when spike-ins are used to calibrate the capture model. Availability and implementation The method is implemented as a publicly available R package available from https://github.com/cz-ye/DECENT. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SCDC: bulk gene expression deconvolution by multiple single-cell RNA sequencing references

Briefings in Bioinformatics ◽

10.1093/bib/bbz166 ◽

2020 ◽

Cited By ~ 13

Author(s):

Meichen Dong ◽

Aatish Thennavan ◽

Eugene Urrutia ◽

Yun Li ◽

Charles M Perou ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Specific Gene ◽

Rna Seq ◽

Cell Type ◽

Mixed Cell ◽

Single Cell Rna Sequencing

Abstract Recent advances in single-cell RNA sequencing (scRNA-seq) enable characterization of transcriptomic profiles with single-cell resolution and circumvent averaging artifacts associated with traditional bulk RNA sequencing (RNA-seq) data. Here, we propose SCDC, a deconvolution method for bulk RNA-seq that leverages cell-type specific gene expression profiles from multiple scRNA-seq reference datasets. SCDC adopts an ENSEMBLE method to integrate deconvolution results from different scRNA-seq datasets that are produced in different laboratories and at different times, implicitly addressing the problem of batch-effect confounding. SCDC is benchmarked against existing methods using both in silico generated pseudo-bulk samples and experimentally mixed cell lines, whose known cell-type compositions serve as ground truths. We show that SCDC outperforms existing methods with improved accuracy of cell-type decomposition under both settings. To illustrate how the ENSEMBLE framework performs in complex tissues under different scenarios, we further apply our method to a human pancreatic islet dataset and a mouse mammary gland dataset. SCDC returns results that are more consistent with experimental designs and that reproduce more significant associations between cell-type proportions and measured phenotypes.

Download Full-text

SPARSim single cell: a count data simulator for scRNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz752 ◽

2019 ◽

Cited By ~ 2

Author(s):

Giacomo Baruzzo ◽

Ilaria Patuzzi ◽

Barbara Di Camillo

Keyword(s):

Single Cell ◽

Count Data ◽

Simulated Data ◽

Real Data ◽

R Package ◽

Supplementary Information ◽

Rna Seq ◽

Distribution Of Zeros ◽

New Methods ◽

Research Fields

Abstract Motivation Single cell RNA-seq (scRNA-seq) count data show many differences compared with bulk RNA-seq count data, making the application of many RNA-seq pre-processing/analysis methods not straightforward or even inappropriate. For this reason, the development of new methods for handling scRNA-seq count data is currently one of the most active research fields in bioinformatics. To help the development of such new methods, the availability of simulated data could play a pivotal role. However, only few scRNA-seq count data simulators are available, often showing poor or not demonstrated similarity with real data. Results In this article we present SPARSim, a scRNA-seq count data simulator based on a Gamma-Multivariate Hypergeometric model. We demonstrate that SPARSim allows to generate count data that resemble real data in terms of count intensity, variability and sparsity, performing comparably or better than one of the most used scRNA-seq simulator, Splat. In particular, SPARSim simulated count matrices well resemble the distribution of zeros across different expression intensities observed in real count data. Availability and implementation SPARSim R package is freely available at http://sysbiobig.dei.unipd.it/? q=SPARSim and at https://gitlab.com/sysbiobig/sparsim. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MARS: leveraging allelic heterogeneity to increase power of association testing

Genome Biology ◽

10.1186/s13059-021-02353-8 ◽

2021 ◽

Vol 22 (1) ◽

Cited By ~ 1

Author(s):

Farhad Hormozdiari ◽

Junghyun Jung ◽

Eleazar Eskin ◽

Jong Wha J. Joo

Keyword(s):

Type I Error ◽

Association Studies ◽

Simulated Data ◽

Real Data ◽

Association Test ◽

Type I ◽

Genome Wide Association Studies ◽

Association Testing ◽

Causal Status ◽

Causal Variants

AbstractIn standard genome-wide association studies (GWAS), the standard association test is underpowered to detect associations between loci with multiple causal variants with small effect sizes. We propose a statistical method, Model-based Association test Reflecting causal Status (MARS), that finds associations between variants in risk loci and a phenotype, considering the causal status of variants, only requiring the existing summary statistics to detect associated risk loci. Utilizing extensive simulated data and real data, we show that MARS increases the power of detecting true associated risk loci compared to previous approaches that consider multiple variants, while controlling the type I error.

Download Full-text

Data-based RNA-seq Simulations by Binomial Thinning

10.1101/758524 ◽

2019 ◽

Cited By ~ 1

Author(s):

David Gerard

Keyword(s):

Theoretical Model ◽

Single Cell ◽

Differential Expression Analysis ◽

Simulated Data ◽

Real Data ◽

Theoretical Models ◽

Simulation Method ◽

R Package ◽

Rna Seq ◽

Ideal Model

AbstractWith the explosion in the number of methods designed to analyze bulk and single-cell RNA-seq data, there is a growing need for approaches that assess and compare these methods. The usual technique is to compare methods on data simulated according to some theoretical model. However, as real data often exhibit violations from theoretical models, this can result in un-substantiated claims of a method’s performance. Rather than generate data from a theoretical model, in this paper we develop methods to add signal to real RNA-seq datasets. Since the resulting simulated data are not generated from an unrealistic theoretical model, they exhibit realistic (annoying) attributes of real data. This lets RNA-seq methods developers assess their procedures in non-ideal (model-violating) scenarios. Our procedures may be applied to both single-cell and bulk RNA-seq. We show that our simulation method results in more realistic datasets and can alter the conclusions of a differential expression analysis study. We also demonstrate our approach by comparing various factor analysis techniques on RNA-seq datasets. Our tools are available in the seqgendiff R package on the Comprehensive R Archive Net-work: https://cran.r-project.org/package=seqgendiff.

Download Full-text

Besca, a single-cell transcriptomics analysis toolkit to accelerate translational research

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab102 ◽

2021 ◽

Vol 3 (4) ◽

Author(s):

Sophia Clara Mädler ◽

Alice Julien-Laferriere ◽

Luis Wyss ◽

Miroslav Phan ◽

Anthony Sonrel ◽

...

Keyword(s):

Best Practices ◽

Translational Research ◽

Single Cell ◽

Tumor Tissue ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Supervised Machine Learning ◽

Rna Seq ◽

Disease Biology ◽

Transcriptomics Data

Abstract Single-cell RNA sequencing (scRNA-seq) revolutionized our understanding of disease biology. The promise it presents to also transform translational research requires highly standardized and robust software workflows. Here, we present the toolkit Besca, which streamlines scRNA-seq analyses and their use to deconvolute bulk RNA-seq data according to current best practices. Beyond a standard workflow covering quality control, filtering, and clustering, two complementary Besca modules, utilizing hierarchical cell signatures and supervised machine learning, automate cell annotation and provide harmonized nomenclatures. Subsequently, the gene expression profiles can be employed to estimate cell type proportions in bulk transcriptomics data. Using multiple, diverse scRNA-seq datasets, some stemming from highly heterogeneous tumor tissue, we show how Besca aids acceleration, interoperability, reusability and interpretability of scRNA-seq data analyses, meeting crucial demands in translational research and beyond.

Download Full-text

RNA-seq analyses of molecular abundance (RoMA) for detecting differential gene expression

10.1101/410985 ◽

2018 ◽

Author(s):

Guoshuai Cai ◽

Jennifer M. Franks ◽

Michael L. Whitfield

Keyword(s):

Type I Error ◽

Simulated Data ◽

Mrna Abundance ◽

Type I ◽

Rna Seq ◽

Improved Method ◽

Efficient Control ◽

Differential Gene ◽

Abundance Modeling ◽

Accurate Quantification

AbstractMotivationVarious methods have been proposed, each with its own limitations. Some naive normal-based tests have low testing power with invalid normal distribution assumptions for RNA-seq read counts, whereas count-based methods lack a biologically meaningful interpretation and have limited capability for integration with other analysis packages for mRNA abundance. In this study, we propose an improved method, RoMA, to accurately detect differential expression and unlock the integration with upstream and downstream analyses on mRNA abundance in RNA-seq studies.ResultsRoMA incorporates information from both mRNA abundance and raw counts. Studies on simulated data and two real datasets showed that RoMA provides an accurate quantification of mRNA abundance and a data adjustment-tolerant DE analysis with high AUC, low FDR, and an efficient control of type I error rate. This study provides a valid strategy for mRNA abundance modeling and data analysis integration for RNA-seq studies, which will greatly facilitate the identification and interpretation of DE genes.Availability and implementationRoMA is available at https://github.com/GuoshuaiCai/[email protected] or [email protected]

Download Full-text

SCDC: Bulk Gene Expression Deconvolution by Multiple Single-Cell RNA Sequencing References

10.1101/743591 ◽

2019 ◽

Cited By ~ 1

Author(s):

Meichen Dong ◽

Aatish Thennavan ◽

Eugene Urrutia ◽

Yun Li ◽

Charles M. Perou ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Specific Gene ◽

Rna Seq ◽

Cell Type ◽

Mixed Cell ◽

Single Cell Rna Sequencing

AbstractRecent advances in single-cell RNA sequencing (scRNA-seq) enable characterization of transcriptomic profiles with single-cell resolution and circumvent averaging artifacts associated with traditional bulk RNA sequencing (RNA-seq) data. Here, we propose SCDC, a deconvolution method for bulk RNA-seq that leverages cell-type specific gene expression profiles from multiple scRNA-seq reference datasets. SCDC adopts an ENSEMBLE method to integrate deconvolution results from different scRNA-seq datasets that are produced in different laboratories and at different times, implicitly addressing the problem of batch-effect confounding. SCDC is benchmarked against existing methods using both in silico generated pseudo-bulk samples and experimentally mixed cell lines, whose known cell-type compositions serve as ground truths. We show that SCDC outperforms existing methods with improved accuracy of cell-type decomposition under both settings. To illustrate how the ENSEMBLE framework performs in complex tissues under different scenarios, we further apply our method to a human pancreatic islet dataset and a mouse mammary gland dataset. SCDC returns results that are more consistent with experimental designs and that reproduce more significant associations between cell-type proportions and measured phenotypes.

Download Full-text