scBatch: batch-effect correction of RNA-seq data through sample distance matrix adjustment

Teng Fei; Tianwei Yu

doi:10.1093/bioinformatics/btaa097

scBatch: batch-effect correction of RNA-seq data through sample distance matrix adjustment

Bioinformatics ◽

10.1093/bioinformatics/btaa097 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3115-3123 ◽

Cited By ~ 3

Author(s):

Teng Fei ◽

Tianwei Yu

Keyword(s):

Single Cell ◽

Differential Expression Analysis ◽

Distance Matrix ◽

Real Data ◽

R Package ◽

Batch Effect ◽

Supplementary Information ◽

Rna Seq ◽

Sequencing Data ◽

Gene Differential Expression

Abstract Motivation Batch effect is a frequent challenge in deep sequencing data analysis that can lead to misleading conclusions. Existing methods do not correct batch effects satisfactorily, especially with single-cell RNA sequencing (RNA-seq) data. Results We present scBatch, a numerical algorithm for batch-effect correction on bulk and single-cell RNA-seq data with emphasis on improving both clustering and gene differential expression analysis. scBatch is not restricted by assumptions on the mechanism of batch-effect generation. As shown in simulations and real data analyses, scBatch outperforms benchmark batch-effect correction methods. Availability and implementation The R package is available at github.com/tengfei-emory/scBatch. The code to generate results and figures in this article is available at github.com/tengfei-emory/scBatch-paper-scripts. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

NewWave: a scalable R/Bioconductor package for the dimensionality reduction and batch effect removal of single-cell RNA-seq data

10.1101/2021.08.02.453487 ◽

2021 ◽

Author(s):

Federico Agostinis ◽

Chiara Romualdi ◽

Gabriele Sales ◽

Davide Risso

Keyword(s):

Dimensionality Reduction ◽

Single Cell ◽

R Package ◽

Batch Effect ◽

Supplementary Information ◽

Bioconductor Package ◽

Rna Seq ◽

Sequencing Data ◽

Bioconductor Project ◽

Single Cell Rna Sequencing

Summary: We present NewWave, a scalable R/Bioconductor package for the dimensionality reduction and batch effect removal of single-cell RNA sequencing data. To achieve scalability, NewWave uses mini-batch optimization and can work with out-of-memory data, enabling users to analyze datasets with millions of cells. Availability and implementation: NewWave is implemented as an open-source R package available through the Bioconductor project at https://bioconductor.org/packages/NewWave/ Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

SPARSim single cell: a count data simulator for scRNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz752 ◽

2019 ◽

Cited By ~ 2

Author(s):

Giacomo Baruzzo ◽

Ilaria Patuzzi ◽

Barbara Di Camillo

Keyword(s):

Single Cell ◽

Count Data ◽

Simulated Data ◽

Real Data ◽

R Package ◽

Supplementary Information ◽

Rna Seq ◽

Distribution Of Zeros ◽

New Methods ◽

Research Fields

Abstract Motivation Single cell RNA-seq (scRNA-seq) count data show many differences compared with bulk RNA-seq count data, making the application of many RNA-seq pre-processing/analysis methods not straightforward or even inappropriate. For this reason, the development of new methods for handling scRNA-seq count data is currently one of the most active research fields in bioinformatics. To help the development of such new methods, the availability of simulated data could play a pivotal role. However, only few scRNA-seq count data simulators are available, often showing poor or not demonstrated similarity with real data. Results In this article we present SPARSim, a scRNA-seq count data simulator based on a Gamma-Multivariate Hypergeometric model. We demonstrate that SPARSim allows to generate count data that resemble real data in terms of count intensity, variability and sparsity, performing comparably or better than one of the most used scRNA-seq simulator, Splat. In particular, SPARSim simulated count matrices well resemble the distribution of zeros across different expression intensities observed in real count data. Availability and implementation SPARSim R package is freely available at http://sysbiobig.dei.unipd.it/? q=SPARSim and at https://gitlab.com/sysbiobig/sparsim. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Data-based RNA-seq Simulations by Binomial Thinning

10.1101/758524 ◽

2019 ◽

Cited By ~ 1

Author(s):

David Gerard

Keyword(s):

Theoretical Model ◽

Single Cell ◽

Differential Expression Analysis ◽

Simulated Data ◽

Real Data ◽

Theoretical Models ◽

Simulation Method ◽

R Package ◽

Rna Seq ◽

Ideal Model

AbstractWith the explosion in the number of methods designed to analyze bulk and single-cell RNA-seq data, there is a growing need for approaches that assess and compare these methods. The usual technique is to compare methods on data simulated according to some theoretical model. However, as real data often exhibit violations from theoretical models, this can result in un-substantiated claims of a method’s performance. Rather than generate data from a theoretical model, in this paper we develop methods to add signal to real RNA-seq datasets. Since the resulting simulated data are not generated from an unrealistic theoretical model, they exhibit realistic (annoying) attributes of real data. This lets RNA-seq methods developers assess their procedures in non-ideal (model-violating) scenarios. Our procedures may be applied to both single-cell and bulk RNA-seq. We show that our simulation method results in more realistic datasets and can alter the conclusions of a differential expression analysis study. We also demonstrate our approach by comparing various factor analysis techniques on RNA-seq datasets. Our tools are available in the seqgendiff R package on the Comprehensive R Archive Net-work: https://cran.r-project.org/package=seqgendiff.

Download Full-text

Batch Effect Correction of RNA-seq Data through Sample Distance Matrix Adjustment

10.1101/669739 ◽

2019 ◽

Author(s):

Teng Fei ◽

Tianwei Yu

Keyword(s):

Simulated Data ◽

Distance Matrix ◽

Batch Effect ◽

Rna Seq ◽

Sequencing Data ◽

Gene Detection ◽

Gene Differential Expression ◽

Optimal Linear ◽

Sample Pattern ◽

Sample Distance

AbstractBatch effect is a frequent challenge in deep sequencing data analysis that can lead to misleading conclusions. We present scBatch, a numerical algorithm that conducts batch effect correction on the count matrix of RNA sequencing (RNA-seq) data. Different from traditional methods, scBatch starts with establishing an ideal correction of the sample distance matrix that effectively reflect the underlying biological subgroups, without considering the actual correction of the raw count matrix itself. It then seeks an optimal linear transformation of the count matrix to approximate the established sample pattern. The benefit of such an approach is the final result is not restricted by assumptions on the mechanism of the batch effect. As a result, the method yields good clustering and gene differential expression (DE) results. We compared the new method, scBatch, with leading batch effect removal methods ComBat and mnnCorrect on simulated data, real bulk RNA-seq data, and real single-cell RNA-seq data. The comparisons demonstrated that scBatch achieved better sample clustering and DE gene detection results.

Download Full-text

Automatic identification of relevant genes from low-dimensional embeddings of single-cell RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btaa198 ◽

2020 ◽

Vol 36 (15) ◽

pp. 4291-4295

Author(s):

Philipp Angerer ◽

David S Fischer ◽

Fabian J Theis ◽

Antonio Scialdone ◽

Carsten Marr

Keyword(s):

Single Cell ◽

Principal Component ◽

R Package ◽

Ease Of Use ◽

Supplementary Information ◽

Automatic Identification ◽

Biological Processes ◽

Rna Seq ◽

Sequencing Data ◽

Low Dimensional

Abstract Motivation Dimensionality reduction is a key step in the analysis of single-cell RNA-sequencing data. It produces a low-dimensional embedding for visualization and as a calculation base for downstream analysis. Nonlinear techniques are most suitable to handle the intrinsic complexity of large, heterogeneous single-cell data. However, with no linear relation between gene and embedding coordinate, there is no way to extract the identity of genes driving any cell’s position in the low-dimensional embedding, making it difficult to characterize the underlying biological processes. Results In this article, we introduce the concepts of local and global gene relevance to compute an equivalent of principal component analysis loadings for non-linear low-dimensional embeddings. Global gene relevance identifies drivers of the overall embedding, while local gene relevance identifies those of a defined sub-region. We apply our method to single-cell RNA-seq datasets from different experimental protocols and to different low-dimensional embedding techniques. This shows our method’s versatility to identify key genes for a variety of biological processes. Availability and implementation To ensure reproducibility and ease of use, our method is released as part of destiny 3.0, a popular R package for building diffusion maps from single-cell transcriptomic data. It is readily available through Bioconductor. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SPsimSeq: semi-parametric simulation of bulk and single-cell RNA-sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa105 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3276-3278 ◽

Cited By ~ 2

Author(s):

Alemu Takele Assefa ◽

Jo Vandesompele ◽

Olivier Thas

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Real Data ◽

Simulation Method ◽

R Package ◽

Supplementary Information ◽

Expression Data ◽

Sequencing Data ◽

Wide Range ◽

Single Cell Rna Sequencing

Abstract Summary SPsimSeq is a semi-parametric simulation method to generate bulk and single-cell RNA-sequencing data. It is designed to simulate gene expression data with maximal retention of the characteristics of real data. It is reasonably flexible to accommodate a wide range of experimental scenarios, including different sample sizes, biological signals (differential expression) and confounding batch effects. Availability and implementation The R package and associated documentation is available from https://github.com/CenterForStatistics-UGent/SPsimSeq. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MarcoPolo: a clustering-free approach to the exploration of differentially expressed genes along with group information in single-cell RNA-seq data

10.1101/2020.11.23.393900 ◽

2020 ◽

Author(s):

Chanwoo Kim ◽

Hanbin Lee ◽

Juhee Jeong ◽

Keehoon Jung ◽

Buhm Han

Keyword(s):

Single Cell ◽

Differentially Expressed Genes ◽

Differential Expression Analysis ◽

Feature Selection Method ◽

Real Data ◽

Cell Types ◽

Differentially Expressed ◽

Rna Seq ◽

Sequencing Data ◽

Group Information

ABSTRACTA common approach to analyzing single-cell RNA-sequencing data is to cluster cells first and then identify differentially expressed genes based on the clustering result. However, clustering has an innate uncertainty and can be imperfect, undermining the reliability of differential expression analysis results. To overcome this challenge, we present MarcoPolo, a clustering-free approach to exploring differentially expressed genes. To find informative genes without clustering, MarcoPolo exploits the bimodality of gene expression to learn the group information of the cells with respect to the expression level directly from given data. Using simulations and real data analyses, we showed that our method puts biologically informative genes at higher ranks more accurately and robustly than other existing methods. As our method provides information on how cells can be grouped for each gene, it can help identify cell types that are not separated well in the standard clustering process. Our method can also be used as a feature selection method to improve the robustness against changes in the number of genes used in clustering.

Download Full-text

ExperimentSubset: an R package to manage subsets of Bioconductor Experiment objects

Bioinformatics ◽

10.1093/bioinformatics/btab179 ◽

2021 ◽

Author(s):

Irzam Sarfraz ◽

Muhammad Asif ◽

Joshua D Campbell

Keyword(s):

Single Cell ◽

R Package ◽

Poor Quality ◽

Data Matrix ◽

Supplementary Information ◽

Data Provenance ◽

Rna Seq ◽

Efficient Management ◽

The Matrix ◽

The Relationship

Abstract Motivation R Experiment objects such as the SummarizedExperiment or SingleCellExperiment are data containers for storing one or more matrix-like assays along with associated row and column data. These objects have been used to facilitate the storage and analysis of high-throughput genomic data generated from technologies such as single-cell RNA sequencing. One common computational task in many genomics analysis workflows is to perform subsetting of the data matrix before applying down-stream analytical methods. For example, one may need to subset the columns of the assay matrix to exclude poor-quality samples or subset the rows of the matrix to select the most variable features. Traditionally, a second object is created that contains the desired subset of assay from the original object. However, this approach is inefficient as it requires the creation of an additional object containing a copy of the original assay and leads to challenges with data provenance. Results To overcome these challenges, we developed an R package called ExperimentSubset, which is a data container that implements classes for efficient storage and streamlined retrieval of assays that have been subsetted by rows and/or columns. These classes are able to inherently provide data provenance by maintaining the relationship between the subsetted and parent assays. We demonstrate the utility of this package on a single-cell RNA-seq dataset by storing and retrieving subsets at different stages of the analysis while maintaining a lower memory footprint. Overall, the ExperimentSubset is a flexible container for the efficient management of subsets. Availability and implementation ExperimentSubset package is available at Bioconductor: https://bioconductor.org/packages/ExperimentSubset/ and Github: https://github.com/campbio/ExperimentSubset. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DEsingle for detecting three types of differential expression in single-cell RNA-seq data

10.1101/173997 ◽

2017 ◽

Cited By ~ 1

Author(s):

Zhun Miao ◽

Ke Deng ◽

Xiaowo Wang ◽

Xuegong Zhang

Keyword(s):

Single Cell ◽

Differential Expression ◽

Negative Binomial ◽

Single Cells ◽

R Package ◽

Supplementary Information ◽

Binomial Model ◽

Supplementary Data ◽

Rna Seq ◽

Real Zeros

AbstractSummaryThe excessive amount of zeros in single-cell RNA-seq data include “real” zeros due to the on-off nature of gene transcription in single cells and “dropout” zeros due to technical reasons. Existing differential expression (DE) analysis methods cannot distinguish these two types of zeros. We developed an R package DEsingle which employed Zero-Inflated Negative Binomial model to estimate the proportion of real and dropout zeros and to define and detect 3 types of DE genes in single-cell RNA-seq data with higher accuracy.Availability and ImplementationThe R package DEsingle is freely available at https://github.com/miaozhun/DEsingle and is under Bioconductor’s consideration [email protected] informationSupplementary data are available at bioRxiv online.

Download Full-text

DECENT: differential expression with capture efficiency adjustmeNT for single-cell RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz453 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5155-5162 ◽

Cited By ~ 10

Author(s):

Chengzhong Ye ◽

Terence P Speed ◽

Agus Salim

Keyword(s):

Single Cell ◽

Differential Expression ◽

Type I Error ◽

R Package ◽

Supplementary Information ◽

Type I ◽

Common Phenomenon ◽

Rna Seq ◽

Capture Process ◽

Technological Platforms

Abstract Motivation Dropout is a common phenomenon in single-cell RNA-seq (scRNA-seq) data, and when left unaddressed it affects the validity of the statistical analyses. Despite this, few current methods for differential expression (DE) analysis of scRNA-seq data explicitly model the process that gives rise to the dropout events. We develop DECENT, a method for DE analysis of scRNA-seq data that explicitly and accurately models the molecule capture process in scRNA-seq experiments. Results We show that DECENT demonstrates improved DE performance over existing DE methods that do not explicitly model dropout. This improvement is consistently observed across several public scRNA-seq datasets generated using different technological platforms. The gain in improvement is especially large when the capture process is overdispersed. DECENT maintains type I error well while achieving better sensitivity. Its performance without spike-ins is almost as good as when spike-ins are used to calibrate the capture model. Availability and implementation The method is implemented as a publicly available R package available from https://github.com/cz-ye/DECENT. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text