Assessment of batch-correction methods for scRNA-seq data with a new test metric

AbstractSingle-cell transcriptomics is a versatile tool for exploring heterogeneous cell populations. As with all genomics experiments, batch effects can hamper data integration and interpretation. The success of batch effect correction is often evaluated by visual inspection of dimension-reduced representations such as principal component analysis. This is inherently imprecise due to the high number of genes and non-normal distribution of gene expression. Here, we present a k-nearest neighbour batch effect test (kBET, https://github.com/theislab/kBET) to quantitatively measure batch effects. kBET is easier to interpret, more sensitive and more robust than visual evaluation and other measures of batch effects. We use kBET to assess commonly used batch regression and normalisation approaches, and quantify the extent to which they remove batch effects while preserving biological variability. Our results illustrate that batch correction based on log-transformation or scran pooling followed by ComBat reduced the batch effect while preserving structure across data sets. Finally we show that kBET can pinpoint successful data integration methods across multiple data sets, in this case from different publications all charting mouse embryonic development. This has important implications for future data integration efforts, which will be central to projects such as the Human Cell Atlas where data for the same tissue may be generated in multiple locations around the world.[Before final publication, we will upload the R package to Bioconductor]

Download Full-text

GEDI: an R package for integration of transcriptomic data from multiple high-throughput platforms

10.1101/2021.11.11.468093 ◽

2021 ◽

Author(s):

Mathias N Stokholm ◽

Maria B Rabaglino ◽

Haja N Kadarmideen

Keyword(s):

Gene Expression ◽

Data Integration ◽

Principal Component ◽

R Package ◽

Gene Expression Omnibus ◽

Batch Effect ◽

Sequencing Data ◽

Transcriptomic Data ◽

Ncbi Gene Expression Omnibus ◽

Forward Stepwise

Transcriptomic data is often expensive and difficult to generate in large cohorts in comparison to genomic data and therefore is often important to integrate multiple transcriptomic datasets from both microarray and next generation sequencing (NGS) based transcriptomic data across similar experiments or clinical trials to improve analytical power and discovery of novel transcripts and genes. However, transcriptomic data integration presents a few challenges including re-annotation and batch effect removal. We developed the Gene Expression Data Integration (GEDI) R package to enable transcriptomic data integration by combining already existing R packages. With just four functions, the GEDI R package makes constructing a transcriptomic data integration pipeline straightforward. Together, the functions overcome the complications in transcriptomic data integration by automatically re-annotating the data and removing the batch effect. The removal of the batch effect is verified with Principal Component Analysis and the data integration is verified using a logistic regression model with forward stepwise feature selection. To demonstrate the functionalities of the GEDI package, we integrated five bovine endometrial transcriptomic datasets from the NCBI Gene Expression Omnibus. The datasets included Affymetrix, Agilent and RNA-sequencing data. Furthermore, we compared the GEDI package to already existing tools and found that GEDI is the only tool that provides a full transcriptomic data integration pipeline including verification of both batch effect removal and data integration.

Download Full-text

Comparing the performance of linear and nonlinear principal components in the context of high-dimensional genomic data integration

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2016-0066 ◽

2017 ◽

Vol 16 (3) ◽

Author(s):

Shofiqul Islam ◽

Sonia Anand ◽

Jemila Hamid ◽

Lehana Thabane ◽

Joseph Beyene

Keyword(s):

Principal Component Analysis ◽

Data Integration ◽

Principal Components ◽

Mirna Expression ◽

Principal Component ◽

Component Analysis ◽

Kernel Principal Component Analysis ◽

Data Sets ◽

Data Set ◽

Multiple Data Sets

AbstractLinear principal component analysis (PCA) is a widely used approach to reduce the dimension of gene or miRNA expression data sets. This method relies on the linearity assumption, which often fails to capture the patterns and relationships inherent in the data. Thus, a nonlinear approach such as kernel PCA might be optimal. We develop a copula-based simulation algorithm that takes into account the degree of dependence and nonlinearity observed in these data sets. Using this algorithm, we conduct an extensive simulation to compare the performance of linear and kernel principal component analysis methods towards data integration and death classification. We also compare these methods using a real data set with gene and miRNA expression of lung cancer patients. First few kernel principal components show poor performance compared to the linear principal components in this occasion. Reducing dimensions using linear PCA and a logistic regression model for classification seems to be adequate for this purpose. Integrating information from multiple data sets using either of these two approaches leads to an improved classification accuracy for the outcome.

Download Full-text

MultiDataSet: an R package for encapsulating multiple data sets with application to omic data integration

BMC Bioinformatics ◽

10.1186/s12859-016-1455-1 ◽

2017 ◽

Vol 18 (1) ◽

Cited By ~ 13

Author(s):

Carles Hernandez-Ferrer ◽

Carlos Ruiz-Arenas ◽

Alba Beltran-Gomila ◽

Juan R. González

Keyword(s):

Data Integration ◽

R Package ◽

Data Sets ◽

Multiple Data ◽

Multiple Data Sets ◽

Omic Data Integration ◽

Omic Data

Download Full-text

Reduced Representation by Neural Networks with Restricted Receptive Fields

Neural Computation ◽

10.1162/neco.1995.7.3.507 ◽

1995 ◽

Vol 7 (3) ◽

pp. 507-517 ◽

Cited By ~ 1

Author(s):

Marco Idiart ◽

Barry Berk ◽

L. F. Abbott

Keyword(s):

Neural Networks ◽

Biological Networks ◽

Input Data ◽

Hebbian Learning ◽

Receptive Fields ◽

Principal Component ◽

Data Sets ◽

Reduced Representation ◽

Learning Rules ◽

Reduced Representations

Model neural networks can perform dimensional reductions of input data sets using correlation-based learning rules to adjust their weights. Simple Hebbian learning rules lead to an optimal reduction at the single unit level but result in highly redundant network representations. More complex rules designed to reduce or remove this redundancy can develop optimal principal component representations, but they are not very compelling from a biological perspective. Neurons in biological networks have restricted receptive fields limiting their access to the input data space. We find that, within this restricted receptive field architecture, simple correlation-based learning rules can produce surprisingly efficient reduced representations. When noise is present, the size of the receptive fields can be optimally tuned to maximize the accuracy of reconstructions of input data from a reduced representation.

Download Full-text

Decomposing the Apoptosis Pathway Into Biologically Interpretable Principal Components

Cancer Informatics ◽

10.1177/1176935118771082 ◽

2018 ◽

Vol 17 ◽

pp. 117693511877108 ◽

Cited By ~ 4

Author(s):

Min Wang ◽

Steven M Kornblau ◽

Kevin R Coombes

Keyword(s):

Principal Components ◽

Myeloid Leukemia ◽

Principal Component ◽

R Package ◽

Biological Data ◽

Data Sets ◽

Proteomics Data ◽

Data Set ◽

Apoptosis Pathway ◽

Biological Interpretation

Principal component analysis (PCA) is one of the most common techniques in the analysis of biological data sets, but applying PCA raises 2 challenges. First, one must determine the number of significant principal components (PCs). Second, because each PC is a linear combination of genes, it rarely has a biological interpretation. Existing methods to determine the number of PCs are either subjective or computationally extensive. We review several methods and describe a new R package, PCDimension, that implements additional methods, the most important being an algorithm that extends and automates a graphical Bayesian method. Using simulations, we compared the methods. Our newly automated procedure is competitive with the best methods when considering both accuracy and speed and is the most accurate when the number of objects is small compared with the number of attributes. We applied the method to a proteomics data set from patients with acute myeloid leukemia. Proteins in the apoptosis pathway could be explained using 6 PCs. By clustering the proteins in PC space, we were able to replace the PCs by 6 “biological components,” 3 of which could be immediately interpreted from the current literature. We expect this approach combining PCA with clustering to be widely applicable.

Download Full-text

BatchServer: a web server for batch effect evaluation, visualization and correction

10.1101/2020.03.23.996264 ◽

2020 ◽

Author(s):

Tiansheng Zhu ◽

Guo-Bo Chen ◽

Chunhui Yuan ◽

Rui Sun ◽

Fangfei Zhang ◽

...

Keyword(s):

Web Server ◽

Batch Effect ◽

Variance Component Analysis ◽

Data Sets ◽

Batch Effects ◽

Source Codes ◽

Link Type ◽

R Shiny ◽

User Friendly ◽

Web Platform

AbstractBatch effects are unwanted data variations that may obscure biological signals, leading to bias or errors in subsequent data analyses. Effective evaluation and elimination of batch effects are necessary for omics data analysis. In order to facilitate the evaluation and correction of batch effects, here we present BatchSever, an open-source R/Shiny based user-friendly interactive graphical web platform for batch effects analysis. In BatchServer we introduced autoComBat, a modified version of ComBat, which is the most widely adopted tool for batch effect correction. BatchServer uses PVCA (Principal Variance Component Analysis) and UMAP (Manifold Approximation and Projection) for evaluation and visualizion of batch effects. We demonstate its application in multiple proteomics and transcriptomic data sets. BatchServer is provided at https://lifeinfo.shinyapps.io/batchserver/ as a web server. The source codes are freely available at https://github.com/guomics-lab/batch_server.

Download Full-text

deepMNN: Deep Learning-Based Single-Cell RNA Sequencing Data Batch Correction Using Mutual Nearest Neighbors

Frontiers in Genetics ◽

10.3389/fgene.2021.708981 ◽

2021 ◽

Vol 12 ◽

Author(s):

Bin Zou ◽

Tongda Zhang ◽

Ruilong Zhou ◽

Xiaosen Jiang ◽

Huanming Yang ◽

...

Keyword(s):

Deep Learning ◽

Single Cell ◽

Rna Sequencing ◽

Large Scale ◽

Cell Types ◽

Batch Effect ◽

Batch Effects ◽

Batch Correction ◽

Single Cell Rna Sequencing ◽

Identical Cell

It is well recognized that batch effect in single-cell RNA sequencing (scRNA-seq) data remains a big challenge when integrating different datasets. Here, we proposed deepMNN, a novel deep learning-based method to correct batch effect in scRNA-seq data. We first searched mutual nearest neighbor (MNN) pairs across different batches in a principal component analysis (PCA) subspace. Subsequently, a batch correction network was constructed by stacking two residual blocks and further applied for the removal of batch effects. The loss function of deepMNN was defined as the sum of a batch loss and a weighted regularization loss. The batch loss was used to compute the distance between cells in MNN pairs in the PCA subspace, while the regularization loss was to make the output of the network similar to the input. The experiment results showed that deepMNN can successfully remove batch effects across datasets with identical cell types, datasets with non-identical cell types, datasets with multiple batches, and large-scale datasets as well. We compared the performance of deepMNN with state-of-the-art batch correction methods, including the widely used methods of Harmony, Scanorama, and Seurat V4 as well as the recently developed deep learning-based methods of MMD-ResNet and scGen. The results demonstrated that deepMNN achieved a better or comparable performance in terms of both qualitative analysis using uniform manifold approximation and projection (UMAP) plots and quantitative metrics such as batch and cell entropies, ARI F1 score, and ASW F1 score under various scenarios. Additionally, deepMNN allowed for integrating scRNA-seq datasets with multiple batches in one step. Furthermore, deepMNN ran much faster than the other methods for large-scale datasets. These characteristics of deepMNN made it have the potential to be a new choice for large-scale single-cell gene expression data analysis.

Download Full-text

Two-stage Linked Component Analysis for Joint Decomposition of Multiple Biologically Related Data Sets

10.1101/2021.03.22.435728 ◽

2021 ◽

Author(s):

By Huan Chen ◽

Brian Caffo ◽

Genevieve Stein-O’Brien ◽

Jinrui Liu ◽

Ben Langmead ◽

...

Keyword(s):

R Package ◽

Component Analysis ◽

Biological Data ◽

Joint Analysis ◽

Data Sets ◽

Biological Processes ◽

Two Stage ◽

Data Set ◽

Multiple Data ◽

Multiple Data Sets

SummaryIntegrative analysis of multiple data sets has the potential of fully leveraging the vast amount of high throughput biological data being generated. In particular such analysis will be powerful in making inference from publicly available collections of genetic, transcriptomic and epigenetic data sets which are designed to study shared biological processes, but which vary in their target measurements, biological variation, unwanted noise, and batch variation. Thus, methods that enable the joint analysis of multiple data sets are needed to gain insights into shared biological processes that would otherwise be hidden by unwanted intra-data set variation. Here, we propose a method called two-stage linked component analysis (2s-LCA) to jointly decompose multiple biologically related experimental data sets with biological and technological relationships that can be structured into the decomposition. The consistency of the proposed method is established and its empirical performance is evaluated via simulation studies. We apply 2s-LCA to jointly analyze four data sets focused on human brain development and identify meaningful patterns of gene expression in human neurogenesis that have shared structure across these data sets. The code to conduct 2s-LCA has been complied into an R package “PJD”, which is available at https://github.com/CHuanSite/PJD.

Download Full-text

Embedding to Reference t-SNE Space Addresses Batch Effects in Single-Cell Classification

10.1101/671404 ◽

2019 ◽

Cited By ~ 2

Author(s):

Pavlin G. Poličar ◽

Martin Stražar ◽

Blaž Zupan

Keyword(s):

Single Cell ◽

Secondary Data ◽

Primary Data ◽

Data Sets ◽

Batch Effects ◽

Data Set ◽

Reduction Techniques ◽

Straightforward Application ◽

Cell Gene Expression ◽

Multiple Data Sets

AbstractDimensionality reduction techniques, such as t-SNE, can construct informative visualizations of high-dimensional data. When working with multiple data sets, a straightforward application of these methods often fails; instead of revealing underlying classes, the resulting visualizations expose data set-specific clusters. To circumvent these batch effects, we propose an embedding procedure that takes a t-SNE visualization constructed on a reference data set and uses it as a scaffold for embedding new data. The new, secondary data is embedded one data-point at the time. This prevents any interactions between instances in the secondary data and implicitly mitigates batch effects. We demonstrate the utility of this approach with an analysis of six recently published single-cell gene expression data sets containing up to tens of thousands of cells and thousands of genes. In these data sets, the batch effects are particularly strong as the data comes from different institutions and was obtained using different experimental protocols. The visualizations constructed by our proposed approach are cleared of batch effects, and the cells from secondary data sets correctly co-cluster with cells from the primary data sharing the same cell type.

Download Full-text

Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference

10.1101/2021.01.24.428009 ◽

2021 ◽

Author(s):

Tenglong Li ◽

Yuqing Zhang ◽

Prasad Patil ◽

W. Evan Johnson

Keyword(s):

Differential Expression Analysis ◽

Final Analysis ◽

Generalized Least Squares ◽

R Package ◽

Batch Effect ◽

Batch Effects ◽

Multiple Sources ◽

Link Type ◽

Corrected Data ◽

One Step

AbstractNon-ignorable technical variation is commonly observed across data from multiple experimental runs, platforms, or studies. These so-called batch effects can lead to difficulty in merging data from multiple sources, as they can severely bias the outcome of the analysis. Many groups have developed approaches for removing batch effects from data, usually by accommodating batch variables into the analysis (one-step correction) or by preprocessing the data prior to the formal or final analysis (two-step correction). One-step correction is often desirable due it its simplicity, but its flexibility is limited and it can be difficult to include batch variables uniformly when an analysis has multiple stages. Two-step correction allows for richer models of batch mean and variance. However, prior investigation has indicated that two-step correction can lead to incorrect statistical inference in downstream analysis. Generally speaking, two-step approaches introduce a correlation structure in the corrected data, which, if ignored, may lead to either exaggerated or diminished significance in downstream applications such as differential expression analysis. Here, we provide more intuitive and more formal evaluations of the impacts of two-step batch correction compared to existing literature. We demonstrate that the undesired impacts of two-step correction (exaggerated or diminished significance) depend on both the nature of the study design and the batch effects. We also provide strategies for overcoming these negative impacts in downstream analyses using the estimated correlation matrix of the corrected data. We compare the results of our proposed workflow with the results from other published one-step and two-step methods and show that our methods lead to more consistent false discovery controls and power of detection across a variety of batch effect scenarios. Software for our method is available through GitHub (https://github.com/jtleek/sva-devel) and will be available in future versions of the sva R package in the Bioconductor project (https://bioconductor.org/packages/release/bioc/html/sva.html). Batch effect; Two-step batch adjustment; ComBat; Sample correlation adjustment; Generalized least squares

Download Full-text