BERMUDA: A novel deep transfer learning method for single-cell RNA sequencing batch correction reveals hidden high-resolution cellular subtypes

AbstractTo fully utilize the power of single-cell RNA sequencing (scRNA-seq) technologies for cell lineation and identifyingbona fidetranscriptional signals, it is necessary to combine data from multiple experiments. We presentBERMUDA(Batch-Effect ReMoval Using Deep Autoencoders) — a novel transfer-learning-based method for batch-effect correction in scRNA-seq data.BERMUDAeffectively combines different batches of scRNA-seq data with vastly different cell population compositions and amplifies biological signals by transferring information among batches. We demonstrate thatBERMUDAoutperforms existing methods for removing batch effects and distinguishing cell types in multiple simulated and real scRNA-seq datasets.

Download Full-text

deepMNN: Deep Learning-Based Single-Cell RNA Sequencing Data Batch Correction Using Mutual Nearest Neighbors

Frontiers in Genetics ◽

10.3389/fgene.2021.708981 ◽

2021 ◽

Vol 12 ◽

Author(s):

Bin Zou ◽

Tongda Zhang ◽

Ruilong Zhou ◽

Xiaosen Jiang ◽

Huanming Yang ◽

...

Keyword(s):

Deep Learning ◽

Single Cell ◽

Rna Sequencing ◽

Large Scale ◽

Cell Types ◽

Batch Effect ◽

Batch Effects ◽

Batch Correction ◽

Single Cell Rna Sequencing ◽

Identical Cell

It is well recognized that batch effect in single-cell RNA sequencing (scRNA-seq) data remains a big challenge when integrating different datasets. Here, we proposed deepMNN, a novel deep learning-based method to correct batch effect in scRNA-seq data. We first searched mutual nearest neighbor (MNN) pairs across different batches in a principal component analysis (PCA) subspace. Subsequently, a batch correction network was constructed by stacking two residual blocks and further applied for the removal of batch effects. The loss function of deepMNN was defined as the sum of a batch loss and a weighted regularization loss. The batch loss was used to compute the distance between cells in MNN pairs in the PCA subspace, while the regularization loss was to make the output of the network similar to the input. The experiment results showed that deepMNN can successfully remove batch effects across datasets with identical cell types, datasets with non-identical cell types, datasets with multiple batches, and large-scale datasets as well. We compared the performance of deepMNN with state-of-the-art batch correction methods, including the widely used methods of Harmony, Scanorama, and Seurat V4 as well as the recently developed deep learning-based methods of MMD-ResNet and scGen. The results demonstrated that deepMNN achieved a better or comparable performance in terms of both qualitative analysis using uniform manifold approximation and projection (UMAP) plots and quantitative metrics such as batch and cell entropies, ARI F1 score, and ASW F1 score under various scenarios. Additionally, deepMNN allowed for integrating scRNA-seq datasets with multiple batches in one step. Furthermore, deepMNN ran much faster than the other methods for large-scale datasets. These characteristics of deepMNN made it have the potential to be a new choice for large-scale single-cell gene expression data analysis.

Download Full-text

A Comprehensive Multi-Center Cross-platform Benchmarking Study of Single-cell RNA Sequencing Using Reference Samples

10.1101/2020.03.27.010249 ◽

2020 ◽

Author(s):

Wanqiu Chen ◽

Yongmei Zhao ◽

Xin Chen ◽

Xiaojiang Xu ◽

Zhaowei Yang ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Cell Types ◽

Batch Effect ◽

Batch Effects ◽

Data Set ◽

Batch Correction ◽

Single Cell Rna Sequencing ◽

Cross Platform ◽

Reference Samples

AbstractSingle-cell RNA sequencing (scRNA-seq) has become a very powerful technology for biomedical research and is becoming much more affordable as methods continue to evolve, but it is unknown how reproducible different platforms are using different bioinformatics pipelines, particularly the recently developed scRNA-seq batch correction algorithms. We carried out a comprehensive multi-center cross-platform comparison on different scRNA-seq platforms using standard reference samples. We compared six pre-processing pipelines, seven bioinformatics normalization procedures, and seven batch effect correction methods including CCA, MNN, Scanorama, BBKNN, Harmony, limma and ComBat to evaluate the performance and reproducibility of 20 scRNA-seq data sets derived from four different platforms and centers. We benchmarked scRNA-seq performance across different platforms and testing sites using global gene expression profiles as well as some cell-type specific marker genes. We showed that there were large batch effects; and the reproducibility of scRNA-seq across platforms was dictated both by the expression level of genes selected and the batch correction methods used. We found that CCA, MNN, and BBKNN all corrected the batch variations fairly well for the scRNA-seq data derived from biologically similar samples across platforms/sites. However, for the scRNA-seq data derived from or consisting of biologically distinct samples, limma and ComBat failed to correct batch effects, whereas CCA over-corrected the batch effect and misclassified the cell types and samples. In contrast, MNN, Harmony and BBKNN separated biologically different samples/cell types into correspondingly distinct dimensional subspaces; however, consistent with this algorithm’s logic, MNN required that the samples evaluated each contain a shared portion of highly similar cells. In summary, we found a great cross-platform consistency in separating two distinct samples when an appropriate batch correction method was used. We hope this large cross-platform/site scRNA-seq data set will provide a valuable resource, and that our findings will offer useful advice for the single-cell sequencing community.

Download Full-text

Comparison of Scanpy-based algorithms to remove the batch effect from single-cell RNA-seq data

Cell Regeneration ◽

10.1186/s13619-020-00041-9 ◽

2020 ◽

Vol 9 (1) ◽

Author(s):

Jiaqi Li ◽

Chengxuan Yu ◽

Lifeng Ma ◽

Jingjing Wang ◽

Guoji Guo

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Large Scale ◽

Batch Effect ◽

Rna Seq ◽

Batch Effects ◽

Integration Methods ◽

Batch Correction ◽

Single Cell Rna Sequencing ◽

Algorithm Level

AbstractWith the development of single-cell RNA sequencing (scRNA-seq) technology, analysts need to integrate hundreds of thousands of cells with multiple experimental batches. It is becoming increasingly difficult for users to select the best integration methods to remove batch effects. Here, we compared the advantages and limitations of four commonly used Scanpy-based batch-correction methods using two representative and large-scale scRNA-seq datasets. We quantitatively evaluated batch-correction performance and efficiency. Furthermore, we discussed the performance differences among the evaluated methods at the algorithm level.

Download Full-text

Transfer learning efficiently maps bone marrow cell types from mouse to human using single-cell RNA sequencing

Communications Biology ◽

10.1038/s42003-020-01463-6 ◽

2020 ◽

Vol 3 (1) ◽

Author(s):

Patrick S. Stumpf ◽

Xin Du ◽

Haruka Imanishi ◽

Yuya Kunisaki ◽

Yuichiro Semba ◽

...

Keyword(s):

Machine Learning ◽

Bone Marrow ◽

Single Cell ◽

Rna Sequencing ◽

Transfer Learning ◽

Biomedical Research ◽

Human Cell ◽

Cell Types ◽

Single Cell Rna Sequencing ◽

Using Data

AbstractBiomedical research often involves conducting experiments on model organisms in the anticipation that the biology learnt will transfer to humans. Previous comparative studies of mouse and human tissues were limited by the use of bulk-cell material. Here we show that transfer learning—the branch of machine learning that concerns passing information from one domain to another—can be used to efficiently map bone marrow biology between species, using data obtained from single-cell RNA sequencing. We first trained a multiclass logistic regression model to recognize different cell types in mouse bone marrow achieving equivalent performance to more complex artificial neural networks. Furthermore, it was able to identify individual human bone marrow cells with 83% overall accuracy. However, some human cell types were not easily identified, indicating important differences in biology. When re-training the mouse classifier using data from human, less than 10 human cells of a given type were needed to accurately learn its representation. In some cases, human cell identities could be inferred directly from the mouse classifier via zero-shot learning. These results show how simple machine learning models can be used to reconstruct complex biology from limited data, with broad implications for biomedical research.

Download Full-text

Flexible Experimental Designs for Valid Single-cell RNA-sequencing Experiments Allowing Batch Effects Correction

10.1101/533372 ◽

2019 ◽

Cited By ~ 1

Author(s):

Fangda Song ◽

Ga Ming Chan ◽

Yingying Wei

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Real Data ◽

Cell Types ◽

Experimental Designs ◽

Batch Effects ◽

Bayesian Hierarchical ◽

Single Cell Rna Sequencing ◽

Randomized Experimental Design ◽

Chain Type

AbstractDespite their widespread applications, single-cell RNA-sequencing (scRNA-seq) experiments are still plagued by batch effects and dropout events. Although the completely randomized experimental design has frequently been advocated to control for batch effects, it is rarely implemented in real applications due to time and budget constraints. Here, we mathematically prove that under two more flexible and realistic experimental designs—the “reference panel” and the “chain-type” designs—true biological variability can also be separated from batch effects. We develop Batch effects correction with Unknown Subtypes for scRNA-seq data (BUSseq), which is an interpretable Bayesian hierarchical model that closely follows the data-generating mechanism of scRNA-seq experiments. BUSseq can simultaneously correct batch effects, cluster cell types, impute missing data caused by dropout events, and detect differentially expressed genes without requiring a preliminary normalization step. We demonstrate that BUSseq outperforms existing methods with simulated and real data.

Download Full-text

iMAP: integration of multiple single-cell datasets by adversarial paired transfer networks

Genome Biology ◽

10.1186/s13059-021-02280-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Dongfang Wang ◽

Siyu Hou ◽

Lei Zhang ◽

Xiliang Wang ◽

Baolin Liu ◽

...

Keyword(s):

Tumor Microenvironment ◽

Single Cell ◽

Rna Sequencing ◽

Cell Types ◽

Batch Effect ◽

Generative Adversarial Networks ◽

Multiple Sources ◽

Adversarial Networks ◽

Mixing Distributions ◽

Single Cell Rna Sequencing

AbstractThe integration of single-cell RNA-sequencing datasets from multiple sources is critical for deciphering cell-to-cell heterogeneities and interactions in complex biological systems. We present a novel unsupervised batch effect removal framework, called iMAP, based on both deep autoencoders and generative adversarial networks. Compared with current methods, iMAP shows superior, robust, and scalable performance in terms of both reliably detecting the batch-specific cells and effectively mixing distributions of the batch-shared cell types. Applying iMAP to tumor microenvironment datasets from two platforms, Smart-seq2 and 10x Genomics, we find that iMAP can leverage the powers of both platforms to discover novel cell-cell interactions.

Download Full-text

BERMUDA: a novel deep transfer learning method for single-cell RNA sequencing batch correction reveals hidden high-resolution cellular subtypes

Genome Biology ◽

10.1186/s13059-019-1764-6 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 25

Author(s):

Tongxin Wang ◽

Travis S. Johnson ◽

Wei Shao ◽

Zixiao Lu ◽

Bryan R. Helm ◽

...

Keyword(s):

High Resolution ◽

Single Cell ◽

Rna Sequencing ◽

Transfer Learning ◽

Learning Method ◽

Batch Correction ◽

Single Cell Rna Sequencing

Download Full-text

Correcting batch effects in single-cell RNA sequencing data by matching mutual nearest neighbours

10.1101/165118 ◽

2017 ◽

Cited By ~ 15

Author(s):

Laleh Haghverdi ◽

Aaron T. L. Lun ◽

Michael D. Morgan ◽

John C. Marioni

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Large Scale ◽

Data Sets ◽

Batch Effects ◽

Sequencing Data ◽

Batch Correction ◽

Nearest Neighbours ◽

Single Cell Rna Sequencing ◽

New Strategy

AbstractThe presence of batch effects is a well-known problem in experimental data analysis, and single- cell RNA sequencing (scRNA-seq) is no exception. Large-scale scRNA-seq projects that generate data from different laboratories and at different times are rife with batch effects that can fatally compromise integration and interpretation of the data. In such cases, computational batch correction is critical for eliminating uninteresting technical factors and obtaining valid biological conclusions. However, existing methods assume that the composition of cell populations are either known or the same across batches. Here, we present a new strategy for batch correction based on the detection of mutual nearest neighbours in the high-dimensional expression space. Our approach does not rely on pre-defined or equal population compositions across batches, only requiring that a subset of the population be shared between batches. We demonstrate the superiority of our approach over existing methods on a range of simulated and real scRNA-seq data sets. We also show how our method can be applied to integrate scRNA-seq data from two separate studies of early embryonic development.

Download Full-text

Analysis of single-cell RNA sequencing data based on autoencoders

BMC Bioinformatics ◽

10.1186/s12859-021-04150-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Andrea Tangherloni ◽

Federico Ricciuti ◽

Daniela Besozzi ◽

Pietro Liò ◽

Ana Cvejic

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Cell Types ◽

Batch Effect ◽

Learning Approaches ◽

Sequencing Data ◽

Starting Point ◽

Single Cell Rna Sequencing ◽

Downstream Analysis ◽

Low Dimensional

Abstract Background Single-cell RNA sequencing (scRNA-Seq) experiments are gaining ground to study the molecular processes that drive normal development as well as the onset of different pathologies. Finding an effective and efficient low-dimensional representation of the data is one of the most important steps in the downstream analysis of scRNA-Seq data, as it could provide a better identification of known or putatively novel cell-types. Another step that still poses a challenge is the integration of different scRNA-Seq datasets. Though standard computational pipelines to gain knowledge from scRNA-Seq data exist, a further improvement could be achieved by means of machine learning approaches. Results Autoencoders (AEs) have been effectively used to capture the non-linearities among gene interactions of scRNA-Seq data, so that the deployment of AE-based tools might represent the way forward in this context. We introduce here scAEspy, a unifying tool that embodies: (1) four of the most advanced AEs, (2) two novel AEs that we developed on purpose, (3) different loss functions. We show that scAEspy can be coupled with various batch-effect removal tools to integrate data by different scRNA-Seq platforms, in order to better identify the cell-types. We benchmarked scAEspy against the most used batch-effect removal tools, showing that our AE-based strategies outperform the existing solutions. Conclusions scAEspy is a user-friendly tool that enables using the most recent and promising AEs to analyse scRNA-Seq data by only setting up two user-defined parameters. Thanks to its modularity, scAEspy can be easily extended to accommodate new AEs to further improve the downstream analysis of scRNA-Seq data. Considering the relevant results we achieved, scAEspy can be considered as a starting point to build a more comprehensive toolkit designed to integrate multi single-cell omics.

Download Full-text

Single-cell data clustering based on sparse optimization and low-rank matrix factorization

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab098 ◽

2021 ◽

Author(s):

Yinlei Hu ◽

Bin Li ◽

Falai Chen ◽

Kun Qu

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Matrix Factorization ◽

Data Clustering ◽

Cell Types ◽

Low Rank ◽

Sequencing Data ◽

Rank Matrix ◽

Single Cell Rna Sequencing ◽

Low Rank Matrix

Abstract Unsupervised clustering is a fundamental step of single-cell RNA sequencing data analysis. This issue has inspired several clustering methods to classify cells in single-cell RNA sequencing data. However, accurate prediction of the cell clusters remains a substantial challenge. In this study, we propose a new algorithm for single-cell RNA sequencing data clustering based on Sparse Optimization and low-rank matrix factorization (scSO). We applied our scSO algorithm to analyze multiple benchmark datasets and showed that the cluster number predicted by scSO was close to the number of reference cell types and that most cells were correctly classified. Our scSO algorithm is available at https://github.com/QuKunLab/scSO. Overall, this study demonstrates a potent cell clustering approach that can help researchers distinguish cell types in single-cell RNA sequencing data.

Download Full-text