scholarly journals SMNN: batch effect correction for single-cell RNA-seq data via supervised mutual nearest neighbor detection

Author(s):  
Yuchen Yang ◽  
Gang Li ◽  
Huijun Qian ◽  
Kirk C Wilhelmsen ◽  
Yin Shen ◽  
...  

Abstract Batch effect correction has been recognized to be indispensable when integrating single-cell RNA sequencing (scRNA-seq) data from multiple batches. State-of-the-art methods ignore single-cell cluster label information, but such information can improve the effectiveness of batch effect correction, particularly under realistic scenarios where biological differences are not orthogonal to batch effects. To address this issue, we propose SMNN for batch effect correction of scRNA-seq data via supervised mutual nearest neighbor detection. Our extensive evaluations in simulated and real datasets show that SMNN provides improved merging within the corresponding cell types across batches, leading to reduced differentiation across batches over MNN, Seurat v3 and LIGER. Furthermore, SMNN retains more cell-type-specific features, partially manifested by differentially expressed genes identified between cell types after SMNN correction being biologically more relevant, with precision improving by up to 841.0%.

2019 ◽  
Author(s):  
Yuchen Yang ◽  
Gang Li ◽  
Huijun Qian ◽  
Kirk C. Wilhelmsen ◽  
Yin Shen ◽  
...  

AbstractBatch effect correction has been recognized to be indispensable when integrating single-cell RNA sequencing (scRNA-seq) data from multiple batches. State-of-the-art methods ignore single-cell cluster label information, but such information can improve effectiveness of batch effect correction, particularly under realistic scenarios where biological differences are not orthogonal to batch effects. To address this issue, we propose SMNN for batch effect correction of scRNA-seq data via supervised mutual nearest neighbor detection. Our extensive evaluations in simulated and real datasets show that SMNN provides improved merging within the corresponding cell types across batches, leading to reduced differentiation across batches over MNN, Seurat v3, and LIGER. Furthermore, SMNN retains more cell type-specific features, partially manifested by differentially expressed genes identified between cell types after SMNN correction being biologically more relevant, with precision improving by up to 841%.Key PointsBatch effect correction has been recognized to be critical when integrating scRNA-seq data from multiple batches due to systematic differences in time points, generating laboratory and/or handling technician(s), experimental protocol, and/or sequencing platform.Existing batch effect correction methods that leverages information from mutual nearest neighbors across batches (for example, implemented in SC3 or Seurat) ignore cell type information and suffer from potentially mismatching single cells from different cell types across batches, which would lead to undesired correction results, especially under the scenario where variation from batch effects is non-negligible compared with biological effects.To address this critical issue, here we present SMNN, a supervised machine learning method that first takes cluster/cell-type label information from users or inferred from scRNA-seq clustering, and then searches mutual nearest neighbors within each cell type instead of global searching.Our SMNN method shows clear advantages over three state-of-the-art batch effect correction methods and can better mix cells of the same cell type across batches and more effectively recover cell-type specific features, in both simulations and real datasets.


2020 ◽  
Author(s):  
Yuchen Yang ◽  
Gang Li ◽  
Yifang Xie ◽  
Li Wang ◽  
Yingxi Yang ◽  
...  

ABSTRACTBatch effect correction is an essential step in the integrative analysis of multiple single cell RNA-seq (scRNA-seq) data. One state-of-the-art strategy for batch effect correction is via unsupervised or supervised detection of mutual nearest neighbors (MNNs). However, both two kinds of methods only detect MNNs across batches on the top of uncorrected data, where the large batch effect may affect the MNN search. To address this issue, we presented iSMNN, a batch effect correction approach via iterative supervised MNN refinement across data after correction. Our benchmarking on both simulation and real datasets showed the advantages of the iterative refinement of MNNs on the performance of correction. Compared to the popular methods MNNcorrect and Seurat v3, our iSMNN is able to better mix the cells of the same cell type across batches. In addition, iSMNN can also facilitate the identification of DEGs relevant to the biological function of certain cell types. These results indicated that iSMNN will be a valuable method for integrating multiple scRNA-seq datasets that can facilitate biological and medical studies at single-cell level.


Author(s):  
Yuchen Yang ◽  
Gang Li ◽  
Yifang Xie ◽  
Li Wang ◽  
Taylor M Lagler ◽  
...  

Abstract Batch effect correction is an essential step in the integrative analysis of multiple single-cell RNA-sequencing (scRNA-seq) data. One state-of-the-art strategy for batch effect correction is via unsupervised or supervised detection of mutual nearest neighbors (MNNs). However, both types of methods only detect MNNs across batches of uncorrected data, where the large batch effects may affect the MNN search. To address this issue, we presented a batch effect correction approach via iterative supervised MNN (iSMNN) refinement across data after correction. Our benchmarking on both simulation and real datasets showed the advantages of the iterative refinement of MNNs on the performance of correction. Compared to popular alternative methods, our iSMNN is able to better mix the cells of the same cell type across batches. In addition, iSMNN can also facilitate the identification of differentially expressed genes (DEGs) that are relevant to the biological function of certain cell types. These results indicated that iSMNN will be a valuable method for integrating multiple scRNA-seq datasets that can facilitate biological and medical studies at single-cell level.


2020 ◽  
Author(s):  
Mohit Goyal ◽  
Guillermo Serrano ◽  
Ilan Shomorony ◽  
Mikel Hernaez ◽  
Idoia Ochoa

AbstractSingle-cell RNA-seq is a powerful tool in the study of the cellular composition of different tissues and organisms. A key step in the analysis pipeline is the annotation of cell-types based on the expression of specific marker genes. Since manual annotation is labor-intensive and does not scale to large datasets, several methods for automated cell-type annotation have been proposed based on supervised learning. However, these methods generally require feature extraction and batch alignment prior to classification, and their performance may become unreliable in the presence of cell-types with very similar transcriptomic profiles, such as differentiating cells. We propose JIND, a framework for automated cell-type identification based on neural networks that directly learns a low-dimensional representation (latent code) in which cell-types can be reliably determined. To account for batch effects, JIND performs a novel asymmetric alignment in which the transcriptomic profile of unseen cells is mapped onto the previously learned latent space, hence avoiding the need of retraining the model whenever a new dataset becomes available. JIND also learns cell-type-specific confidence thresholds to identify and reject cells that cannot be reliably classified. We show on datasets with and without batch effects that JIND classifies cells more accurately than previously proposed methods while rejecting only a small proportion of cells. Moreover, JIND batch alignment is parallelizable, being more than five or six times faster than Seurat integration. Availability: https://github.com/mohit1997/JIND.


2019 ◽  
Author(s):  
Matthew N. Bernstein ◽  
Zhongjie Ma ◽  
Michael Gleicher ◽  
Colin N. Dewey

SummaryCell type annotation is a fundamental task in the analysis of single-cell RNA-sequencing data. In this work, we present CellO, a machine learning-based tool for annotating human RNA-seq data with the Cell Ontology. CellO enables accurate and standardized cell type classification by considering the rich hierarchical structure of known cell types, a source of prior knowledge that is not utilized by existing methods. Furthemore, CellO comes pre-trained on a novel, comprehensive dataset of human, healthy, untreated primary samples in the Sequence Read Archive, which to the best of our knowledge, is the most diverse curated collection of primary cell data to date. CellO’s comprehensive training set enables it to run out-of-the-box on diverse cell types and achieves superior or competitive performance when compared to existing state-of-the-art methods. Lastly, CellO’s linear models are easily interpreted, thereby enabling exploration of cell type-specific expression signatures across the ontology. To this end, we also present the CellO Viewer: a web application for exploring CellO’s models across the ontology.HighlightWe present CellO, a tool for hierarchically classifying cell type from single-cell RNA-seq data against the graph-structured Cell OntologyCellO is pre-trained on a comprehensive dataset comprising nearly all bulk RNA-seq primary cell samples in the Sequence Read ArchiveCellO achieves superior or comparable performance with existing methods while featuring a more comprehensive pre-packaged training setCellO is built with easily interpretable models which we expose through a novel web application, the CellO Viewer, for exploring cell type-specific signatures across the Cell OntologyGraphical Abstract


2019 ◽  
Author(s):  
Pawel F. Przytycki ◽  
Katherine S. Pollard

Single-cell and bulk genomics assays have complementary strengths and weaknesses, and alone neither strategy can fully capture regulatory elements across the diversity of cells in complex tissues. We present CellWalker, a method that integrates single-cell open chromatin (scATAC-seq) data with gene expression (RNA-seq) and other data types using a network model that simultaneously improves cell labeling in noisy scATAC-seq and annotates cell-type specific regulatory elements in bulk data. We demonstrate CellWalker’s robustness to sparse annotations and noise using simulations and combined RNA-seq and ATAC-seq in individual cells. We then apply CellWalker to the developing brain. We identify cells transitioning between transcriptional states, resolve enhancers to specific cell types, and observe that autism and other neurological traits can be mapped to specific cell types through their enhancers.


2018 ◽  
Author(s):  
Xuran Wang ◽  
Jihwan Park ◽  
Katalin Susztak ◽  
Nancy R. Zhang ◽  
Mingyao Li

AbstractWe present MuSiC, a method that utilizes cell-type specific gene expression from single-cell RNA sequencing (RNA-seq) data to characterize cell type compositions from bulk RNA-seq data in complex tissues. When applied to pancreatic islet and whole kidney expression data in human, mouse, and rats, MuSiC outperformed existing methods, especially for tissues with closely related cell types. MuSiC enables characterization of cellular heterogeneity of complex tissues for identification of disease mechanisms.


2021 ◽  
Author(s):  
Hanbyeol Kim ◽  
Joongho Lee ◽  
Keunsoo Kang ◽  
Seokhyun Yoon

Abstract Cell type identification is a key step to downstream analysis of single cell RNA-seq experiments. Indispensible information for this is gene expression, which is used to cluster cells, train the model and set rejection thresholds. Problem is they are subject to batch effect arising from different platforms and preprocessing. We present MarkerCount, which uses the number of markers expressed regardless of their expression level to initially identify cell types and, then, reassign cell type in cluster-basis. MarkerCount works both in reference and marker-based mode, where the latter utilizes only the existing lists of markers, while the former required pre-annotated dataset to train the model. The performance was evaluated and compared with the existing identifiers, both marker and reference-based, that can be customized with publicly available datasets and marker DB. The results show that MarkerCount provides a stable performance when comparing with other reference-based and marker-based cell type identifiers.


2021 ◽  
Vol 14 ◽  
Author(s):  
Ian A. Taukulis ◽  
Rafal T. Olszewski ◽  
Soumya Korrapati ◽  
Katharine A. Fernandez ◽  
Erich T. Boger ◽  
...  

The endocochlear potential (EP) generated by the stria vascularis (SV) is necessary for hair cell mechanotransduction in the mammalian cochlea. We sought to create a model of EP dysfunction for the purposes of transcriptional analysis and treatment testing. By administering a single dose of cisplatin, a commonly prescribed cancer treatment drug with ototoxic side effects, to the adult mouse, we acutely disrupt EP generation. By combining these data with single cell RNA-sequencing findings, we identify transcriptional changes induced by cisplatin exposure, and by extension transcriptional changes accompanying EP reduction, in the major cell types of the SV. We use these data to identify gene regulatory networks unique to cisplatin treated SV, as well as the differentially expressed and druggable gene targets within those networks. Our results reconstruct transcriptional responses that occur in gene expression on the cellular level while identifying possible targets for interventions not only in cisplatin ototoxicity but also in EP dysfunction.


2021 ◽  
Author(s):  
Gulden Olgun ◽  
Vishaka Gopalan ◽  
Sridhar Hannenhalli

Micro-RNAs (miRNA) are critical in development, homeostasis, and diseases, including cancer. However, our understanding of miRNA function at cellular resolution is thwarted by the inability of the standard single cell RNA-seq protocols to capture miRNAs. Here we introduce a machine learning tool -- miRSCAPE -- to infer miRNA expression in a sample from its RNA-seq profile. We establish miRSCAPE's accuracy separately in 10 tissues comprising ~10,000 tumor and normal bulk samples and demonstrate that miRSCAPE accurately infers cell type-specific miRNA activities (predicted vs observed fold-difference correlation ~ 0.81) in two independent datasets where miRNA profiles of specific cell types are available (HEK-GBM, Kidney-Breast-Skin). When trained on human hematopoietic cancers, miRSCAPE can identify active miRNAs in 8 hematopoietic cell lines in mouse with a reasonable accuracy (auROC = 0.67). Finally, we apply miRSCAPE to infer miRNA activities in scRNA clusters in Pancreatic and Lung cancers, as well as in 56 cell types in the Human Cell Landscape (HCL). Across the board, miRSCAPE recapitulates and provides a refined view of known miRNA biology. miRSCAPE is freely available and promises to substantially expand our understanding of gene regulatory networks at cellular resolution.


Sign in / Sign up

Export Citation Format

Share Document