Integrating Deep Supervised, Self-Supervised and Unsupervised Learning for Single-Cell RNA-seq Clustering and Annotation

As single-cell RNA sequencing technologies mature, massive gene expression profiles can be obtained. Consequently, cell clustering and annotation become two crucial and fundamental procedures affecting other specific downstream analyses. Most existing single-cell RNA-seq (scRNA-seq) data clustering algorithms do not take into account the available cell annotation results on the same tissues or organisms from other laboratories. Nonetheless, such data could assist and guide the clustering process on the target dataset. Identifying marker genes through differential expression analysis to manually annotate large amounts of cells also costs labor and resources. Therefore, in this paper, we propose a novel end-to-end cell supervised clustering and annotation framework called scAnCluster, which fully utilizes the cell type labels available from reference data to facilitate the cell clustering and annotation on the unlabeled target data. Our algorithm integrates deep supervised learning, self-supervised learning and unsupervised learning techniques together, and it outperforms other customized scRNA-seq supervised clustering methods in both simulation and real data. It is particularly worth noting that our method performs well on the challenging task of discovering novel cell types that are absent in the reference data.

Download Full-text

SC3 - consensus clustering of single-cell RNA-Seq data

10.1101/036558 ◽

2016 ◽

Cited By ~ 29

Author(s):

Vladimir Yu. Kiselev ◽

Kristina Kirschner ◽

Michael T. Schaub ◽

Tallulah Andrews ◽

Andrew Yiu ◽

...

Keyword(s):

Single Cell ◽

Expression Profiles ◽

Cell Types ◽

Marker Genes ◽

Consensus Clustering ◽

Rna Seq ◽

Large Dataset ◽

Wide Audience ◽

Large Variability ◽

Biological Interpretation

AbstractUsing single-cell RNA-seq (scRNA-seq), the full transcriptome of individual cells can be acquired, enabling a quantitative cell-type characterisation based on expression profiles. However, due to the large variability in gene expression, identifying cell types based on the transcriptome remains challenging. We present Single-Cell Consensus Clustering (SC3), a tool for unsupervised clustering of scRNA-seq data. SC3 achieves high accuracy and robustness by consistently integrating different clustering solutions through a consensus approach. Tests on twelve published datasets show that SC3 outperforms five existing methods while remaining scalable, as shown by the analysis of a large dataset containing 44,808 cells. Moreover, an interactive graphical implementation makes SC3 accessible to a wide audience of users, and SC3 aids biological interpretation by identifying marker genes, differentially expressed genes and outlier cells. We illustrate the capabilities of SC3 by characterising newly obtained transcriptomes from subclones of neoplastic cells collected from patients.

Download Full-text

Single-cell RNA-seq data semi-supervised clustering and annotation via structural regularized domain adaptation

Bioinformatics ◽

10.1093/bioinformatics/btaa908 ◽

2020 ◽

Cited By ~ 1

Author(s):

Liang Chen ◽

Qiuyan He ◽

Yuyao Zhai ◽

Minghua Deng

Keyword(s):

Single Cell ◽

Supervised Classification ◽

Reference Data ◽

Domain Adaptation ◽

Rapid Development ◽

Differential Expression Analysis ◽

Batch Effect ◽

Supplementary Information ◽

Supervised Clustering ◽

Target Data

Abstract Motivation The rapid development of single-cell RNA sequencing (scRNA-seq) technologies allows us to explore tissue heterogeneity at the cellular level. The identification of cell types plays an essential role in the analysis of scRNA-seq data, which, in turn, influences the discovery of regulatory genes that induce heterogeneity. As the scale of sequencing data increases, the classical method of combining clustering and differential expression analysis to annotate cells becomes more costly in terms of both labor and resources. Existing scRNA-seq supervised classification method can alleviate this issue through learning a classifier trained on the labeled reference data and then making a prediction based on the unlabeled target data. However, such label transference strategy carries with risks, such as susceptibility to batch effect and further compromise of inherent discrimination of target data. Results In this article, inspired by unsupervised domain adaptation, we propose a flexible single cell semi-supervised clustering and annotation framework, scSemiCluster, which integrates the reference data and target data for training. We utilize structure similarity regularization on the reference domain to restrict the clustering solutions of the target domain. We also incorporates pairwise constraints in the feature learning process such that cells belonging to the same cluster are close to each other, and cells belonging to different clusters are far from each other in the latent space. Notably, without explicit domain alignment and batch effect correction, scSemiCluster outperforms other state-of-the-art, single-cell supervised classification and semi-supervised clustering annotation algorithms in both simulation and real data. To the best of our knowledge, we are the first to use both deep discriminative clustering and deep generative clustering techniques in the single-cell field. Availabilityand implementation An implementation of scSemiCluster is available from https://github.com/xuebaliang/scSemiCluster. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Dynamic changes in the regulatory T-cell heterogeneity and function by murine IL-2 mutein

Life Science Alliance ◽

10.26508/lsa.201900520 ◽

2020 ◽

Vol 3 (5) ◽

pp. e201900520 ◽

Cited By ~ 1

Author(s):

Daniel R Lu ◽

Hao Wu ◽

Ian Driver ◽

Sarah Ingersoll ◽

Sue Sohn ◽

...

Keyword(s):

Single Cell ◽

Single Cell Analysis ◽

Expression Profiles ◽

Control Cell ◽

Mutant Form ◽

Rna Seq ◽

Dynamic Changes ◽

Cell Clustering ◽

Transcriptional Changes ◽

And Function

The therapeutic expansion of Foxp3+ regulatory T cells (Tregs) shows promise for treating autoimmune and inflammatory disorders. Yet, how this treatment affects the heterogeneity and function of Tregs is not clear. Using single-cell RNA-seq analysis, we characterized 31,908 Tregs from the mice treated with a half-life extended mutant form of murine IL-2 (IL-2 mutein, IL-2M) that preferentially expanded Tregs, or mouse IgG Fc as a control. Cell clustering analysis revealed that IL-2M specifically expands multiple sub-states of Tregs with distinct expression profiles. TCR profiling with single-cell analysis uncovered Treg migration across tissues and transcriptional changes between clonally related Tregs after IL-2M treatment. Finally, we identified IL-2M–expanded Tnfrsf9+Il1rl1+ Tregs with superior suppressive function, highlighting the potential of IL-2M to expand highly suppressive Foxp3+ Tregs.

Download Full-text

Bulk and single-cell RNA-seq reveal dmrtb1 gene expression profiles during sex change in zig-zag eel (Mastacembelus armatus)

Aquaculture ◽

10.1016/j.aquaculture.2021.737194 ◽

2021 ◽

pp. 737194

Author(s):

Lingzhan Xue ◽

Dan Jia ◽

Luohao Xu ◽

Zhen Huang ◽

Haiping Fan ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Sex Change ◽

Rna Seq ◽

Mastacembelus Armatus

Download Full-text

sc-REnF:An entropy guided robust feature selection for clustering of single-cell rna-seq data

10.1101/2020.10.10.334573 ◽

2020 ◽

Author(s):

Snehalika Lall ◽

Abhik Ghosh ◽

Sumanta Ray ◽

Sanghamitra Bandyopadhyay

Keyword(s):

Single Cell ◽

Gene Selection ◽

Rna Seq ◽

Technical Noise ◽

Marker Selection ◽

Cell Clustering ◽

Typing Methods ◽

Original Application ◽

Downstream Analysis ◽

Cell Typing

ABSTRACTMany single-cell typing methods require pure clustering of cells, which is susceptible towards the technical noise, and heavily dependent on high quality informative genes selected in the preliminary steps of downstream analysis. Techniques for gene selection in single-cell RNA sequencing (scRNA-seq) data are seemingly simple which casts problems with respect to the resolution of (sub-)types detection, marker selection and ultimately impacts towards cell annotation. We introduce sc-REnF, a novel and robust entropy based feature (gene) selection method, which leverages the landmark advantage of ‘Renyi’ and ‘Tsallis’ entropy achieved in their original application, in single cell clustering. Thereby, gene selection is robust and less sensitive towards the technical noise present in the data, producing a pure clustering of cells, beyond classifying independent and unknown sample with utmost accuracy. The corresponding software is available at: https://github.com/Snehalikalall/sc-REnF

Download Full-text

Leveraging high-powered RNA-Seq datasets to improve inference of regulatory activity in single-cell RNA-Seq data

10.1101/553040 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ning Wang ◽

Andrew E. Teschendorff

Keyword(s):

Transcription Factors ◽

Single Cell ◽

Cell Fate ◽

Regulatory Networks ◽

Large Scale ◽

Single Cells ◽

Differential Expression Analysis ◽

Dropout Rate ◽

Rna Seq ◽

Regulatory Activity

AbstractInferring the activity of transcription factors in single cells is a key task to improve our understanding of development and complex genetic diseases. This task is, however, challenging due to the relatively large dropout rate and noisy nature of single-cell RNA-Seq data. Here we present a novel statistical inference framework called SCIRA (Single Cell Inference of Regulatory Activity), which leverages the power of large-scale bulk RNA-Seq datasets to infer high-quality tissue-specific regulatory networks, from which regulatory activity estimates in single cells can be subsequently obtained. We show that SCIRA can correctly infer regulatory activity of transcription factors affected by high technical dropouts. In particular, SCIRA can improve sensitivity by as much as 70% compared to differential expression analysis and current state-of-the-art methods. Importantly, SCIRA can reveal novel regulators of cell-fate in tissue-development, even for cell-types that only make up 5% of the tissue, and can identify key novel tumor suppressor genes in cancer at single cell resolution. In summary, SCIRA will be an invaluable tool for single-cell studies aiming to accurately map activity patterns of key transcription factors during development, and how these are altered in disease.

Download Full-text

JIND: Joint Integration and Discrimination for Automated Single-Cell Annotation

10.1101/2020.10.06.327601 ◽

2020 ◽

Author(s):

Mohit Goyal ◽

Guillermo Serrano ◽

Ilan Shomorony ◽

Mikel Hernaez ◽

Idoia Ochoa

Keyword(s):

Single Cell ◽

Cell Types ◽

Marker Genes ◽

Specific Marker ◽

Rna Seq ◽

Batch Effects ◽

Cell Type ◽

Latent Space ◽

Cell Type Specific ◽

Low Dimensional

AbstractSingle-cell RNA-seq is a powerful tool in the study of the cellular composition of different tissues and organisms. A key step in the analysis pipeline is the annotation of cell-types based on the expression of specific marker genes. Since manual annotation is labor-intensive and does not scale to large datasets, several methods for automated cell-type annotation have been proposed based on supervised learning. However, these methods generally require feature extraction and batch alignment prior to classification, and their performance may become unreliable in the presence of cell-types with very similar transcriptomic profiles, such as differentiating cells. We propose JIND, a framework for automated cell-type identification based on neural networks that directly learns a low-dimensional representation (latent code) in which cell-types can be reliably determined. To account for batch effects, JIND performs a novel asymmetric alignment in which the transcriptomic profile of unseen cells is mapped onto the previously learned latent space, hence avoiding the need of retraining the model whenever a new dataset becomes available. JIND also learns cell-type-specific confidence thresholds to identify and reject cells that cannot be reliably classified. We show on datasets with and without batch effects that JIND classifies cells more accurately than previously proposed methods while rejecting only a small proportion of cells. Moreover, JIND batch alignment is parallelizable, being more than five or six times faster than Seurat integration. Availability: https://github.com/mohit1997/JIND.

Download Full-text

Integrated Single Cell Atlas of Endothelial Cells of the Human Lung

Circulation ◽

10.1161/circulationaha.120.052318 ◽

2021 ◽

Author(s):

Jonas C. Schupp ◽

Taylor S. Adams ◽

Carlos Cosme Jr. ◽

Micha Sam Brickman Raredon ◽

Yifan Yuan ◽

...

Keyword(s):

Endothelial Cells ◽

Pulmonary Hypertension ◽

Single Cell ◽

Differential Expression ◽

Human Lung ◽

Differential Expression Analysis ◽

Cell Types ◽

Marker Genes ◽

Lung Endothelium ◽

Lung Endothelial Cells

Background: The cellular diversity of the lung endothelium has not been systematically characterized in humans. Here, we provide a reference atlas of human lung endothelial cells (ECs) to facilitate a better understanding of the phenotypic diversity and composition of cells comprising the lung endothelium. Methods: We reprocessed human control single cell RNA sequencing (scRNAseq) data from six datasets. EC populations were characterized through iterative clustering with subsequent differential expression analysis. Marker genes were validated by fluorescent microscopy and in situ hybridization. scRNAseq of primary lung ECs cultured in-vitro was performed. The signaling network between different lung cell types was studied. For cross species analysis or disease relevance, we applied the same methods to scRNAseq data obtained from mouse lungs or from human lungs with pulmonary hypertension. Results: Six lung scRNAseq datasets were reanalyzed and annotated to identify over 15,000 vascular EC cells from 73 individuals. Differential expression analysis of EC revealed signatures corresponding to endothelial lineage, including pan-endothelial, pan-vascular and subpopulation-specific marker gene sets. Beyond the broad cellular categories of lymphatic, capillary, arterial and venous ECs, we found previously indistinguishable subpopulations: among venous EC, we identified two previously indistinguishable populations, pulmonary-venous ECs (COL15A1neg) localized to the lung parenchyma and systemic-venous ECs (COL15A1pos) localized to the airways and the visceral pleura; among capillary EC, we confirmed their subclassification into recently discovered aerocytes characterized by EDNRB, SOSTDC1 and TBX2 and general capillary EC. We confirmed that all six endothelial cell types, including the systemic-venous EC and aerocytes, are present in mice and identified endothelial marker genes conserved in humans and mice. Ligand-receptor connectome analysis revealed important homeostatic crosstalk of EC with other lung resident cell types. scRNAseq of commercially available primary lung ECs demonstrated a loss of their native lung phenotype in culture. scRNAseq revealed that the endothelial diversity is maintained in pulmonary hypertension. Our manuscript is accompanied by an online data mining tool (www.LungEndothelialCellAtlas.com). Conclusions: Our integrated analysis provides the comprehensive and well-crafted reference atlas of lung endothelial cells in the normal lung and confirms and describes in detail previously unrecognized endothelial populations across a large number of humans and mice.

Download Full-text

Evaluation of single-cell classifiers for single-cell RNA sequencing data sets

Briefings in Bioinformatics ◽

10.1093/bib/bbz096 ◽

2019 ◽

Vol 21 (5) ◽

pp. 1581-1595 ◽

Cited By ~ 6

Author(s):

Xinlei Zhao ◽

Shuang Wu ◽

Nan Fang ◽

Xiao Sun ◽

Jue Fan

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Reference Data ◽

Predictive Accuracy ◽

Cell Types ◽

Superior Performance ◽

Marker Genes ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Abstract Single-cell RNA sequencing (scRNA-seq) has been rapidly developing and widely applied in biological and medical research. Identification of cell types in scRNA-seq data sets is an essential step before in-depth investigations of their functional and pathological roles. However, the conventional workflow based on clustering and marker genes is not scalable for an increasingly large number of scRNA-seq data sets due to complicated procedures and manual annotation. Therefore, a number of tools have been developed recently to predict cell types in new data sets using reference data sets. These methods have not been generally adapted due to a lack of tool benchmarking and user guidance. In this article, we performed a comprehensive and impartial evaluation of nine classification software tools specifically designed for scRNA-seq data sets. Results showed that Seurat based on random forest, SingleR based on correlation analysis and CaSTLe based on XGBoost performed better than others. A simple ensemble voting of all tools can improve the predictive accuracy. Under nonideal situations, such as small-sized and class-imbalanced reference data sets, tools based on cluster-level similarities have superior performance. However, even with the function of assigning ‘unassigned’ labels, it is still challenging to catch novel cell types by solely using any of the single-cell classifiers. This article provides a guideline for researchers to select and apply suitable classification tools in their analysis workflows and sheds some lights on potential direction of future improvement on classification tools.

Download Full-text

Impact of data preprocessing on cell-type clustering based on single-cell RNA-seq data

BMC Bioinformatics ◽

10.1186/s12859-020-03797-8 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Chunxiang Wang ◽

Xin Gao ◽

Juntao Liu

Keyword(s):

Single Cell ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Data Preprocessing ◽

Cell Types ◽

Rna Seq ◽

Cell Type ◽

Preprocessing Method ◽

Cell Clustering ◽

Cell Gene Expression

Abstract Background Advances in single-cell RNA-seq technology have led to great opportunities for the quantitative characterization of cell types, and many clustering algorithms have been developed based on single-cell gene expression. However, we found that different data preprocessing methods show quite different effects on clustering algorithms. Moreover, there is no specific preprocessing method that is applicable to all clustering algorithms, and even for the same clustering algorithm, the best preprocessing method depends on the input data. Results We designed a graph-based algorithm, SC3-e, specifically for discriminating the best data preprocessing method for SC3, which is currently the most widely used clustering algorithm for single cell clustering. When tested on eight frequently used single-cell RNA-seq data sets, SC3-e always accurately selects the best data preprocessing method for SC3 and therefore greatly enhances the clustering performance of SC3. Conclusion The SC3-e algorithm is practically powerful for discriminating the best data preprocessing method, and therefore largely enhances the performance of cell-type clustering of SC3. It is expected to play a crucial role in the related studies of single-cell clustering, such as the studies of human complex diseases and discoveries of new cell types.

Download Full-text