An Interpretable Framework for Clustering Single-Cell RNA-Seq Datasets

ABSTRACTBackgroundWith the recent proliferation of single-cell RNA-Seq experiments, several methods have been developed for unsupervised analysis of the resulting datasets. These methods often rely on unintuitive hyperparameters and do not explicitly address the subjectivity associated with clustering.ResultsIn this work, we present DendroSplit, an interpretable framework for analyzing single-cell RNA-Seq datasets that addresses both the clustering interpretability and clustering subjectivity issues. DendroSplit offers a novel perspective on the single-cell RNA-Seq clustering problem motivated by the definition of “cell type,” allowing us to cluster using feature selection to uncover multiple levels of biologically meaningful populations in the data. We analyze several landmark single-cell datasets, demonstrating both the method’s efficacy and computational efficiency.ConclusionDendroSplit offers a clustering framework that is comparable to existing methods in terms of accuracy and speed but is novel in its emphasis on interpretabilty. We provide the full DendroSplit software package at https://github.com/jessemzhang/dendrosplit.

Download Full-text

MetaCell: analysis of single cell RNA-seq data using k-NN graph partitions

10.1101/437665 ◽

2018 ◽

Cited By ~ 10

Author(s):

Yael Baran ◽

Arnau Sebe-Pedros ◽

Yaniv Lubling ◽

Amir Giladi ◽

Elad Chomsky ◽

...

Keyword(s):

Single Cell ◽

Software Package ◽

Building Blocks ◽

Cell Populations ◽

Compact Groups ◽

Sampling Variance ◽

Statistical Control ◽

Rna Seq ◽

Cell Type ◽

Graph Partitions

ABSTRACTSingle cell RNA-seq (scRNA-seq) has become the method of choice for analyzing mRNA distributions in heterogeneous cell populations. scRNA-seq only partially samples the cells in a tissue and the RNA in each cell, resulting in sparse data that challenge analysis. We develop a methodology that addresses scRNA-seq’s sparsity through partitioning the data into metacells: disjoint, homogenous and highly compact groups of cells, each exhibiting only sampling variance. Metacells constitute local building blocks for clustering and quantitative analysis of gene expression, while not enforcing any global structure on the data, thereby maintaining statistical control and minimizing biases. We illustrate the MetaCell framework by re-analyzing cell type and transcriptional gradients in peripheral blood and whole organism scRNA-seq maps. Our algorithms are implemented in the new MetaCell R/C++ software package.

Download Full-text

Information-theory-based benchmarking and feature selection algorithm improve cell type annotation and reproducibility of single cell RNA-seq data analysis pipelines

10.1101/2020.11.02.365510 ◽

2020 ◽

Author(s):

Ziyou Ren ◽

Martin Gerlach ◽

Hanyu Shi ◽

GR Scott Budinger ◽

Luís A. Nunes Amaral

Keyword(s):

Information Theory ◽

Feature Selection ◽

Data Analysis ◽

Single Cell ◽

Clustering Algorithms ◽

Rna Seq ◽

Cell Type ◽

Cluster Resolution ◽

Parameter Values ◽

The Impact

AbstractSingle cell RNA sequencing (scRNA-seq) data are now routinely generated in experimental practice because of their promise to enable the quantitative study of biological processes at the single cell level. However, cell type and cell state annotations remain an important computational challenge in analyzing scRNA-seq data. Here, we report on the development of a benchmark dataset where reference annotations are generated independently from transcriptomic measurements. We used this benchmark to systematically investigate the impact on labelling accuracy of different approaches to feature selection, of different clustering algorithms, and of different sets of parameter values. We show that an approach grounded on information theory can provide a general, reliable, and accurate process for discarding uninformative features and to optimize cluster resolution in single cell RNA-seq data analysis.

Download Full-text

Evaluation of Cell Type Annotation R Packages on Single-cell RNA-seq Data

Genomics Proteomics & Bioinformatics ◽

10.1016/j.gpb.2020.07.004 ◽

2020 ◽

Author(s):

Qianhui Huang ◽

Yu Liu ◽

Yuheng Du ◽

Lana X. Garmire

Keyword(s):

Single Cell ◽

Rna Seq ◽

Cell Type ◽

R Packages

Download Full-text

Cell type diversity in scallop adductor muscles revealed by single-cell RNA-Seq

Genomics ◽

10.1016/j.ygeno.2021.08.015 ◽

2021 ◽

Vol 113 (6) ◽

pp. 3582-3598

Author(s):

Xiujun Sun ◽

Li Li ◽

Biao Wu ◽

Jianlong Ge ◽

Yanxin Zheng ◽

...

Keyword(s):

Single Cell ◽

Rna Seq ◽

Cell Type ◽

Adductor Muscles

Download Full-text

JIND: Joint Integration and Discrimination for Automated Single-Cell Annotation

10.1101/2020.10.06.327601 ◽

2020 ◽

Author(s):

Mohit Goyal ◽

Guillermo Serrano ◽

Ilan Shomorony ◽

Mikel Hernaez ◽

Idoia Ochoa

Keyword(s):

Single Cell ◽

Cell Types ◽

Marker Genes ◽

Specific Marker ◽

Rna Seq ◽

Batch Effects ◽

Cell Type ◽

Latent Space ◽

Cell Type Specific ◽

Low Dimensional

AbstractSingle-cell RNA-seq is a powerful tool in the study of the cellular composition of different tissues and organisms. A key step in the analysis pipeline is the annotation of cell-types based on the expression of specific marker genes. Since manual annotation is labor-intensive and does not scale to large datasets, several methods for automated cell-type annotation have been proposed based on supervised learning. However, these methods generally require feature extraction and batch alignment prior to classification, and their performance may become unreliable in the presence of cell-types with very similar transcriptomic profiles, such as differentiating cells. We propose JIND, a framework for automated cell-type identification based on neural networks that directly learns a low-dimensional representation (latent code) in which cell-types can be reliably determined. To account for batch effects, JIND performs a novel asymmetric alignment in which the transcriptomic profile of unseen cells is mapped onto the previously learned latent space, hence avoiding the need of retraining the model whenever a new dataset becomes available. JIND also learns cell-type-specific confidence thresholds to identify and reject cells that cannot be reliably classified. We show on datasets with and without batch effects that JIND classifies cells more accurately than previously proposed methods while rejecting only a small proportion of cells. Moreover, JIND batch alignment is parallelizable, being more than five or six times faster than Seurat integration. Availability: https://github.com/mohit1997/JIND.

Download Full-text

CellO: Comprehensive and hierarchical cell type classification of human cells with the Cell Ontology

10.1101/634097 ◽

2019 ◽

Cited By ~ 1

Author(s):

Matthew N. Bernstein ◽

Zhongjie Ma ◽

Michael Gleicher ◽

Colin N. Dewey

Keyword(s):

Single Cell ◽

Web Application ◽

Cell Types ◽

Rna Seq ◽

Cell Type ◽

Training Set ◽

Sequence Read Archive ◽

Cell Ontology ◽

Cell Type Specific ◽

Type Classification

SummaryCell type annotation is a fundamental task in the analysis of single-cell RNA-sequencing data. In this work, we present CellO, a machine learning-based tool for annotating human RNA-seq data with the Cell Ontology. CellO enables accurate and standardized cell type classification by considering the rich hierarchical structure of known cell types, a source of prior knowledge that is not utilized by existing methods. Furthemore, CellO comes pre-trained on a novel, comprehensive dataset of human, healthy, untreated primary samples in the Sequence Read Archive, which to the best of our knowledge, is the most diverse curated collection of primary cell data to date. CellO’s comprehensive training set enables it to run out-of-the-box on diverse cell types and achieves superior or competitive performance when compared to existing state-of-the-art methods. Lastly, CellO’s linear models are easily interpreted, thereby enabling exploration of cell type-specific expression signatures across the ontology. To this end, we also present the CellO Viewer: a web application for exploring CellO’s models across the ontology.HighlightWe present CellO, a tool for hierarchically classifying cell type from single-cell RNA-seq data against the graph-structured Cell OntologyCellO is pre-trained on a comprehensive dataset comprising nearly all bulk RNA-seq primary cell samples in the Sequence Read ArchiveCellO achieves superior or comparable performance with existing methods while featuring a more comprehensive pre-packaged training setCellO is built with easily interpretable models which we expose through a novel web application, the CellO Viewer, for exploring cell type-specific signatures across the Cell OntologyGraphical Abstract

Download Full-text

CellMap: Characterizing the types and composition of iPSC-derived cells from RNA-seq data

10.1101/2021.05.24.445360 ◽

2021 ◽

Author(s):

Zhengyu Ouyang ◽

Nathanael Bourgeois ◽

Eugenia Lyashenko ◽

Paige Cundiff ◽

Patrick F Cullen ◽

...

Keyword(s):

Single Cell ◽

Induced Pluripotent Stem Cell ◽

Cell Types ◽

Model Systems ◽

Rna Seq ◽

Cell Type ◽

Fine Grained ◽

Single Nucleus ◽

Induced Pluripotent

Induced pluripotent stem cell (iPSC) derived cell types are increasingly employed as in vitro model systems for drug discovery. For these studies to be meaningful, it is important to understand the reproducibility of the iPSC-derived cultures and their similarity to equivalent endogenous cell types. Single-cell and single-nucleus RNA sequencing (RNA-seq) are useful to gain such understanding, but they are expensive and time consuming, while bulk RNA-seq data can be generated quicker and at lower cost. In silico cell type decomposition is an efficient, inexpensive, and convenient alternative that can leverage bulk RNA-seq to derive more fine-grained information about these cultures. We developed CellMap, a computational tool that derives cell type profiles from publicly available single-cell and single-nucleus datasets to infer cell types in bulk RNA-seq data from iPSC-derived cell lines.

Download Full-text

Evaluating statistical learning methods for cell type classification and feature selection using RNA-seq data

BMC Bioinformatics ◽

10.1186/1471-2105-15-s10-p26 ◽

2014 ◽

Vol 15 (S10) ◽

Author(s):

Hao Chen

Keyword(s):

Feature Selection ◽

Statistical Learning ◽

Rna Seq ◽

Cell Type ◽

Learning Methods ◽

Type Classification

Download Full-text

Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling

Nature Methods ◽

10.1038/s41592-019-0529-1 ◽

2019 ◽

Vol 16 (10) ◽

pp. 1007-1015 ◽

Cited By ~ 46

Author(s):

Allen W. Zhang ◽

Ciara O’Flanagan ◽

Elizabeth A. Chavez ◽

Jamie L. P. Lim ◽

Nicholas Ceglia ◽

...

Keyword(s):

Tumor Microenvironment ◽

Single Cell ◽

Rna Seq ◽

Cell Type ◽

Type Assignment

Download Full-text

Impact of data preprocessing on cell-type clustering based on single-cell RNA-seq data

BMC Bioinformatics ◽

10.1186/s12859-020-03797-8 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Chunxiang Wang ◽

Xin Gao ◽

Juntao Liu

Keyword(s):

Single Cell ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Data Preprocessing ◽

Cell Types ◽

Rna Seq ◽

Cell Type ◽

Preprocessing Method ◽

Cell Clustering ◽

Cell Gene Expression

Abstract Background Advances in single-cell RNA-seq technology have led to great opportunities for the quantitative characterization of cell types, and many clustering algorithms have been developed based on single-cell gene expression. However, we found that different data preprocessing methods show quite different effects on clustering algorithms. Moreover, there is no specific preprocessing method that is applicable to all clustering algorithms, and even for the same clustering algorithm, the best preprocessing method depends on the input data. Results We designed a graph-based algorithm, SC3-e, specifically for discriminating the best data preprocessing method for SC3, which is currently the most widely used clustering algorithm for single cell clustering. When tested on eight frequently used single-cell RNA-seq data sets, SC3-e always accurately selects the best data preprocessing method for SC3 and therefore greatly enhances the clustering performance of SC3. Conclusion The SC3-e algorithm is practically powerful for discriminating the best data preprocessing method, and therefore largely enhances the performance of cell-type clustering of SC3. It is expected to play a crucial role in the related studies of single-cell clustering, such as the studies of human complex diseases and discoveries of new cell types.

Download Full-text