Evaluating statistical learning methods for cell type classification and feature selection using RNA-seq data

SummaryCell type annotation is a fundamental task in the analysis of single-cell RNA-sequencing data. In this work, we present CellO, a machine learning-based tool for annotating human RNA-seq data with the Cell Ontology. CellO enables accurate and standardized cell type classification by considering the rich hierarchical structure of known cell types, a source of prior knowledge that is not utilized by existing methods. Furthemore, CellO comes pre-trained on a novel, comprehensive dataset of human, healthy, untreated primary samples in the Sequence Read Archive, which to the best of our knowledge, is the most diverse curated collection of primary cell data to date. CellO’s comprehensive training set enables it to run out-of-the-box on diverse cell types and achieves superior or competitive performance when compared to existing state-of-the-art methods. Lastly, CellO’s linear models are easily interpreted, thereby enabling exploration of cell type-specific expression signatures across the ontology. To this end, we also present the CellO Viewer: a web application for exploring CellO’s models across the ontology.HighlightWe present CellO, a tool for hierarchically classifying cell type from single-cell RNA-seq data against the graph-structured Cell OntologyCellO is pre-trained on a comprehensive dataset comprising nearly all bulk RNA-seq primary cell samples in the Sequence Read ArchiveCellO achieves superior or comparable performance with existing methods while featuring a more comprehensive pre-packaged training setCellO is built with easily interpretable models which we expose through a novel web application, the CellO Viewer, for exploring cell type-specific signatures across the Cell OntologyGraphical Abstract

Download Full-text

A Reference-free Approach for Cell Type Classification with scRNA-seq

10.1101/2021.05.29.446268 ◽

2021 ◽

Author(s):

Qi Sun ◽

YIFAN PENG ◽

Jinze Liu

Keyword(s):

State Of The Art ◽

Cell Types ◽

Rna Seq ◽

Cell Type ◽

Current Version ◽

Expression Of Genes ◽

Distinct Cell ◽

Alignment Process ◽

Efficient Alternative ◽

Type Classification

The single-cell RNA sequencing (scRNA-seq) has become a revolutionary technology to detect and characterize distinct cell populations under different biological conditions. Unlike bulk RNA-seq, the expression of genes from scRNA-seq is highly sparse due to limited sequencing depth per cell. This is worsened by tossing away a significant portion of reads that cannot be mapped during gene quantification. To overcome data sparsity and fully utilize original sequences, we propose scSimClassify, a reference-free and alignment-free approach to classify cell types with k-mer level features derived from raw reads in a scRNA-seq experiment. The major contribution of scSimClassify is the simhash method compressing k-mers with similar abundance profiles into groups. The compressed k-mer groups (CKGs) serve as the aggregated k-mer level features for cell type classification. We evaluate the performance of CKG features for predicting cell types in four scRNA-seq datasets comparing four state-of-the-art classification methods as well as two scRNA-seq specific algorithms. Our experiments demonstrate that the CKG features lend themselves to better performance than traditional gene expression features in scRNA-seq classification accuracy in the majority of cases. Because CKG features can be efficiently derived from raw reads without a resource-intensive alignment process, scSimClassify offers an efficient alternative to help scientists rapidly classify cell types without relying on reference sequences. The current version of scSimClassify is implemented in python and can be found at https://github.com/digi2002/scSimClassify.

Download Full-text

An Interpretable Framework for Clustering Single-Cell RNA-Seq Datasets

10.1101/191254 ◽

2017 ◽

Author(s):

Jesse M. Zhang ◽

Jue Fan ◽

H. Christina Fan ◽

David Rosenfeld ◽

David N. Tse

Keyword(s):

Feature Selection ◽

Single Cell ◽

Computational Efficiency ◽

Software Package ◽

Rna Seq ◽

Cell Type ◽

Clustering Problem ◽

Unsupervised Analysis ◽

Multiple Levels ◽

Definition Of

ABSTRACTBackgroundWith the recent proliferation of single-cell RNA-Seq experiments, several methods have been developed for unsupervised analysis of the resulting datasets. These methods often rely on unintuitive hyperparameters and do not explicitly address the subjectivity associated with clustering.ResultsIn this work, we present DendroSplit, an interpretable framework for analyzing single-cell RNA-Seq datasets that addresses both the clustering interpretability and clustering subjectivity issues. DendroSplit offers a novel perspective on the single-cell RNA-Seq clustering problem motivated by the definition of “cell type,” allowing us to cluster using feature selection to uncover multiple levels of biologically meaningful populations in the data. We analyze several landmark single-cell datasets, demonstrating both the method’s efficacy and computational efficiency.ConclusionDendroSplit offers a clustering framework that is comparable to existing methods in terms of accuracy and speed but is novel in its emphasis on interpretabilty. We provide the full DendroSplit software package at https://github.com/jessemzhang/dendrosplit.

Download Full-text

Standardizing RNA-seq Data Comparison and Unambiguous Cell Type Classification

10.17918/00000418 ◽

2021 ◽

Author(s):

Roze Alzabey

Keyword(s):

Rna Seq ◽

Cell Type ◽

Data Comparison ◽

Type Classification

Download Full-text

Single-cell entropy to quantify the cellular transcriptome from single-cell RNA-seq data

10.1101/678557 ◽

2019 ◽

Author(s):

Jingxin Liu ◽

You Song ◽

Jinzhi Lei

Keyword(s):

Single Cell ◽

Gaussian Mixture ◽

Rna Seq ◽

Cell Type ◽

Transcriptome Profile ◽

Biological Interpretation ◽

A Cell ◽

Transcriptional State ◽

Type Classification ◽

Cellular Transcriptome

We present the use of single-cell entropy (scEntropy) to measure the order of the cellular transcriptome profile from single-cell RNA-seq data, which leads to a method of unsupervised cell type classification through scEntropy followed by the Gaussian mixture model (scEGMM). scEntropy is straightforward in defining an intrinsic transcriptional state of a cell. scEGMM is a coherent method of cell type classification that includes no parameters and no clustering; however, it is comparable to existing machine learning-based methods in benchmarking studies and facilitates biological interpretation.

Download Full-text

Iterative transfer learning with neural network for clustering and cell type classification in single-cell RNA-seq analysis

Nature Machine Intelligence ◽

10.1038/s42256-020-00233-7 ◽

2020 ◽

Vol 2 (10) ◽

pp. 607-618 ◽

Cited By ~ 3

Author(s):

Jian Hu ◽

Xiangjie Li ◽

Gang Hu ◽

Yafei Lyu ◽

Katalin Susztak ◽

...

Keyword(s):

Neural Network ◽

Single Cell ◽

Transfer Learning ◽

Rna Seq ◽

Cell Type ◽

Type Classification

Download Full-text

Sfaira accelerates data and model reuse in single cell genomics

Genome Biology ◽

10.1186/s13059-021-02452-6 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

David S. Fischer ◽

Leander Dony ◽

Martin König ◽

Abdul Moeed ◽

Luke Zappia ◽

...

Keyword(s):

Single Cell ◽

Data Sets ◽

Rna Seq ◽

Cell Type ◽

Training Models ◽

Public Data ◽

Data Partitions ◽

Cell Data ◽

Type Classification ◽

Different Levels

AbstractSingle-cell RNA-seq datasets are often first analyzed independently without harnessing model fits from previous studies, and are then contextualized with public data sets, requiring time-consuming data wrangling. We address these issues with sfaira, a single-cell data zoo for public data sets paired with a model zoo for executable pre-trained models. The data zoo is designed to facilitate contribution of data sets using ontologies for metadata. We propose an adaption of cross-entropy loss for cell type classification tailored to datasets annotated at different levels of coarseness. We demonstrate the utility of sfaira by training models across anatomic data partitions on 8 million cells.

Download Full-text

Information-theory-based benchmarking and feature selection algorithm improve cell type annotation and reproducibility of single cell RNA-seq data analysis pipelines

10.1101/2020.11.02.365510 ◽

2020 ◽

Author(s):

Ziyou Ren ◽

Martin Gerlach ◽

Hanyu Shi ◽

GR Scott Budinger ◽

Luís A. Nunes Amaral

Keyword(s):

Information Theory ◽

Feature Selection ◽

Data Analysis ◽

Single Cell ◽

Clustering Algorithms ◽

Rna Seq ◽

Cell Type ◽

Cluster Resolution ◽

Parameter Values ◽

The Impact

AbstractSingle cell RNA sequencing (scRNA-seq) data are now routinely generated in experimental practice because of their promise to enable the quantitative study of biological processes at the single cell level. However, cell type and cell state annotations remain an important computational challenge in analyzing scRNA-seq data. Here, we report on the development of a benchmark dataset where reference annotations are generated independently from transcriptomic measurements. We used this benchmark to systematically investigate the impact on labelling accuracy of different approaches to feature selection, of different clustering algorithms, and of different sets of parameter values. We show that an approach grounded on information theory can provide a general, reliable, and accurate process for discarding uninformative features and to optimize cluster resolution in single cell RNA-seq data analysis.

Download Full-text

Iterative transfer learning with neural network for clustering and cell type classification in single-cell RNA-seq analysis

10.1101/2020.02.02.931139 ◽

2020 ◽

Cited By ~ 1

Author(s):

Jian Hu ◽

Xiangjie Li ◽

Gang Hu ◽

Yafei Lyu ◽

Katalin Susztak ◽

...

Keyword(s):

Neural Network ◽

Single Cell ◽

Learning Algorithm ◽

Classification Algorithms ◽

Rna Seq ◽

Cell Type ◽

Source Data ◽

Different Populations ◽

Type Classification ◽

Target Data

AbstractAn important step in single-cell RNA-seq (scRNA-seq) analysis is to cluster cells into different populations or types. Here we describe ItClust, an Iterative Transfer learning algorithm with neural network for scRNA-seq Clustering. ItClust learns cell type knowledge from well-annotated source data, but also leverages information in the target data to make it less dependent on the source data quality. Through extensive evaluations using datasets from different species and tissues generated with diverse scRNA-seq protocols, we show that ItClust significantly improves clustering and cell type classification accuracy compared to popular unsupervised clustering and supervised cell type classification algorithms.

Download Full-text

Active feature selection discovers minimal gene-sets for classifying cell-types and disease states in single-cell mRNA-seq data

10.1101/2021.06.15.448478 ◽

2021 ◽

Author(s):

Xiaoqiao Chen ◽

Sisi Chen ◽

Matt Thomson

Keyword(s):

Feature Selection ◽

Single Cell ◽

Cell Types ◽

Mouse Tissue ◽

Support Vector ◽

Cell Type ◽

Multiple Myeloma Patient ◽

Gene Sets ◽

Type Classification

Sequencing costs currently prohibit the application of single cell mRNA-seq for many biological and clinical tasks of interest. Here, we introduce an active learning framework that constructs compressed gene sets that enable high accuracy classification of cell-types and physiological states while analyzing a minimal number of gene transcripts. Our active feature selection procedure constructs gene sets through an iterative cell-type classification task where misclassified cells are examined at each round to identify maximally informative genes through an `active' support vector machine (SVM) classifier. Our active SVM procedure automatically identifies gene sets that enables >90% cell-type classification accuracy in the Tabula Muris mouse tissue survey as well as a ~40 gene set that enables classification of multiple myeloma patient samples with >95% accuracy. Broadly, the discovery of compact but highly informative gene sets might enable drastic reductions in sequencing requirements for applications of single-cell mRNA-seq.

Download Full-text