Information Theoretic Feature Selection Methods for Single Cell RNA-Sequencing

AbstractSingle cell RNA-sequencing (scRNA-seq) technologies have generated an expansive amount of new biological information, revealing new cellular populations and hierarchical relationships. A number of technologies complementary to scRNA-seq rely on the selection of a smaller number of marker genes (or features) to accurately differentiate cell types within a complex mixture of cells. In this paper, we benchmark differential expression methods against information-theoretic feature selection methods to evaluate the ability of these algorithms to identify small and efficient sets of genes that are informative about cell types. Unlike differential methods, that are strictly binary and univariate, information-theoretic methods can be used as any combination of binary or multiclass and univariate or multivariate. We show for some datasets, information theoretic methods can reveal genes that are both distinct from those selected by traditional algorithms and that are as informative, if not more, of the class labels. We also present detailed and principled theoretical analyses of these algorithms. All information theoretic methods in this paper are implemented in our PicturedRocks Python package that is compatible with the widely used scanpy package.

Download Full-text

Evaluation of single-cell classifiers for single-cell RNA sequencing data sets

Briefings in Bioinformatics ◽

10.1093/bib/bbz096 ◽

2019 ◽

Vol 21 (5) ◽

pp. 1581-1595 ◽

Cited By ~ 6

Author(s):

Xinlei Zhao ◽

Shuang Wu ◽

Nan Fang ◽

Xiao Sun ◽

Jue Fan

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Reference Data ◽

Predictive Accuracy ◽

Cell Types ◽

Superior Performance ◽

Marker Genes ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Abstract Single-cell RNA sequencing (scRNA-seq) has been rapidly developing and widely applied in biological and medical research. Identification of cell types in scRNA-seq data sets is an essential step before in-depth investigations of their functional and pathological roles. However, the conventional workflow based on clustering and marker genes is not scalable for an increasingly large number of scRNA-seq data sets due to complicated procedures and manual annotation. Therefore, a number of tools have been developed recently to predict cell types in new data sets using reference data sets. These methods have not been generally adapted due to a lack of tool benchmarking and user guidance. In this article, we performed a comprehensive and impartial evaluation of nine classification software tools specifically designed for scRNA-seq data sets. Results showed that Seurat based on random forest, SingleR based on correlation analysis and CaSTLe based on XGBoost performed better than others. A simple ensemble voting of all tools can improve the predictive accuracy. Under nonideal situations, such as small-sized and class-imbalanced reference data sets, tools based on cluster-level similarities have superior performance. However, even with the function of assigning ‘unassigned’ labels, it is still challenging to catch novel cell types by solely using any of the single-cell classifiers. This article provides a guideline for researchers to select and apply suitable classification tools in their analysis workflows and sheds some lights on potential direction of future improvement on classification tools.

Download Full-text

Deciphering cell lineage specification of human lung adenocarcinoma with single-cell RNA sequencing

10.21203/rs.3.rs-127270/v1 ◽

2021 ◽

Author(s):

Zhoufeng Wang ◽

Zhe Li ◽

Kun Zhou ◽

Li Zhang ◽

Ying Yang ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Early Stage ◽

Precancerous Lesions ◽

Cell Types ◽

Marker Genes ◽

Adenocarcinoma In Situ ◽

Invasive Adenocarcinoma ◽

Single Cell Rna Sequencing ◽

Clinical Stages

Abstract Lung adenocarcinomas (LUAD) start as precancerous lesions such as atypical adenomatous hyperplasia (AAH), develop stepwise into adenocarcinoma in situ (AIS) and minimally invasive adenocarcinoma (MIA), then eventually progress toward invasive adenocarcinoma (IA). To date the cellular heterogeneity across these distinct clinical stages and the underlying molecular events driving tumor progression remain largely unclear. In this study, we performed single-cell RNA sequencing on 52 specimens from 25 patients spanning the four clinical stages. By assessing the expression pattern of marker genes among 268,471 cells, we identified 16 major cell types. We demonstrated that AT2 feature cell types (AT2-like cells) were associated with malignant composition. AT2-like subcluster emerged first in AAH and partially lost AT2 cell transcriptional identity, accompanied with a gain of stemness during cell transition. In addition, genes related to energy metabolism, ribosome synthesis were upregulated in the early stage of LUAD, leading us to identify new markers including miRNA10 and β-hydroxybutyric acid to diagnose early-stage LUAD noninvasively in the blood. We also identified MDK and TIMP1 as potential biomarkers to facilitate our understanding of LUAD pathogenesis. Taken together, our data identified a new mechanism in LUAD evolution, and provided a robust basis for diagnosis and treatment of LUAD.

Download Full-text

Automated identification of Cell Types in Single Cell RNA Sequencing

10.1101/532093 ◽

2019 ◽

Cited By ~ 3

Author(s):

Feiyang Ma ◽

Matteo Pellegrini

Keyword(s):

Neural Network ◽

Single Cell ◽

Rna Sequencing ◽

Immune Cell ◽

Cell Types ◽

Marker Genes ◽

Complex Data ◽

Cell Type ◽

Human T Cell ◽

Single Cell Rna Sequencing

AbstractCell type identification is one of the major goals in single cell RNA sequencing (scRNA-seq). Current methods for assigning cell types typically involve the use of unsupervised clustering, the identification of signature genes in each cluster, followed by a manual lookup of these genes in the literature and databases to assign cell types. However, there are several limitations associated with these approaches, such as unwanted sources of variation that influence clustering and a lack of canonical markers for certain cell types. Here, we present ACTINN (Automated Cell Type Identification using Neural Networks), which employs a neural network with 3 hidden layers, trains on datasets with predefined cell types, and predicts cell types for other datasets based on the trained parameters. We trained the neural network on a mouse cell type atlas (Tabula Muris Atlas) and a human immune cell dataset, and used it to predict cell types for mouse leukocytes, human PBMCs and human T cell sub types. The results showed that our neural network is fast and accurate, and should therefore be a useful tool to complement existing scRNA-seq pipelines.Author SummarySingle cell RNA sequencing (scRNA-seq) provides high resolution profiling of the transcriptomes of individual cells, which inevitably results in high volumes of data that require complex data processing pipelines. Usually, one of the first steps in the analysis of scRNA-seq is to assign individual cells to known cell types. To accomplish this, traditional methods first group the cells into different clusters, then find marker genes, and finally use these to manually assign cell types for each cluster. Thus these methods require prior knowledge of cell type canonical markers, and some level of subjectivity to make the cell type assignments. As a result, the process is often laborious and requires domain specific expertise, which is a barrier for inexperienced users. By contrast, our neural network ACTINN automatically learns the features for each predefined cell type and uses these features to predict cell types for individual cells. This approach is computationally efficient and requires no domain expertise of the tissues being studied. We believe ACTINN allows users to rapidly identify cell types in their datasets, thus rendering the analysis of their scRNA-seq datasets more efficient.

Download Full-text

Analysis of Single-Cell RNA-seq Data by Clustering Approaches

Current Bioinformatics ◽

10.2174/1574893614666181120095038 ◽

2019 ◽

Vol 14 (4) ◽

pp. 314-322 ◽

Cited By ~ 3

Author(s):

Xiaoshu Zhu ◽

Hong-Dong Li ◽

Lilu Guo ◽

Fang-Xiang Wu ◽

Jianxin Wang

Keyword(s):

Feature Selection ◽

Single Cell ◽

Cell Types ◽

Semisupervised Learning ◽

Similarity Measurement ◽

Marker Genes ◽

Rna Seq ◽

Selection Methods ◽

Clustering Methods ◽

Similarity Calculation

Background: The recently developed single-cell RNA sequencing (scRNA-seq) has attracted a great amount of attention due to its capability to interrogate expression of individual cells, which is superior to traditional bulk cell sequencing that can only measure mean gene expression of a population of cells. scRNA-seq has been successfully applied in finding new cell subtypes. New computational challenges exist in the analysis of scRNA-seq data. Objective: We provide an overview of the features of different similarity calculation and clustering methods, in order to facilitate users to select methods that are suitable for their scRNA-seq. We would also like to show that feature selection methods are important to improve clustering performance. Results: We first described similarity measurement methods, followed by reviewing some new clustering methods, as well as their algorithmic details. This analysis revealed several new questions, including how to automatically estimate the number of clustering categories, how to discover novel subpopulation, and how to search for new marker genes by using feature selection methods. Conclusion: Without prior knowledge about the number of cell types, clustering or semisupervised learning methods are important tools for exploratory analysis of scRNA-seq data.</P>

Download Full-text

Application of HDBSСAN Method for Clustering scRNA-seq Data

Proceedings of the Institute for System Programming of RAS ◽

10.15514/ispras-2020-32(5)-8 ◽

2020 ◽

Vol 32 (5) ◽

pp. 111-120

Author(s):

Maria Andreevna Akimenkova ◽

Anna Anatolyevna Maznina ◽

Anton Yurievich Naumov ◽

Evgeny Andreevich Karpulevich

Keyword(s):

Feature Selection ◽

Dimensionality Reduction ◽

Single Cell ◽

Rna Sequencing ◽

Cell Types ◽

Adjusted Rand Index ◽

Clustering Method ◽

Clustering Problem ◽

Single Cell Rna Sequencing

One of the main tasks in the analysis of single cell RNA sequencing (scRNA-seq) data is the identification of cell types and subtypes, which is usually based on some method of clustering. There is a number of generally accepted approaches to solving the clustering problem, one of which is implemented in the Seurat package. In addition, the quality of clustering is influenced by the use of preprocessing algorithms, such as imputation, dimensionality reduction, feature selection, etc. In the article, the HDBSCAN hierarchical clustering method is used to cluster scRNA-seq data. For a more complete comparison Experiments and comparisons were made on two labeled datasets: Zeisel (3005 cells) and Romanov (2881 cells). To compare the quality of clustering, two external metrics were used: Adjusted Rand index and V-measure. The experiments demonstrated a higher quality of clustering by the HDBSCAN method on the Zeisel dataset and a poorer quality on the Romanov dataset.

Download Full-text

Single-cell data clustering based on sparse optimization and low-rank matrix factorization

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab098 ◽

2021 ◽

Author(s):

Yinlei Hu ◽

Bin Li ◽

Falai Chen ◽

Kun Qu

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Matrix Factorization ◽

Data Clustering ◽

Cell Types ◽

Low Rank ◽

Sequencing Data ◽

Rank Matrix ◽

Single Cell Rna Sequencing ◽

Low Rank Matrix

Abstract Unsupervised clustering is a fundamental step of single-cell RNA sequencing data analysis. This issue has inspired several clustering methods to classify cells in single-cell RNA sequencing data. However, accurate prediction of the cell clusters remains a substantial challenge. In this study, we propose a new algorithm for single-cell RNA sequencing data clustering based on Sparse Optimization and low-rank matrix factorization (scSO). We applied our scSO algorithm to analyze multiple benchmark datasets and showed that the cluster number predicted by scSO was close to the number of reference cell types and that most cells were correctly classified. Our scSO algorithm is available at https://github.com/QuKunLab/scSO. Overall, this study demonstrates a potent cell clustering approach that can help researchers distinguish cell types in single-cell RNA sequencing data.

Download Full-text

Molecular characteristics and spatial distribution of adult human corneal cell subtypes

Scientific Reports ◽

10.1038/s41598-021-94933-8 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Ann J. Ligocki ◽

Wen Fury ◽

Christian Gutierrez ◽

Christina Adler ◽

Tao Yang ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Cross Sections ◽

Cell Types ◽

Marker Genes ◽

Molecular Characteristics ◽

Transcriptional Level ◽

Human Cornea ◽

Adult Human ◽

And Migration

AbstractBulk RNA sequencing of a tissue captures the gene expression profile from all cell types combined. Single-cell RNA sequencing identifies discrete cell-signatures based on transcriptomic identities. Six adult human corneas were processed for single-cell RNAseq and 16 cell clusters were bioinformatically identified. Based on their transcriptomic signatures and RNAscope results using representative cluster marker genes on human cornea cross-sections, these clusters were confirmed to be stromal keratocytes, endothelium, several subtypes of corneal epithelium, conjunctival epithelium, and supportive cells in the limbal stem cell niche. The complexity of the epithelial cell layer was captured by eight distinct corneal clusters and three conjunctival clusters. These were further characterized by enriched biological pathways and molecular characteristics which revealed novel groupings related to development, function, and location within the epithelial layer. Moreover, epithelial subtypes were found to reflect their initial generation in the limbal region, differentiation, and migration through to mature epithelial cells. The single-cell map of the human cornea deepens the knowledge of the cellular subsets of the cornea on a whole genome transcriptional level. This information can be applied to better understand normal corneal biology, serve as a reference to understand corneal disease pathology, and provide potential insights into therapeutic approaches.

Download Full-text

Defining the Cell Types That Drive Idiopathic Pulmonary Fibrosis Using Single-Cell RNA Sequencing

American Journal of Respiratory and Critical Care Medicine ◽

10.1164/rccm.201901-0197ed ◽

2019 ◽

Vol 199 (12) ◽

pp. 1454-1456 ◽

Cited By ~ 1

Author(s):

Joanna M. Poczobutt ◽

Oliver Eickelberg

Keyword(s):

Idiopathic Pulmonary Fibrosis ◽

Pulmonary Fibrosis ◽

Single Cell ◽

Rna Sequencing ◽

Cell Types ◽

Single Cell Rna Sequencing

Download Full-text

Single cell RNA sequencing of the Strongylocentrotus purpuratus larva reveals the blueprint of major cell types and nervous system of a non-chordate deuterostome

eLife ◽

10.7554/elife.70416 ◽

2021 ◽

Vol 10 ◽

Author(s):

Periklis Paganos ◽

Danila Voronov ◽

Jacob M Musser ◽

Detlev Arendt ◽

Maria Ina Arnone

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Sea Urchin ◽

Regulatory Networks ◽

Sea Urchins ◽

Neuronal Cell ◽

Cell Types ◽

Molecular Fingerprint ◽

Larval Body ◽

Single Cell Rna Sequencing

Identifying the molecular fingerprint of organismal cell types is key for understanding their function and evolution. Here, we use single cell RNA sequencing (scRNA-seq) to survey the cell types of the sea urchin early pluteus larva, representing an important developmental transition from non-feeding to feeding larva. We identify 21 distinct cell clusters, representing cells of the digestive, skeletal, immune, and nervous systems. Further subclustering of these reveal a highly detailed portrait of cell diversity across the larva, including the identification of neuronal cell types. We then validate important gene regulatory networks driving sea urchin development and reveal new domains of activity within the larval body. Focusing on neurons that co-express Pdx-1 and Brn1/2/4, we identify an unprecedented number of genes shared by this population of neurons in sea urchin and vertebrate endocrine pancreatic cells. Using differential expression results from Pdx-1 knockdown experiments, we show that Pdx1 is necessary for the acquisition of the neuronal identity of these cells. We hypothesize that a network similar to the one orchestrated by Pdx1 in the sea urchin neurons was active in an ancestral cell type and then inherited by neuronal and pancreatic developmental lineages in sea urchins and vertebrates.

Download Full-text

Localization of migraine susceptibility genes in human brain by single-cell RNA sequencing

Cephalalgia ◽

10.1177/0333102418762476 ◽

2018 ◽

Vol 38 (13) ◽

pp. 1976-1983 ◽

Cited By ~ 5

Author(s):

William Renthal

Keyword(s):

Human Brain ◽

Single Cell ◽

Rna Sequencing ◽

Expression Profiles ◽

Cell Types ◽

Susceptibility Genes ◽

Brain Cell ◽

Cell Type ◽

Single Cell Rna Sequencing ◽

Brain Cell Types

Background Migraine is a debilitating disorder characterized by severe headaches and associated neurological symptoms. A key challenge to understanding migraine has been the cellular complexity of the human brain and the multiple cell types implicated in its pathophysiology. The present study leverages recent advances in single-cell transcriptomics to localize the specific human brain cell types in which putative migraine susceptibility genes are expressed. Methods The cell-type specific expression of both familial and common migraine-associated genes was determined bioinformatically using data from 2,039 individual human brain cells across two published single-cell RNA sequencing datasets. Enrichment of migraine-associated genes was determined for each brain cell type. Results Analysis of single-brain cell RNA sequencing data from five major subtypes of cells in the human cortex (neurons, oligodendrocytes, astrocytes, microglia, and endothelial cells) indicates that over 40% of known migraine-associated genes are enriched in the expression profiles of a specific brain cell type. Further analysis of neuronal migraine-associated genes demonstrated that approximately 70% were significantly enriched in inhibitory neurons and 30% in excitatory neurons. Conclusions This study takes the next step in understanding the human brain cell types in which putative migraine susceptibility genes are expressed. Both familial and common migraine may arise from dysfunction of discrete cell types within the neurovascular unit, and localization of the affected cell type(s) in an individual patient may provide insight into to their susceptibility to migraine.

Download Full-text