scholarly journals Information Theoretic Feature Selection Methods for Single Cell RNA-Sequencing

2019 ◽  
Author(s):  
Umang Varma ◽  
Justin Colacino ◽  
Anna Gilbert

AbstractSingle cell RNA-sequencing (scRNA-seq) technologies have generated an expansive amount of new biological information, revealing new cellular populations and hierarchical relationships. A number of technologies complementary to scRNA-seq rely on the selection of a smaller number of marker genes (or features) to accurately differentiate cell types within a complex mixture of cells. In this paper, we benchmark differential expression methods against information-theoretic feature selection methods to evaluate the ability of these algorithms to identify small and efficient sets of genes that are informative about cell types. Unlike differential methods, that are strictly binary and univariate, information-theoretic methods can be used as any combination of binary or multiclass and univariate or multivariate. We show for some datasets, information theoretic methods can reveal genes that are both distinct from those selected by traditional algorithms and that are as informative, if not more, of the class labels. We also present detailed and principled theoretical analyses of these algorithms. All information theoretic methods in this paper are implemented in our PicturedRocks Python package that is compatible with the widely used scanpy package.

2019 ◽  
Vol 21 (5) ◽  
pp. 1581-1595 ◽  
Author(s):  
Xinlei Zhao ◽  
Shuang Wu ◽  
Nan Fang ◽  
Xiao Sun ◽  
Jue Fan

Abstract Single-cell RNA sequencing (scRNA-seq) has been rapidly developing and widely applied in biological and medical research. Identification of cell types in scRNA-seq data sets is an essential step before in-depth investigations of their functional and pathological roles. However, the conventional workflow based on clustering and marker genes is not scalable for an increasingly large number of scRNA-seq data sets due to complicated procedures and manual annotation. Therefore, a number of tools have been developed recently to predict cell types in new data sets using reference data sets. These methods have not been generally adapted due to a lack of tool benchmarking and user guidance. In this article, we performed a comprehensive and impartial evaluation of nine classification software tools specifically designed for scRNA-seq data sets. Results showed that Seurat based on random forest, SingleR based on correlation analysis and CaSTLe based on XGBoost performed better than others. A simple ensemble voting of all tools can improve the predictive accuracy. Under nonideal situations, such as small-sized and class-imbalanced reference data sets, tools based on cluster-level similarities have superior performance. However, even with the function of assigning ‘unassigned’ labels, it is still challenging to catch novel cell types by solely using any of the single-cell classifiers. This article provides a guideline for researchers to select and apply suitable classification tools in their analysis workflows and sheds some lights on potential direction of future improvement on classification tools.


2021 ◽  
Author(s):  
Zhoufeng Wang ◽  
Zhe Li ◽  
Kun Zhou ◽  
Li Zhang ◽  
Ying Yang ◽  
...  

Abstract Lung adenocarcinomas (LUAD) start as precancerous lesions such as atypical adenomatous hyperplasia (AAH), develop stepwise into adenocarcinoma in situ (AIS) and minimally invasive adenocarcinoma (MIA), then eventually progress toward invasive adenocarcinoma (IA). To date the cellular heterogeneity across these distinct clinical stages and the underlying molecular events driving tumor progression remain largely unclear. In this study, we performed single-cell RNA sequencing on 52 specimens from 25 patients spanning the four clinical stages. By assessing the expression pattern of marker genes among 268,471 cells, we identified 16 major cell types. We demonstrated that AT2 feature cell types (AT2-like cells) were associated with malignant composition. AT2-like subcluster emerged first in AAH and partially lost AT2 cell transcriptional identity, accompanied with a gain of stemness during cell transition. In addition, genes related to energy metabolism, ribosome synthesis were upregulated in the early stage of LUAD, leading us to identify new markers including miRNA10 and β-hydroxybutyric acid to diagnose early-stage LUAD noninvasively in the blood. We also identified MDK and TIMP1 as potential biomarkers to facilitate our understanding of LUAD pathogenesis. Taken together, our data identified a new mechanism in LUAD evolution, and provided a robust basis for diagnosis and treatment of LUAD.


2019 ◽  
Author(s):  
Feiyang Ma ◽  
Matteo Pellegrini

AbstractCell type identification is one of the major goals in single cell RNA sequencing (scRNA-seq). Current methods for assigning cell types typically involve the use of unsupervised clustering, the identification of signature genes in each cluster, followed by a manual lookup of these genes in the literature and databases to assign cell types. However, there are several limitations associated with these approaches, such as unwanted sources of variation that influence clustering and a lack of canonical markers for certain cell types. Here, we present ACTINN (Automated Cell Type Identification using Neural Networks), which employs a neural network with 3 hidden layers, trains on datasets with predefined cell types, and predicts cell types for other datasets based on the trained parameters. We trained the neural network on a mouse cell type atlas (Tabula Muris Atlas) and a human immune cell dataset, and used it to predict cell types for mouse leukocytes, human PBMCs and human T cell sub types. The results showed that our neural network is fast and accurate, and should therefore be a useful tool to complement existing scRNA-seq pipelines.Author SummarySingle cell RNA sequencing (scRNA-seq) provides high resolution profiling of the transcriptomes of individual cells, which inevitably results in high volumes of data that require complex data processing pipelines. Usually, one of the first steps in the analysis of scRNA-seq is to assign individual cells to known cell types. To accomplish this, traditional methods first group the cells into different clusters, then find marker genes, and finally use these to manually assign cell types for each cluster. Thus these methods require prior knowledge of cell type canonical markers, and some level of subjectivity to make the cell type assignments. As a result, the process is often laborious and requires domain specific expertise, which is a barrier for inexperienced users. By contrast, our neural network ACTINN automatically learns the features for each predefined cell type and uses these features to predict cell types for individual cells. This approach is computationally efficient and requires no domain expertise of the tissues being studied. We believe ACTINN allows users to rapidly identify cell types in their datasets, thus rendering the analysis of their scRNA-seq datasets more efficient.


2019 ◽  
Vol 14 (4) ◽  
pp. 314-322 ◽  
Author(s):  
Xiaoshu Zhu ◽  
Hong-Dong Li ◽  
Lilu Guo ◽  
Fang-Xiang Wu ◽  
Jianxin Wang

Background: The recently developed single-cell RNA sequencing (scRNA-seq) has attracted a great amount of attention due to its capability to interrogate expression of individual cells, which is superior to traditional bulk cell sequencing that can only measure mean gene expression of a population of cells. scRNA-seq has been successfully applied in finding new cell subtypes. New computational challenges exist in the analysis of scRNA-seq data. Objective: We provide an overview of the features of different similarity calculation and clustering methods, in order to facilitate users to select methods that are suitable for their scRNA-seq. We would also like to show that feature selection methods are important to improve clustering performance. Results: We first described similarity measurement methods, followed by reviewing some new clustering methods, as well as their algorithmic details. This analysis revealed several new questions, including how to automatically estimate the number of clustering categories, how to discover novel subpopulation, and how to search for new marker genes by using feature selection methods. Conclusion: Without prior knowledge about the number of cell types, clustering or semisupervised learning methods are important tools for exploratory analysis of scRNA-seq data.</P>


2020 ◽  
Vol 32 (5) ◽  
pp. 111-120
Author(s):  
Maria Andreevna Akimenkova ◽  
Anna Anatolyevna Maznina ◽  
Anton Yurievich Naumov ◽  
Evgeny Andreevich Karpulevich

One of the main tasks in the analysis of single cell RNA sequencing (scRNA-seq) data is the identification of cell types and subtypes, which is usually based on some method of clustering. There is a number of generally accepted approaches to solving the clustering problem, one of which is implemented in the Seurat package. In addition, the quality of clustering is influenced by the use of preprocessing algorithms, such as imputation, dimensionality reduction, feature selection, etc. In the article, the HDBSCAN hierarchical clustering method is used to cluster scRNA-seq data. For a more complete comparison Experiments and comparisons were made on two labeled datasets: Zeisel (3005 cells) and Romanov (2881 cells). To compare the quality of clustering, two external metrics were used: Adjusted Rand index and V-measure. The experiments demonstrated a higher quality of clustering by the HDBSCAN method on the Zeisel dataset and a poorer quality on the Romanov dataset.


Author(s):  
Yinlei Hu ◽  
Bin Li ◽  
Falai Chen ◽  
Kun Qu

Abstract Unsupervised clustering is a fundamental step of single-cell RNA sequencing data analysis. This issue has inspired several clustering methods to classify cells in single-cell RNA sequencing data. However, accurate prediction of the cell clusters remains a substantial challenge. In this study, we propose a new algorithm for single-cell RNA sequencing data clustering based on Sparse Optimization and low-rank matrix factorization (scSO). We applied our scSO algorithm to analyze multiple benchmark datasets and showed that the cluster number predicted by scSO was close to the number of reference cell types and that most cells were correctly classified. Our scSO algorithm is available at https://github.com/QuKunLab/scSO. Overall, this study demonstrates a potent cell clustering approach that can help researchers distinguish cell types in single-cell RNA sequencing data.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Ann J. Ligocki ◽  
Wen Fury ◽  
Christian Gutierrez ◽  
Christina Adler ◽  
Tao Yang ◽  
...  

AbstractBulk RNA sequencing of a tissue captures the gene expression profile from all cell types combined. Single-cell RNA sequencing identifies discrete cell-signatures based on transcriptomic identities. Six adult human corneas were processed for single-cell RNAseq and 16 cell clusters were bioinformatically identified. Based on their transcriptomic signatures and RNAscope results using representative cluster marker genes on human cornea cross-sections, these clusters were confirmed to be stromal keratocytes, endothelium, several subtypes of corneal epithelium, conjunctival epithelium, and supportive cells in the limbal stem cell niche. The complexity of the epithelial cell layer was captured by eight distinct corneal clusters and three conjunctival clusters. These were further characterized by enriched biological pathways and molecular characteristics which revealed novel groupings related to development, function, and location within the epithelial layer. Moreover, epithelial subtypes were found to reflect their initial generation in the limbal region, differentiation, and migration through to mature epithelial cells. The single-cell map of the human cornea deepens the knowledge of the cellular subsets of the cornea on a whole genome transcriptional level. This information can be applied to better understand normal corneal biology, serve as a reference to understand corneal disease pathology, and provide potential insights into therapeutic approaches.


eLife ◽  
2021 ◽  
Vol 10 ◽  
Author(s):  
Periklis Paganos ◽  
Danila Voronov ◽  
Jacob M Musser ◽  
Detlev Arendt ◽  
Maria Ina Arnone

Identifying the molecular fingerprint of organismal cell types is key for understanding their function and evolution. Here, we use single cell RNA sequencing (scRNA-seq) to survey the cell types of the sea urchin early pluteus larva, representing an important developmental transition from non-feeding to feeding larva. We identify 21 distinct cell clusters, representing cells of the digestive, skeletal, immune, and nervous systems. Further subclustering of these reveal a highly detailed portrait of cell diversity across the larva, including the identification of neuronal cell types. We then validate important gene regulatory networks driving sea urchin development and reveal new domains of activity within the larval body. Focusing on neurons that co-express Pdx-1 and Brn1/2/4, we identify an unprecedented number of genes shared by this population of neurons in sea urchin and vertebrate endocrine pancreatic cells. Using differential expression results from Pdx-1 knockdown experiments, we show that Pdx1 is necessary for the acquisition of the neuronal identity of these cells. We hypothesize that a network similar to the one orchestrated by Pdx1 in the sea urchin neurons was active in an ancestral cell type and then inherited by neuronal and pancreatic developmental lineages in sea urchins and vertebrates.


Cephalalgia ◽  
2018 ◽  
Vol 38 (13) ◽  
pp. 1976-1983 ◽  
Author(s):  
William Renthal

Background Migraine is a debilitating disorder characterized by severe headaches and associated neurological symptoms. A key challenge to understanding migraine has been the cellular complexity of the human brain and the multiple cell types implicated in its pathophysiology. The present study leverages recent advances in single-cell transcriptomics to localize the specific human brain cell types in which putative migraine susceptibility genes are expressed. Methods The cell-type specific expression of both familial and common migraine-associated genes was determined bioinformatically using data from 2,039 individual human brain cells across two published single-cell RNA sequencing datasets. Enrichment of migraine-associated genes was determined for each brain cell type. Results Analysis of single-brain cell RNA sequencing data from five major subtypes of cells in the human cortex (neurons, oligodendrocytes, astrocytes, microglia, and endothelial cells) indicates that over 40% of known migraine-associated genes are enriched in the expression profiles of a specific brain cell type. Further analysis of neuronal migraine-associated genes demonstrated that approximately 70% were significantly enriched in inhibitory neurons and 30% in excitatory neurons. Conclusions This study takes the next step in understanding the human brain cell types in which putative migraine susceptibility genes are expressed. Both familial and common migraine may arise from dysfunction of discrete cell types within the neurovascular unit, and localization of the affected cell type(s) in an individual patient may provide insight into to their susceptibility to migraine.


Sign in / Sign up

Export Citation Format

Share Document