GeneMarkeR: A Database and User Interface for scRNA-seq Marker Genes

Single-cell sequencing (scRNA-seq) has enabled researchers to study cellular heterogeneity. Accurate cell type identification is crucial for scRNA-seq analysis to be valid and robust. Marker genes, genes specific for one or a few cell types, can improve cell type classification; however, their specificity varies across species, samples, and cell subtypes. Current marker gene databases lack standardization, cell hierarchy consideration, sample diversity, and/or the flexibility for updates as new data become available. Most of these databases are derived from a single statistical analysis despite many such analyses scattered in the literature to identify marker genes from scRNA-seq data and pure cell populations. An R Shiny web tool called GeneMarkeR was developed for researchers to retrieve marker genes demonstrating cell type specificity across species, methodology and sample types based on a novel algorithm. The web tool facilitates online submission and interfaces with MySQL to ensure updatability. Furthermore, the tool incorporates reactive programming to enable researchers to retrieve standardized public data supporting the marker genes. GeneMarkeR currently hosts over 261,000 rows of standardized marker gene results from 25 studies across 21,012 unique genomic entities and 99 unique cell types mapped to hierarchical ontologies.

Download Full-text

Exploiting marker genes for robust classification and characterization of single-cell chromatin accessibility

10.1101/2021.04.01.438068 ◽

2021 ◽

Author(s):

Risa Karakida Kawaguchi ◽

Ziqi Tang ◽

Stephan Fischer ◽

Rohit Tripathy ◽

Peter K. Koo ◽

...

Keyword(s):

Single Cell ◽

Marker Gene ◽

Cell Types ◽

Chromatin Accessibility ◽

Marker Genes ◽

Cell Type ◽

Gene Sets ◽

Typing Methods ◽

Cell Type Specific ◽

Cell Typing

Background: Single-cell Assay for Transposase Accessible Chromatin using sequencing (scATAC-seq) measures genome-wide chromatin accessibility for the discovery of cell-type specific regulatory networks. ScATAC-seq combined with single-cell RNA sequencing (scRNA-seq) offers important avenues for ongoing research, such as novel cell-type specific activation of enhancer and transcription factor binding sites as well as chromatin changes specific to cell states. On the other hand, scATAC-seq data is known to be challenging to interpret due to its high number of zeros as well as the heterogeneity derived from different protocols. Because of the stochastic lack of marker gene activities, cell type identification by scATAC-seq remains difficult even at a cluster level. Results: In this study, we exploit reference knowledge obtained from external scATAC-seq or scRNA-seq datasets to define existing cell types and uncover the genomic regions which drive cell-type specific gene regulation. To investigate the robustness of existing cell-typing methods, we collected 7 scATAC-seq datasets targeting mouse brain for a meta-analytic comparison of neuronal cell-type annotation, including a reference atlas generated by the BRAIN Initiative Cell Census Network (BICCN). By comparing the area under the receiver operating characteristics curves (AUROCs) for the three major cell types (inhibitory, excitatory, and non-neuronal cells), cell-typing performance by single markers is found to be highly variable even for known marker genes due to study-specific biases. However, the signal aggregation of a large and redundant marker gene set, optimized via multiple scRNA-seq data, achieves the highest cell-typing performances among 5 existing marker gene sets, from the individual cell to cluster level. That gene set also shows a high consistency with the cluster-specific genes from inhibitory subtypes in two well-annotated datasets, suggesting applicability to rare cell types. Next, we demonstrate a comprehensive assessment of scATAC-seq cell typing using exhaustive combinations of the marker gene sets with supervised learning methods including machine learning classifiers and joint clustering methods. Our results show that the combinations using robust marker gene sets systematically ranked at the top, not only with model based prediction using a large reference data but also with a simple summation of expression strengths across markers. To demonstrate the utility of this robust cell typing approach, we trained a deep neural network to predict chromatin accessibility in each subtype using only DNA sequence. Through model interpretation methods, we identify key motifs enriched about robust gene sets for each neuronal subtype. Conclusions: Through the meta-analytic evaluation of scATAC-seq cell-typing methods, we develop a novel method set to exploit the BICCN reference atlas. Our study strongly supports the value of robust marker gene selection as a feature selection tool and cross-dataset comparison between scATAC-seq datasets to improve alignment of scATAC-seq to known biology. With this novel, high quality epigenetic data, genomic analysis of regulatory regions can reveal sequence motifs that drive cell type-specific regulatory programs.

Download Full-text

scQuery: a web server for comparative analysis of single-cell RNA-seq data

10.1101/323238 ◽

2018 ◽

Author(s):

Amir Alavi ◽

Matthew Ruffalo ◽

Aiyappa Parvangada ◽

Zhilin Huang ◽

Ziv Bar-Joseph

Keyword(s):

Single Cell ◽

Large Scale ◽

Web Server ◽

Cell Types ◽

Marker Genes ◽

Heterogeneous Environments ◽

Rna Seq ◽

Cell Type ◽

Small Set ◽

Unique Cell

SummarySingle cell RNA-Seq (scRNA-seq) studies often profile upward of thousands of cells in heterogeneous environments. Current methods for characterizing cells perform unsupervised analysis followed by assignment using a small set of known marker genes. Such approaches are limited to a few, well characterized cell types. To enable large scale supervised characterization we developed an automated pipeline to download, process, and annotate publicly available scRNA-seq datasets. We extended supervised neural networks to obtain efficient and accurate representations for scRNA-seq data. We applied our pipeline to analyze data from over 500 different studies with over 300 unique cell types and show that supervised methods greatly outperform unsupervised methods for cell type identification. A case study of neural degeneration data highlights the ability of these methods to identify differences between cell type distributions in healthy and diseased mice. We implemented a web server that compares new datasets to collected data employing fast matching methods in order to determine cell types, key genes, similar prior studies, and more.

Download Full-text

AutoGeneS: Automatic gene selection using multi-objective optimization for RNA-seq deconvolution

10.1101/2020.02.21.940650 ◽

2020 ◽

Cited By ~ 5

Author(s):

Hananeh Aliee ◽

Fabian Theis

Keyword(s):

Single Cell ◽

Prior Knowledge ◽

Gene Selection ◽

Ground Truth ◽

Cell Types ◽

Cellular Heterogeneity ◽

Marker Genes ◽

Rna Seq ◽

Cell Type ◽

The Impact

AbstractTissues are complex systems of interacting cell types. Knowing cell-type proportions in a tissue is very important to identify which cells or cell types are targeted by a disease or perturbation. When measuring such responses using RNA-seq, bulk RNA-seq masks cellular heterogeneity. Hence, several computational methods have been proposed to infer cell-type proportions from bulk RNA samples. Their performance with noisy reference profiles highly depends on the set of genes undergoing deconvolution. These genes are often selected based on prior knowledge or a single-criterion test that might not be useful to dissect closely correlated cell types. In this work, we introduce AutoGeneS, a tool that automatically extracts informative genes and reveals the cellular heterogeneity of bulk RNA samples. AutoGeneS requires no prior knowledge about marker genes and selects genes by simultaneously optimizing multiple criteria: minimizing the correlation and maximizing the distance between cell types. It can be applied to reference profiles from various sources like single-cell experiments or sorted cell populations. Results from human samples of peripheral blood illustrate that AutoGeneS outperforms other methods. Our results also highlight the impact of our approach on analyzing bulk RNA samples with noisy single-cell reference profiles and closely correlated cell types. Ground truth cell proportions analyzed by flow cytometry confirmed the accuracy of the predictions of AutoGeneS in identifying cell-type proportions. AutoGeneS is available for use via a standalone Python package (https://github.com/theislab/AutoGeneS).

Download Full-text

NS-Forest: A machine learning method for the objective identification of minimum marker gene combinations for cell type determination from single cell RNA sequencing

10.1101/2020.09.23.308932 ◽

2020 ◽

Author(s):

Brian Aevermann ◽

Yun Zhang ◽

Mark Novotny ◽

Trygve Bakken ◽

Jeremy Miller ◽

...

Keyword(s):

Machine Learning ◽

Single Cell ◽

Rna Sequencing ◽

Marker Gene ◽

Cell Types ◽

Biological Research ◽

Marker Genes ◽

Cell Type ◽

Type Identity ◽

Wide Range

AbstractSingle cell genomics is rapidly advancing our knowledge of cell phenotypic types and states. Driven by single cell/nucleus RNA sequencing (scRNA-seq) data, comprehensive atlas projects covering a wide range of organisms and tissues are currently underway. As a result, it is critical that the cell transcriptional phenotypes discovered are defined and disseminated in a consistent and concise manner. Molecular biomarkers have historically played an important role in biological research, from defining immune cell-types by surface protein expression to defining diseases by molecular drivers. Here we describe a machine learning-based marker gene selection algorithm, NS-Forest version 2.0, which leverages the non-linear attributes of random forest feature selection and a binary expression scoring approach to discover the minimal marker gene expression combinations that precisely captures the cell type identity represented in the complete scRNA-seq transcriptional profiles. The marker genes selected provide a barcode of the necessary and sufficient characteristics for semantic cell type definition and serve as useful tools for downstream biological investigation. The use of NS-Forest to identify marker genes for human brain middle temporal gyrus cell types reveals the importance of cell signaling and non-coding RNAs in neuronal cell type identity.

Download Full-text

Accurate and fast cell marker gene identification with COSG

10.1101/2021.06.15.448484 ◽

2021 ◽

Author(s):

Min Dai ◽

Xiaobing Pei ◽

Xiu-Jie Wang

Keyword(s):

Single Cell ◽

Marker Gene ◽

Cell Types ◽

Superior Performance ◽

Gene Identification ◽

Marker Genes ◽

Sequencing Data ◽

Cell Type Specificity ◽

Spatially Resolved ◽

Downstream Analysis

Accurate cell classification is the groundwork for downstream analysis of single-cell sequencing data, yet how to identify marker genes to distinguish different cell types still remains as a big challenge. We developed COSG as a cosine similarity-based method for more accurate and scalable marker gene identification. COSG is applicable to single-cell RNA sequencing data, single-cell ATAC sequencing data and spatially resolved transcriptome data. COSG is fast and scalable for ultra-large datasets of million-scale cells. Application on both simulated and real experimental datasets demonstrates the superior performance of COSG in terms of both accuracy and efficiency as compared with other available methods. Marker genes or genomic regions identified by COSG are more indicative and with greater cell-type specificity.

Download Full-text

Specificity of gene expression in adipocytes

Molecular and Cellular Biology ◽

10.1128/mcb.5.2.419-421.1985 ◽

1985 ◽

Vol 5 (2) ◽

pp. 419-421

Author(s):

K M Zezulak ◽

H Green

Keyword(s):

Gene Expression ◽

Cell Types ◽

Cell Type ◽

3T3 Cells ◽

Cell Type Specificity ◽

Number Of Genes ◽

Enhanced Expression ◽

Adipose Cells ◽

Distinctive Phenotype

During the differentiation of preadipose 3T3 cells into adipose cells, the mRNAs for three proteins increase strikingly in abundance. To determine the degree of cell-type specificity in the expression of these mRNAs, we estimated their abundances in several nonadipose tissues of the mouse. None of these mRNAs was strictly confined to adipocytes, but the ensemble of three mRNAs was rather specific to adipocytes. Insofar as is revealed by these three markers, the distinctive phenotype of adipocytes is the result of the enhanced expression of a number of genes, none of which is completely silent in all other cell types.

Download Full-text

A scalable platform for the development of cell-type-specific viral drivers

eLife ◽

10.7554/elife.48089 ◽

2019 ◽

Vol 8 ◽

Cited By ~ 12

Author(s):

Sinisa Hrvatin ◽

Christopher P Tzeng ◽

M Aurel Nagy ◽

Hume Stroud ◽

Charalampia Koutsioumpa ◽

...

Keyword(s):

Gene Expression ◽

Heterologous Gene Expression ◽

High Specificity ◽

Cell Types ◽

Regulatory Elements ◽

Cell Type ◽

Cell Type Specificity ◽

Cell Type Specific ◽

The Many ◽

Dna Regulatory Elements

Enhancers are the primary DNA regulatory elements that confer cell type specificity of gene expression. Recent studies characterizing individual enhancers have revealed their potential to direct heterologous gene expression in a highly cell-type-specific manner. However, it has not yet been possible to systematically identify and test the function of enhancers for each of the many cell types in an organism. We have developed PESCA, a scalable and generalizable method that leverages ATAC- and single-cell RNA-sequencing protocols, to characterize cell-type-specific enhancers that should enable genetic access and perturbation of gene function across mammalian cell types. Focusing on the highly heterogeneous mammalian cerebral cortex, we apply PESCA to find enhancers and generate viral reagents capable of accessing and manipulating a subset of somatostatin-expressing cortical interneurons with high specificity. This study demonstrates the utility of this platform for developing new cell-type-specific viral reagents, with significant implications for both basic and translational research.

Download Full-text

JIND: Joint Integration and Discrimination for Automated Single-Cell Annotation

10.1101/2020.10.06.327601 ◽

2020 ◽

Author(s):

Mohit Goyal ◽

Guillermo Serrano ◽

Ilan Shomorony ◽

Mikel Hernaez ◽

Idoia Ochoa

Keyword(s):

Single Cell ◽

Cell Types ◽

Marker Genes ◽

Specific Marker ◽

Rna Seq ◽

Batch Effects ◽

Cell Type ◽

Latent Space ◽

Cell Type Specific ◽

Low Dimensional

AbstractSingle-cell RNA-seq is a powerful tool in the study of the cellular composition of different tissues and organisms. A key step in the analysis pipeline is the annotation of cell-types based on the expression of specific marker genes. Since manual annotation is labor-intensive and does not scale to large datasets, several methods for automated cell-type annotation have been proposed based on supervised learning. However, these methods generally require feature extraction and batch alignment prior to classification, and their performance may become unreliable in the presence of cell-types with very similar transcriptomic profiles, such as differentiating cells. We propose JIND, a framework for automated cell-type identification based on neural networks that directly learns a low-dimensional representation (latent code) in which cell-types can be reliably determined. To account for batch effects, JIND performs a novel asymmetric alignment in which the transcriptomic profile of unseen cells is mapped onto the previously learned latent space, hence avoiding the need of retraining the model whenever a new dataset becomes available. JIND also learns cell-type-specific confidence thresholds to identify and reject cells that cannot be reliably classified. We show on datasets with and without batch effects that JIND classifies cells more accurately than previously proposed methods while rejecting only a small proportion of cells. Moreover, JIND batch alignment is parallelizable, being more than five or six times faster than Seurat integration. Availability: https://github.com/mohit1997/JIND.

Download Full-text

A nervous system-specific subnuclear organelle in Caenorhabditis elegans

Genetics ◽

10.1093/genetics/iyaa016 ◽

2021 ◽

Vol 217 (1) ◽

Author(s):

Kenneth Pham ◽

Neda Masoudi ◽

Eduardo Leyva-Díaz ◽

Oliver Hobert

Keyword(s):

Nervous System ◽

Caenorhabditis Elegans ◽

Homeobox Gene ◽

Cell Types ◽

Nuclear Bodies ◽

Loss Of Function ◽

Cell Type ◽

Cell Type Specificity ◽

Splicing Speckles ◽

Polycomb Bodies

Abstract We describe here phase-separated subnuclear organelles in the nematode Caenorhabditis elegans, which we term NUN (NUclear Nervous system-specific) bodies. Unlike other previously described subnuclear organelles, NUN bodies are highly cell type specific. In fully mature animals, 4–10 NUN bodies are observed exclusively in the nucleus of neuronal, glial and neuron-like cells, but not in other somatic cell types. Based on co-localization and genetic loss of function studies, NUN bodies are not related to other previously described subnuclear organelles, such as nucleoli, splicing speckles, paraspeckles, Polycomb bodies, promyelocytic leukemia bodies, gems, stress-induced nuclear bodies, or clastosomes. NUN bodies form immediately after cell cycle exit, before other signs of overt neuronal differentiation and are unaffected by the genetic elimination of transcription factors that control many other aspects of neuronal identity. In one unusual neuron class, the canal-associated neurons, NUN bodies remodel during larval development, and this remodeling depends on the Prd-type homeobox gene ceh-10. In conclusion, we have characterized here a novel subnuclear organelle whose cell type specificity poses the intriguing question of what biochemical process in the nucleus makes all nervous system-associated cells different from cells outside the nervous system.

Download Full-text

Comparative cellular analysis of motor cortex in human, marmoset and mouse

Nature ◽

10.1038/s41586-021-03465-8 ◽

2021 ◽

Vol 598 (7879) ◽

pp. 111-119 ◽

Cited By ~ 1

Author(s):

Trygve E. Bakken ◽

Nikolas L. Jorstad ◽

Qiwen Hu ◽

Blue B. Lake ◽

Wei Tian ◽

...

Keyword(s):

Motor Cortex ◽

Primary Motor Cortex ◽

Neuronal Cell ◽

Cell Types ◽

Morphological Characterization ◽

Fine Motor ◽

Marker Genes ◽

Cell Type ◽

Regulatory Pathways ◽

Cellular Analysis

AbstractThe primary motor cortex (M1) is essential for voluntary fine-motor control and is functionally conserved across mammals1. Here, using high-throughput transcriptomic and epigenomic profiling of more than 450,000 single nuclei in humans, marmoset monkeys and mice, we demonstrate a broadly conserved cellular makeup of this region, with similarities that mirror evolutionary distance and are consistent between the transcriptome and epigenome. The core conserved molecular identities of neuronal and non-neuronal cell types allow us to generate a cross-species consensus classification of cell types, and to infer conserved properties of cell types across species. Despite the overall conservation, however, many species-dependent specializations are apparent, including differences in cell-type proportions, gene expression, DNA methylation and chromatin state. Few cell-type marker genes are conserved across species, revealing a short list of candidate genes and regulatory mechanisms that are responsible for conserved features of homologous cell types, such as the GABAergic chandelier cells. This consensus transcriptomic classification allows us to use patch–seq (a combination of whole-cell patch-clamp recordings, RNA sequencing and morphological characterization) to identify corticospinal Betz cells from layer 5 in non-human primates and humans, and to characterize their highly specialized physiology and anatomy. These findings highlight the robust molecular underpinnings of cell-type diversity in M1 across mammals, and point to the genes and regulatory pathways responsible for the functional identity of cell types and their species-specific adaptations.

Download Full-text