scholarly journals SSMD: a semi-supervised approach for a robust cell type identification and deconvolution of mouse transcriptomics data

Author(s):  
Xiaoyu Lu ◽  
Szu-Wei Tu ◽  
Wennan Chang ◽  
Changlin Wan ◽  
Jiashi Wang ◽  
...  

Abstract Deconvolution of mouse transcriptomic data is challenged by the fact that mouse models carry various genetic and physiological perturbations, making it questionable to assume fixed cell types and cell type marker genes for different data set scenarios. We developed a Semi-Supervised Mouse data Deconvolution (SSMD) method to study the mouse tissue microenvironment. SSMD is featured by (i) a novel nonparametric method to discover data set-specific cell type signature genes; (ii) a community detection approach for fixing cell types and their marker genes; (iii) a constrained matrix decomposition method to solve cell type relative proportions that is robust to diverse experimental platforms. In summary, SSMD addressed several key challenges in the deconvolution of mouse tissue data, including: (i) varied cell types and marker genes caused by highly divergent genotypic and phenotypic conditions of mouse experiment; (ii) diverse experimental platforms of mouse transcriptomics data; (iii) small sample size and limited training data source and (iv) capable to estimate the proportion of 35 cell types in blood, inflammatory, central nervous or hematopoietic systems. In silico and experimental validation of SSMD demonstrated its high sensitivity and accuracy in identifying (sub) cell types and predicting cell proportions comparing with state-of-the-arts methods. A user-friendly R package and a web server of SSMD are released via https://github.com/xiaoyulu95/SSMD.

2020 ◽  
Author(s):  
Xiaoyu Lu ◽  
Szu-Wei Tu ◽  
Wennan Chang ◽  
Changlin Wan ◽  
Jiashi Wang ◽  
...  

ABSTRACTDeconvolution of mouse transcriptomic data is challenged by the fact that mouse models carry various genetic and physiological perturbations, making it questionable to assume fixed cell types and cell type marker genes for different dataset scenarios. We developed a Semi-Supervised Mouse data Deconvolution (SSMD) method to study the mouse tissue microenvironment (TME). SSMD is featured by (i) a novel non-parametric method to discover data set specific cell type signature genes; (ii) a community detection approach for fixing cell types and their marker genes; (iii) a constrained matrix decomposition method to solve cell type relative proportions that is robust to diverse experimental platforms. In summary, SSMD addressed several key challenges in the deconvolution of mouse tissue data, including: (1) varied cell types and marker genes caused by highly divergent genotypic and phenotypic conditions of mouse experiment, (2) diverse experimental platforms of mouse transcriptomics data, (3) small sample size and limited training data source, and (4) capable to estimate the proportion of 35 cell types in blood, inflammatory, central nervous or hematopoietic systems. In silico and experimental validation of SSMD demonstrated its high sensitivity and accuracy in identifying (sub) cell types and predicting cell proportions comparing to state-of-the-arts methods. A user-friendly R package and a web server of SSMD are released via https://github.com/xiaoyulu95/SSMD.Key pointsWe provide a novel tissue deconvolution method, namely SSMD, which is specifically designed for mouse data to handle the variations caused by different mouse strain, genetic and phenotypic background, and experimental platforms.SSMD is capable to detect data set and tissue microenvironment specific cell markers for more than 30 cell types in mouse blood, inflammatory tissue, cancer, and central nervous system.SSMD achieve much improved performance in estimating relative proportion of the cell types compared with state-of-the-art methods.The semi-supervised setting enables the application of SSMD on transcriptomics, DNA methylation and ATAC-seq data.A user friendly R package and a R shiny of SSMD based webserver are also developed.


2018 ◽  
Author(s):  
Wennan Chang ◽  
Changlin Wan ◽  
Xiaoyu Lu ◽  
Szu-wei Tu ◽  
Yifan Sun ◽  
...  

AbstractWe developed a novel deconvolution method, namely Inference of Cell Types and Deconvolution (ICTD) that addresses the fundamental issue of identifiability and robustness in current tissue data deconvolution problem. ICTD provides substantially new capabilities for omics data based characterization of a tissue microenvironment, including (1) maximizing the resolution in identifying resident cell and sub types that truly exists in a tissue, (2) identifying the most reliable marker genes for each cell type, which are tissue and data set specific, (3) handling the stability problem with co-linear cell types, (4) co-deconvoluting with available matched multi-omics data, and (5) inferring functional variations specific to one or several cell types. ICTD is empowered by (i) rigorously derived mathematical conditions of identifiable cell type and cell type specific functions in tissue transcriptomics data and (ii) a semi supervised approach to maximize the knowledge transfer of cell type and functional marker genes identified in single cell or bulk cell data in the analysis of tissue data, and (iii) a novel unsupervised approach to minimize the bias brought by training data. Application of ICTD on real and single cell simulated tissue data validated that the method has consistently good performance for tissue data coming from different species, tissue microenvironments, and experimental platforms. Other than the new capabilities, ICTD outperformed other state-of-the-art devolution methods on prediction accuracy, the resolution of identifiable cell, detection of unknown sub cell types, and assessment of cell type specific functions. The premise of ICTD also lies in characterizing cell-cell interactions and discovering cell types and prognostic markers that are predictive of clinical outcomes.


2019 ◽  
Author(s):  
Chenling Xu ◽  
Romain Lopez ◽  
Edouard Mehlman ◽  
Jeffrey Regier ◽  
Michael I. Jordan ◽  
...  

AbstractAs single-cell transcriptomics becomes a mainstream technology, the natural next step is to integrate the accumulating data in order to achieve a common ontology of cell types and states. However, owing to various nuisance factors of variation, it is not straightforward how to compare gene expression levels across data sets and how to automatically assign cell type labels in a new data set based on existing annotations. In this manuscript, we demonstrate that our previously developed method, scVI, provides an effective and fully probabilistic approach for joint representation and analysis of cohorts of single-cell RNA-seq data sets, while accounting for uncertainty caused by biological and measurement noise. We also introduce single-cell ANnotation using Variational Inference (scANVI), a semi-supervised variant of scVI designed to leverage any available cell state annotations — for instance when only one data set in a cohort is annotated, or when only a few cells in a single data set can be labeled using marker genes. We demonstrate that scVI and scANVI compare favorably to the existing methods for data integration and cell state annotation in terms of accuracy, scalability, and adaptability to challenging settings such as a hierarchical structure of cell state labels. We further show that different from existing methods, scVI and scANVI represent the integrated datasets with a single generative model that can be directly used for any probabilistic decision making task, using differential expression as our case study. scVI and scANVI are available as open source software and can be readily used to facilitate cell state annotation and help ensure consistency and reproducibility across studies.


2019 ◽  
Author(s):  
Alexander J. Cammack ◽  
Arnav Moudgil ◽  
Tomas Lagunas ◽  
Michael J. Vasek ◽  
Mark Shabsovich ◽  
...  

AbstractTranscription factors (TFs) play a central role in the regulation of gene expression, controlling everything from cell fate decisions to activity dependent gene expression. However, widely-used methods for TF profiling in vivo (e.g. ChIP-seq) yield only an aggregated picture of TF binding across all cell types present within the harvested tissue; thus, it is challenging or impossible to determine how the same TF might bind different portions of the genome in different cell types, or even to identify its binding events at all in rare cell types in a complex tissue such as the brain. Here we present a versatile methodology, FLEX Calling Cards, for the mapping of TF occupancy in specific cell types from heterogenous tissues. In this method, the TF of interest is fused to a hyperactive piggyBac transposase (hypPB), and this bipartite gene is delivered, along with donor transposons, to mouse tissue via a Cre-dependent adeno-associated virus (AAV). The fusion protein is expressed in Cre-expressing cells where it inserts transposon “Calling Cards” near to TF binding sites. These transposons permanently mark TF binding events and can be mapped using high-throughput sequencing. Alternatively, unfused hypPB interacts with and records the binding of the super enhancer (SE)-associated bromodomain protein, Brd4. To demonstrate the FLEX Calling Card method, we first show that donor transposon and transposase constructs can be efficiently delivered to the postnatal day 1 (P1) mouse brain with AAV and that insertion profiles report TF occupancy. Then, using a Cre-dependent hypPB virus, we show utility of this tool in defining cell type-specific TF profiles in multiple cell types of the brain. This approach will enable important cell type-specific studies of TF-mediated gene regulation in the brain and will provide valuable insights into brain development, homeostasis, and disease.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Bianca Dumitrascu ◽  
Soledad Villar ◽  
Dustin G. Mixon ◽  
Barbara E. Engelhardt

AbstractSingle-cell technologies characterize complex cell populations across multiple data modalities at unprecedented scale and resolution. Multi-omic data for single cell gene expression, in situ hybridization, or single cell chromatin states are increasingly available across diverse tissue types. When isolating specific cell types from a sample of disassociated cells or performing in situ sequencing in collections of heterogeneous cells, one challenging task is to select a small set of informative markers that robustly enable the identification and discrimination of specific cell types or cell states as precisely as possible. Given single cell RNA-seq data and a set of cellular labels to discriminate, scGeneFit selects gene markers that jointly optimize cell label recovery using label-aware compressive classification methods. This results in a substantially more robust and less redundant set of markers than existing methods, most of which identify markers that separate each cell label from the rest. When applied to a data set given a hierarchy of cell types as labels, the markers found by our method improves the recovery of the cell type hierarchy with fewer markers than existing methods using a computationally efficient and principled optimization.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Sheng Wang ◽  
Angela Oliveira Pisco ◽  
Aaron McGeever ◽  
Maria Brbic ◽  
Marinka Zitnik ◽  
...  

AbstractSingle cell technologies are rapidly generating large amounts of data that enables us to understand biological systems at single-cell resolution. However, joint analysis of datasets generated by independent labs remains challenging due to a lack of consistent terminology to describe cell types. Here, we present OnClass, an algorithm and accompanying software for automatically classifying cells into cell types that are part of the controlled vocabulary that forms the Cell Ontology. A key advantage of OnClass is its capability to classify cells into cell types not present in the training data because it uses the Cell Ontology graph to infer cell type relationships. Furthermore, OnClass can be used to identify marker genes for all the cell ontology categories, regardless of whether the cell types are present or absent in the training data, suggesting that OnClass goes beyond a simple annotation tool for single cell datasets, being the first algorithm capable to identify marker genes specific to all terms of the Cell Ontology and offering the possibility of refining the Cell Ontology using a data-centric approach.


2019 ◽  
Author(s):  
Sheng Wang ◽  
Angela Oliveira Pisco ◽  
Aaron McGeever ◽  
Maria Brbic ◽  
Marinka Zitnik ◽  
...  

AbstractSingle cell technologies have rapidly generated an unprecedented amount of data that enables us to understand biological systems at single-cell resolution. However, joint analysis of datasets generated by independent labs remains challenging due to a lack of consistent terminology to describe cell types. Here, we present OnClass, an algorithm and accompanying software for automatically classifying cells into cell types part of the controlled vocabulary that forms the Cell Ontology. A key advantage of OnClass is its capability to classify cells into cell types not present in the training data because it uses the Cell Ontology graph to infer cell type relationships. Furthermore, OnClass can be used to identify marker genes for all the cell ontology categories, independently of whether the cells types are present or absent in the training data, suggesting that OnClass can be used not only as an annotation tool for single cell datasets but also as an algorithm to identify marker genes specific to each term of the Cell Ontology, offering the possibility of refining the Cell Ontology using a data-centric approach.


2019 ◽  
Author(s):  
Bianca Dumitrascu ◽  
Soledad Villar ◽  
Dustin G. Mixon ◽  
Barbara E. Engelhardt

Single-cell technologies characterize complex cell populations across multiple data modalities at un-precedented scale and resolution. Multi-omic data for single cell gene expression,in situhybridization, or single cell chromatin states are increasingly available across diverse tissue types. When isolating specific cell types from a sample of disassociated cells or performingin situsequencing in collections of heterogeneous cells, one challenging task is to select a small set of informative markers to identify and differentiate specific cell types or cell states as precisely as possible. Given single cell RNA-seq data and a set of cellular labels to discriminate, scGene-Fit selects gene transcript markers that jointly optimize cell label recovery using label-aware compressive classification methods, resulting in a substantially more robust and less redundant set of markers than existing methods. When applied to a data set given a hierarchy of cell type labels, the markers found by our method enable the recovery of the label hierarchy through a computationally efficient and principled optimization.


2020 ◽  
Author(s):  
Haidong Yan ◽  
Qi Song ◽  
Jiyoung Lee ◽  
John Schiefelbein ◽  
Song Li

AbstractAn essential step of single-cell RNA sequencing analysis is to classify specific cell types with marker genes in order to dissect the biological functions of each individual cell. In this study, we integrated five published scRNA-seq datasets from the Arabidopsis root containing over 25,000 cells and 17 cell clusters. We have compared the performance of seven machine learning methods in classifying these cell types, and determined that the random forest and support vector machine methods performed best. Using feature selection with these two methods and a correlation method, we have identified 600 new marker genes for 10 root cell types, and more than 70% of these machine learning-derived marker genes were not identified before. We found that these new markers not only can assign cell types consistently as the previously known cell markers, but also performed better than existing markers in several evaluation metrics including accuracy and sensitivity. Markers derived by the random forest method, in particular, were expressed in 89-98% of cells in endodermis, trichoblast, and cortex clusters, which is a 29-67% improvement over known markers. Finally, we have found 111 new orthologous marker genes for the trichoblast in five plant species, which expands the number of marker genes by 58-170% in non-Arabidopsis plants. Our results represent a new approach to identify cell-type marker genes from scRNA-seq data and pave the way for cross-species mapping of scRNA-seq data in plants.


Nature ◽  
2021 ◽  
Vol 598 (7879) ◽  
pp. 103-110 ◽  
Author(s):  
Zizhen Yao ◽  
Hanqing Liu ◽  
Fangming Xie ◽  
Stephan Fischer ◽  
Ricky S. Adkins ◽  
...  

AbstractSingle-cell transcriptomics can provide quantitative molecular signatures for large, unbiased samples of the diverse cell types in the brain1–3. With the proliferation of multi-omics datasets, a major challenge is to validate and integrate results into a biological understanding of cell-type organization. Here we generated transcriptomes and epigenomes from more than 500,000 individual cells in the mouse primary motor cortex, a structure that has an evolutionarily conserved role in locomotion. We developed computational and statistical methods to integrate multimodal data and quantitatively validate cell-type reproducibility. The resulting reference atlas—containing over 56 neuronal cell types that are highly replicable across analysis methods, sequencing technologies and modalities—is a comprehensive molecular and genomic account of the diverse neuronal and non-neuronal cell types in the mouse primary motor cortex. The atlas includes a population of excitatory neurons that resemble pyramidal cells in layer 4 in other cortical regions4. We further discovered thousands of concordant marker genes and gene regulatory elements for these cell types. Our results highlight the complex molecular regulation of cell types in the brain and will directly enable the design of reagents to target specific cell types in the mouse primary motor cortex for functional analysis.


Sign in / Sign up

Export Citation Format

Share Document