Integration of Diverse Transcriptomics Datasets using Random Forest to Predict Universal Functional Pathways in Tfr Cells

Motivation T follicular regulatory (Tfr) cells are a specialized cell subset that controls humoral immunity. Despite a number of individual transcriptomic studies on these cells, core functional pathways have been difficult to uncover due to the substantial transcriptional overlap of these cells with other effector cell types, as well as transcriptional changes occurring due to disease settings. Developing a core transcriptional module for Tfr cells that integrates multiple cell type comparisons as well as diverse disease settings will allow a more accurate prediction of functional pathways. Researchers studying allergic reactions, immune responses to vaccines, autoimmunity and cancer could use this gene set to better understand the roles of Tfr cells in controlling disease progression. Additional cell types beyond Tfr cells that have similar features of transcriptomic complexity within diverse disease settings may also be studied using similar approaches. High-throughput sequencing technologies allow the generation of large datasets that require specific tools to best interpret the data. The development of a core transcriptional module for Tfr cells will allow investigators to determine if Tfr cells may have functional roles within their biological systems with little knowledge of Tfr biology. With this work, we have addressed the need of core gene modules to define specific subsets of immune cells. Results We introduce an integrated "core Tfr cell gene module" that can be incorporated into GSEA analysis using various input sizes. The integrated core Tfr gene module was built using transcriptomic studies in Tfr cells from several different tissues, disease settings, and cell type comparisons. Random forest was used to integrate the transcriptomic studies to generate the core gene module. A GSEA gene set was formulated from the integrated core Tfr gene module for incorporation into end-user friendly GSEA. The gene sets are presented along with random genes taken from the GTEX data set and are presented as GMT files. The user can upload the gene set to the GSEA website or any gene set tool which takes GMT files. We also present the full results of the model including p-values calculated by random forest. This allows the user to choose a p-value cutoff that is most appropriate for the experimental setting.

Download Full-text

Panoramic stitching of heterogeneous single-cell transcriptomic data

10.1101/371179 ◽

2018 ◽

Cited By ~ 17

Author(s):

Brian Hie ◽

Bryan Bryson ◽

Bonnie Berger

Keyword(s):

Single Cell ◽

Cell Types ◽

Data Sets ◽

Cell Type ◽

Data Set ◽

Wide Range ◽

Data Set Integration ◽

Biological Patterns ◽

Insight Into ◽

Comprehensive Reference

AbstractResearchers are generating single-cell RNA sequencing (scRNA-seq) profiles of diverse biological systems1–4 and every cell type in the human body.5 Leveraging this data to gain unprecedented insight into biology and disease will require assembling heterogeneous cell populations across multiple experiments, laboratories, and technologies. Although methods for scRNA-seq data integration exist6,7, they often naively merge data sets together even when the data sets have no cell types in common, leading to results that do not correspond to real biological patterns. Here we present Scanorama, inspired by algorithms for panorama stitching, that overcomes the limitations of existing methods to enable accurate, heterogeneous scRNA-seq data set integration. Our strategy identifies and merges the shared cell types among all pairs of data sets and is orders of magnitude faster than existing techniques. We use Scanorama to combine 105,476 cells from 26 diverse scRNA-seq experiments across 9 different technologies into a single comprehensive reference, demonstrating how Scanorama can be used to obtain a more complete picture of cellular function across a wide range of scRNA-seq experiments.

Download Full-text

ICTD: A semi-supervised cell type identification and deconvolution method for multi-omics data

10.1101/426593 ◽

2018 ◽

Cited By ~ 2

Author(s):

Wennan Chang ◽

Changlin Wan ◽

Xiaoyu Lu ◽

Szu-wei Tu ◽

Yifan Sun ◽

...

Keyword(s):

Single Cell ◽

Cell Types ◽

Training Data ◽

Marker Genes ◽

Cell Detection ◽

Omics Data ◽

Deconvolution Method ◽

Cell Type ◽

Data Set ◽

Cell Type Specific

AbstractWe developed a novel deconvolution method, namely Inference of Cell Types and Deconvolution (ICTD) that addresses the fundamental issue of identifiability and robustness in current tissue data deconvolution problem. ICTD provides substantially new capabilities for omics data based characterization of a tissue microenvironment, including (1) maximizing the resolution in identifying resident cell and sub types that truly exists in a tissue, (2) identifying the most reliable marker genes for each cell type, which are tissue and data set specific, (3) handling the stability problem with co-linear cell types, (4) co-deconvoluting with available matched multi-omics data, and (5) inferring functional variations specific to one or several cell types. ICTD is empowered by (i) rigorously derived mathematical conditions of identifiable cell type and cell type specific functions in tissue transcriptomics data and (ii) a semi supervised approach to maximize the knowledge transfer of cell type and functional marker genes identified in single cell or bulk cell data in the analysis of tissue data, and (iii) a novel unsupervised approach to minimize the bias brought by training data. Application of ICTD on real and single cell simulated tissue data validated that the method has consistently good performance for tissue data coming from different species, tissue microenvironments, and experimental platforms. Other than the new capabilities, ICTD outperformed other state-of-the-art devolution methods on prediction accuracy, the resolution of identifiable cell, detection of unknown sub cell types, and assessment of cell type specific functions. The premise of ICTD also lies in characterizing cell-cell interactions and discovering cell types and prognostic markers that are predictive of clinical outcomes.

Download Full-text

SSMD: a semi-supervised approach for a robust cell type identification and deconvolution of mouse transcriptomics data

Briefings in Bioinformatics ◽

10.1093/bib/bbaa307 ◽

2020 ◽

Author(s):

Xiaoyu Lu ◽

Szu-Wei Tu ◽

Wennan Chang ◽

Changlin Wan ◽

Jiashi Wang ◽

...

Keyword(s):

Small Sample Size ◽

Cell Types ◽

Small Sample ◽

Training Data ◽

Mouse Tissue ◽

Marker Genes ◽

Specific Cell ◽

Cell Type ◽

Data Set ◽

Transcriptomics Data

Abstract Deconvolution of mouse transcriptomic data is challenged by the fact that mouse models carry various genetic and physiological perturbations, making it questionable to assume fixed cell types and cell type marker genes for different data set scenarios. We developed a Semi-Supervised Mouse data Deconvolution (SSMD) method to study the mouse tissue microenvironment. SSMD is featured by (i) a novel nonparametric method to discover data set-specific cell type signature genes; (ii) a community detection approach for fixing cell types and their marker genes; (iii) a constrained matrix decomposition method to solve cell type relative proportions that is robust to diverse experimental platforms. In summary, SSMD addressed several key challenges in the deconvolution of mouse tissue data, including: (i) varied cell types and marker genes caused by highly divergent genotypic and phenotypic conditions of mouse experiment; (ii) diverse experimental platforms of mouse transcriptomics data; (iii) small sample size and limited training data source and (iv) capable to estimate the proportion of 35 cell types in blood, inflammatory, central nervous or hematopoietic systems. In silico and experimental validation of SSMD demonstrated its high sensitivity and accuracy in identifying (sub) cell types and predicting cell proportions comparing with state-of-the-arts methods. A user-friendly R package and a web server of SSMD are released via https://github.com/xiaoyulu95/SSMD.

Download Full-text

Integration of GWAS Summary Statistics and Gene Expression Reveals Target Cell Types Underlying Kidney Function Traits

Journal of the American Society of Nephrology ◽

10.1681/asn.2020010051 ◽

2020 ◽

Vol 31 (10) ◽

pp. 2326-2340 ◽

Cited By ~ 2

Author(s):

Yong Li ◽

Stefan Haug ◽

Pascal Schlosser ◽

Alexander Teumer ◽

Adrienne Tin ◽

...

Keyword(s):

Gene Expression ◽

Kidney Function ◽

Cell Types ◽

Gene Set Enrichment Analysis ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Rna Seq ◽

Cell Type ◽

Gene Set Enrichment ◽

Gene Set

BackgroundGenetic variants identified in genome-wide association studies (GWAS) are often not specific enough to reveal complex underlying physiology. By integrating RNA-seq data and GWAS summary statistics, novel computational methods allow unbiased identification of trait-relevant tissues and cell types.MethodsThe CKDGen consortium provided GWAS summary data for eGFR, urinary albumin-creatinine ratio (UACR), BUN, and serum urate. Genotype-Tissue Expression Project (GTEx) RNA-seq data were used to construct the top 10% specifically expressed genes for each of 53 tissues followed by linkage disequilibrium (LD) score–based enrichment testing for each trait. Similar procedures were performed for five kidney single-cell RNA-seq datasets from humans and mice and for a microdissected tubule RNA-seq dataset from rat. Gene set enrichment analyses were also conducted for genes implicated in Mendelian kidney diseases.ResultsAcross 53 tissues, genes in kidney function–associated GWAS loci were enriched in kidney (P=9.1E-8 for eGFR; P=1.2E-5 for urate) and liver (P=6.8·10-5 for eGFR). In the kidney, proximal tubule was enriched in humans (P=8.5E-5 for eGFR; P=7.8E-6 for urate) and mice (P=0.0003 for eGFR; P=0.0002 for urate) and confirmed as the primary cell type in microdissected tubules and organoids. Gene set enrichment analysis supported this and showed enrichment of genes implicated in monogenic glomerular diseases in podocytes. A systematic approach generated a comprehensive list of GWAS genes prioritized by cell type–specific expression.ConclusionsIntegration of GWAS statistics of kidney function traits and gene expression data identified relevant tissues and cell types, as a basis for further mechanistic studies to understand GWAS loci.

Download Full-text

Single-nuclei chromatin profiling of ventral midbrain reveals cell identity transcription factors and cell-type-specific gene regulatory variation

Epigenetics & Chromatin ◽

10.1186/s13072-021-00418-3 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Yujuan Gui ◽

Kamil Grzyb ◽

Mélanie H. Thomas ◽

Jochen Ohnmacht ◽

Pierre Garcia ◽

...

Keyword(s):

Gene Expression ◽

Transcription Factors ◽

Cell Types ◽

Mouse Strains ◽

Cell Type ◽

Data Set ◽

Cell Identity ◽

Regulatory Variation ◽

Ventral Midbrain ◽

Cell Type Specific

Abstract Background Cell types in ventral midbrain are involved in diseases with variable genetic susceptibility, such as Parkinson’s disease and schizophrenia. Many genetic variants affect regulatory regions and alter gene expression in a cell-type-specific manner depending on the chromatin structure and accessibility. Results We report 20,658 single-nuclei chromatin accessibility profiles of ventral midbrain from two genetically and phenotypically distinct mouse strains. We distinguish ten cell types based on chromatin profiles and analysis of accessible regions controlling cell identity genes highlights cell-type-specific key transcription factors. Regulatory variation segregating the mouse strains manifests more on transcriptome than chromatin level. However, cell-type-level data reveals changes not captured at tissue level. To discover the scope and cell-type specificity of cis-acting variation in midbrain gene expression, we identify putative regulatory variants and show them to be enriched at differentially expressed loci. Finally, we find TCF7L2 to mediate trans-acting variation selectively in midbrain neurons. Conclusions Our data set provides an extensive resource to study gene regulation in mesencephalon and provides insights into control of cell identity in the midbrain and identifies cell-type-specific regulatory variation possibly underlying phenotypic and behavioural differences between mouse strains.

Download Full-text

Automated identification of cell-type–specific genes and alternative promoters

10.1101/2021.12.01.470587 ◽

2021 ◽

Author(s):

Mickaël Mendez ◽

Jayson Harshbarger ◽

Michael M. Hoffman

Keyword(s):

Random Forest ◽

Individual Cell ◽

Cell Types ◽

Differentially Expressed ◽

Pairwise Comparisons ◽

Alternative Promoters ◽

Cell Type ◽

Bootstrap Approach ◽

Cell Type Specific ◽

Different Cell Types

Background: Identifying key transcriptional features, such as genes or transcripts, involved in cellular differentiation remains a challenging problem. Current methods for identifying key transcriptional features predominantly rely on pairwise comparisons among different cell types. These methods also identify long lists of differentially expressed transcriptional features. Combining the results from many such pairwise comparisons to find the transcriptional features specific only to one cell type is not straightforward. Thus, one must have a principled method for amalgamating pairwise cell type comparisons that makes full use of prior knowledge about the developmental relationships between cell types. Method: We developed Cell Lineage Analysis (CLA), a computational method which identifies transcriptional features with expression patterns that discriminate cell types, incorporating Cell Ontology knowledge on the relationship between different cell types. CLA uses random forest classification with a stratified bootstrap to increase the accuracy of binary classifiers when each cell type have a different number of samples. Regularized random forest results in a classifier that selects few but important transcriptional features. For each cell type pair, CLA runs multiple instances of regularized random forest and reports the transcriptional features consistently selected. CLA not only discriminates individual cell types but can also discriminate lineages of cell types related in the developmental hierarchy. Results: We applied CLA to Functional Annotation of the Mammalian Genome 5 (FANTOM5) data and identified discriminative transcription factor and long non-coding RNA (lncRNA) genes for 71 human cell types. With capped analysis of gene expression (CAGE) data, CLA identified individual cell-type–specific alternative promoters for cell surface markers. Compared to random forest with a standard bootstrap approach, CLA's stratified bootstrap approach improved the accuracy of gene expression classification models for more than 95% of 2060 cell type pairs examined. Applied on 10X Genomics single-cell RNA-seq data for CD14+ monocytes and FCGR3A+ monocytes, CLA selected only 13 discriminative genes. These genes included the top 9 out of 370 significantly differentially expressed genes obtained from conventional differential expression analysis methods. Discussion: Our CLA method combines tools to simplify the interpretation of transcriptome datasets from many cell types. It automates the identification of the most differentially expressed genes for each cell type pairs CLA's lineage score allows easy identification of the best transcriptional markers for each cell type and lineage in both bulk and single-cell transcriptomic data. Availability: CLA is available at https://cla.hoffmanlab.org. We deposited the version of the CLA source with which we ran our experiments at https://doi.org/10.5281/zenodo.3630670. We deposited other analysis code and results at https://doi.org/10.5281/zenodo.5735636.

Download Full-text

Optimal marker gene selection for cell type discrimination in single cell analyses

Nature Communications ◽

10.1038/s41467-021-21453-4 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Bianca Dumitrascu ◽

Soledad Villar ◽

Dustin G. Mixon ◽

Barbara E. Engelhardt

Keyword(s):

Single Cell ◽

Gene Selection ◽

Marker Gene ◽

Cell Types ◽

Specific Cell ◽

Cell Type ◽

Computationally Efficient ◽

Data Set ◽

Gene Markers

AbstractSingle-cell technologies characterize complex cell populations across multiple data modalities at unprecedented scale and resolution. Multi-omic data for single cell gene expression, in situ hybridization, or single cell chromatin states are increasingly available across diverse tissue types. When isolating specific cell types from a sample of disassociated cells or performing in situ sequencing in collections of heterogeneous cells, one challenging task is to select a small set of informative markers that robustly enable the identification and discrimination of specific cell types or cell states as precisely as possible. Given single cell RNA-seq data and a set of cellular labels to discriminate, scGeneFit selects gene markers that jointly optimize cell label recovery using label-aware compressive classification methods. This results in a substantially more robust and less redundant set of markers than existing methods, most of which identify markers that separate each cell label from the rest. When applied to a data set given a hierarchy of cell types as labels, the markers found by our method improves the recovery of the cell type hierarchy with fewer markers than existing methods using a computationally efficient and principled optimization.

Download Full-text

ADAPTS: Automated Deconvolution Augmentation of Profiles for Tissue Specific cells

10.1101/633958 ◽

2019 ◽

Author(s):

Samuel A Danziger ◽

David L Gibbs ◽

Ilya Shmulevich ◽

Mark McConnell ◽

Matthew WB Trotter ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Gene Expression Data ◽

Immune Cell ◽

De Novo ◽

Cell Types ◽

Expression Data ◽

Cell Type ◽

Data Set ◽

Rnaseq Data

AbstractImmune cell infiltration of tumors can be an important component for determining patient outcomes, e.g. by inferring immune cell presence by deconvolving gene expression data drawn from a heterogenous mix of cell types. One particularly powerful family of deconvolution techniques uses signature matrices of genes that uniquely identify each cell type as determined from cell type purified gene expression data. Many methods of this type have been recently published, often including new signature matrices appropriate for a single purpose, such as investigating a specific type of tumor. The package ADAPTS helps users make the most of this expanding knowledge base by introducing a framework for cell type deconvolution. ADAPTS implements modular tools for customizing signature matrices for new tissue types by adding custom cell types or building new matrices de novo, including from single cell RNAseq data. It includes a common interface to several popular deconvolution algorithms that use a signature matrix to estimate the proportion of cell types present in heterogenous samples. ADAPTS also implements a novel method for clustering cell types into groups that are hard to distinguish by deconvolution and then re-splitting those clusters using hierarchical deconvolution. We demonstrate that the techniques implemented in ADAPTS improve the ability to reconstruct the cell types present in a single cell RNAseq data set in a blind predictive analysis. ADAPTS is currently available for use in R on CRAN and GitHub.

Download Full-text

Cell Type–Specific Decomposition of Gingival Tissue Transcriptomes

Journal of Dental Research ◽

10.1177/0022034520979614 ◽

2021 ◽

pp. 002203452097961

Author(s):

F. Momen-Heravi ◽

R.A. Friedman ◽

S. Albeshri ◽

A. Sawle ◽

M. Kebschull ◽

...

Keyword(s):

Gene Expression ◽

B Cells ◽

Epithelial Cells ◽

Differentially Expressed Genes ◽

Cell Types ◽

Differentially Expressed ◽

Gingival Tissue ◽

Cell Type ◽

Data Set ◽

Cell Type Specific

Genome-wide transcriptomic analyses in whole tissues reflect the aggregate gene expression in heterogeneous cell populations comprising resident and migratory cells, and they are unable to identify cell type–specific information. We used a computational method (population-specific expression analysis [PSEA]) to decompose gene expression in gingival tissues into cell type–specific signatures for 8 cell types (epithelial cells, fibroblasts, endothelial cells, neutrophils, monocytes/macrophages, plasma cells, T cells, and B cells). We used a gene expression data set generated using microarrays from 120 persons (310 tissue samples; 241 periodontitis affected and 69 healthy). Decomposition of the whole-tissue transcriptomes identified differentially expressed genes in each of the cell types, which mapped to biologically relevant pathways, including dysregulation of Th17 cell differentiation, AGE-RAGE signaling, and epithelial-mesenchymal transition in epithelial cells. We validated selected PSEA-predicted, differentially expressed genes in purified gingival epithelial cells and B cells from an unrelated cohort ( n = 15 persons), each of whom contributed with 1 periodontitis-affected and 1 healthy gingival tissue sample. Differential expression of these genes by quantitative reverse transcription polymerase chain reaction corroborated the PSEA predictions and pointed to dysregulation of biologically important pathways in periodontitis. Collectively, our results demonstrate the robustness of the PSEA in the decomposition of gingival tissue transcriptomes and its ability to identify differentially regulated transcripts in particular cellular constituents. These genes may serve as candidates for further investigation with respect to their roles in the pathogenesis of periodontitis.

Download Full-text

Computational inference of H3K4me3 and H3K27ac domain length

10.7287/peerj.preprints.1748 ◽

2016 ◽

Author(s):

Julian Zubek ◽

Michael L Stitzel ◽

Duygu Ucar ◽

Dariusz M Plewczynski

Keyword(s):

Random Forest ◽

Dna Sequences ◽

Regression Models ◽

Cell Types ◽

Regulatory Elements ◽

Epigenetic Mark ◽

Cell Type ◽

Epigenetic Marks ◽

Genomic Signatures ◽

The Individual

Background. Recent epigenomic studies have shown that the length of a DNA region covered by an epigenetic mark is not just a byproduct of the assaying technologies and has functional implications for that locus. For example, expanded regions of DNA sequences that are marked by enhancer-specific histone modifications, such as acetylation of histone H3 lysine 27 (H3K27ac) domains coincide with cell-specific enhancers, known as super or stretch enhancers. Similarly, promoters of genes critical for cell-specific functions are marked by expanded H3K4me3 domains in the cognate cell type, and these can span DNA regions from 4-5kb up to 40-50kb in length. These expanded H3K4me3 domains are known as buffer domains or super promoters. Methods. To ask what correlates with—and potentially regulates—the length of loci marked with these two important histone marks, H3K4me3 and H3K27ac, we built Random Forest regression models. With these models we computationally identified genomic and epigenomic patterns that are predictive for the length of these marks in seven ENCODE cell lines. Results. We found that certain epigenetic marks and transcription factors explain the variability of the length of H3K4me3 and H3K27ac marks across different cell types, which implies that the lengths of these two epigenetic marks are tightly regulated in a given cell type. Our source code for the regression models and data can be found at our GitHub page: https://github.com/zubekj/broad_peaks Discussion. Our Random Forest based regression models enabled us to estimate the individual contribution of different epigenetic marks and protein binding patterns to the length of H3K4me3 and H3K27ac deposition patterns; therefore potentially revealing genomic signatures at cell specific regulatory elements.

Download Full-text