Pervasive correlated evolution in gene expression shapes cell type transcriptomes

AbstractThe evolution and diversification of cell types is a key means by which animal complexity evolves. Recently, hierarchical clustering and phylogenetic methods have been applied to RNA-seq data to infer cell type evolutionary history and homology. A major challenge for interpreting this data is that cell type transcriptomes may not evolve independently due to correlated changes in gene expression. This non-independence can arise for several reasons, such as when different tissues share common regulatory sequences for regulating genes expressed in multiple tissues, i.e. pleiotropic effects of mutations. We develop a model to estimate the level of correlated transcriptome evolution (LCE) and apply it to different datasets. The results reveal pervasive correlated transcriptome evolution among different cell and tissue types. In general, tissues related by morphology or developmental lineage exhibit higher LCE than more distantly related tissues. Analyzing new data collected from bird skin appendages suggests that LCE decreases with the phylogenetic age of tissues compared, with recently evolved tissues exhibiting the highest LCE. Furthermore, we show correlated evolution can alter patterns of hierarchical clustering, causing different tissue types from the same species to cluster together. Using a dataset with sufficient taxon sampling, we performed a gene-wise estimation of LCE, identifying genes that most strongly contribute to the correlated evolution signal. Removing genes with high LCE allows for accurate reconstruction of evolutionary relationships among tissue types. Our study provides a statistical method to measure and account for correlated gene expression evolution when interpreting comparative transcriptome data.

Download Full-text

Integration of GWAS Summary Statistics and Gene Expression Reveals Target Cell Types Underlying Kidney Function Traits

Journal of the American Society of Nephrology ◽

10.1681/asn.2020010051 ◽

2020 ◽

Vol 31 (10) ◽

pp. 2326-2340 ◽

Cited By ~ 2

Author(s):

Yong Li ◽

Stefan Haug ◽

Pascal Schlosser ◽

Alexander Teumer ◽

Adrienne Tin ◽

...

Keyword(s):

Gene Expression ◽

Kidney Function ◽

Cell Types ◽

Gene Set Enrichment Analysis ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Rna Seq ◽

Cell Type ◽

Gene Set Enrichment ◽

Gene Set

BackgroundGenetic variants identified in genome-wide association studies (GWAS) are often not specific enough to reveal complex underlying physiology. By integrating RNA-seq data and GWAS summary statistics, novel computational methods allow unbiased identification of trait-relevant tissues and cell types.MethodsThe CKDGen consortium provided GWAS summary data for eGFR, urinary albumin-creatinine ratio (UACR), BUN, and serum urate. Genotype-Tissue Expression Project (GTEx) RNA-seq data were used to construct the top 10% specifically expressed genes for each of 53 tissues followed by linkage disequilibrium (LD) score–based enrichment testing for each trait. Similar procedures were performed for five kidney single-cell RNA-seq datasets from humans and mice and for a microdissected tubule RNA-seq dataset from rat. Gene set enrichment analyses were also conducted for genes implicated in Mendelian kidney diseases.ResultsAcross 53 tissues, genes in kidney function–associated GWAS loci were enriched in kidney (P=9.1E-8 for eGFR; P=1.2E-5 for urate) and liver (P=6.8·10-5 for eGFR). In the kidney, proximal tubule was enriched in humans (P=8.5E-5 for eGFR; P=7.8E-6 for urate) and mice (P=0.0003 for eGFR; P=0.0002 for urate) and confirmed as the primary cell type in microdissected tubules and organoids. Gene set enrichment analysis supported this and showed enrichment of genes implicated in monogenic glomerular diseases in podocytes. A systematic approach generated a comprehensive list of GWAS genes prioritized by cell type–specific expression.ConclusionsIntegration of GWAS statistics of kidney function traits and gene expression data identified relevant tissues and cell types, as a basis for further mechanistic studies to understand GWAS loci.

Download Full-text

Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq

eLife ◽

10.7554/elife.43803 ◽

2019 ◽

Vol 8 ◽

Cited By ~ 37

Author(s):

Dylan Kotliar ◽

Adrian Veres ◽

M Aurel Nagy ◽

Shervin Tabrizi ◽

Eran Hodis ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Matrix Factorization ◽

Cell Types ◽

Environmental Cues ◽

Rna Seq ◽

Cell Type ◽

Type Identity ◽

Brain Organoid ◽

Non Negative Matrix Factorization

Identifying gene expression programs underlying both cell-type identity and cellular activities (e.g. life-cycle processes, responses to environmental cues) is crucial for understanding the organization of cells and tissues. Although single-cell RNA-Seq (scRNA-Seq) can quantify transcripts in individual cells, each cell’s expression profile may be a mixture of both types of programs, making them difficult to disentangle. Here, we benchmark and enhance the use of matrix factorization to solve this problem. We show with simulations that a method we call consensus non-negative matrix factorization (cNMF) accurately infers identity and activity programs, including their relative contributions in each cell. To illustrate the insights this approach enables, we apply it to published brain organoid and visual cortex scRNA-Seq datasets; cNMF refines cell types and identifies both expected (e.g. cell cycle and hypoxia) and novel activity programs, including programs that may underlie a neurosecretory phenotype and synaptogenesis.

Download Full-text

MarkerCount: A stable, count-based cell type identifier for single cell RNA-Seq experiments

10.21203/rs.3.rs-418249/v1 ◽

2021 ◽

Author(s):

Hanbyeol Kim ◽

Joongho Lee ◽

Keunsoo Kang ◽

Seokhyun Yoon

Keyword(s):

Gene Expression ◽

Single Cell ◽

Cell Types ◽

Batch Effect ◽

Expression Level ◽

Rna Seq ◽

Cell Type ◽

Stable Performance ◽

Downstream Analysis

Abstract Cell type identification is a key step to downstream analysis of single cell RNA-seq experiments. Indispensible information for this is gene expression, which is used to cluster cells, train the model and set rejection thresholds. Problem is they are subject to batch effect arising from different platforms and preprocessing. We present MarkerCount, which uses the number of markers expressed regardless of their expression level to initially identify cell types and, then, reassign cell type in cluster-basis. MarkerCount works both in reference and marker-based mode, where the latter utilizes only the existing lists of markers, while the former required pre-annotated dataset to train the model. The performance was evaluated and compared with the existing identifiers, both marker and reference-based, that can be customized with publicly available datasets and marker DB. The results show that MarkerCount provides a stable performance when comparing with other reference-based and marker-based cell type identifiers.

Download Full-text

Identifying Gene Expression Programs of Cell-type Identity and Cellular Activity with Single-Cell RNA-Seq

10.1101/310599 ◽

2018 ◽

Cited By ~ 7

Author(s):

Dylan Kotliar ◽

Adrian Veres ◽

M. Aurel Nagy ◽

Shervin Tabrizi ◽

Eran Hodis ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Matrix Factorization ◽

Cell Types ◽

Rna Seq ◽

Cell Type ◽

Relative Contribution ◽

Neuronal Synapses ◽

Type Identity ◽

Brain Organoid

AbstractIdentifying gene expression programs underlying both cell-type identity and cellular activities (e.g. life-cycle processes, responses to environmental cues) is crucial for understanding the organization of cells and tissues. Although single-cell RNA-Seq (scRNA-Seq) can quantify transcripts in individual cells, each cell’s expression profile may be a mixture of both types of programs, making them difficult to disentangle. Here we illustrate and enhance the use of matrix factorization as a solution to this problem. We show with simulations that a method that we call consensus non-negative matrix factorization (cNMF) accurately infers identity and activity programs, including the relative contribution of programs in each cell. Applied to published brain organoid and visual cortex scRNA-Seq datasets, cNMF refines the hierarchy of cell types and identifies both expected (e.g. cell cycle and hypoxia) and intriguing novel activity programs. We propose that one of the novel programs may reflect a neurosecretory phenotype and a second may underlie the formation of neuronal synapses. We make cNMF available to the community and illustrate how this approach can provide key insights into gene expression variation within and between cell types.

Download Full-text

Comprehensive characterization of tissue-specific chromatin accessibility in L2 Caenorhabditis elegans nematodes

10.1101/2020.09.15.299123 ◽

2020 ◽

Author(s):

Timothy J. Durham ◽

Riza M. Daza ◽

Louis Gevirtzman ◽

Darren A. Cusanovich ◽

William Stafford Noble ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Expression Patterns ◽

Cell Types ◽

Chromatin Accessibility ◽

Gene Expression Patterns ◽

Rna Seq ◽

Cell Type ◽

Tissue Specific ◽

C Elegans

AbstractRecently developed single cell technologies allow researchers to characterize cell states at ever greater resolution and scale. C. elegans is a particularly tractable system for studying development, and recent single cell RNA-seq studies characterized the gene expression patterns for nearly every cell type in the embryo and at the second larval stage (L2). Gene expression patterns are useful for learning about gene function and give insight into the biochemical state of different cell types; however, in order to understand these cell types, we must also determine how these gene expression levels are regulated. We present the first single cell ATAC-seq study in C. elegans. We collected data in L2 larvae to match the available single cell RNA-seq data set, and we identify tissue-specific chromatin accessibility patterns that align well with existing data, including the L2 single cell RNA-seq results. Using a novel implementation of the latent Dirichlet allocation algorithm, we leverage the single-cell resolution of the sci-ATAC-seq data to identify accessible loci at the level of individual cell types, providing new maps of putative cell type-specific gene regulatory sites, with promise for better understanding of cellular differentiation and gene regulation in the worm.

Download Full-text

Cell type-specific endometrial transcriptome changes during initial recognition of pregnancy in the mare

Reproduction Fertility and Development ◽

10.1071/rd18144 ◽

2019 ◽

Vol 31 (3) ◽

pp. 496 ◽

Cited By ~ 2

Author(s):

Iside Scaravaggi ◽

Nicole Borel ◽

Rebekka Romer ◽

Isabel Imboden ◽

Susanne E. Ulbrich ◽

...

Keyword(s):

Gene Expression ◽

Spatial Information ◽

Gene Expression Regulation ◽

Cell Types ◽

Rna Seq ◽

Cell Type ◽

Expression Of Genes ◽

Prostaglandin F ◽

Cell Type Specific ◽

Gene Expression Studies

Previous endometrial gene expression studies during the time of conceptus migration did not provide final conclusions on the mechanisms of maternal recognition of pregnancy (MRP) in the mare. This called for a cell type-specific endometrial gene expression analysis in response to embryo signals to improve the understanding of gene expression regulation in the context of MRP. Laser capture microdissection was used to collect luminal epithelium (LE), glandular epithelium and stroma from endometrial biopsies from Day 12 of pregnancy and Day 12 of the oestrous cycle. RNA sequencing (RNA-Seq) showed greater expression differences between cell types than between pregnant and cyclic states; differences between the pregnant and cyclic states were mainly found in LE. Comparison with a previous RNA-Seq dataset for whole biopsy samples revealed the specific origin of gene expression differences. Furthermore, genes specifically differentially expressed (DE) in one cell type were found that were not detectable as DE in biopsies. Overall, this study revealed spatial information about endometrial gene expression during the phase of initial MRP. The conceptus induced changes in the expression of genes involved in blood vessel development, specific spatial regulation of the immune system, growth factors, regulation of prostaglandin synthesis, transport prostaglandin receptors, specifically prostaglandin F receptor (PTGFR) in the context of prevention of luteolysis.

Download Full-text

Amalgamated cross-species transcriptomes reveal organ-specific propensity in gene expression evolution

Nature Communications ◽

10.1038/s41467-020-18090-8 ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Kenji Fukushima ◽

David D. Pollock

Keyword(s):

Gene Expression ◽

Gene Duplication ◽

Evolutionary Rate ◽

Expression Patterns ◽

Vertebrate Species ◽

Rna Seq ◽

Organ Specific ◽

Tightly Coupled ◽

Expression Evolution ◽

Gene Expression Evolution

Abstract The origins of multicellular physiology are tied to evolution of gene expression. Genes can shift expression as organisms evolve, but how ancestral expression influences altered descendant expression is not well understood. To examine this, we amalgamate 1,903 RNA-seq datasets from 182 research projects, including 6 organs in 21 vertebrate species. Quality control eliminates project-specific biases, and expression shifts are reconstructed using gene-family-wise phylogenetic Ornstein–Uhlenbeck models. Expression shifts following gene duplication result in more drastic changes in expression properties than shifts without gene duplication. The expression properties are tightly coupled with protein evolutionary rate, depending on whether and how gene duplication occurred. Fluxes in expression patterns among organs are nonrandom, forming modular connections that are reshaped by gene duplication. Thus, if expression shifts, ancestral expression in some organs induces a strong propensity for expression in particular organs in descendants. Regardless of whether the shifts are adaptive or not, this supports a major role for what might be termed preadaptive pathways of gene expression evolution.

Download Full-text

CDSeqR: fast complete deconvolution for gene expression data from bulk tissues

BMC Bioinformatics ◽

10.1186/s12859-021-04186-5 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Kai Kang ◽

Caizhi Huang ◽

Yuanyuan Li ◽

David M. Umbach ◽

Leping Li

Keyword(s):

Gene Expression ◽

Cell Types ◽

Biological Tissues ◽

Specific Gene ◽

Specific Cell ◽

Specific Information ◽

Expression Data ◽

Rna Seq ◽

Cell Type ◽

Cell Type Specific

Abstract Background Biological tissues consist of heterogenous populations of cells. Because gene expression patterns from bulk tissue samples reflect the contributions from all cells in the tissue, understanding the contribution of individual cell types to the overall gene expression in the tissue is fundamentally important. We recently developed a computational method, CDSeq, that can simultaneously estimate both sample-specific cell-type proportions and cell-type-specific gene expression profiles using only bulk RNA-Seq counts from multiple samples. Here we present an R implementation of CDSeq (CDSeqR) with significant performance improvement over the original implementation in MATLAB and an added new function to aid cell type annotation. The R package would be of interest for the broader R community. Result We developed a novel strategy to substantially improve computational efficiency in both speed and memory usage. In addition, we designed and implemented a new function for annotating the CDSeq estimated cell types using single-cell RNA sequencing (scRNA-seq) data. This function allows users to readily interpret and visualize the CDSeq estimated cell types. In addition, this new function further allows the users to annotate CDSeq-estimated cell types using marker genes. We carried out additional validations of the CDSeqR software using synthetic, real cell mixtures, and real bulk RNA-seq data from the Cancer Genome Atlas (TCGA) and the Genotype-Tissue Expression (GTEx) project. Conclusions The existing bulk RNA-seq repositories, such as TCGA and GTEx, provide enormous resources for better understanding changes in transcriptomics and human diseases. They are also potentially useful for studying cell–cell interactions in the tissue microenvironment. Bulk level analyses neglect tissue heterogeneity, however, and hinder investigation of a cell-type-specific expression. The CDSeqR package may aid in silico dissection of bulk expression data, enabling researchers to recover cell-type-specific information.

Download Full-text

DSAVE: Detection of misclassified cells in single-cell RNA-Seq data

PLoS ONE ◽

10.1371/journal.pone.0243360 ◽

2020 ◽

Vol 15 (12) ◽

pp. e0243360

Author(s):

Johan Gustafsson ◽

Jonathan Robinson ◽

Juan S. Inda-Díaz ◽

Elias Björnson ◽

Rebecka Jörnsten ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Cell Types ◽

Rna Seq ◽

Cell Type ◽

Log Likelihood ◽

Single Cell Rna Sequencing ◽

Cell Transcriptome ◽

Average Gene ◽

Single Cell Transcriptome

Single-cell RNA sequencing has become a valuable tool for investigating cell types in complex tissues, where clustering of cells enables the identification and comparison of cell populations. Although many studies have sought to develop and compare different clustering approaches, a deeper investigation into the properties of the resulting populations is lacking. Specifically, the presence of misclassified cells can influence downstream analyses, highlighting the need to assess subpopulation purity and to detect such cells. We developed DSAVE (Down-SAmpling based Variation Estimation), a method to evaluate the purity of single-cell transcriptome clusters and to identify misclassified cells. The method utilizes down-sampling to eliminate differences in sampling noise and uses a log-likelihood based metric to help identify misclassified cells. In addition, DSAVE estimates the number of cells needed in a population to achieve a stable average gene expression profile within a certain gene expression range. We show that DSAVE can be used to find potentially misclassified cells that are not detectable by similar tools and reveal the cause of their divergence from the other cells, such as differing cell state or cell type. With the growing use of single-cell RNA-seq, we foresee that DSAVE will be an increasingly useful tool for comparing and purifying subpopulations in single-cell RNA-Seq datasets.

Download Full-text

Single-cell deconvolution of 3,000 post-mortem brain samples for eQTL and GWAS dissection in mental disorders

10.1101/2021.01.21.426000 ◽

2021 ◽

Author(s):

Yongjin Park ◽

Liang He ◽

Jose Davila-Velderrain ◽

Lei Hou ◽

Shahin Mohammadi ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Genetic Variants ◽

Cell Types ◽

Brain Regions ◽

Post Mortem ◽

Rna Seq ◽

Cell Type ◽

Post Mortem Brain ◽

Cell Type Specific

AbstractThousands of genetic variants acting in multiple cell types underlie complex disorders, yet most gene expression studies profile only bulk tissues, making it hard to resolve where genetic and non-genetic contributors act. This is particularly important for psychiatric and neurodegenerative disorders that impact multiple brain cell types with highly-distinct gene expression patterns and proportions. To address this challenge, we develop a new framework, SPLITR, that integrates single-nucleus and bulk RNA-seq data, enabling phenotype-aware deconvolution and correcting for systematic discrepancies between bulk and single-cell data. We deconvolved 3,387 post-mortem brain samples across 1,127 individuals and in multiple brain regions. We find that cell proportion varies across brain regions, individuals, disease status, and genotype, including genetic variants in TMEM106B that impact inhibitory neuron fraction and 4,757 cell-type-specific eQTLs. Our results demonstrate the power of jointly analyzing bulk and single-cell RNA-seq to provide insights into cell-type-specific mechanisms for complex brain disorders.

Download Full-text