scholarly journals Visualizing the Structure of RNA-seq Expression Data using Grade of Membership Models

2016 ◽  
Author(s):  
Kushal K Dey ◽  
Chiaowen Joyce Hsiao ◽  
Matthew Stephens

AbstractGrade of membership models, also known as “admixture models”, “topic models” or “Latent Dirichlet Allocation”, are a generalization of cluster models that allow each sample to have membership in multiple clusters. These models are widely used in population genetics to model admixed individuals who have ancestry from multiple “populations”, and in natural language processing to model documents having words from multiple “topics”. Here we illustrate the potential for these models to cluster samples of RNA-seq gene expression data, measured on either bulk samples or single cells. We also provide methods to help interpret the clusters, by identifying genes that are distinctively expressed in each cluster. By applying these methods to several example RNA-seq applications we demonstrate their utility in identifying and summarizing structure and heterogeneity. Applied to data from the GTEx project on 53 human tissues, the approach highlights similarities among biologically-related tissues and identifies distinctively-expressed genes that recapitulate known biology. Applied to single-cell expression data from mouse preimplantation embryos, the approach highlights both discrete and continuous variation through early embryonic development stages, and highlights genes involved in a variety of relevant processes – from germ cell development, through compaction and morula formation, to the formation of inner cell mass and trophoblast at the blastocyst stage. The methods are implemented in the Bioconductor package CountClust.Author SummaryGene expression profile of a biological sample (either from single cells or pooled cells) results from a complex interplay of multiple related biological processes. Consequently, for example, distal tissue samples may share a similar gene expression profile through some common underlying biological processes. Our goal here is to illustrate that grade of membership (GoM) models – an approach widely used in population genetics to cluster admixed individuals who have ancestry from multiple populations – provide an attractive approach for clustering biological samples of RNA sequencing data. The GoM model allows each biological sample to have partial memberships in multiple biologically-distinct clusters, in contrast to traditional clustering methods that partition samples into distinct subgroups. We also provide methods for identifying genes that are distinctively expressed in each cluster to help biologically interpret the results. Applied to a dataset of 53 human tissues, the GoM approach highlights similarities among biologically-related tissues and identifies distinctively-expressed genes that recapitulate known biology. Applied to gene expression data of single cells from mouse preimplantation embryos, the approach highlights both discrete and continuous variation through early embryonic development stages, and genes involved in a variety of relevant processes. Our study highlights the potential of GoM models for elucidating biological structure in RNA-seq gene expression data.

2020 ◽  
Vol 61 (10) ◽  
pp. 1818-1827
Author(s):  
Kuan-Chieh Tseng ◽  
Guan-Zhen Li ◽  
Yu-Cheng Hung ◽  
Chi-Nga Chow ◽  
Nai-Yun Wu ◽  
...  

Abstract Co-expressed genes tend to have regulatory relationships and participate in similar biological processes. Construction of gene correlation networks from microarray or RNA-seq expression data has been widely applied to study transcriptional regulatory mechanisms and metabolic pathways under specific conditions. Furthermore, since transcription factors (TFs) are critical regulators of gene expression, it is worth investigating TFs on the promoters of co-expressed genes. Although co-expressed genes and their related metabolic pathways can be easily identified from previous resources, such as EXPath and EXPath Tool, this information is not simultaneously available to identify their regulatory TFs. EXPath 2.0 is an updated database for the investigation of regulatory mechanisms in various plant metabolic pathways with 1,881 microarray and 978 RNA-seq samples. There are six significant improvements in EXPath 2.0: (i) the number of species has been extended from three to six to include Arabidopsis, rice, maize, Medicago, soybean and tomato; (ii) gene expression at various developmental stages have been added; (iii) construction of correlation networks according to a group of genes is available; (iv) hierarchical figures of the enriched Gene Ontology (GO) terms are accessible; (v) promoter analysis of genes in a metabolic pathway or correlation network is provided; and (vi) user’s gene expression data can be uploaded and analyzed. Thus, EXPath 2.0 is an updated platform for investigating gene expression profiles and metabolic pathways under specific conditions. It facilitates users to access the regulatory mechanisms of plant biological processes. The new version is available at http://EXPath.itps.ncku.edu.tw.


2017 ◽  
Author(s):  
Tao Peng ◽  
Qing Nie

AbstractMeasurement of gene expression levels for multiple genes in single cells provides a powerful approach to study heterogeneity of cell populations and cellular plasticity. While the expression levels of multiple genes in each cell are available in such data, the potential connections among the cells (e.g. the cellular state transition relationship) are not directly evident from the measurement. Classifying the cellular states, identifying their transitions among those states, and extracting the pseudotime ordering of cells are challenging due to the noise in the data and the high-dimensionality in the number of genes in the data. In this paper we adapt the classical self-organizing-map (SOM) approach for single-cell gene expression data (SOMSC), such as those based on single cell qPCR and single cell RNA-seq. In SOMSC, a cellular state map (CSM) is derived and employed to identify cellular states inherited in the population of the measured single cells. Cells located in the same basin of the CSM are considered as in one cellular state while barriers among the basins in CSM provide information on transitions among the cellular states. A cellular state transitions path (e.g. differentiation) and a temporal ordering of the measured single cells are consequently obtained. In addition, SOMSC could estimate the cellular state replication probability and transition probabilities. Applied to a set of synthetic data, one single-cell qPCR data set on mouse early embryonic development and two single-cell RNA-seq data sets, SOMSC shows effectiveness in capturing cellular states and their transitions presented in the high-dimensional single-cell data. This approach will have broader applications to analyzing cellular fate specification and cell lineages using single cell gene expression data


2019 ◽  
Vol 15 (2) ◽  
pp. e1006792 ◽  
Author(s):  
Brandon Monier ◽  
Adam McDermaid ◽  
Cankun Wang ◽  
Jing Zhao ◽  
Allison Miller ◽  
...  

2019 ◽  
Vol 17 (04) ◽  
pp. 1950024 ◽  
Author(s):  
Tinghua Huang ◽  
Xiali Huang ◽  
Bomei Shi ◽  
Min Yao

Understanding how genes are expressed and regulated in different biological processes are fundamental and challenging issues. Considerable progress has been made in studying the relationship between the expression and regulation of human genes. However, it is difficult to use these resources productively to analyze gene expression data. GEREDB ( www.thua45.cn/geredb ) has been developed to facilitate analyses that will provide insights into the regulation of genes that govern specific biological responses. GEREDB is a publicly available, manually curated biological database that stores the data regarding relationships between expression and regulation of human genes. To date, more than 39,000 Links have been contextually annotated by reviewing more than 53,000 abstracts. GEREDB can be searched using the official NCBI gene symbol as a query, and it can be downloaded along with the GEREA software package. GEREDB has the ability to analyze user-supplied gene expression data in a causal analysis oriented manner using the GEREA bioinformatics tool.


Author(s):  
D Fumagalli ◽  
B Haibe-Kains ◽  
S Michiels ◽  
DN Brown ◽  
D Gacquer ◽  
...  

2019 ◽  
Author(s):  
Ashkaun Razmara ◽  
Shannon E. Ellis ◽  
Dustin J. Sokolowski ◽  
Sean Davis ◽  
Michael D. Wilson ◽  
...  

AbstractThe usability of publicly-available gene expression data is often limited by the availability of high-quality, standardized biological phenotype and experimental condition information (“metadata”). We released the recount2 project, which involved re-processing ∼70,000 samples in the Sequencing Read Archive (SRA), Genotype-Tissue Expression (GTEx), and The Cancer Genome Atlas (TCGA) projects. While samples from the latter two projects are well-characterized with extensive metadata, the ∼50,000 RNA-seq samples from SRA in recount2 are inconsistently annotated with metadata. Tissue type, sex, and library type can be estimated from the RNA sequencing (RNA-seq) data itself. However, more detailed and harder to predict metadata, like age and diagnosis, must ideally be provided by labs that deposit the data.To facilitate more analyses within human brain tissue data, we have complemented phenotype predictions by manually constructing a uniformly-curated database of public RNA-seq samples present in SRA and recount2. We describe the reproducible curation process for constructing recount-brain that involves systematic review of the primary manuscript, which can serve as a guide to annotate other studies and tissues. We further expanded recount-brain by merging it with GTEx and TCGA brain samples as well as linking to controlled vocabulary terms for tissue, Brodmann area and disease. Furthermore, we illustrate how to integrate the sample metadata in recount-brain with the gene expression data in recount2 to perform differential expression analysis. We then provide three analysis examples involving modeling postmortem interval, glioblastoma, and meta-analyses across GTEx and TCGA. Overall, recount-brain facilitates expression analyses and improves their reproducibility as individual researchers do not have to manually curate the sample metadata. recount-brain is available via the add_metadata() function from the recount Bioconductor package at bioconductor.org/packages/recount.


Sign in / Sign up

Export Citation Format

Share Document