scholarly journals Model-based clustering of multi-tissue gene expression data

Author(s):  
Pau Erola ◽  
Johan L M Björkegren ◽  
Tom Michoel

Abstract Motivation Recently, it has become feasible to generate large-scale, multi-tissue gene expression data, where expression profiles are obtained from multiple tissues or organs sampled from dozens to hundreds of individuals. When traditional clustering methods are applied to this type of data, important information is lost, because they either require all tissues to be analyzed independently, ignoring dependencies and similarities between tissues, or to merge tissues in a single, monolithic dataset, ignoring individual characteristics of tissues. Results We developed a Bayesian model-based multi-tissue clustering algorithm, revamp, which can incorporate prior information on physiological tissue similarity, and which results in a set of clusters, each consisting of a core set of genes conserved across tissues as well as differential sets of genes specific to one or more subsets of tissues. Using data from seven vascular and metabolic tissues from over 100 individuals in the STockholm Atherosclerosis Gene Expression (STAGE) study, we demonstrate that multi-tissue clusters inferred by revamp are more enriched for tissue-dependent protein-protein interactions compared to alternative approaches. We further demonstrate that revamp results in easily interpretable multi-tissue gene expression associations to key coronary artery disease processes and clinical phenotypes in the STAGE individuals. Availability and implementation Revamp is implemented in the Lemon-Tree software, available at https://github.com/eb00/lemon-tree Supplementary information Supplementary data are available at Bioinformatics online.

Processes ◽  
2019 ◽  
Vol 7 (5) ◽  
pp. 301
Author(s):  
Muying Wang ◽  
Satoshi Fukuyama ◽  
Yoshihiro Kawaoka ◽  
Jason E. Shoemaker

Motivation: Immune cell dynamics is a critical factor of disease-associated pathology (immunopathology) that also impacts the levels of mRNAs in diseased tissue. Deconvolution algorithms attempt to infer cell quantities in a tissue/organ sample based on gene expression profiles and are often evaluated using artificial, non-complex samples. Their accuracy on estimating cell counts given temporal tissue gene expression data remains not well characterized and has never been characterized when using diseased lung. Further, how to remove the effects of cell migration on transcript counts to improve discovery of disease factors is an open question. Results: Four cell count inference (i.e., deconvolution) tools are evaluated using microarray data from influenza-infected lung sampled at several time points post-infection. The analysis finds that inferred cell quantities are accurate only for select cell types and there is a tendency for algorithms to have a good relative fit (R 2 ) but a poor absolute fit (normalized mean squared error; NMSE), which suggests systemic biases exist. Nonetheless, using cell fraction estimates to adjust gene expression data, we show that genes associated with influenza virus replication and increased infection pathology are more likely to be identified as significant than when applying traditional statistical tests.


2015 ◽  
Vol 11 (1) ◽  
pp. 86-96 ◽  
Author(s):  
Aakash Chavan Ravindranath ◽  
Nolen Perualila-Tan ◽  
Adetayo Kasim ◽  
Georgios Drakakis ◽  
Sonia Liggi ◽  
...  

Integrating gene expression profiles with certain proteins can improve our understanding of the fundamental mechanisms in protein–ligand binding.


2012 ◽  
Vol 10 (05) ◽  
pp. 1250011
Author(s):  
NATALIA NOVOSELOVA ◽  
IGOR TOM

Many external and internal validity measures have been proposed in order to estimate the number of clusters in gene expression data but as a rule they do not consider the analysis of the stability of the groupings produced by a clustering algorithm. Based on the approach assessing the predictive power or stability of a partitioning, we propose the new measure of cluster validation and the selection procedure to determine the suitable number of clusters. The validity measure is based on the estimation of the "clearness" of the consensus matrix, which is the result of a resampling clustering scheme or consensus clustering. According to the proposed selection procedure the stable clustering result is determined with the reference to the validity measure for the null hypothesis encoding for the absence of clusters. The final number of clusters is selected by analyzing the distance between the validity plots for initial and permutated data sets. We applied the selection procedure to estimate the clustering results on several datasets. As a result the proposed procedure produced an accurate and robust estimate of the number of clusters, which are in agreement with the biological knowledge and gold standards of cluster quality.


Author(s):  
Crescenzio Gallo

The possible applications of modeling and simulation in the field of bioinformatics are very extensive, ranging from understanding basic metabolic paths to exploring genetic variability. Experimental results carried out with DNA microarrays allow researchers to measure expression levels for thousands of genes simultaneously, across different conditions and over time. A key step in the analysis of gene expression data is the detection of groups of genes that manifest similar expression patterns. In this chapter, the authors examine various methods for analyzing gene expression data, addressing the important topics of (1) selecting the most differentially expressed genes, (2) grouping them by means of their relationships, and (3) classifying samples based on gene expressions.


eLife ◽  
2017 ◽  
Vol 6 ◽  
Author(s):  
Julien Racle ◽  
Kaat de Jonge ◽  
Petra Baumgaertner ◽  
Daniel E Speiser ◽  
David Gfeller

Immune cells infiltrating tumors can have important impact on tumor progression and response to therapy. We present an efficient algorithm to simultaneously estimate the fraction of cancer and immune cell types from bulk tumor gene expression data. Our method integrates novel gene expression profiles from each major non-malignant cell type found in tumors, renormalization based on cell-type-specific mRNA content, and the ability to consider uncharacterized and possibly highly variable cell types. Feasibility is demonstrated by validation with flow cytometry, immunohistochemistry and single-cell RNA-Seq analyses of human melanoma and colorectal tumor specimens. Altogether, our work not only improves accuracy but also broadens the scope of absolute cell fraction predictions from tumor gene expression data, and provides a unique novel experimental benchmark for immunogenomics analyses in cancer research (http://epic.gfellerlab.org).


2019 ◽  
Vol 20 (S22) ◽  
Author(s):  
Juan Wang ◽  
Cong-Hai Lu ◽  
Jin-Xing Liu ◽  
Ling-Yun Dai ◽  
Xiang-Zhen Kong

Abstract Background Identifying different types of cancer based on gene expression data has become hotspot in bioinformatics research. Clustering cancer gene expression data from multiple cancers to their own class is a significance solution. However, the characteristics of high-dimensional and small samples of gene expression data and the noise of the data make data mining and research difficult. Although there are many effective and feasible methods to deal with this problem, the possibility remains that these methods are flawed. Results In this paper, we propose the graph regularized low-rank representation under symmetric and sparse constraints (sgLRR) method in which we introduce graph regularization based on manifold learning and symmetric sparse constraints into the traditional low-rank representation (LRR). For the sgLRR method, by means of symmetric constraint and sparse constraint, the effect of raw data noise on low-rank representation is alleviated. Further, sgLRR method preserves the important intrinsic local geometrical structures of the raw data by introducing graph regularization. We apply this method to cluster multi-cancer samples based on gene expression data, which improves the clustering quality. First, the gene expression data are decomposed by sgLRR method. And, a lowest rank representation matrix is obtained, which is symmetric and sparse. Then, an affinity matrix is constructed to perform the multi-cancer sample clustering by using a spectral clustering algorithm, i.e., normalized cuts (Ncuts). Finally, the multi-cancer samples clustering is completed. Conclusions A series of comparative experiments demonstrate that the sgLRR method based on low rank representation has a great advantage and remarkable performance in the clustering of multi-cancer samples.


2015 ◽  
Vol 13 (06) ◽  
pp. 1550019 ◽  
Author(s):  
Alexei A. Sharov ◽  
David Schlessinger ◽  
Minoru S. H. Ko

We have developed ExAtlas, an on-line software tool for meta-analysis and visualization of gene expression data. In contrast to existing software tools, ExAtlas compares multi-component data sets and generates results for all combinations (e.g. all gene expression profiles versus all Gene Ontology annotations). ExAtlas handles both users’ own data and data extracted semi-automatically from the public repository (GEO/NCBI database). ExAtlas provides a variety of tools for meta-analyses: (1) standard meta-analysis (fixed effects, random effects, z-score, and Fisher’s methods); (2) analyses of global correlations between gene expression data sets; (3) gene set enrichment; (4) gene set overlap; (5) gene association by expression profile; (6) gene specificity; and (7) statistical analysis (ANOVA, pairwise comparison, and PCA). ExAtlas produces graphical outputs, including heatmaps, scatter-plots, bar-charts, and three-dimensional images. Some of the most widely used public data sets (e.g. GNF/BioGPS, Gene Ontology, KEGG, GAD phenotypes, BrainScan, ENCODE ChIP-seq, and protein–protein interaction) are pre-loaded and can be used for functional annotations.


Cells ◽  
2019 ◽  
Vol 8 (7) ◽  
pp. 675 ◽  
Author(s):  
Xia ◽  
Liu ◽  
Zhang ◽  
Guo

High-throughput technologies generate a tremendous amount of expression data on mRNA, miRNA and protein levels. Mining and visualizing the large amount of expression data requires sophisticated computational skills. An easy to use and user-friendly web-server for the visualization of gene expression profiles could greatly facilitate data exploration and hypothesis generation for biologists. Here, we curated and normalized the gene expression data on mRNA, miRNA and protein levels in 23315, 9009 and 9244 samples, respectively, from 40 tissues (The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GETx)) and 1594 cell lines (Cancer Cell Line Encyclopedia (CCLE) and MD Anderson Cell Lines Project (MCLP)). Then, we constructed the Gene Expression Display Server (GEDS), a web-based tool for quantification, comparison and visualization of gene expression data. GEDS integrates multiscale expression data and provides multiple types of figures and tables to satisfy several kinds of user requirements. The comprehensive expression profiles plotted in the one-stop GEDS platform greatly facilitate experimental biologists utilizing big data for better experimental design and analysis. GEDS is freely available on http://bioinfo.life.hust.edu.cn/web/GEDS/.


Sign in / Sign up

Export Citation Format

Share Document