Gene Set-Based Module Discovery Decodes cis-Regulatory Codes Governing Diverse Gene Expression across Human Multiple Tissues

Aim and Objective: The number of anticancer drugs available currently is limited, and some of them have low treatment response rates. Moreover, developing a new drug for cancer therapy is labor intensive and sometimes cost prohibitive. Therefore, “repositioning” of known cancer treatment compounds can speed up the development time and potentially increase the response rate of cancer therapy. This study proposes a systems biology method for identifying new compound candidates for cancer treatment in two separate procedures. Materials and Methods: First, a “gene set–compound” network was constructed by conducting gene set enrichment analysis on the expression profile of responses to a compound. Second, survival analyses were applied to gene expression profiles derived from four breast cancer patient cohorts to identify gene sets that are associated with cancer survival. A “cancer–functional gene set– compound” network was constructed, and candidate anticancer compounds were identified. Through the use of breast cancer as an example, 162 breast cancer survival-associated gene sets and 172 putative compounds were obtained. Results: We demonstrated how to utilize the clinical relevance of previous studies through gene sets and then connect it to candidate compounds by using gene expression data from the Connectivity Map. Specifically, we chose a gene set derived from a stem cell study to demonstrate its association with breast cancer prognosis and discussed six new compounds that can increase the expression of the gene set after the treatment. Conclusion: Our method can effectively identify compounds with a potential to be “repositioned” for cancer treatment according to their active mechanisms and their association with patients’ survival time.

Download Full-text

Gene Set Correlation Analysis and Visualization Using Gene Expression Data

Current Bioinformatics ◽

10.2174/1574893615999200629124444 ◽

2020 ◽

Vol 15 ◽

Author(s):

Chen-An Tsai ◽

James J. Chen

Keyword(s):

Gene Expression ◽

Correlation Analysis ◽

Gene Expression Data ◽

Differentially Expressed Gene ◽

Differentially Expressed ◽

Superior Performance ◽

Expression Data ◽

Gene Set ◽

Gene Sets ◽

Set Correlation

Background: Gene set enrichment analyses (GSEA) provide a useful and powerful approach to identify differentially expressed gene sets with prior biological knowledge. Several GSEA algorithms have been proposed to perform enrichment analyses on groups of genes. However, many of these algorithms have focused on identification of differentially expressed gene sets in a given phenotype. Objective: In this paper, we propose a gene set analytic framework, Gene Set Correlation Analysis (GSCoA), that simultaneously measures within and between gene sets variation to identify sets of genes enriched for differential expression and highly co-related pathways. Methods: We apply co-inertia analysis to the comparisons of cross-gene sets in gene expression data to measure the costructure of expression profiles in pairs of gene sets. Co-inertia analysis (CIA) is one multivariate method to identify trends or co-relationships in multiple datasets, which contain the same samples. The objective of CIA is to seek ordinations (dimension reduction diagrams) of two gene sets such that the square covariance between the projections of the gene sets on successive axes is maximized. Simulation studies illustrate that CIA offers superior performance in identifying corelationships between gene sets in all simulation settings when compared to correlation-based gene set methods. Result and Conclusion: We also combine between-gene set CIA and GSEA to discover the relationships between gene sets significantly associated with phenotypes. In addition, we provide a graphical technique for visualizing and simultaneously exploring the associations of between and within gene sets and their interaction and network. We then demonstrate integration of within and between gene sets variation using CIA and GSEA, applied to the p53 gene expression data using the c2 curated gene sets. Ultimately, the GSCoA approach provides an attractive tool for identification and visualization of novel associations between pairs of gene sets by integrating co-relationships between gene sets into gene set analysis.

Download Full-text

Analysis of brain atrophy and local gene expression in genetic frontotemporal dementia

Brain Communications ◽

10.1093/braincomms/fcaa122 ◽

2020 ◽

Vol 2 (2) ◽

Cited By ~ 1

Author(s):

Andre Altmann ◽

David M Cash ◽

Martina Bocchetta ◽

Carolin Heller ◽

Regina Reynolds ◽

...

Keyword(s):

Gene Expression ◽

Frontotemporal Dementia ◽

Cell Function ◽

Neurodegenerative Disorder ◽

Positive Association ◽

Negative Association ◽

Marker Genes ◽

Gene Set ◽

Genetic Groups ◽

Cortical Regions

Abstract Frontotemporal dementia is a heterogeneous neurodegenerative disorder characterized by neuronal loss in the frontal and temporal lobes. Despite progress in understanding which genes are associated with the aetiology of frontotemporal dementia, the biological basis of how mutations in these genes lead to cell loss in specific cortical regions remains unclear. In this work, we combined gene expression data for 16 772 genes from the Allen Institute for Brain Science atlas with brain maps of grey matter atrophy in symptomatic C9orf72, GRN and MAPT mutation carriers obtained from the Genetic Frontotemporal dementia Initiative study. No significant association was seen between C9orf72, GRN and MAPT expression and the atrophy patterns in the respective genetic groups. After adjusting for spatial autocorrelation, between 1000 and 5000 genes showed a negative or positive association with the atrophy pattern within each individual genetic group, with the most significantly associated genes being TREM2, SSBP3 and GPR158 (negative association in C9Orf72, GRN and MAPT respectively) and RELN, MXRA8 and LPA (positive association in C9Orf72, GRN and MAPT respectively). An overrepresentation analysis identified a negative association with genes involved in mitochondrial function, and a positive association with genes involved in vascular and glial cell function in each of the genetic groups. A set of 423 and 700 genes showed significant positive and negative association, respectively, with atrophy patterns in all three maps. The gene set with increased expression in spared cortical regions was enriched for neuronal and microglial genes, while the gene set with increased expression in atrophied regions was enriched for astrocyte and endothelial cell genes. Our analysis suggests that these cell types may play a more active role in the onset of neurodegeneration in frontotemporal dementia than previously assumed, and in the case of the positively associated cell marker genes, potentially through emergence of neurotoxic astrocytes and alteration in the blood–brain barrier, respectively.

Download Full-text

A Method to Detect Differential Gene Expression in Cross-Species Hybridization Experiments at Gene and Probe Level

Biomedical Informatics Insights ◽

10.4137/bii.s3846 ◽

2010 ◽

Vol 3 ◽

pp. BII.S3846 ◽

Cited By ~ 1

Author(s):

Ying Chen ◽

Rebekah Wu ◽

James Felton ◽

David M. Rocke ◽

Anu Chakicherla

Keyword(s):

Gene Expression ◽

Gene Set Analysis ◽

Model Organisms ◽

Whole Genome ◽

Supplementary Data ◽

Genome Sequences ◽

Data Set ◽

Gene Set ◽

Level Data ◽

Species Hybridization

Motivation Whole genome microarrays are increasingly becoming the method of choice to study responses in model organisms to disease, stressors or other stimuli. However, whole genome sequences are available for only some model organisms, and there are still many species whose genome sequences are not yet available. Cross-species studies, where arrays developed for one species are used to study gene expression in a closely related species, have been used to address this gap, with some promising results. Current analytical methods have included filtration of some probes or genes that showed low hybridization activities. But consensus filtration schemes are still not available. Results A novel masking procedure is proposed based on currently available target species sequences to filter out probes and study a cross-species data set using this masking procedure and gene-set analysis. Gene-set analysis evaluates the association of some priori defined gene groups with a phenotype of interest. Two methods, Gene Set Enrichment Analysis (GSEA) and Test of Test Statistics (ToTS) were investigated. The results showed that masking procedure together with ToTS method worked well in our data set. The results from an alternative way to study cross-species hybridization experiments without masking are also presented. We hypothesize that the multi-probes structure of Affymetrix microarrays makes it possible to aggregate the effects of both well-hybridized and poorly-hybridized probes to study a group of genes. The principles of gene-set analysis were applied to the probe-level data instead of gene-level data. The results showed that ToTS can give valuable information and thus can be used as a powerful technique for analyzing cross-species hybridization experiments. Availability Software in the form of R code is available at http://anson.ucdavis.edu/~ychen/cross-species.html Supplementary Data Supplementary data are available at http://anson.ucdavis.edu/~ychen/cross-species.html

Download Full-text

Putative biomarkers for predicting tumor sample purity based on gene expression data

BMC Genomics ◽

10.1186/s12864-019-6412-8 ◽

2019 ◽

Vol 20 (1) ◽

Author(s):

Yuanyuan Li ◽

David M. Umbach ◽

Adrienna Bingham ◽

Qi-Jing Li ◽

Yuan Zhuang ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Supervised Machine Learning ◽

Tumor Type ◽

Expression Data ◽

Expression Levels ◽

Gene Set ◽

Tumor Purity ◽

Tumor Types ◽

Cancerous Cells

Abstract Background Tumor purity is the percent of cancer cells present in a sample of tumor tissue. The non-cancerous cells (immune cells, fibroblasts, etc.) have an important role in tumor biology. The ability to determine tumor purity is important to understand the roles of cancerous and non-cancerous cells in a tumor. Methods We applied a supervised machine learning method, XGBoost, to data from 33 TCGA tumor types to predict tumor purity using RNA-seq gene expression data. Results Across the 33 tumor types, the median correlation between observed and predicted tumor-purity ranged from 0.75 to 0.87 with small root mean square errors, suggesting that tumor purity can be accurately predicted υσινγ expression data. We further confirmed that expression levels of a ten-gene set (CSF2RB, RHOH, C1S, CCDC69, CCL22, CYTIP, POU2AF1, FGR, CCL21, and IL7R) were predictive of tumor purity regardless of tumor type. We tested whether our set of ten genes could accurately predict tumor purity of a TCGA-independent data set. We showed that expression levels from our set of ten genes were highly correlated (ρ = 0.88) with the actual observed tumor purity. Conclusions Our analyses suggested that the ten-gene set may serve as a biomarker for tumor purity prediction using gene expression data.

Download Full-text

Genome Improvement and Core Gene Set Refinement of Fugacium kawagutii

Microorganisms ◽

10.3390/microorganisms8010102 ◽

2020 ◽

Vol 8 (1) ◽

pp. 102 ◽

Cited By ~ 3

Author(s):

Tangcheng Li ◽

Liying Yu ◽

Bo Song ◽

Yue Song ◽

Ling Li ◽

...

Keyword(s):

Gene Expression ◽

Gene Annotation ◽

Gene Prediction ◽

Acid Synthesis ◽

Core Gene ◽

Chromosome Conformation ◽

Gene Set ◽

Kegg Pathways ◽

Expression Studies ◽

Gene Expression Studies

Cataloging an accurate functional gene set for the Symbiodiniaceae species is crucial for addressing biological questions of dinoflagellate symbiosis with corals and other invertebrates. To improve the gene models of Fugacium kawagutii, we conducted high-throughput chromosome conformation capture (Hi-C) for the genome and Illumina combined with PacBio sequencing for the transcriptome to achieve a new genome assembly and gene prediction. A 0.937-Gbp assembly of F. kawagutii were obtained, with a N50 > 13 Mbp and the longest scaffold of 121 Mbp capped with telomere motif at both ends. Gene annotation produced 45,192 protein-coding genes, among which, 11,984 are new compared to previous versions of the genome. The newly identified genes are mainly enriched in 38 KEGG pathways including N-Glycan biosynthesis, mRNA surveillance pathway, cell cycle, autophagy, mitophagy, and fatty acid synthesis, which are important for symbiosis, nutrition, and reproduction. The newly identified genes also included those encoding O-methyltransferase (O-MT), 3-dehydroquinate synthase, homologous-pairing protein 2-like (HOP2) and meiosis protein 2 (MEI2), which function in mycosporine-like amino acids (MAAs) biosynthesis and sexual reproduction, respectively. The improved version of the gene set (Fugka_Geneset _V3) raised transcriptomic read mapping rate from 33% to 54% and BUSCO match from 29% to 55%. Further differential gene expression analysis yielded a set of stably expressed genes under variable trace metal conditions, of which 115 with annotated functions have recently been found to be stably expressed under three other conditions, thus further developing the “core gene set” of F. kawagutii. This improved genome will prove useful for future Symbiodiniaceae transcriptomic, gene structure, and gene expression studies, and the refined “core gene set” will be a valuable resource from which to develop reference genes for gene expression studies.

Download Full-text

ExAtlas: An interactive online tool for meta-analysis of gene expression data

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720015500195 ◽

2015 ◽

Vol 13 (06) ◽

pp. 1550019 ◽

Cited By ~ 37

Author(s):

Alexei A. Sharov ◽

David Schlessinger ◽

Minoru S. H. Ko

Keyword(s):

Gene Expression ◽

Gene Ontology ◽

Gene Expression Data ◽

Fixed Effects ◽

Expression Profiles ◽

Meta Analysis ◽

Data Sets ◽

Expression Data ◽

Gene Set ◽

Public Data

We have developed ExAtlas, an on-line software tool for meta-analysis and visualization of gene expression data. In contrast to existing software tools, ExAtlas compares multi-component data sets and generates results for all combinations (e.g. all gene expression profiles versus all Gene Ontology annotations). ExAtlas handles both users’ own data and data extracted semi-automatically from the public repository (GEO/NCBI database). ExAtlas provides a variety of tools for meta-analyses: (1) standard meta-analysis (fixed effects, random effects, z-score, and Fisher’s methods); (2) analyses of global correlations between gene expression data sets; (3) gene set enrichment; (4) gene set overlap; (5) gene association by expression profile; (6) gene specificity; and (7) statistical analysis (ANOVA, pairwise comparison, and PCA). ExAtlas produces graphical outputs, including heatmaps, scatter-plots, bar-charts, and three-dimensional images. Some of the most widely used public data sets (e.g. GNF/BioGPS, Gene Ontology, KEGG, GAD phenotypes, BrainScan, ENCODE ChIP-seq, and protein–protein interaction) are pre-loaded and can be used for functional annotations.

Download Full-text

Gene set based systematic analysis of prostate cancer and its subtypes

Future Oncology ◽

10.2217/fon-2019-0459 ◽

2020 ◽

Vol 16 (2) ◽

pp. 4381-4393

Author(s):

Senlin Ye ◽

Haohui Wang ◽

Kancheng He ◽

Hongwei Shen ◽

Mou Peng ◽

...

Keyword(s):

Gene Expression ◽

Prostate Cancer ◽

Expression Patterns ◽

Methylation Status ◽

Gene Expression Patterns ◽

Biological Functions ◽

Systematic Analysis ◽

Gene Set ◽

Analysis Strategy ◽

Similarities And Differences

Aim: A gene set based systematic analysis strategy is used to investigate prostate tumors and its subclusters with focuses on similarities and differences of biological functions. Results: Dysregulation of methylation status, as well as RAS/RAF/ERK and PI3K-ATK signaling pathways, were found to be the most dramatic changes during prostate cancer tumorigenesis. Besides, neural and inflammation microenvironment is also significantly divergent between tumor and adjacent tissues. Insights of subclasses within prostate tumor cohorts revealed four different clusters with distinct gene expression patterns. We found that samples are mainly clustered by immune environments and proliferation traits. Conclusion: The findings of this article may help to advance the progress of identifying better diagnosis biomarkers and therapeutic targets.

Download Full-text

Optimizing gene set annotations combining GO structure and gene expression data

BMC Systems Biology ◽

10.1186/s12918-018-0659-6 ◽

2018 ◽

Vol 12 (S9) ◽

Author(s):

Dong Wang ◽

Jie Li ◽

Rui Liu ◽

Yadong Wang

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Expression Data ◽

Gene Set

Download Full-text

Integration of GWAS Summary Statistics and Gene Expression Reveals Target Cell Types Underlying Kidney Function Traits

Journal of the American Society of Nephrology ◽

10.1681/asn.2020010051 ◽

2020 ◽

Vol 31 (10) ◽

pp. 2326-2340 ◽

Cited By ~ 2

Author(s):

Yong Li ◽

Stefan Haug ◽

Pascal Schlosser ◽

Alexander Teumer ◽

Adrienne Tin ◽

...

Keyword(s):

Gene Expression ◽

Kidney Function ◽

Cell Types ◽

Gene Set Enrichment Analysis ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Rna Seq ◽

Cell Type ◽

Gene Set Enrichment ◽

Gene Set

BackgroundGenetic variants identified in genome-wide association studies (GWAS) are often not specific enough to reveal complex underlying physiology. By integrating RNA-seq data and GWAS summary statistics, novel computational methods allow unbiased identification of trait-relevant tissues and cell types.MethodsThe CKDGen consortium provided GWAS summary data for eGFR, urinary albumin-creatinine ratio (UACR), BUN, and serum urate. Genotype-Tissue Expression Project (GTEx) RNA-seq data were used to construct the top 10% specifically expressed genes for each of 53 tissues followed by linkage disequilibrium (LD) score–based enrichment testing for each trait. Similar procedures were performed for five kidney single-cell RNA-seq datasets from humans and mice and for a microdissected tubule RNA-seq dataset from rat. Gene set enrichment analyses were also conducted for genes implicated in Mendelian kidney diseases.ResultsAcross 53 tissues, genes in kidney function–associated GWAS loci were enriched in kidney (P=9.1E-8 for eGFR; P=1.2E-5 for urate) and liver (P=6.8·10-5 for eGFR). In the kidney, proximal tubule was enriched in humans (P=8.5E-5 for eGFR; P=7.8E-6 for urate) and mice (P=0.0003 for eGFR; P=0.0002 for urate) and confirmed as the primary cell type in microdissected tubules and organoids. Gene set enrichment analysis supported this and showed enrichment of genes implicated in monogenic glomerular diseases in podocytes. A systematic approach generated a comprehensive list of GWAS genes prioritized by cell type–specific expression.ConclusionsIntegration of GWAS statistics of kidney function traits and gene expression data identified relevant tissues and cell types, as a basis for further mechanistic studies to understand GWAS loci.

Download Full-text