Enriching the gene set analysis of genome-wide data by incorporating directionality of gene expression and combining statistical hypotheses and methods

Over the last decade, gene set analysis has become the first choice for gaining insights into underlying complex biology of diseases through gene expression and gene association studies. It also reduces the complexity of statistical analysis and enhances the explanatory power of the obtained results. Although gene set analysis approaches are extensively used in gene expression and genome wide association data analysis, the statistical structure and steps common to these approaches have not yet been comprehensively discussed, which limits their utility. In this article, we provide a comprehensive overview, statistical structure and steps of gene set analysis approaches used for microarrays, RNA-sequencing and genome wide association data analysis. Further, we also classify the gene set analysis approaches and tools by the type of genomic study, null hypothesis, sampling model and nature of the test statistic, etc. Rather than reviewing the gene set analysis approaches individually, we provide the generation-wise evolution of such approaches for microarrays, RNA-sequencing and genome wide association studies and discuss their relative merits and limitations. Here, we identify the key biological and statistical challenges in current gene set analysis, which will be addressed by statisticians and biologists collectively in order to develop the next generation of gene set analysis approaches. Further, this study will serve as a catalog and provide guidelines to genome researchers and experimental biologists for choosing the proper gene set analysis approach based on several factors.

Download Full-text

Genome-wide association study and gene set analysis for understanding candidate genes involved in salt tolerance at the rice seedling stage

Molecular Genetics and Genomics ◽

10.1007/s00438-017-1354-9 ◽

2017 ◽

Vol 292 (6) ◽

pp. 1391-1403 ◽

Cited By ~ 13

Author(s):

Jie Yu ◽

Weiguo Zao ◽

Qiang He ◽

Tae-Sung Kim ◽

Yong-Jin Park

Keyword(s):

Salt Tolerance ◽

Association Study ◽

Candidate Genes ◽

Genome Wide Association Study ◽

Rice Seedling ◽

Seedling Stage ◽

Genome Wide Association ◽

Gene Set Analysis ◽

Gene Set ◽

Genome Wide

Download Full-text

Performance Comparison of Two Gene Set Analysis Methods for Genome-wide Association Study Results: GSA-SNP vs i-GSEA4GWAS

Genomics & Informatics ◽

10.5808/gi.2012.10.2.123 ◽

2012 ◽

Vol 10 (2) ◽

pp. 123 ◽

Cited By ~ 3

Author(s):

Ji-sun Kwon ◽

Jihye Kim ◽

Dougu Nam ◽

Sangsoo Kim

Keyword(s):

Association Study ◽

Genome Wide Association Study ◽

Performance Comparison ◽

Genome Wide Association ◽

Gene Set Analysis ◽

Gene Set ◽

Analysis Methods ◽

Genome Wide ◽

Study Results

Download Full-text

A Method to Detect Differential Gene Expression in Cross-Species Hybridization Experiments at Gene and Probe Level

Biomedical Informatics Insights ◽

10.4137/bii.s3846 ◽

2010 ◽

Vol 3 ◽

pp. BII.S3846 ◽

Cited By ~ 1

Author(s):

Ying Chen ◽

Rebekah Wu ◽

James Felton ◽

David M. Rocke ◽

Anu Chakicherla

Keyword(s):

Gene Expression ◽

Gene Set Analysis ◽

Model Organisms ◽

Whole Genome ◽

Supplementary Data ◽

Genome Sequences ◽

Data Set ◽

Gene Set ◽

Level Data ◽

Species Hybridization

Motivation Whole genome microarrays are increasingly becoming the method of choice to study responses in model organisms to disease, stressors or other stimuli. However, whole genome sequences are available for only some model organisms, and there are still many species whose genome sequences are not yet available. Cross-species studies, where arrays developed for one species are used to study gene expression in a closely related species, have been used to address this gap, with some promising results. Current analytical methods have included filtration of some probes or genes that showed low hybridization activities. But consensus filtration schemes are still not available. Results A novel masking procedure is proposed based on currently available target species sequences to filter out probes and study a cross-species data set using this masking procedure and gene-set analysis. Gene-set analysis evaluates the association of some priori defined gene groups with a phenotype of interest. Two methods, Gene Set Enrichment Analysis (GSEA) and Test of Test Statistics (ToTS) were investigated. The results showed that masking procedure together with ToTS method worked well in our data set. The results from an alternative way to study cross-species hybridization experiments without masking are also presented. We hypothesize that the multi-probes structure of Affymetrix microarrays makes it possible to aggregate the effects of both well-hybridized and poorly-hybridized probes to study a group of genes. The principles of gene-set analysis were applied to the probe-level data instead of gene-level data. The results showed that ToTS can give valuable information and thus can be used as a powerful technique for analyzing cross-species hybridization experiments. Availability Software in the form of R code is available at http://anson.ucdavis.edu/~ychen/cross-species.html Supplementary Data Supplementary data are available at http://anson.ucdavis.edu/~ychen/cross-species.html

Download Full-text

A COMPRESSED SENSING BASED APPROACH FOR SUBTYPING OF LEUKEMIA FROM GENE EXPRESSION DATA

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720011005689 ◽

2011 ◽

Vol 09 (05) ◽

pp. 631-645 ◽

Cited By ~ 12

Author(s):

WENLONG TANG ◽

HONGBAO CAO ◽

JUNBO DUAN ◽

YU-PING WANG

Keyword(s):

Gene Expression ◽

Compressed Sensing ◽

Gene Expression Analysis ◽

Expression Data ◽

Data Set ◽

New Methods ◽

Genome Wide ◽

Different Types ◽

Genome Wide Data ◽

Improved Accuracy

With the development of genomic techniques, the demand for new methods that can handle high-throughput genome-wide data effectively is becoming stronger than ever before. Compressed sensing (CS) is an emerging approach in statistics and signal processing. With the CS theory, a signal can be uniquely reconstructed or approximated from its sparse representations, which can therefore better distinguish different types of signals. However, the application of CS approach to genome-wide data analysis has been rarely investigated. We propose a novel CS-based approach for genomic data classification and test its performance in the subtyping of leukemia through gene expression analysis. The detection of subtypes of cancers such as leukemia according to different genetic markups is significant, which holds promise for the individualization of therapies and improvement of treatments. In our work, four statistical features were employed to select significant genes for the classification. With our selected genes out of 7,129 ones, the proposed CS method achieved a classification accuracy of 97.4% when evaluated with the cross validation and 94.3% when evaluated with another independent data set. The robustness of the method to noise was also tested, giving good performance. Therefore, this work demonstrates that the CS method can effectively detect subtypes of leukemia, implying improved accuracy of diagnosis of leukemia.

Download Full-text

Non-Homologous End-Joining Pathway Associated with Occurrence of Myocardial Infarction: Gene Set Analysis of Genome-Wide Association Study Data

PLoS ONE ◽

10.1371/journal.pone.0056262 ◽

2013 ◽

Vol 8 (2) ◽

pp. e56262 ◽

Cited By ~ 12

Author(s):

Jeffrey J. W. Verschuren ◽

Stella Trompet ◽

Joris Deelen ◽

David J. Stott ◽

Naveed Sattar ◽

...

Keyword(s):

Myocardial Infarction ◽

Association Study ◽

Genome Wide Association Study ◽

Study Data ◽

Genome Wide Association ◽

Gene Set Analysis ◽

End Joining ◽

Gene Set ◽

Genome Wide ◽

Non Homologous End Joining

Download Full-text

PathCORE-T: identifying and visualizing globally co-occurring pathways in large transcriptomic compendia

10.1101/147645 ◽

2017 ◽

Author(s):

Kathleen M. Chen ◽

Jie Tan ◽

Gregory P. Way ◽

Georgia Doing ◽

Deborah A. Hogan ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Original Data ◽

Knowledge Bases ◽

Expression Data ◽

Expression Levels ◽

Genome Wide ◽

Genome Wide Data ◽

Tcga Dataset ◽

Construction Algorithms

AbstractBackgroundInvestigators often interpret genome-wide data by analyzing the expression levels of genes within pathways. While this within-pathway analysis is routine, the products of any one pathway can affect the activity of other pathways. Past efforts to identify relationships between biological processes have evaluated overlap in knowledge bases or evaluated changes that occur after specific treatments. Individual experiments can highlight condition-specific pathway-pathway relationships; however, constructing a complete network of such relationships across many conditions requires analyzing results from many studies.ResultsWe developed PathCORE-T framework by implementing existing methods to identify pathway-pathway transcriptional relationships evident across a broad data compendium. PathCORE-T is applied to the output of feature construction algorithms; it identifies pairs of pathways observed in features more than expected by chance as functionally co-occurring. We demonstrate PathCORE-T by analyzing an existing eADAGE model of a microbial compendium and building and analyzing NMF features from the TCGA dataset of 33 cancer types. The PathCORE-T framework includes a demonstration web interface, with source code, that users can launch to (1) visualize the network and (2) review the expression levels of associated genes in the original data. PathCORE-T creates and displays the network of globally co-occurring pathways based on features observed in a machine learning analysis of gene expression data.ConclusionsThe PathCORE-T framework identifies transcriptionally co-occurring pathways from the results of unsupervised analysis of gene expression data and visualizes the relationships between pathways as a network. PathCORE-T recapitulated previously described pathway-pathway relationships and suggested experimentally testable additional hypotheses that remain to be explored.

Download Full-text

Importance of SNP Dependency Correction and Association Integration for Gene Set Analysis in Genome-Wide Association Studies

Frontiers in Genetics ◽

10.3389/fgene.2021.767358 ◽

2021 ◽

Vol 12 ◽

Author(s):

Michal Marczyk ◽

Agnieszka Macioszek ◽

Joanna Tobiasz ◽

Joanna Polanska ◽

Joanna Zyla

Keyword(s):

Association Studies ◽

Enrichment Analysis ◽

Gene Set Enrichment Analysis ◽

Genome Wide Association ◽

Gene Set Analysis ◽

Genome Wide Association Studies ◽

Gene Set Enrichment ◽

Gene Set ◽

Genome Wide ◽

The Impact

A typical genome-wide association study (GWAS) analyzes millions of single-nucleotide polymorphisms (SNPs), several of which are in a region of the same gene. To conduct gene set analysis (GSA), information from SNPs needs to be unified at the gene level. A widely used practice is to use only the most relevant SNP per gene; however, there are other methods of integration that could be applied here. Also, the problem of nonrandom association of alleles at two or more loci is often neglected. Here, we tested the impact of incorporation of different integrations and linkage disequilibrium (LD) correction on the performance of several GSA methods. Matched normal and breast cancer samples from The Cancer Genome Atlas database were used to evaluate the performance of six GSA algorithms: Coincident Extreme Ranks in Numerical Observations (CERNO), Gene Set Enrichment Analysis (GSEA), GSEA-SNP, improved GSEA for GWAS (i-GSEA4GWAS), Meta-Analysis Gene-set Enrichment of variaNT Associations (MAGENTA), and Over-Representation Analysis (ORA). Association of SNPs to phenotype was calculated using modified McNemar’s test. Results for SNPs mapped to the same gene were integrated using Fisher and Stouffer methods and compared with the minimum p-value method. Four common measures were used to quantify the performance of all combinations of methods. Results of GSA analysis on GWAS were compared to the one performed on gene expression data. Comparing all evaluation metrics across different GSA algorithms, integrations, and LD correction, we highlighted CERNO, and MAGENTA with Stouffer as the most efficient. Applying LD correction increased prioritization and specificity of enrichment outcomes for all tested algorithms. When Fisher or Stouffer were used with LD, sensitivity and reproducibility were also better. Using any integration method was beneficial in comparison with a minimum p-value method in specific combinations. The correlation between GSA results from genomic and transcriptomic level was the highest when Stouffer integration was combined with LD correction. We thoroughly evaluated different approaches to GSA in GWAS in terms of performance to guide others to select the most effective combinations. We showed that LD correction and Stouffer integration could increase the performance of enrichment analysis and encourage the usage of these techniques.

Download Full-text

Genome-wide gene-set analysis for identification of pathways associated with alcohol dependence

The International Journal of Neuropsychopharmacology ◽

10.1017/s1461145712000375 ◽

2012 ◽

Vol 16 (2) ◽

pp. 271-278 ◽

Cited By ~ 24

Author(s):

Joanna M. Biernacka ◽

Jennifer Geske ◽

Gregory D. Jenkins ◽

Colin Colby ◽

David N. Rider ◽

...

Keyword(s):

Alcohol Dependence ◽

Association Studies ◽

Gene Set Analysis ◽

Receptor Interaction ◽

Genome Wide Association Studies ◽

Addictive Disorders ◽

Gene Set ◽

Genome Wide ◽

Individual Snps ◽

A Genome

Abstract It is believed that multiple genetic variants with small individual effects contribute to the risk of alcohol dependence. Such polygenic effects are difficult to detect in genome-wide association studies that test for association of the phenotype with each single nucleotide polymorphism (SNP) individually. To overcome this challenge, gene-set analysis (GSA) methods that jointly test for the effects of pre-defined groups of genes have been proposed. Rather than testing for association between the phenotype and individual SNPs, these analyses evaluate the global evidence of association with a set of related genes enabling the identification of cellular or molecular pathways or biological processes that play a role in development of the disease. It is hoped that by aggregating the evidence of association for all available SNPs in a group of related genes, these approaches will have enhanced power to detect genetic associations with complex traits. We performed GSA using data from a genome-wide study of 1165 alcohol-dependent cases and 1379 controls from the Study of Addiction: Genetics and Environment (SAGE), for all 200 pathways listed in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. Results demonstrated a potential role of the ‘synthesis and degradation of ketone bodies’ pathway. Our results also support the potential involvement of the ‘neuroactive ligand–receptor interaction’ pathway, which has previously been implicated in addictive disorders. These findings demonstrate the utility of GSA in the study of complex disease, and suggest specific directions for further research into the genetic architecture of alcohol dependence.

Download Full-text

Correction: Time-Course Gene Set Analysis for Longitudinal Gene Expression Data

PLoS Computational Biology ◽

10.1371/journal.pcbi.1004446 ◽

2015 ◽

Vol 11 (8) ◽

pp. e1004446

Author(s):

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Time Course ◽

Gene Set Analysis ◽

Expression Data ◽

Gene Set

Download Full-text