RASMA: A Reverse Search Algorithm for Mining Maximal Frequent Subgraphs

Abstract Background: Given a collection of coexpression networks over a set of genes, identifying subnetworks that appear frequently is an important research problem known as mining frequent subgraphs. Maximal frequent subgraphs is a representative set of frequent subgraphs; A frequent subgraph is maximal if it does have a super-graph that is frequent. In the bioinformatics discipline, methodologies for mining frequent and/or maximal frequent subgraphs can be used to discover interesting network motifs that elucidate complex interactions among genes, reflected through the edges of the frequent subnetworks. Further study of frequent coexpression subnetworks enhances the discovery of biological modules and biological signatures for gene expression and disease classification.Results: We propose a reverse search algorithm, called RASMA, for mining frequent and maximal frequent subgraphs in a given collection of graphs. A key innovation in RASMA is a connected subgraph enumerator that uses a reverse-search strategy to enumerate connected subgraphs of an undirected graph. Using this enumeration strategy, RASMA obtains all maximal frequent subgraphs very efficiently. To overcome the computationally prohibitive task of enumerating all frequent subgraphs while mining for the maximal frequent subgraphs, RASMA employs several pruning strategies that substantially improve its overall runtime performance. Experimental results show that on large gene coexpression networks, the proposed algorithm efficiently mines biologically relevant maximal frequent subgraphs.Conclusion: Extracting recurrent gene coexpression subnetworks from multiple gene expression experiments enables the discovery of functional modules and subnetwork biomarkers. We have proposed a reverse search algorithm for mining maximal frequent subnetworks. Enrichment analysis of the extracted maximal frequent subnetworks reveals that subnetworks that are frequent are highly enriched with known biological ontologies.

Download Full-text

RASMA: A Reverse Search Algorithm for Mining Maximal Frequent Subgraphs

10.21203/rs.3.rs-46148/v2 ◽

2020 ◽

Author(s):

Saeed Salem ◽

Mohammed Alokshiya ◽

Mohammad Al Hasan

Keyword(s):

Gene Expression ◽

Search Algorithm ◽

Enrichment Analysis ◽

Research Problem ◽

Disease Classification ◽

Connected Subgraph ◽

Gene Coexpression ◽

Key Innovation ◽

Frequent Subgraphs ◽

Coexpression Networks

Abstract Background: Given a collection of coexpression networks over a set of genes, identifying subnetworks that appear frequently is an important research problem known as mining frequent subgraphs. Maximal frequent subgraphs is a representative set of frequent subgraphs; A frequent subgraph is maximal if it does have a super-graph that is frequent. In the bioinformatics discipline, methodologies for mining frequent and/or maximal frequent subgraphs can be used to discover interesting network motifs that elucidate complex interactions among genes, reflected through the edges of the frequent subnetworks. Further study of frequent coexpression subnetworks enhances the discovery of biological modules and biological signatures for gene expression and disease classification. Results: We propose a reverse search algorithm, called RASMA, for mining frequent and maximal frequent subgraphs in a given collection of graphs. A key innovation in RASMA is a connected subgraph enumerator that uses a reverse-search strategy to enumerate connected subgraphs of an undirected graph. Using this enumeration strategy, RASMA obtains all maximal frequent subgraphs very efficiently. To overcome the computationally prohibitive task of enumerating all frequent subgraphs while mining for the maximal frequent subgraphs, RASMA employs several pruning strategies that substantially improve its overall runtime performance. Experimental results show that on large gene coexpression networks, the proposed algorithm efficiently mines biologically relevant maximal frequent subgraphs. Conclusion: Extracting recurrent gene coexpression subnetworks from multiple gene expression experiments enables the discovery of functional modules and subnetwork biomarkers. We have proposed a reverse search algorithm for mining maximal frequent subnetworks. Enrichment analysis of the extracted maximal frequent subnetworks reveals that subnetworks that are frequent are highly enriched with known biological ontologies.

Download Full-text

RASMA: a reverse search algorithm for mining maximal frequent subgraphs

BioData Mining ◽

10.1186/s13040-021-00250-1 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Saeed Salem ◽

Mohammed Alokshiya ◽

Mohammad Al Hasan

Keyword(s):

Gene Expression ◽

Search Algorithm ◽

Enrichment Analysis ◽

Research Problem ◽

Disease Classification ◽

Connected Subgraph ◽

Gene Coexpression ◽

Key Innovation ◽

Frequent Subgraphs ◽

Coexpression Networks

Abstract Background Given a collection of coexpression networks over a set of genes, identifying subnetworks that appear frequently is an important research problem known as mining frequent subgraphs. Maximal frequent subgraphs are a representative set of frequent subgraphs; A frequent subgraph is maximal if it does not have a super-graph that is frequent. In the bioinformatics discipline, methodologies for mining frequent and/or maximal frequent subgraphs can be used to discover interesting network motifs that elucidate complex interactions among genes, reflected through the edges of the frequent subnetworks. Further study of frequent coexpression subnetworks enhances the discovery of biological modules and biological signatures for gene expression and disease classification. Results We propose a reverse search algorithm, called RASMA, for mining frequent and maximal frequent subgraphs in a given collection of graphs. A key innovation in RASMA is a connected subgraph enumerator that uses a reverse-search strategy to enumerate connected subgraphs of an undirected graph. Using this enumeration strategy, RASMA obtains all maximal frequent subgraphs very efficiently. To overcome the computationally prohibitive task of enumerating all frequent subgraphs while mining for the maximal frequent subgraphs, RASMA employs several pruning strategies that substantially improve its overall runtime performance. Experimental results show that on large gene coexpression networks, the proposed algorithm efficiently mines biologically relevant maximal frequent subgraphs. Conclusion Extracting recurrent gene coexpression subnetworks from multiple gene expression experiments enables the discovery of functional modules and subnetwork biomarkers. We have proposed a reverse search algorithm for mining maximal frequent subnetworks. Enrichment analysis of the extracted maximal frequent subnetworks reveals that subnetworks that are frequent are highly enriched with known biological ontologies.

Download Full-text

RASMA: A Reverse Search Algorithm for Mining Frequent Subgraphs

10.21203/rs.3.rs-46148/v1 ◽

2020 ◽

Author(s):

Saeed Salem ◽

Mohammed Alokshiya ◽

Mohammad Al Hasan

Keyword(s):

Gene Expression ◽

Search Algorithm ◽

Enrichment Analysis ◽

Disease Classification ◽

Induced Subgraphs ◽

Biologically Relevant ◽

Large Gene ◽

Biological Ontologies ◽

Gene Coexpression ◽

Frequent Subgraphs

Abstract Background: Mining frequent co-expression networks enables the discovery of interesting network motifs that elucidate important interactions among genes. Such interaction subnetworks have been shown to enhance the discovery of biological modules and subnetwork signatures for gene expression and disease classification. Results: We propose a reverse search algorithm for mining frequent and maximal subgraphs over a collection of graphs. We develop an approach for enumerating connected edge-induced subgraphs of an undirected graph by using a reverse-search algorithm, and then use this enumeration strategy for mining all maximal frequent subgraphs. To overcome the computationally prohibitive task of enumerating all frequent subgraphs while mining for maximal subgraphs, the proposed algorithm employs several pruning strategies, which substantially improve its overall runtime performance. Experimental results show that on large gene coexpression networks, the proposed algorithm efficiently mines biologically relevant maximal frequent subgraphs. Conclusion: Extracting recurrent gene coexpression subnetworks from multiple gene expression experiments enables the discovery of functional modules and subnetwork biomarkers. We have proposed a reverse search algorithm for mining maximal frequent subnetworks. Enrichment analysis of the extracted maximal frequent subnetworks reveals that subnetworks that are more frequent are more likely to be enriched with biological ontologies.

Download Full-text

Integration of Gene Coexpression Network, GO Enrichment Analysis for Identification Gene Expression Signature of Invasive Bladder Carcinoma

Transcriptomics Open Access ◽

10.4172/2329-8936.1000126 ◽

2016 ◽

Vol 04 (01) ◽

Cited By ~ 2

Author(s):

Hanaa Hibishy Gaballah

Keyword(s):

Gene Expression ◽

Bladder Carcinoma ◽

Gene Expression Signature ◽

Enrichment Analysis ◽

Coexpression Network ◽

Gene Coexpression Network ◽

Expression Signature ◽

Gene Coexpression ◽

Go Enrichment ◽

Go Enrichment Analysis

Download Full-text

Mining approximate frequent dense modules from multiple gene expression datasets

10.29007/d87q ◽

2020 ◽

Author(s):

San Ha Seo ◽

Saeed Salem

Keyword(s):

Gene Expression ◽

Gene Annotation ◽

Expression Data ◽

Frequent Subgraph Mining ◽

Multiple Gene ◽

Real Gene ◽

Frequent Subgraph ◽

Frequent Subgraphs ◽

Coexpression Networks ◽

Functional Gene Annotation

Large amount of gene expression data has been collected for various environmental and biological conditions. Extracting co-expression networks that are recurrent in multiple co-expression networks has been shown promising in functional gene annotation and biomarkers discovery. Frequent subgraph mining reports a large number of subnetworks. In this work, we propose to mine approximate dense frequent subgraphs. Our proposed approach reports representative frequent subgraphs that are also dense. Our experiments on real gene coexpression networks show that frequent subgraphs are biologically interesting as evidenced by the large percentage of biologically enriched frequent dense subgraphs.

Download Full-text

Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data

Genome Biology ◽

10.1186/s13059-021-02568-9 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Kayla A. Johnson ◽

Arjun Krishnan

Keyword(s):

Gene Expression ◽

Expression Data ◽

Rna Seq ◽

Functional Relationships ◽

Gene Coexpression ◽

Transformation Methods ◽

Network Transformation ◽

Almost All ◽

Coexpression Networks ◽

The Impact

Abstract Background Constructing gene coexpression networks is a powerful approach for analyzing high-throughput gene expression data towards module identification, gene function prediction, and disease-gene prioritization. While optimal workflows for constructing coexpression networks, including good choices for data pre-processing, normalization, and network transformation, have been developed for microarray-based expression data, such well-tested choices do not exist for RNA-seq data. Almost all studies that compare data processing and normalization methods for RNA-seq focus on the end goal of determining differential gene expression. Results Here, we present a comprehensive benchmarking and analysis of 36 different workflows, each with a unique set of normalization and network transformation methods, for constructing coexpression networks from RNA-seq datasets. We test these workflows on both large, homogenous datasets and small, heterogeneous datasets from various labs. We analyze the workflows in terms of aggregate performance, individual method choices, and the impact of multiple dataset experimental factors. Our results demonstrate that between-sample normalization has the biggest impact, with counts adjusted by size factors producing networks that most accurately recapitulate known tissue-naive and tissue-aware gene functional relationships. Conclusions Based on this work, we provide concrete recommendations on robust procedures for building an accurate coexpression network from an RNA-seq dataset. In addition, researchers can examine all the results in great detail at https://krishnanlab.github.io/RNAseq_coexpression to make appropriate choices for coexpression analysis based on the experimental factors of their RNA-seq dataset.

Download Full-text

Constructing local cell-specific networks from single-cell data

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2113178118 ◽

2021 ◽

Vol 118 (51) ◽

pp. e2113178118

Author(s):

Xuran Wang ◽

David Choi ◽

Kathryn Roeder

Keyword(s):

Gene Expression ◽

Single Cell ◽

Gene Networks ◽

Fetal Brain ◽

Autism Spectrum ◽

Cellular Level ◽

Cellular Heterogeneity ◽

Gene Coexpression ◽

Downstream Analysis ◽

Coexpression Networks

Gene coexpression networks yield critical insights into biological processes, and single-cell RNA sequencing provides an opportunity to target inquiries at the cellular level. However, due to the sparsity and heterogeneity of transcript counts, it is challenging to construct accurate gene networks. We develop an approach, locCSN, that estimates cell-specific networks (CSNs) for each cell, preserving information about cellular heterogeneity that is lost with other approaches. LocCSN is based on a nonparametric investigation of the joint distribution of gene expression; hence it can readily detect nonlinear correlations, and it is more robust to distributional challenges. Although individual CSNs are estimated with considerable noise, average CSNs provide stable estimates of networks, which reveal gene communities better than traditional measures. Additionally, we propose downstream analysis methods using CSNs to utilize more fully the information contained within them. Repeated estimates of gene networks facilitate testing for differences in network structure between cell groups. Notably, with this approach, we can identify differential network genes, which typically do not differ in gene expression, but do differ in terms of the coexpression networks. These genes might help explain the etiology of disease. Finally, to further our understanding of autism spectrum disorder, we examine the evolution of gene networks in fetal brain cells and compare the CSNs of cells sampled from case and control subjects to reveal intriguing patterns in gene coexpression.

Download Full-text

Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data

10.1101/2020.09.22.308577 ◽

2020 ◽

Author(s):

Kayla A Johnson ◽

Arjun Krishnan

Keyword(s):

Gene Expression ◽

Tissue Expression ◽

Specific Gene ◽

Expression Data ◽

Rna Seq ◽

Functional Relationships ◽

Gene Coexpression ◽

Network Transformation ◽

Coexpression Networks ◽

The Impact

AbstractBackgroundConstructing gene coexpression networks is a powerful approach for analyzing high-throughput gene expression data towards module identification, gene function prediction, and disease-gene prioritization. While optimal workflows for constructing coexpression networks – including good choices for data pre-processing, normalization, and network transformation – have been developed for microarray-based expression data, such well-tested choices do not exist for RNA-seq data. Almost all studies that compare data processing/normalization methods for RNA-seq focus on the end goal of determining differential gene expression.ResultsHere, we present a comprehensive benchmarking and analysis of 30 different workflows, each with a unique set of normalization and network transformation methods, for constructing coexpression networks from RNA-seq datasets. We tested these workflows on both large, homogenous datasets (Genotype-Tissue Expression project) and small, heterogeneous datasets from various labs (submitted to the Sequence Read Archive). We analyzed the workflows in terms of aggregate performance, individual method choices, and the impact of multiple dataset experimental factors. Our results demonstrate that between-sample normalization has the biggest impact, with trimmed mean of M-values or upper quartile normalization producing networks that most accurately recapitulate known tissue-naive and tissue-specific gene functional relationships.ConclusionsBased on this work, we provide concrete recommendations on robust procedures for building an accurate coexpression network from an RNA-seq dataset. In addition, researchers can examine all the results in great detail at https://krishnanlab.github.io/norm_for_RNAseq_coexp to make appropriate choices for coexpression analysis based on the experimental factors of their RNA-seq dataset.

Download Full-text

Integrative Network-Based Analysis Reveals Gene Networks and Novel Drug Repositioning Candidates for Alzheimer Disease

Neurology Genetics ◽

10.1212/nxg.0000000000000622 ◽

2021 ◽

Vol 7 (5) ◽

pp. e622

Author(s):

Zachary F. Gerring ◽

Eric R. Gamazon ◽

Anthony White ◽

Eske M. Derks

Keyword(s):

Gene Expression ◽

Alzheimer Disease ◽

Drug Repositioning ◽

Genome Wide Association ◽

Specific Gene ◽

Tissue Specific ◽

Tissue Specific Gene ◽

Gene Coexpression ◽

Genome Wide ◽

Coexpression Networks

Background and ObjectivesTo integrate genome-wide association study data with tissue-specific gene expression information to identify coexpression networks, biological pathways, and drug repositioning candidates for Alzheimer disease.MethodsWe integrated genome-wide association summary statistics for Alzheimer disease with tissue-specific gene coexpression networks from brain tissue samples in the Genotype-Tissue Expression study. We identified gene coexpression networks enriched with genetic signals for Alzheimer disease and characterized the associated networks using biological pathway analysis. The disease-implicated modules were subsequently used as a molecular substrate for a computational drug repositioning analysis, in which we (1) imputed genetically regulated gene expression within Alzheimer disease implicated modules; (2) integrated the imputed gene expression levels with drug-gene signatures from the connectivity map to identify compounds that normalize dysregulated gene expression underlying Alzheimer disease; and (3) prioritized drug compounds and mechanisms of action based on the extent to which they normalize dysregulated expression signatures.ResultsGenetic factors for Alzheimer disease are enriched in brain gene coexpression networks involved in the immune response. Computational drug repositioning analyses of expression changes within the disease-associated networks retrieved known Alzheimer disease drugs (e.g., memantine) as well as biologically meaningful drug categories (e.g., glutamate receptor antagonists).DiscussionOur results improve the biological interpretation of genetic data for Alzheimer disease and provide a list of potential antidementia drug repositioning candidates for which the efficacy should be investigated in functional validation studies.

Download Full-text

Pathway Activity Score Learning for Dimensionality Reduction of Gene Expression Data

Discovery Science - Lecture Notes in Computer Science ◽

10.1007/978-3-030-61527-7_17 ◽

2020 ◽

pp. 246-261

Author(s):

Ioulia Karagiannaki ◽

Yannis Pantazis ◽

Ekaterini Chatzaki ◽

Ioannis Tsamardinos

Keyword(s):

Gene Expression ◽

Dimensionality Reduction ◽

Enrichment Analysis ◽

Disease Classification ◽

Gene Set Enrichment Analysis ◽

Activity Score ◽

Use Case ◽

Pathway Activity ◽

Latent Space ◽

Biological Interpretation

Abstract Molecular gene-expression datasets consist of samples with tens of thousands of measured quantities (e.g., high dimensional data). However, there exist lower-dimensional representations that retain the useful information. We present a novel algorithm for such dimensionality reduction called Pathway Activity Score Learning (PASL). The major novelty of PASL is that the constructed features directly correspond to known molecular pathways and can be interpreted as pathway activity scores. Hence, unlike PCA and similar methods, PASL’s latent space has a relatively straight-forward biological interpretation. As a use-case, PASL is applied on two collections of breast cancer and leukemia gene expression datasets. We show that PASL does retain the predictive information for disease classification on new, unseen datasets, as well as outperforming PLIER, a recently proposed competitive method. We also show that differential activation pathway analysis provides complementary information to standard gene set enrichment analysis. The code is available at https://github.com/mensxmachina/PASL.

Download Full-text