RASMA: A Reverse Search Algorithm for Mining Frequent Subgraphs

Abstract Background: Mining frequent co-expression networks enables the discovery of interesting network motifs that elucidate important interactions among genes. Such interaction subnetworks have been shown to enhance the discovery of biological modules and subnetwork signatures for gene expression and disease classification. Results: We propose a reverse search algorithm for mining frequent and maximal subgraphs over a collection of graphs. We develop an approach for enumerating connected edge-induced subgraphs of an undirected graph by using a reverse-search algorithm, and then use this enumeration strategy for mining all maximal frequent subgraphs. To overcome the computationally prohibitive task of enumerating all frequent subgraphs while mining for maximal subgraphs, the proposed algorithm employs several pruning strategies, which substantially improve its overall runtime performance. Experimental results show that on large gene coexpression networks, the proposed algorithm efficiently mines biologically relevant maximal frequent subgraphs. Conclusion: Extracting recurrent gene coexpression subnetworks from multiple gene expression experiments enables the discovery of functional modules and subnetwork biomarkers. We have proposed a reverse search algorithm for mining maximal frequent subnetworks. Enrichment analysis of the extracted maximal frequent subnetworks reveals that subnetworks that are more frequent are more likely to be enriched with biological ontologies.

Download Full-text

RASMA: A Reverse Search Algorithm for Mining Maximal Frequent Subgraphs

10.21203/rs.3.rs-46148/v3 ◽

2021 ◽

Author(s):

Saeed Salem ◽

Mohammed Alokshiya ◽

Mohammad Al Hasan

Keyword(s):

Gene Expression ◽

Search Algorithm ◽

Enrichment Analysis ◽

Research Problem ◽

Disease Classification ◽

Connected Subgraph ◽

Gene Coexpression ◽

Key Innovation ◽

Frequent Subgraphs ◽

Coexpression Networks

Abstract Background: Given a collection of coexpression networks over a set of genes, identifying subnetworks that appear frequently is an important research problem known as mining frequent subgraphs. Maximal frequent subgraphs is a representative set of frequent subgraphs; A frequent subgraph is maximal if it does have a super-graph that is frequent. In the bioinformatics discipline, methodologies for mining frequent and/or maximal frequent subgraphs can be used to discover interesting network motifs that elucidate complex interactions among genes, reflected through the edges of the frequent subnetworks. Further study of frequent coexpression subnetworks enhances the discovery of biological modules and biological signatures for gene expression and disease classification.Results: We propose a reverse search algorithm, called RASMA, for mining frequent and maximal frequent subgraphs in a given collection of graphs. A key innovation in RASMA is a connected subgraph enumerator that uses a reverse-search strategy to enumerate connected subgraphs of an undirected graph. Using this enumeration strategy, RASMA obtains all maximal frequent subgraphs very efficiently. To overcome the computationally prohibitive task of enumerating all frequent subgraphs while mining for the maximal frequent subgraphs, RASMA employs several pruning strategies that substantially improve its overall runtime performance. Experimental results show that on large gene coexpression networks, the proposed algorithm efficiently mines biologically relevant maximal frequent subgraphs.Conclusion: Extracting recurrent gene coexpression subnetworks from multiple gene expression experiments enables the discovery of functional modules and subnetwork biomarkers. We have proposed a reverse search algorithm for mining maximal frequent subnetworks. Enrichment analysis of the extracted maximal frequent subnetworks reveals that subnetworks that are frequent are highly enriched with known biological ontologies.

Download Full-text

RASMA: A Reverse Search Algorithm for Mining Maximal Frequent Subgraphs

10.21203/rs.3.rs-46148/v2 ◽

2020 ◽

Author(s):

Saeed Salem ◽

Mohammed Alokshiya ◽

Mohammad Al Hasan

Keyword(s):

Gene Expression ◽

Search Algorithm ◽

Enrichment Analysis ◽

Research Problem ◽

Disease Classification ◽

Connected Subgraph ◽

Gene Coexpression ◽

Key Innovation ◽

Frequent Subgraphs ◽

Coexpression Networks

Abstract Background: Given a collection of coexpression networks over a set of genes, identifying subnetworks that appear frequently is an important research problem known as mining frequent subgraphs. Maximal frequent subgraphs is a representative set of frequent subgraphs; A frequent subgraph is maximal if it does have a super-graph that is frequent. In the bioinformatics discipline, methodologies for mining frequent and/or maximal frequent subgraphs can be used to discover interesting network motifs that elucidate complex interactions among genes, reflected through the edges of the frequent subnetworks. Further study of frequent coexpression subnetworks enhances the discovery of biological modules and biological signatures for gene expression and disease classification. Results: We propose a reverse search algorithm, called RASMA, for mining frequent and maximal frequent subgraphs in a given collection of graphs. A key innovation in RASMA is a connected subgraph enumerator that uses a reverse-search strategy to enumerate connected subgraphs of an undirected graph. Using this enumeration strategy, RASMA obtains all maximal frequent subgraphs very efficiently. To overcome the computationally prohibitive task of enumerating all frequent subgraphs while mining for the maximal frequent subgraphs, RASMA employs several pruning strategies that substantially improve its overall runtime performance. Experimental results show that on large gene coexpression networks, the proposed algorithm efficiently mines biologically relevant maximal frequent subgraphs. Conclusion: Extracting recurrent gene coexpression subnetworks from multiple gene expression experiments enables the discovery of functional modules and subnetwork biomarkers. We have proposed a reverse search algorithm for mining maximal frequent subnetworks. Enrichment analysis of the extracted maximal frequent subnetworks reveals that subnetworks that are frequent are highly enriched with known biological ontologies.

Download Full-text

RASMA: a reverse search algorithm for mining maximal frequent subgraphs

BioData Mining ◽

10.1186/s13040-021-00250-1 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Saeed Salem ◽

Mohammed Alokshiya ◽

Mohammad Al Hasan

Keyword(s):

Gene Expression ◽

Search Algorithm ◽

Enrichment Analysis ◽

Research Problem ◽

Disease Classification ◽

Connected Subgraph ◽

Gene Coexpression ◽

Key Innovation ◽

Frequent Subgraphs ◽

Coexpression Networks

Abstract Background Given a collection of coexpression networks over a set of genes, identifying subnetworks that appear frequently is an important research problem known as mining frequent subgraphs. Maximal frequent subgraphs are a representative set of frequent subgraphs; A frequent subgraph is maximal if it does not have a super-graph that is frequent. In the bioinformatics discipline, methodologies for mining frequent and/or maximal frequent subgraphs can be used to discover interesting network motifs that elucidate complex interactions among genes, reflected through the edges of the frequent subnetworks. Further study of frequent coexpression subnetworks enhances the discovery of biological modules and biological signatures for gene expression and disease classification. Results We propose a reverse search algorithm, called RASMA, for mining frequent and maximal frequent subgraphs in a given collection of graphs. A key innovation in RASMA is a connected subgraph enumerator that uses a reverse-search strategy to enumerate connected subgraphs of an undirected graph. Using this enumeration strategy, RASMA obtains all maximal frequent subgraphs very efficiently. To overcome the computationally prohibitive task of enumerating all frequent subgraphs while mining for the maximal frequent subgraphs, RASMA employs several pruning strategies that substantially improve its overall runtime performance. Experimental results show that on large gene coexpression networks, the proposed algorithm efficiently mines biologically relevant maximal frequent subgraphs. Conclusion Extracting recurrent gene coexpression subnetworks from multiple gene expression experiments enables the discovery of functional modules and subnetwork biomarkers. We have proposed a reverse search algorithm for mining maximal frequent subnetworks. Enrichment analysis of the extracted maximal frequent subnetworks reveals that subnetworks that are frequent are highly enriched with known biological ontologies.

Download Full-text

Integration of Gene Coexpression Network, GO Enrichment Analysis for Identification Gene Expression Signature of Invasive Bladder Carcinoma

Transcriptomics Open Access ◽

10.4172/2329-8936.1000126 ◽

2016 ◽

Vol 04 (01) ◽

Cited By ~ 2

Author(s):

Hanaa Hibishy Gaballah

Keyword(s):

Gene Expression ◽

Bladder Carcinoma ◽

Gene Expression Signature ◽

Enrichment Analysis ◽

Coexpression Network ◽

Gene Coexpression Network ◽

Expression Signature ◽

Gene Coexpression ◽

Go Enrichment ◽

Go Enrichment Analysis

Download Full-text

Analyzing Large Gene Expression Data Sets

Computational Text Analysis ◽

10.1093/oso/9780198567400.003.0014 ◽

2006 ◽

Author(s):

Soumya Raychaudhuri

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Expression Analysis ◽

Gene Expression Analysis ◽

Data Sets ◽

Expression Data ◽

Clustering Methods ◽

Biologically Relevant ◽

Large Gene ◽

Functional Coherence

The most interesting and challenging gene expression data sets to analyze are large multidimensional data sets that contain expression values for many genes across multiple conditions. In these data sets the use of scientific text can be particularly useful, since there are a myriad of genes examined under vastly different conditions, each of which may induce or repress expression of the same gene for different reasons. There is an enormous complexity to the data that we are examining—each gene is associated with dozens if not hundreds of expression values as well as multiple documents built up from vocabularies consisting of thousands of words. In Section 2.4 we reviewed common gene expression strategies, most of which revolve around defining groups of genes based on common profiles. A limitation of many gene expression analytic approaches is that they do not incorporate comprehensive background knowledge about the genes into the analysis. We present computational methods that leverage the peer-reviewed literature in the automatic analysis of gene expression data sets. Including the literature in gene expression data analysis offers an opportunity to incorporate background functional information about the genes when defining expression clusters. In Chapter 5 we saw how literature- based approaches could help in the analysis of single condition experiments. Here we will apply the strategies introduced in Chapter 6 to assess the coherence of groups of genes to enhance gene expression analysis approaches. The methods proposed here could, in fact, be applied to any multivariate genomics data type. The key concepts discussed in this chapter are listed in the frame box. We begin with a discussion of gene groups and their role in expression analysis; we briefly discuss strategies to assign keywords to groups and strategies to assess their functional coherence. We apply functional coherence measures to gene expression analysis; for examples we focus on a yeast expression data set. We first demonstrate how functional coherence can be used to focus in on the key biologically relevant gene groups derived by clustering methods such as self-organizing maps and k-means clustering.

Download Full-text

Identification of Biologically Relevant Biclusters from Gene Expression Dataset of Duchenne Muscular Dystrophy (DMD) Disease Using Elephant Swarm Water Search Algorithm

Advances in Intelligent Systems and Computing - Emerging Technologies in Data Mining and Information Security ◽

10.1007/978-981-15-9927-9_15 ◽

2021 ◽

pp. 147-157

Author(s):

Joy Adhikary ◽

Sriyankar Acharyya

Keyword(s):

Gene Expression ◽

Duchenne Muscular Dystrophy ◽

Muscular Dystrophy ◽

Search Algorithm ◽

Gene Expression Dataset ◽

Biologically Relevant

Download Full-text

Identifying gene-specific subgroups: an alternative to biclustering

BMC Bioinformatics ◽

10.1186/s12859-019-3289-0 ◽

2019 ◽

Vol 20 (1) ◽

Author(s):

Vincent Branders ◽

Pierre Schaus ◽

Pierre Dupont

Keyword(s):

Gene Expression ◽

Expression Patterns ◽

Enrichment Analysis ◽

R Package ◽

Additional Contribution ◽

Computationally Efficient ◽

Statistical Validation ◽

Experimental Conditions ◽

Large Gene ◽

Significant Gene

Abstract Background Transcriptome analysis aims at gaining insight into cellular processes through discovering gene expression patterns across various experimental conditions. Biclustering is a standard approach to discover genes subsets with similar expression across subgroups of samples to be identified. The result is a set of biclusters, each forming a specific submatrix of rows (e.g. genes) and columns (e.g. samples). Relevant biclusters can, however, be missed when, due to the presence of a few outliers, they lack the assumed homogeneity of expression values among a few gene/sample combinations. The Max-Sum SubMatrix problem addresses this issue by looking at highly expressed subsets of genes and of samples, without enforcing such homogeneity. Results We present here the algorithm to identify K relevant submatrices. Our main contribution is to show that this approach outperforms biclustering algorithms to identify several gene subsets representative of specific subgroups of samples. Experiments are conducted on 35 gene expression datasets from human tissues and yeast samples. We report comparative results with those obtained by several biclustering algorithms, including , , , , and . Gene enrichment analysis demonstrates the benefits of the proposed approach to identify more statistically significant gene subsets. The most significant Gene Ontology terms identified with are shown consistent with the controlled conditions of each dataset. This analysis supports the biological relevance of the identified gene subsets. An additional contribution is the statistical validation protocol proposed here to assess the relative performances of biclustering algorithms and of the proposed method. It relies on a Friedman test and the Hochberg’s sequential procedure to report critical differences of ranks among all algorithms. Conclusions We propose here the method, a computationally efficient algorithm to identify K max-sum submatrices in a large gene expression matrix. Comparisons show that it identifies more significantly enriched subsets of genes and specific subgroups of samples which are easily interpretable by biologists. Experiments also show its ability to identify more reliable GO terms. These results illustrate the benefits of the proposed approach in terms of interpretability and of biological enrichment quality. Open implementation of this algorithm is available as an R package.

Download Full-text

Pathway Activity Score Learning for Dimensionality Reduction of Gene Expression Data

Discovery Science - Lecture Notes in Computer Science ◽

10.1007/978-3-030-61527-7_17 ◽

2020 ◽

pp. 246-261

Author(s):

Ioulia Karagiannaki ◽

Yannis Pantazis ◽

Ekaterini Chatzaki ◽

Ioannis Tsamardinos

Keyword(s):

Gene Expression ◽

Dimensionality Reduction ◽

Enrichment Analysis ◽

Disease Classification ◽

Gene Set Enrichment Analysis ◽

Activity Score ◽

Use Case ◽

Pathway Activity ◽

Latent Space ◽

Biological Interpretation

Abstract Molecular gene-expression datasets consist of samples with tens of thousands of measured quantities (e.g., high dimensional data). However, there exist lower-dimensional representations that retain the useful information. We present a novel algorithm for such dimensionality reduction called Pathway Activity Score Learning (PASL). The major novelty of PASL is that the constructed features directly correspond to known molecular pathways and can be interpreted as pathway activity scores. Hence, unlike PCA and similar methods, PASL’s latent space has a relatively straight-forward biological interpretation. As a use-case, PASL is applied on two collections of breast cancer and leukemia gene expression datasets. We show that PASL does retain the predictive information for disease classification on new, unseen datasets, as well as outperforming PLIER, a recently proposed competitive method. We also show that differential activation pathway analysis provides complementary information to standard gene set enrichment analysis. The code is available at https://github.com/mensxmachina/PASL.

Download Full-text

CogNet: classification of gene expression data based on ranked active-subnetwork-oriented KEGG pathway enrichment analysis

PeerJ Computer Science ◽

10.7717/peerj-cs.336 ◽

2021 ◽

Vol 7 ◽

pp. e336

Author(s):

Malik Yousef ◽

Ege Ülgen ◽

Osman Uğur Sezerman

Keyword(s):

Gene Expression ◽

Gene Selection ◽

Kegg Pathway ◽

Enrichment Analysis ◽

Biological Knowledge ◽

Pathway Enrichment Analysis ◽

Biologically Relevant ◽

Pathway Enrichment ◽

Kegg Pathways

Most of the traditional gene selection approaches are borrowed from other fields such as statistics and computer science, However, they do not prioritize biologically relevant genes since the ultimate goal is to determine features that optimize model performance metrics not to build a biologically meaningful model. Therefore, there is an imminent need for new computational tools that integrate the biological knowledge about the data in the process of gene selection and machine learning. Integrative gene selection enables incorporation of biological domain knowledge from external biological resources. In this study, we propose a new computational approach named CogNet that is an integrative gene selection tool that exploits biological knowledge for grouping the genes for the computational modeling tasks of ranking and classification. In CogNet, the pathfindR serves as the biological grouping tool to allow the main algorithm to rank active-subnetwork-oriented KEGG pathway enrichment analysis results to build a biologically relevant model. CogNet provides a list of significant KEGG pathways that can classify the data with a very high accuracy. The list also provides the genes belonging to these pathways that are differentially expressed that are used as features in the classification problem. The list facilitates deep analysis and better interpretability of the role of KEGG pathways in classification of the data thus better establishing the biological relevance of these differentially expressed genes. Even though the main aim of our study is not to improve the accuracy of any existing tool, the performance of the CogNet outperforms a similar approach called maTE while obtaining similar performance compared to other similar tools including SVM-RCE. CogNet was tested on 13 gene expression datasets concerning a variety of diseases.

Download Full-text