Evolutionary Local Search Algorithm for the biclustering of gene expression data based on biological knowledge

Abstract In network and systems medicine, active module identification methods (AMIMs) are widely used for discovering candidate molecular disease mechanisms. To this end, AMIMs combine network analysis algorithms with molecular profiling data, most commonly, by projecting gene expression data onto generic protein–protein interaction (PPI) networks. Although active module identification has led to various novel insights into complex diseases, there is increasing awareness in the field that the combination of gene expression data and PPI network is problematic because up-to-date PPI networks have a very small diameter and are subject to both technical and literature bias. In this paper, we report the results of an extensive study where we analyzed for the first time whether widely used AMIMs really benefit from using PPI networks. Our results clearly show that, except for the recently proposed AMIM DOMINO, the tested AMIMs do not produce biologically more meaningful candidate disease modules on widely used PPI networks than on random networks with the same node degrees. AMIMs hence mainly learn from the node degrees and mostly fail to exploit the biological knowledge encoded in the edges of the PPI networks. This has far-reaching consequences for the field of active module identification. In particular, we suggest that novel algorithms are needed which overcome the degree bias of most existing AMIMs and/or work with customized, context-specific networks instead of generic PPI networks.

Download Full-text

ENTROPY-BASED CLUSTER VALIDATION AND ESTIMATION OF THE NUMBER OF CLUSTERS IN GENE EXPRESSION DATA

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720012500114 ◽

2012 ◽

Vol 10 (05) ◽

pp. 1250011

Author(s):

NATALIA NOVOSELOVA ◽

IGOR TOM

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Clustering Algorithm ◽

Selection Procedure ◽

Biological Knowledge ◽

Consensus Clustering ◽

Expression Data ◽

Cluster Validation ◽

Number Of Clusters ◽

Validity Measure

Many external and internal validity measures have been proposed in order to estimate the number of clusters in gene expression data but as a rule they do not consider the analysis of the stability of the groupings produced by a clustering algorithm. Based on the approach assessing the predictive power or stability of a partitioning, we propose the new measure of cluster validation and the selection procedure to determine the suitable number of clusters. The validity measure is based on the estimation of the "clearness" of the consensus matrix, which is the result of a resampling clustering scheme or consensus clustering. According to the proposed selection procedure the stable clustering result is determined with the reference to the validity measure for the null hypothesis encoding for the absence of clusters. The final number of clusters is selected by analyzing the distance between the validity plots for initial and permutated data sets. We applied the selection procedure to estimate the clustering results on several datasets. As a result the proposed procedure produced an accurate and robust estimate of the number of clusters, which are in agreement with the biological knowledge and gold standards of cluster quality.

Download Full-text

Incorporating Pathway Information into Feature Selection towards Better Performed Gene Signatures

BioMed Research International ◽

10.1155/2019/2497509 ◽

2019 ◽

Vol 2019 ◽

pp. 1-12 ◽

Cited By ~ 1

Author(s):

Suyan Tian ◽

Chi Wang ◽

Bing Wang

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Gene Selection ◽

Selection Process ◽

Biological Knowledge ◽

Expression Data ◽

Selection Methods ◽

Its Gene ◽

Active Research

To analyze gene expression data with sophisticated grouping structures and to extract hidden patterns from such data, feature selection is of critical importance. It is well known that genes do not function in isolation but rather work together within various metabolic, regulatory, and signaling pathways. If the biological knowledge contained within these pathways is taken into account, the resulting method is a pathway-based algorithm. Studies have demonstrated that a pathway-based method usually outperforms its gene-based counterpart in which no biological knowledge is considered. In this article, a pathway-based feature selection is firstly divided into three major categories, namely, pathway-level selection, bilevel selection, and pathway-guided gene selection. With bilevel selection methods being regarded as a special case of pathway-guided gene selection process, we discuss pathway-guided gene selection methods in detail and the importance of penalization in such methods. Last, we point out the potential utilizations of pathway-guided gene selection in one active research avenue, namely, to analyze longitudinal gene expression data. We believe this article provides valuable insights for computational biologists and biostatisticians so that they can make biology more computable.

Download Full-text

Gene Expression Data Clustering Using Variance-based Harmony Search Algorithm

IETE Journal of Research ◽

10.1080/03772063.2018.1452641 ◽

2018 ◽

Vol 65 (5) ◽

pp. 641-652 ◽

Cited By ~ 3

Author(s):

Vijay Kumar ◽

Dinesh Kumar

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Data Clustering ◽

Search Algorithm ◽

Harmony Search ◽

Harmony Search Algorithm ◽

Expression Data ◽

Gene Expression Data Clustering

Download Full-text

Incorporating Biological Knowledge into Density-Based Clustering Analysis of Gene Expression Data

2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery ◽

10.1109/fskd.2009.191 ◽

2009 ◽

Cited By ~ 1

Author(s):

Sun Hang ◽

Zhou You ◽

Liang Yan Chun

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Clustering Analysis ◽

Biological Knowledge ◽

Expression Data ◽

Density Based Clustering

Download Full-text

Pathway based factor analysis of gene expression data produces highly heritable phenotypes that associate with age

10.1101/016154 ◽

2015 ◽

Cited By ~ 1

Author(s):

Andrew Anand Brown ◽

Zhihao Ding ◽

Ana Viñuela ◽

Dan Glass ◽

Leopold Parts ◽

...

Keyword(s):

Gene Expression ◽

Factor Analysis ◽

Gene Expression Data ◽

Biological Knowledge ◽

Expression Data ◽

Expression Levels ◽

Biologically Relevant ◽

Kegg Pathways ◽

Analysis Methods ◽

Gene Expression Levels

Statistical factor analysis methods have previously been used to remove noise components from high dimensional data prior to genetic association mapping, and in a guided fashion to summarise biologically relevant sources of variation. Here we show how the derived factors summarising pathway expression can be used to analyse the relationships between expression, heritability and ageing. We used skin gene expression data from 647 twins from the MuTHER Consortium and applied factor analysis to concisely summarise patterns of gene expression, both to remove broad confounding influences and to produce concise pathway-level phenotypes. We derived 930 "pathway phenotypes" which summarised patterns of variation across 186 KEGG pathways (five phenotypes per pathway). We identified 69 significant associations of age with phenotype from 57 distinct KEGG pathways at a stringent Bonferroni threshold (P<5.38E-5). These phenotypes are more heritable (h^2=0.32) than gene expression levels. On average, expression levels of 16% of genes within these pathways are associated with age. Several significant pathways relate to metabolising sugars and fatty acids, others with insulin signalling. We have demonstrated that factor analysis methods combined with biological knowledge can produce more reliable phenotypes with less stochastic noise than the individual gene expression levels, which increases our power to discover biologically relevant associations. These phenotypes could also be applied to discover associations with other environmental factors.

Download Full-text

A Joint Optimization Framework Integrated with Biological Knowledge for Clustering Incomplete Gene Expression Data

10.21203/rs.3.rs-1087790/v1 ◽

2021 ◽

Author(s):

Dan Li ◽

Hong Gu ◽

Qiaozhen Chang ◽

Jia Wang ◽

Pan Qin

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Missing Values ◽

Clustering Algorithms ◽

Joint Optimization ◽

Gene Clustering ◽

Biological Knowledge ◽

Data Sets ◽

Expression Data ◽

Optimization Framework

Abstract Clustering algorithms have been successfully applied to identify co-expressed gene groups from gene expression data. Missing values often occur in gene expression data, which presents a challenge for gene clustering. When partitioning incomplete gene expression data into co-expressed gene groups, missing value imputation and clustering are generally performed as two separate processes. These two-stage methods are likely to result in unsuitable imputation values for clustering task and unsatisfying clustering performance. This paper proposes a multi-objective joint optimization framework for clustering incomplete gene expression data that addresses this problem. The proposed framework can impute the missing expression values under the guidance of clustering, and therefore realize the synergistic improvement of imputation and clustering. In addition, gene expression similarity and gene semantic similarity extracted from the Gene Ontology are combined, as the form of functional neighbor interval for each missing expression value, to provide reasonable constraints for the joint optimization framework. Experiments on several benchmark data sets confirm the effectiveness of the proposed framework.

Download Full-text

Smoothing Gene Expression Data with Network Information Improves Consistency of Regulated Genes

Statistical Applications in Genetics and Molecular Biology ◽

10.2202/1544-6115.1618 ◽

2011 ◽

Vol 10 (1) ◽

Cited By ~ 6

Author(s):

Guro Dørum ◽

Lars Snipen ◽

Margrete Solheim ◽

Solve Saebo

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Gene Networks ◽

Simulated Data ◽

Real Data ◽

Biological Knowledge ◽

Expression Data ◽

Data Set ◽

Gene Set ◽

Network Information

Gene set analysis methods have become a widely used tool for including prior biological knowledge in the statistical analysis of gene expression data. Advantages of these methods include increased sensitivity, easier interpretation and more conformity in the results. However, gene set methods do not employ all the available information about gene relations. Genes are arranged in complex networks where the network distances contain detailed information about inter-gene dependencies. We propose a method that uses gene networks to smooth gene expression data with the aim of reducing the number of false positives and identify important subnetworks. Gene dependencies are extracted from the network topology and are used to smooth genewise test statistics. To find the optimal degree of smoothing, we propose using a criterion that considers the correlation between the network and the data. The network smoothing is shown to improve the ability to identify important genes in simulated data. Applied to a real data set, the smoothing accentuates parts of the network with a high density of differentially expressed genes.

Download Full-text

A New Hybrid Cuckoo Search Algorithm for Biclustering of Microarray Gene-Expression Data

Applied Artificial Intelligence ◽

10.1080/08839514.2018.1501918 ◽

2018 ◽

Vol 32 (7-8) ◽

pp. 644-659 ◽

Cited By ~ 1

Author(s):

R. Balamurugan ◽

A.M. Natarajan ◽

K. Premalatha

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Search Algorithm ◽

Cuckoo Search ◽

Microarray Gene Expression Data ◽

Cuckoo Search Algorithm ◽

Expression Data ◽

Microarray Gene Expression ◽

Microarray Gene

Download Full-text