Comprehensive analysis integrating both clinicopathological and gene expression data in more than 1,500 samples: Proliferation captured by gene expression grade index appears to be the strongest prognostic factor in breast cancer (BC)

507 Background: Although, the development of high-throughput gene expression technologies has allowed the identification of several “molecular signatures” predicting clinical outcome, no attempt has been made yet to perform a comprehensive analysis integrating both clinicopathological, and gene expression data. Here, we aim to elucidate the relationship between clinical parameters and tumor markers, with gene expression patterns and their interaction with prognosis. Methods: We analyzed gene expression and clinical data from several published studies, including more than 1500 BC patients. We developed several gene expression indices associated with different biological stages of disease characterized by the expression of hormone receptors, HER2 amplification, p53 mutation, angiogenesis, tumor invasion and proliferation. Multivariable analyses were used to characterize the dependency patterns between these indices and their impact on survival. Results: Estrogen receptor (ER) and HER2 indices were the most prominent discriminators dichotomizing tumor samples into two main subsets in agreement with the previously proposed BC subtypes. Tumor proliferation, assessed by our previously reported gene expression index (GGI), was the most strongly associated with prognosis (HR 2.29, CI 1.88–2.78, p<0.0001). Almost all ER- and HER2+ tumors were associated with high GGI scores. In contrast, ER+ and HER2- tumors showed a whole range of GGI values. Within the high proliferation subset, ER- and HER2+ indices did not have any prognostic value. Similar results were found with relation to p53 mutation index. Nodal status and tumor size, which essentially measure the duration of disease, retained prognostic value in addition to proliferation. Conclusions: Proliferation captured by the GGI appears to be a key biological factor, downstream of ER, HER2 and p53. Although understanding the upstream factors is important for advancing biological knowledge and therapeutic interventions, GGI seems to be the most important factor predicting clinical outcome in BC and deserves consideration as stratification factor in clinical trials. No significant financial relationships to disclose.

Download Full-text

On the limits of active module identification

Briefings in Bioinformatics ◽

10.1093/bib/bbab066 ◽

2021 ◽

Author(s):

Olga Lazareva ◽

Jan Baumbach ◽

Markus List ◽

David B Blumenthal

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Small Diameter ◽

Extensive Study ◽

Biological Knowledge ◽

Expression Data ◽

Module Identification ◽

Ppi Networks ◽

Novel Algorithms ◽

Context Specific

Abstract In network and systems medicine, active module identification methods (AMIMs) are widely used for discovering candidate molecular disease mechanisms. To this end, AMIMs combine network analysis algorithms with molecular profiling data, most commonly, by projecting gene expression data onto generic protein–protein interaction (PPI) networks. Although active module identification has led to various novel insights into complex diseases, there is increasing awareness in the field that the combination of gene expression data and PPI network is problematic because up-to-date PPI networks have a very small diameter and are subject to both technical and literature bias. In this paper, we report the results of an extensive study where we analyzed for the first time whether widely used AMIMs really benefit from using PPI networks. Our results clearly show that, except for the recently proposed AMIM DOMINO, the tested AMIMs do not produce biologically more meaningful candidate disease modules on widely used PPI networks than on random networks with the same node degrees. AMIMs hence mainly learn from the node degrees and mostly fail to exploit the biological knowledge encoded in the edges of the PPI networks. This has far-reaching consequences for the field of active module identification. In particular, we suggest that novel algorithms are needed which overcome the degree bias of most existing AMIMs and/or work with customized, context-specific networks instead of generic PPI networks.

Download Full-text

Evolutionary Local Search Algorithm for the biclustering of gene expression data based on biological knowledge

Applied Soft Computing ◽

10.1016/j.asoc.2021.107177 ◽

2021 ◽

Vol 104 ◽

pp. 107177

Author(s):

Ons Maâtouk ◽

Wassim Ayadi ◽

Hend Bouziri ◽

Béatrice Duval

Keyword(s):

Gene Expression ◽

Local Search ◽

Gene Expression Data ◽

Search Algorithm ◽

Biological Knowledge ◽

Expression Data ◽

Local Search Algorithm

Download Full-text

ENTROPY-BASED CLUSTER VALIDATION AND ESTIMATION OF THE NUMBER OF CLUSTERS IN GENE EXPRESSION DATA

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720012500114 ◽

2012 ◽

Vol 10 (05) ◽

pp. 1250011

Author(s):

NATALIA NOVOSELOVA ◽

IGOR TOM

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Clustering Algorithm ◽

Selection Procedure ◽

Biological Knowledge ◽

Consensus Clustering ◽

Expression Data ◽

Cluster Validation ◽

Number Of Clusters ◽

Validity Measure

Many external and internal validity measures have been proposed in order to estimate the number of clusters in gene expression data but as a rule they do not consider the analysis of the stability of the groupings produced by a clustering algorithm. Based on the approach assessing the predictive power or stability of a partitioning, we propose the new measure of cluster validation and the selection procedure to determine the suitable number of clusters. The validity measure is based on the estimation of the "clearness" of the consensus matrix, which is the result of a resampling clustering scheme or consensus clustering. According to the proposed selection procedure the stable clustering result is determined with the reference to the validity measure for the null hypothesis encoding for the absence of clusters. The final number of clusters is selected by analyzing the distance between the validity plots for initial and permutated data sets. We applied the selection procedure to estimate the clustering results on several datasets. As a result the proposed procedure produced an accurate and robust estimate of the number of clusters, which are in agreement with the biological knowledge and gold standards of cluster quality.

Download Full-text

Building Gene Networks by Analyzing Gene Expression Profiles

Advanced Methodologies and Technologies in Medicine and Healthcare - Advances in Medical Diagnosis, Treatment, and Care ◽

10.4018/978-1-5225-7489-7.ch003 ◽

2019 ◽

pp. 27-44

Author(s):

Crescenzio Gallo

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Gene Networks ◽

Dna Microarrays ◽

Expression Profiles ◽

Expression Patterns ◽

Gene Expression Profiles ◽

Expression Data ◽

Gene Expressions ◽

Over Time

The possible applications of modeling and simulation in the field of bioinformatics are very extensive, ranging from understanding basic metabolic paths to exploring genetic variability. Experimental results carried out with DNA microarrays allow researchers to measure expression levels for thousands of genes simultaneously, across different conditions and over time. A key step in the analysis of gene expression data is the detection of groups of genes that manifest similar expression patterns. In this chapter, the authors examine various methods for analyzing gene expression data, addressing the important topics of (1) selecting the most differentially expressed genes, (2) grouping them by means of their relationships, and (3) classifying samples based on gene expressions.

Download Full-text

Incorporating Pathway Information into Feature Selection towards Better Performed Gene Signatures

BioMed Research International ◽

10.1155/2019/2497509 ◽

2019 ◽

Vol 2019 ◽

pp. 1-12 ◽

Cited By ~ 1

Author(s):

Suyan Tian ◽

Chi Wang ◽

Bing Wang

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Gene Selection ◽

Selection Process ◽

Biological Knowledge ◽

Expression Data ◽

Selection Methods ◽

Its Gene ◽

Active Research

To analyze gene expression data with sophisticated grouping structures and to extract hidden patterns from such data, feature selection is of critical importance. It is well known that genes do not function in isolation but rather work together within various metabolic, regulatory, and signaling pathways. If the biological knowledge contained within these pathways is taken into account, the resulting method is a pathway-based algorithm. Studies have demonstrated that a pathway-based method usually outperforms its gene-based counterpart in which no biological knowledge is considered. In this article, a pathway-based feature selection is firstly divided into three major categories, namely, pathway-level selection, bilevel selection, and pathway-guided gene selection. With bilevel selection methods being regarded as a special case of pathway-guided gene selection process, we discuss pathway-guided gene selection methods in detail and the importance of penalization in such methods. Last, we point out the potential utilizations of pathway-guided gene selection in one active research avenue, namely, to analyze longitudinal gene expression data. We believe this article provides valuable insights for computational biologists and biostatisticians so that they can make biology more computable.

Download Full-text

CURVE-BASED CLUSTERING OF TIME COURSE GENE EXPRESSION DATA USING SELF-ORGANIZING MAPS

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720009004291 ◽

2009 ◽

Vol 07 (04) ◽

pp. 645-661 ◽

Cited By ~ 11

Author(s):

XIN CHEN

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Regulatory Networks ◽

Time Course ◽

Clustering Algorithm ◽

Expression Patterns ◽

Self Organizing Map ◽

Expression Data ◽

Wide Range ◽

Self Organizing

There is an increasing interest in clustering time course gene expression data to investigate a wide range of biological processes. However, developing a clustering algorithm ideal for time course gene express data is still challenging. As timing is an important factor in defining true clusters, a clustering algorithm shall explore expression correlations between time points in order to achieve a high clustering accuracy. Moreover, inter-cluster gene relationships are often desired in order to facilitate the computational inference of biological pathways and regulatory networks. In this paper, a new clustering algorithm called CurveSOM is developed to offer both features above. It first presents each gene by a cubic smoothing spline fitted to the time course expression profile, and then groups genes into clusters by applying a self-organizing map-based clustering on the resulting splines. CurveSOM has been tested on three well-studied yeast cell cycle datasets, and compared with four popular programs including Cluster 3.0, GENECLUSTER, MCLUST, and SSClust. The results show that CurveSOM is a very promising tool for the exploratory analysis of time course expression data, as it is not only able to group genes into clusters with high accuracy but also able to find true time-shifted correlations of expression patterns across clusters.

Download Full-text

Incorporating Biological Knowledge into Density-Based Clustering Analysis of Gene Expression Data

2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery ◽

10.1109/fskd.2009.191 ◽

2009 ◽

Cited By ~ 1

Author(s):

Sun Hang ◽

Zhou You ◽

Liang Yan Chun

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Clustering Analysis ◽

Biological Knowledge ◽

Expression Data ◽

Density Based Clustering

Download Full-text

Pathway based factor analysis of gene expression data produces highly heritable phenotypes that associate with age

10.1101/016154 ◽

2015 ◽

Cited By ~ 1

Author(s):

Andrew Anand Brown ◽

Zhihao Ding ◽

Ana Viñuela ◽

Dan Glass ◽

Leopold Parts ◽

...

Keyword(s):

Gene Expression ◽

Factor Analysis ◽

Gene Expression Data ◽

Biological Knowledge ◽

Expression Data ◽

Expression Levels ◽

Biologically Relevant ◽

Kegg Pathways ◽

Analysis Methods ◽

Gene Expression Levels

Statistical factor analysis methods have previously been used to remove noise components from high dimensional data prior to genetic association mapping, and in a guided fashion to summarise biologically relevant sources of variation. Here we show how the derived factors summarising pathway expression can be used to analyse the relationships between expression, heritability and ageing. We used skin gene expression data from 647 twins from the MuTHER Consortium and applied factor analysis to concisely summarise patterns of gene expression, both to remove broad confounding influences and to produce concise pathway-level phenotypes. We derived 930 "pathway phenotypes" which summarised patterns of variation across 186 KEGG pathways (five phenotypes per pathway). We identified 69 significant associations of age with phenotype from 57 distinct KEGG pathways at a stringent Bonferroni threshold (P<5.38E-5). These phenotypes are more heritable (h^2=0.32) than gene expression levels. On average, expression levels of 16% of genes within these pathways are associated with age. Several significant pathways relate to metabolising sugars and fatty acids, others with insulin signalling. We have demonstrated that factor analysis methods combined with biological knowledge can produce more reliable phenotypes with less stochastic noise than the individual gene expression levels, which increases our power to discover biologically relevant associations. These phenotypes could also be applied to discover associations with other environmental factors.

Download Full-text

Gene Expression Studies to Identify Significant Genes in AR, MTOR, MAPK Pathways and their Overlapping Regulatory Role in Prostate Cancer

Journal of Integrative Bioinformatics ◽

10.1515/jib-2018-0080 ◽

2019 ◽

Vol 16 (3) ◽

Author(s):

Nimisha Asati ◽

Abhinav Mishra ◽

Ankita Shukla ◽

Tiratha Raj Singh

Keyword(s):

Gene Expression ◽

Prostate Cancer ◽

Gene Expression Data ◽

Meta Analysis ◽

Expression Patterns ◽

Mitogen Activated Protein Kinase ◽

Expression Data ◽

Mapk Pathways ◽

Expression Studies ◽

Gene Expression Studies

AbstractGene expression studies revealed a large degree of variability in gene expression patterns particularly in tissues even in genetically identical individuals. It helps to reveal the components majorly fluctuating during the disease condition. With the advent of gene expression studies many microarray studies have been conducted in prostate cancer, but the results have varied across different studies. To better understand the genetic and biological regulatory mechanisms of prostate cancer, we conducted a meta-analysis of three major pathways i.e. androgen receptor (AR), mechanistic target of rapamycin (mTOR) and Mitogen-Activated Protein Kinase (MAPK) on prostate cancer. Meta-analysis has been performed for the gene expression data for the human species that are exposed to prostate cancer. Twelve datasets comprising AR, mTOR, and MAPK pathways were taken for analysis, out of which thirteen potential biomarkers were identified through meta-analysis. These findings were compiled based upon the quantitative data analysis by using different tools. Also, various interconnections were found amongst the pathways in study. Our study suggests that the microarray analysis of the gene expression data and their pathway level connections allows detection of the potential predictors that can prove to be putative therapeutic targets with biological and functional significance in progression of prostate cancer.

Download Full-text

A Joint Optimization Framework Integrated with Biological Knowledge for Clustering Incomplete Gene Expression Data

10.21203/rs.3.rs-1087790/v1 ◽

2021 ◽

Author(s):

Dan Li ◽

Hong Gu ◽

Qiaozhen Chang ◽

Jia Wang ◽

Pan Qin

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Missing Values ◽

Clustering Algorithms ◽

Joint Optimization ◽

Gene Clustering ◽

Biological Knowledge ◽

Data Sets ◽

Expression Data ◽

Optimization Framework

Abstract Clustering algorithms have been successfully applied to identify co-expressed gene groups from gene expression data. Missing values often occur in gene expression data, which presents a challenge for gene clustering. When partitioning incomplete gene expression data into co-expressed gene groups, missing value imputation and clustering are generally performed as two separate processes. These two-stage methods are likely to result in unsuitable imputation values for clustering task and unsatisfying clustering performance. This paper proposes a multi-objective joint optimization framework for clustering incomplete gene expression data that addresses this problem. The proposed framework can impute the missing expression values under the guidance of clustering, and therefore realize the synergistic improvement of imputation and clustering. In addition, gene expression similarity and gene semantic similarity extracted from the Gene Ontology are combined, as the form of functional neighbor interval for each missing expression value, to provide reasonable constraints for the joint optimization framework. Experiments on several benchmark data sets confirm the effectiveness of the proposed framework.

Download Full-text