On the robustness of graph-based clustering to random network alterations

ABSTRACTBiological functions emerge from complex and dynamic networks of protein-protein interactions. Because these protein-protein interaction networks, or interactomes, represent pairwise connections within a hierarchically organized system, it is often useful to identify higher-order associations embedded within them, such as multi-member protein-complexes. Graph-based clustering techniques are widely used to accomplish this goal, and dozens of field-specific and general clustering algorithms exist. However, interactomes can be prone to errors, especially interactomes that infer interactions using high-throughput biochemical assays. Therefore, robustness to network-level variability is an important criterion for any clustering algorithm that aims to generate robust, reproducible clusters. Here, we tested the robustness of a range of graph-based clustering algorithms in the presence of network-level noise, including algorithms common across domains and those specific to protein networks. We found that the results of all clustering algorithms measured were profoundly sensitive to injected network noise.Randomly rewiring 1% of network edges yielded up to a 57% change in clustering results, indicating that clustering markedly amplified network-level noise. However, the impact of network noise on individual clusters was not uniform. We found that some clusters were consistently robust to injected network noise while others were not. Therefore, we developed the clust.perturb R package and Shiny web application, which measures the reproducibility of clusters by randomly perturbing the network. We show that clust.perturb results are predictive of real-world cluster stability: poorly reproducible clusters as identified by clust.perturb are significantly less likely to be reclustered across experiments. We conclude that quantifying the robustness of a cluster to network noise, as implemented in clust.perturb, provides a powerful tool for ranking the reproducibility of clusters, and separating stable protein complexes from spurious associations.

Download Full-text

On the robustness of graph-based clustering to random network alterations

Molecular & Cellular Proteomics ◽

10.1074/mcp.ra120.002275 ◽

2020 ◽

pp. mcp.RA120.002275

Author(s):

R. Greg Stacey ◽

Michael A. Skinnider ◽

Leonard J. Foster

Keyword(s):

Protein Interaction ◽

Web Application ◽

Clustering Algorithm ◽

Protein Complexes ◽

Random Network ◽

Clustering Algorithms ◽

Protein Interaction Networks ◽

Interaction Networks ◽

The Impact ◽

Graph Based Clustering

Biological functions emerge from complex and dynamic networks of protein-protein interactions. Because these protein-protein interaction networks, or interactomes, represent pairwise connections within a hierarchically organized system, it is often useful to identify higher-order associations embedded within them, such as multi-member protein complexes. Graph-based clustering techniques are widely used to accomplish this goal, and dozens of field-specific and general clustering algorithms exist. However, interactomes can be prone to errors, especially when inferred from high-throughput biochemical assays. Therefore, robustness to network-level noise is an important criterion for any clustering algorithm that aims to generate robust, reproducible clusters. Here, we tested the robustness of a range of graph-based clustering algorithms in the presence of noise, including algorithms common across domains and those specific to protein networks. Strikingly, we found that all of the clustering algorithms tested here markedly amplified noise within the underlying protein interaction network. Randomly rewiring only 1% of network edges yielded more than a 50% change in clustering results, indicating that clustering markedly amplified network-level noise. Moreover, we found the impact of network noise on individual clusters was not uniform: some clusters were consistently robust to injected noise while others were not. To assist in assessing this, we developed the clust.perturb R package and Shiny web application to measure the reproducibility of clusters by randomly perturbing the network. We show that clust.perturb results are predictive of real-world cluster stability: poorly reproducible clusters as identified by clust.perturb are significantly less likely to be reclustered across experiments. We conclude that graph-based clustering amplifies noise in protein interaction networks, but quantifying the robustness of a cluster to network noise can separate stable protein complexes from spurious associations.

Download Full-text

Efficient weighted univariate clustering maps outstanding dysregulated genomic zones in human cancers

Bioinformatics ◽

10.1093/bioinformatics/btaa613 ◽

2020 ◽

Vol 36 (20) ◽

pp. 5027-5036 ◽

Cited By ~ 3

Author(s):

Mingzhou Song ◽

Hua Zhong

Keyword(s):

Clustering Algorithm ◽

Human Cancer ◽

Clustering Algorithms ◽

Search Space ◽

R Package ◽

Supplementary Information ◽

Diagnostic Biomarkers ◽

Cancer Types ◽

Molecular Patterns ◽

Pan Cancer

Abstract Motivation Chromosomal patterning of gene expression in cancer can arise from aneuploidy, genome disorganization or abnormal DNA methylation. To map such patterns, we introduce a weighted univariate clustering algorithm to guarantee linear runtime, optimality and reproducibility. Results We present the chromosome clustering method, establish its optimality and runtime and evaluate its performance. It uses dynamic programming enhanced with an algorithm to reduce search-space in-place to decrease runtime overhead. Using the method, we delineated outstanding genomic zones in 17 human cancer types. We identified strong continuity in dysregulation polarity—dominance by either up- or downregulated genes in a zone—along chromosomes in all cancer types. Significantly polarized dysregulation zones specific to cancer types are found, offering potential diagnostic biomarkers. Unreported previously, a total of 109 loci with conserved dysregulation polarity across cancer types give insights into pan-cancer mechanisms. Efficient chromosomal clustering opens a window to characterize molecular patterns in cancer genome and beyond. Availability and implementation Weighted univariate clustering algorithms are implemented within the R package ‘Ckmeans.1d.dp’ (4.0.0 or above), freely available at https://cran.r-project.org/package=Ckmeans.1d.dp. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Expression attenuation as a mechanism of robustness against gene duplication

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2014345118 ◽

2021 ◽

Vol 118 (6) ◽

pp. e2014345118

Author(s):

Diana Ascencio ◽

Guillaume Diss ◽

Isabelle Gagnon-Arsenault ◽

Alexandre K. Dubé ◽

Alexander DeLuna ◽

...

Keyword(s):

Gene Duplication ◽

Protein Interactions ◽

Protein Complexes ◽

Phenotypic Diversity ◽

26S Proteasome ◽

Gene Duplications ◽

Protein Protein Interactions ◽

Individual Gene ◽

The Impact

Gene duplication is ubiquitous and a major driver of phenotypic diversity across the tree of life, but its immediate consequences are not fully understood. Deleterious effects would decrease the probability of retention of duplicates and prevent their contribution to long-term evolution. One possible detrimental effect of duplication is the perturbation of the stoichiometry of protein complexes. Here, we measured the fitness effects of the duplication of 899 essential genes in the budding yeast using high-resolution competition assays. At least 10% of genes caused a fitness disadvantage when duplicated. Intriguingly, the duplication of most protein complex subunits had small to nondetectable effects on fitness, with few exceptions. We selected four complexes with subunits that had an impact on fitness when duplicated and measured the impact of individual gene duplications on their protein–protein interactions. We found that very few duplications affect both fitness and interactions. Furthermore, large complexes such as the 26S proteasome are protected from gene duplication by attenuation of protein abundance. Regulatory mechanisms that maintain the stoichiometric balance of protein complexes may protect from the immediate effects of gene duplication. Our results show that a better understanding of protein regulation and assembly in complexes is required for the refinement of current models of gene duplication.

Download Full-text

GRAPH BASED CLUSTERING WITH CONSTRAINTS AND ACTIVE LEARNING

Journal of Computer Science and Cybernetics ◽

10.15625/1813-9663/37/1/15773 ◽

2021 ◽

Vol 37 (1) ◽

pp. 71-89

Author(s):

Vu-Tuan Dang ◽

Viet-Vu Vu ◽

Hong-Quan Do ◽

Thi Kieu Oanh Le

Keyword(s):

Active Learning ◽

Clustering Algorithm ◽

Side Information ◽

Clustering Algorithms ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Supervised Clustering ◽

Class Labels ◽

Graph Based Clustering

During the past few years, semi-supervised clustering has emerged as a new interesting direction in machine learning research. In a semi-supervised clustering algorithm, the clustering results can be significantly improved by using side information, which is available or collected from users. There are two main kinds of side information that can be learned in semi-supervised clustering algorithms: the class labels - called seeds or the pairwise constraints. The first semi-supervised clustering was introduced in 2000, and since that, many algorithms have been presented in literature. However, it is not easy to use both types of side information in the same algorithm. To address the problem, this paper proposes a semi-supervised graph based clustering algorithm that tries to use seeds and constraints in the clustering process, called MCSSGC. Moreover, we introduces a simple but efficient active learning method to collect the constraints that can boost the performance of MCSSGC, named KMMFFQS. In order to verify effectiveness of the proposed algorithm, we conducted a series of experiments not only on real data sets from UCI, but also on a document data set applied in an Information Extraction of Vietnamese documents. These obtained results show that the proposed algorithm can significantly improve the clustering process compared to some recent algorithms.

Download Full-text

Entropy-Based Graph Clustering of PPI Networks for Predicting Overlapping Functional Modules of Proteins

Entropy ◽

10.3390/e23101271 ◽

2021 ◽

Vol 23 (10) ◽

pp. 1271

Author(s):

Hoyeon Jeong ◽

Yoonbee Kim ◽

Yi-Sue Jung ◽

Dae Ryong Kang ◽

Young-Rae Cho

Keyword(s):

Protein Interactions ◽

Protein Complexes ◽

Clustering Algorithms ◽

Graph Clustering ◽

Functional Modules ◽

Protein Protein Interactions ◽

Overlapping Clusters ◽

Novel Proteins ◽

Ppi Networks ◽

Function Modules

Functional modules can be predicted using genome-wide protein–protein interactions (PPIs) from a systematic perspective. Various graph clustering algorithms have been applied to PPI networks for this task. In particular, the detection of overlapping clusters is necessary because a protein is involved in multiple functions under different conditions. graph entropy (GE) is a novel metric to assess the quality of clusters in a large, complex network. In this study, the unweighted and weighted GE algorithm is evaluated to prove the validity of predicting function modules. To measure clustering accuracy, the clustering results are compared to protein complexes and Gene Ontology (GO) annotations as references. We demonstrate that the GE algorithm is more accurate in overlapping clusters than the other competitive methods. Moreover, we confirm the biological feasibility of the proteins that occur most frequently in the set of identified clusters. Finally, novel proteins for the additional annotation of GO terms are revealed.

Download Full-text

Using a Genetic Algorithm and Markov Clustering on Protein–Protein Interaction Graphs

International Journal of Systems Biology and Biomedical Technologies ◽

10.4018/ijsbbt.2012040103 ◽

2012 ◽

Vol 1 (2) ◽

pp. 35-47

Author(s):

Charalampos Moschopoulos ◽

Grigorios Beligiannis ◽

Spiridon Likothanassis ◽

Sophia Kossida

Keyword(s):

Genetic Algorithm ◽

Protein Interaction ◽

Clustering Algorithm ◽

Protein Complexes ◽

Clustering Algorithms ◽

Interaction Graphs ◽

Protein Protein Interaction ◽

Protein Complex Detection ◽

Successful Prediction ◽

Markov Clustering

In this paper, a Genetic Algorithm is applied on the filter of the Enhanced Markov Clustering algorithm to optimize the selection of clusters having a high probability to represent protein complexes. The filter was applied on the results (obtained by experiments made on five different yeast datasets) of three different algorithms known for their efficiency on protein complex detection through protein interaction graphs. The results are compared with three popular clustering algorithms, proving the efficiency of the proposed method according to metrics such as successful prediction rate and geometrical accuracy.

Download Full-text

Using a Genetic Algorithm and Markov Clustering on Protein–Protein Interaction Graphs

Bioinformatics ◽

10.4018/978-1-4666-3604-0.ch043 ◽

2013 ◽

pp. 805-816

Author(s):

Charalampos Moschopoulos ◽

Grigorios Beligiannis ◽

Spiridon Likothanassis ◽

Sophia Kossida

Keyword(s):

Genetic Algorithm ◽

Protein Interaction ◽

Clustering Algorithm ◽

Protein Complexes ◽

Clustering Algorithms ◽

Interaction Graphs ◽

Protein Protein Interaction ◽

Protein Complex Detection ◽

Markov Clustering ◽

Complex Detection

Download Full-text

From the static interactome to dynamic protein complexes: Three challenges

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720015710018 ◽

2015 ◽

Vol 13 (02) ◽

pp. 1571001 ◽

Cited By ~ 14

Author(s):

Chern Han Yong ◽

Limsoon Wong

Keyword(s):

Protein Interactions ◽

Protein Complexes ◽

Clustering Algorithms ◽

Ppi Network ◽

Protein Protein Interaction ◽

Static Interaction ◽

Ppi Networks ◽

Interaction Screening ◽

Discovery Algorithms ◽

Insight Into

Protein interactions and complexes behave in a dynamic fashion, but this dynamism is not captured by interaction screening technologies, and not preserved in protein–protein interaction (PPI) networks. The analysis of static interaction data to derive dynamic protein complexes leads to several challenges, of which we identify three. First, many proteins participate in multiple complexes, leading to overlapping complexes embedded within highly-connected regions of the PPI network. This makes it difficult to accurately delimit the boundaries of such complexes. Second, many condition- and location-specific PPIs are not detected, leading to sparsely-connected complexes that cannot be picked out by clustering algorithms. Third, the majority of complexes are small complexes (made up of two or three proteins), which are extra sensitive to the effects of extraneous edges and missing co-complex edges. We show that many existing complex-discovery algorithms have trouble predicting such complexes, and show that our insight into the disparity between the static interactome and dynamic protein complexes can be used to improve the performance of complex discovery.

Download Full-text

Expression attenuation as a mechanism of robustness to gene duplication in protein complexes

10.1101/2020.07.09.195990 ◽

2020 ◽

Author(s):

Diana Ascencio ◽

Guillaume Diss ◽

Isabelle Gagnon-Arsenault ◽

Alexandre K Dubé ◽

Alexander DeLuna ◽

...

Keyword(s):

Gene Duplication ◽

Protein Interactions ◽

Protein Complexes ◽

Phenotypic Diversity ◽

26S Proteasome ◽

Gene Duplications ◽

Protein Protein Interactions ◽

Individual Gene ◽

The Impact

AbstractGene duplication is ubiquitous and a major driver of phenotypic diversity across the tree of life, but its immediate consequences are not fully understood. Deleterious effects would decrease the probability of retention of duplicates and prevent their contribution to long term evolution. One possible detrimental effect of duplication is the perturbation of the stoichiometry of protein complexes. Here, we measured the fitness effects of the duplication of 899 essential genes in the budding yeast using high-resolution competition assays. At least ten percent of genes caused a fitness disadvantage when duplicated. Intriguingly, the duplication of most protein complex subunits had small to non-detectable effects on fitness, with few exceptions. We selected four complexes with subunits that had an impact on fitness when duplicated and measured the impact of individual gene duplications on their protein-protein interactions. We found that very few duplications affect both fitness and interactions. Furthermore, large complexes such as the 26S proteasome are protected from gene duplication by attenuation of protein abundance. Regulatory mechanisms that maintain the stoichiometric balance of protein complexes may protect from the immediate effects of gene duplication. Our results show that a better understanding of protein regulation and assembly in complexes is required for the refinement of current models of gene duplication.

Download Full-text

Fuzzy C-Means Clustering Algorithm Based on Coefficient of Variation

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.998-999.873 ◽

2014 ◽

Vol 998-999 ◽

pp. 873-877

Author(s):

Zhen Bo Wang ◽

Bao Zhi Qiu

Keyword(s):

Coefficient Of Variation ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Data ◽

Cluster Center ◽

Data Set ◽

Fuzzy C Means ◽

Initial Cluster ◽

Fuzzy C Means Clustering ◽

The Impact

To reduce the impact of irrelevant attributes on clustering results, and improve the importance of relevant attributes to clustering, this paper proposes fuzzy C-means clustering algorithm based on coefficient of variation (CV-FCM). In the algorithm, coefficient of variation is used to weigh attributes so as to assign different weights to each attribute in the data set, and the magnitude of weight is used to express the importance of different attributes to clusters. In addition, for the characteristic of fuzzy C-means clustering algorithm that it is susceptible to initial cluster center value, the method for the selection of initial cluster center based on maximum distance is introduced on the basis of weighted coefficient of variation. The result of the experiment based on real data sets shows that this algorithm can select cluster center effectively, with the clustering result superior to general fuzzy C-means clustering algorithms.

Download Full-text