scholarly journals On the robustness of graph-based clustering to random network alterations

2020 ◽  
Author(s):  
R. Greg Stacey ◽  
Michael A. Skinnider ◽  
Leonard J. Foster

ABSTRACTBiological functions emerge from complex and dynamic networks of protein-protein interactions. Because these protein-protein interaction networks, or interactomes, represent pairwise connections within a hierarchically organized system, it is often useful to identify higher-order associations embedded within them, such as multi-member protein-complexes. Graph-based clustering techniques are widely used to accomplish this goal, and dozens of field-specific and general clustering algorithms exist. However, interactomes can be prone to errors, especially interactomes that infer interactions using high-throughput biochemical assays. Therefore, robustness to network-level variability is an important criterion for any clustering algorithm that aims to generate robust, reproducible clusters. Here, we tested the robustness of a range of graph-based clustering algorithms in the presence of network-level noise, including algorithms common across domains and those specific to protein networks. We found that the results of all clustering algorithms measured were profoundly sensitive to injected network noise.Randomly rewiring 1% of network edges yielded up to a 57% change in clustering results, indicating that clustering markedly amplified network-level noise. However, the impact of network noise on individual clusters was not uniform. We found that some clusters were consistently robust to injected network noise while others were not. Therefore, we developed the clust.perturb R package and Shiny web application, which measures the reproducibility of clusters by randomly perturbing the network. We show that clust.perturb results are predictive of real-world cluster stability: poorly reproducible clusters as identified by clust.perturb are significantly less likely to be reclustered across experiments. We conclude that quantifying the robustness of a cluster to network noise, as implemented in clust.perturb, provides a powerful tool for ranking the reproducibility of clusters, and separating stable protein complexes from spurious associations.

2020 ◽  
pp. mcp.RA120.002275
Author(s):  
R. Greg Stacey ◽  
Michael A. Skinnider ◽  
Leonard J. Foster

Biological functions emerge from complex and dynamic networks of protein-protein interactions. Because these protein-protein interaction networks, or interactomes, represent pairwise connections within a hierarchically organized system, it is often useful to identify higher-order associations embedded within them, such as multi-member protein complexes. Graph-based clustering techniques are widely used to accomplish this goal, and dozens of field-specific and general clustering algorithms exist. However, interactomes can be prone to errors, especially when inferred from high-throughput biochemical assays. Therefore, robustness to network-level noise is an important criterion for any clustering algorithm that aims to generate robust, reproducible clusters. Here, we tested the robustness of a range of graph-based clustering algorithms in the presence of noise, including algorithms common across domains and those specific to protein networks. Strikingly, we found that all of the clustering algorithms tested here markedly amplified noise within the underlying protein interaction network. Randomly rewiring only 1% of network edges yielded more than a 50% change in clustering results, indicating that clustering markedly amplified network-level noise. Moreover, we found the impact of network noise on individual clusters was not uniform: some clusters were consistently robust to injected noise while others were not. To assist in assessing this, we developed the clust.perturb R package and Shiny web application to measure the reproducibility of clusters by randomly perturbing the network. We show that clust.perturb results are predictive of real-world cluster stability: poorly reproducible clusters as identified by clust.perturb are significantly less likely to be reclustered across experiments. We conclude that graph-based clustering amplifies noise in protein interaction networks, but quantifying the robustness of a cluster to network noise can separate stable protein complexes from spurious associations.


2020 ◽  
Vol 36 (20) ◽  
pp. 5027-5036 ◽  
Author(s):  
Mingzhou Song ◽  
Hua Zhong

Abstract Motivation Chromosomal patterning of gene expression in cancer can arise from aneuploidy, genome disorganization or abnormal DNA methylation. To map such patterns, we introduce a weighted univariate clustering algorithm to guarantee linear runtime, optimality and reproducibility. Results We present the chromosome clustering method, establish its optimality and runtime and evaluate its performance. It uses dynamic programming enhanced with an algorithm to reduce search-space in-place to decrease runtime overhead. Using the method, we delineated outstanding genomic zones in 17 human cancer types. We identified strong continuity in dysregulation polarity—dominance by either up- or downregulated genes in a zone—along chromosomes in all cancer types. Significantly polarized dysregulation zones specific to cancer types are found, offering potential diagnostic biomarkers. Unreported previously, a total of 109 loci with conserved dysregulation polarity across cancer types give insights into pan-cancer mechanisms. Efficient chromosomal clustering opens a window to characterize molecular patterns in cancer genome and beyond. Availability and implementation Weighted univariate clustering algorithms are implemented within the R package ‘Ckmeans.1d.dp’ (4.0.0 or above), freely available at https://cran.r-project.org/package=Ckmeans.1d.dp. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 118 (6) ◽  
pp. e2014345118
Author(s):  
Diana Ascencio ◽  
Guillaume Diss ◽  
Isabelle Gagnon-Arsenault ◽  
Alexandre K. Dubé ◽  
Alexander DeLuna ◽  
...  

Gene duplication is ubiquitous and a major driver of phenotypic diversity across the tree of life, but its immediate consequences are not fully understood. Deleterious effects would decrease the probability of retention of duplicates and prevent their contribution to long-term evolution. One possible detrimental effect of duplication is the perturbation of the stoichiometry of protein complexes. Here, we measured the fitness effects of the duplication of 899 essential genes in the budding yeast using high-resolution competition assays. At least 10% of genes caused a fitness disadvantage when duplicated. Intriguingly, the duplication of most protein complex subunits had small to nondetectable effects on fitness, with few exceptions. We selected four complexes with subunits that had an impact on fitness when duplicated and measured the impact of individual gene duplications on their protein–protein interactions. We found that very few duplications affect both fitness and interactions. Furthermore, large complexes such as the 26S proteasome are protected from gene duplication by attenuation of protein abundance. Regulatory mechanisms that maintain the stoichiometric balance of protein complexes may protect from the immediate effects of gene duplication. Our results show that a better understanding of protein regulation and assembly in complexes is required for the refinement of current models of gene duplication.


2021 ◽  
Vol 37 (1) ◽  
pp. 71-89
Author(s):  
Vu-Tuan Dang ◽  
Viet-Vu Vu ◽  
Hong-Quan Do ◽  
Thi Kieu Oanh Le

During the past few years, semi-supervised clustering has emerged as a new interesting direction in machine learning research. In a semi-supervised clustering algorithm, the clustering results can be significantly improved by using side information, which is available or collected from users. There are two main kinds of side information that can be learned in semi-supervised clustering algorithms: the class labels - called seeds or the pairwise constraints. The first semi-supervised clustering was introduced in 2000, and since that, many algorithms have been presented in literature. However, it is not easy to use both types of side information in the same algorithm. To address the problem, this paper proposes a semi-supervised graph based clustering algorithm that tries to use seeds and constraints in the clustering process, called MCSSGC. Moreover, we introduces a simple but efficient active learning method to collect the constraints that can boost the performance of MCSSGC, named KMMFFQS. In order to verify effectiveness of the proposed algorithm, we conducted a series of experiments not only on real data sets from UCI, but also on a document data set applied in an Information Extraction of Vietnamese documents. These obtained results show that the proposed algorithm can significantly improve the clustering process compared to some recent algorithms.


Entropy ◽  
2021 ◽  
Vol 23 (10) ◽  
pp. 1271
Author(s):  
Hoyeon Jeong ◽  
Yoonbee Kim ◽  
Yi-Sue Jung ◽  
Dae Ryong Kang ◽  
Young-Rae Cho

Functional modules can be predicted using genome-wide protein–protein interactions (PPIs) from a systematic perspective. Various graph clustering algorithms have been applied to PPI networks for this task. In particular, the detection of overlapping clusters is necessary because a protein is involved in multiple functions under different conditions. graph entropy (GE) is a novel metric to assess the quality of clusters in a large, complex network. In this study, the unweighted and weighted GE algorithm is evaluated to prove the validity of predicting function modules. To measure clustering accuracy, the clustering results are compared to protein complexes and Gene Ontology (GO) annotations as references. We demonstrate that the GE algorithm is more accurate in overlapping clusters than the other competitive methods. Moreover, we confirm the biological feasibility of the proteins that occur most frequently in the set of identified clusters. Finally, novel proteins for the additional annotation of GO terms are revealed.


Author(s):  
Charalampos Moschopoulos ◽  
Grigorios Beligiannis ◽  
Spiridon Likothanassis ◽  
Sophia Kossida

In this paper, a Genetic Algorithm is applied on the filter of the Enhanced Markov Clustering algorithm to optimize the selection of clusters having a high probability to represent protein complexes. The filter was applied on the results (obtained by experiments made on five different yeast datasets) of three different algorithms known for their efficiency on protein complex detection through protein interaction graphs. The results are compared with three popular clustering algorithms, proving the efficiency of the proposed method according to metrics such as successful prediction rate and geometrical accuracy.


2013 ◽  
pp. 805-816
Author(s):  
Charalampos Moschopoulos ◽  
Grigorios Beligiannis ◽  
Spiridon Likothanassis ◽  
Sophia Kossida

In this paper, a Genetic Algorithm is applied on the filter of the Enhanced Markov Clustering algorithm to optimize the selection of clusters having a high probability to represent protein complexes. The filter was applied on the results (obtained by experiments made on five different yeast datasets) of three different algorithms known for their efficiency on protein complex detection through protein interaction graphs. The results are compared with three popular clustering algorithms, proving the efficiency of the proposed method according to metrics such as successful prediction rate and geometrical accuracy.


2015 ◽  
Vol 13 (02) ◽  
pp. 1571001 ◽  
Author(s):  
Chern Han Yong ◽  
Limsoon Wong

Protein interactions and complexes behave in a dynamic fashion, but this dynamism is not captured by interaction screening technologies, and not preserved in protein–protein interaction (PPI) networks. The analysis of static interaction data to derive dynamic protein complexes leads to several challenges, of which we identify three. First, many proteins participate in multiple complexes, leading to overlapping complexes embedded within highly-connected regions of the PPI network. This makes it difficult to accurately delimit the boundaries of such complexes. Second, many condition- and location-specific PPIs are not detected, leading to sparsely-connected complexes that cannot be picked out by clustering algorithms. Third, the majority of complexes are small complexes (made up of two or three proteins), which are extra sensitive to the effects of extraneous edges and missing co-complex edges. We show that many existing complex-discovery algorithms have trouble predicting such complexes, and show that our insight into the disparity between the static interactome and dynamic protein complexes can be used to improve the performance of complex discovery.


2020 ◽  
Author(s):  
Diana Ascencio ◽  
Guillaume Diss ◽  
Isabelle Gagnon-Arsenault ◽  
Alexandre K Dubé ◽  
Alexander DeLuna ◽  
...  

AbstractGene duplication is ubiquitous and a major driver of phenotypic diversity across the tree of life, but its immediate consequences are not fully understood. Deleterious effects would decrease the probability of retention of duplicates and prevent their contribution to long term evolution. One possible detrimental effect of duplication is the perturbation of the stoichiometry of protein complexes. Here, we measured the fitness effects of the duplication of 899 essential genes in the budding yeast using high-resolution competition assays. At least ten percent of genes caused a fitness disadvantage when duplicated. Intriguingly, the duplication of most protein complex subunits had small to non-detectable effects on fitness, with few exceptions. We selected four complexes with subunits that had an impact on fitness when duplicated and measured the impact of individual gene duplications on their protein-protein interactions. We found that very few duplications affect both fitness and interactions. Furthermore, large complexes such as the 26S proteasome are protected from gene duplication by attenuation of protein abundance. Regulatory mechanisms that maintain the stoichiometric balance of protein complexes may protect from the immediate effects of gene duplication. Our results show that a better understanding of protein regulation and assembly in complexes is required for the refinement of current models of gene duplication.


2014 ◽  
Vol 998-999 ◽  
pp. 873-877
Author(s):  
Zhen Bo Wang ◽  
Bao Zhi Qiu

To reduce the impact of irrelevant attributes on clustering results, and improve the importance of relevant attributes to clustering, this paper proposes fuzzy C-means clustering algorithm based on coefficient of variation (CV-FCM). In the algorithm, coefficient of variation is used to weigh attributes so as to assign different weights to each attribute in the data set, and the magnitude of weight is used to express the importance of different attributes to clusters. In addition, for the characteristic of fuzzy C-means clustering algorithm that it is susceptible to initial cluster center value, the method for the selection of initial cluster center based on maximum distance is introduced on the basis of weighted coefficient of variation. The result of the experiment based on real data sets shows that this algorithm can select cluster center effectively, with the clustering result superior to general fuzzy C-means clustering algorithms.


Sign in / Sign up

Export Citation Format

Share Document