A Survey on Innovative Graph-Based Clustering Algorithms

Author(s):  
Mark Hloch ◽  
Mario Kubek ◽  
Herwig Unger
2021 ◽  
Vol 37 (1) ◽  
pp. 71-89
Author(s):  
Vu-Tuan Dang ◽  
Viet-Vu Vu ◽  
Hong-Quan Do ◽  
Thi Kieu Oanh Le

During the past few years, semi-supervised clustering has emerged as a new interesting direction in machine learning research. In a semi-supervised clustering algorithm, the clustering results can be significantly improved by using side information, which is available or collected from users. There are two main kinds of side information that can be learned in semi-supervised clustering algorithms: the class labels - called seeds or the pairwise constraints. The first semi-supervised clustering was introduced in 2000, and since that, many algorithms have been presented in literature. However, it is not easy to use both types of side information in the same algorithm. To address the problem, this paper proposes a semi-supervised graph based clustering algorithm that tries to use seeds and constraints in the clustering process, called MCSSGC. Moreover, we introduces a simple but efficient active learning method to collect the constraints that can boost the performance of MCSSGC, named KMMFFQS. In order to verify effectiveness of the proposed algorithm, we conducted a series of experiments not only on real data sets from UCI, but also on a document data set applied in an Information Extraction of Vietnamese documents. These obtained results show that the proposed algorithm can significantly improve the clustering process compared to some recent algorithms.


2009 ◽  
Vol 27 (7) ◽  
pp. 979-988 ◽  
Author(s):  
P. Foggia ◽  
G. Percannella ◽  
C. Sansone ◽  
M. Vento

2020 ◽  
Vol 38 (1) ◽  
pp. 52
Author(s):  
Felipe Vasconcelos dos Passos ◽  
Marco Antonio Braga ◽  
Thiago Gonçalves Carelli ◽  
Josiane Branco Plantz

ABSTRACT. In Ponta Grossa Formation, devonian interval of Paraná Basin, Brazil, sampling restrictions are frequent, and lithological interpretations from gamma ray logs are common. However, no single log can discriminate lithology unambiguously. An alternative to reduce the uncertainty of these assessments is to perform multivariate analysis of well logs using data clustering methods. In this sense, this study aims to apply two different clustering algorithms, trained with gamma ray, sonic and resistivity logs. Five electrofacies were differentiated and validated by core data. It was found that one of the electrofacies identified by the model was not distinguished by macroscopic descriptions. However, the model developed is sufficiently accurate for lithological predictions.Keywords: geophysical well logging, lithology prediction, Paraná Basin. CLASSIFICAÇÃO DE ELETROFÁCIES DA FORMAÇÃO PONTA GROSSA UTILIZANDO OS MÉTODOS MULTI-RESOLUTION GRAPH-BASED CLUSTERING (MRGC) E SELF-ORGANIZING MAPS (SOM)RESUMO. Na Formação Ponta Grossa, intervalo devoniano da Bacia do Paraná, Brasil, restrições de amostragem são frequentes e interpretações litológicas dos registros de raios gama são comuns. No entanto, nenhum perfil geofísico único pode discriminar litologias sem ambiguidade. Uma alternativa para reduzir a incerteza dessas avaliações é executar uma análise multivariada combinando vários perfis geofísicos de poços por meio de métodos de agrupamento de dados. Nesse sentido, este estudo tem como objetivo aplicar dois algoritmos de agrupamento aos registros de raios gama, sônico e resistividade para fins de predição litológica. Cinco eletrofácies foram diferenciadas e validadas por dados de testemunhos. Verificou-se que uma classe identificada pelo modelo não foi identificada por descrições macroscópicas. Porém, o modelo é suficientemente preciso para predições litológicas.Palavras-chave: geofísica de poços, predição litológica, correlação rocha-perfil, Bacia do Paraná.


2020 ◽  
Author(s):  
R. Greg Stacey ◽  
Michael A. Skinnider ◽  
Leonard J. Foster

ABSTRACTBiological functions emerge from complex and dynamic networks of protein-protein interactions. Because these protein-protein interaction networks, or interactomes, represent pairwise connections within a hierarchically organized system, it is often useful to identify higher-order associations embedded within them, such as multi-member protein-complexes. Graph-based clustering techniques are widely used to accomplish this goal, and dozens of field-specific and general clustering algorithms exist. However, interactomes can be prone to errors, especially interactomes that infer interactions using high-throughput biochemical assays. Therefore, robustness to network-level variability is an important criterion for any clustering algorithm that aims to generate robust, reproducible clusters. Here, we tested the robustness of a range of graph-based clustering algorithms in the presence of network-level noise, including algorithms common across domains and those specific to protein networks. We found that the results of all clustering algorithms measured were profoundly sensitive to injected network noise.Randomly rewiring 1% of network edges yielded up to a 57% change in clustering results, indicating that clustering markedly amplified network-level noise. However, the impact of network noise on individual clusters was not uniform. We found that some clusters were consistently robust to injected network noise while others were not. Therefore, we developed the clust.perturb R package and Shiny web application, which measures the reproducibility of clusters by randomly perturbing the network. We show that clust.perturb results are predictive of real-world cluster stability: poorly reproducible clusters as identified by clust.perturb are significantly less likely to be reclustered across experiments. We conclude that quantifying the robustness of a cluster to network noise, as implemented in clust.perturb, provides a powerful tool for ranking the reproducibility of clusters, and separating stable protein complexes from spurious associations.


2020 ◽  
pp. mcp.RA120.002275
Author(s):  
R. Greg Stacey ◽  
Michael A. Skinnider ◽  
Leonard J. Foster

Biological functions emerge from complex and dynamic networks of protein-protein interactions. Because these protein-protein interaction networks, or interactomes, represent pairwise connections within a hierarchically organized system, it is often useful to identify higher-order associations embedded within them, such as multi-member protein complexes. Graph-based clustering techniques are widely used to accomplish this goal, and dozens of field-specific and general clustering algorithms exist. However, interactomes can be prone to errors, especially when inferred from high-throughput biochemical assays. Therefore, robustness to network-level noise is an important criterion for any clustering algorithm that aims to generate robust, reproducible clusters. Here, we tested the robustness of a range of graph-based clustering algorithms in the presence of noise, including algorithms common across domains and those specific to protein networks. Strikingly, we found that all of the clustering algorithms tested here markedly amplified noise within the underlying protein interaction network. Randomly rewiring only 1% of network edges yielded more than a 50% change in clustering results, indicating that clustering markedly amplified network-level noise. Moreover, we found the impact of network noise on individual clusters was not uniform: some clusters were consistently robust to injected noise while others were not. To assist in assessing this, we developed the clust.perturb R package and Shiny web application to measure the reproducibility of clusters by randomly perturbing the network. We show that clust.perturb results are predictive of real-world cluster stability: poorly reproducible clusters as identified by clust.perturb are significantly less likely to be reclustered across experiments. We conclude that graph-based clustering amplifies noise in protein interaction networks, but quantifying the robustness of a cluster to network noise can separate stable protein complexes from spurious associations.


2020 ◽  
Vol 5 (1) ◽  
Author(s):  
Zijing Liu ◽  
Mauricio Barahona

AbstractWe present a graph-theoretical approach to data clustering, which combines the creation of a graph from the data with Markov Stability, a multiscale community detection framework. We show how the multiscale capabilities of the method allow the estimation of the number of clusters, as well as alleviating the sensitivity to the parameters in graph construction. We use both synthetic and benchmark real datasets to compare and evaluate several graph construction methods and clustering algorithms, and show that multiscale graph-based clustering achieves improved performance compared to popular clustering methods without the need to set externally the number of clusters.


2019 ◽  
Vol 35 (4) ◽  
pp. 373-384
Author(s):  
Cuong Le ◽  
Viet Vu Vu ◽  
Le Thi Kieu Oanh ◽  
Nguyen Thi Hai Yen

Though clustering algorithms have long history, nowadays clustering topic still attracts a lot of attention because of the need of efficient data analysis tools in many applications such as social network, electronic commerce, GIS, etc. Recently, semi-supervised clustering, for example, semi-supervised K-Means, semi-supervised DBSCAN, semi-supervised graph-based clustering (SSGC) etc., which uses side information, has received a great deal of attention. Generally, there are two forms of side information: seed form (labeled data) and constraint form (must-link, cannot-link). By integrating information provided by the user or domain expert, the semi-supervised clustering can produce expected results. In fact, clustering results usually depend on side information provided, so different side information will produce different results of clustering. In some cases, the performance of clustering may decrease if the side information is not carefully chosen. This paper addresses the problem of efficient collection of seeds for semi-supervised clustering, especially for graph based clustering by seeding (SSGC). The properly collected seeds can boost the quality of clustering and minimize the number of queries solicited from the user. For this purpose, we have developed an active learning algorithm (called SKMMM) for the seeds collection task, which identifies candidates to solicit users by using the K-Means and min-max algorithms. Experiments conducted on real data sets from UCI and a real collected document data set show the effectiveness of our approach compared with other methods.


Sign in / Sign up

Export Citation Format

Share Document