scholarly journals Graph-based data clustering via multiscale community detection

2020 ◽  
Vol 5 (1) ◽  
Author(s):  
Zijing Liu ◽  
Mauricio Barahona

AbstractWe present a graph-theoretical approach to data clustering, which combines the creation of a graph from the data with Markov Stability, a multiscale community detection framework. We show how the multiscale capabilities of the method allow the estimation of the number of clusters, as well as alleviating the sensitivity to the parameters in graph construction. We use both synthetic and benchmark real datasets to compare and evaluate several graph construction methods and clustering algorithms, and show that multiscale graph-based clustering achieves improved performance compared to popular clustering methods without the need to set externally the number of clusters.

Author(s):  
B.K. Tripathy ◽  
Adhir Ghosh

Developing Data Clustering algorithms have been pursued by researchers since the introduction of k-means algorithm (Macqueen 1967; Lloyd 1982). These algorithms were subsequently modified to handle categorical data. In order to handle the situations where objects can have memberships in multiple clusters, fuzzy clustering and rough clustering methods were introduced (Lingras et al 2003, 2004a). There are many extensions of these initial algorithms (Lingras et al 2004b; Lingras 2007; Mitra 2004; Peters 2006, 2007). The MMR algorithm (Parmar et al 2007), its extensions (Tripathy et al 2009, 2011a, 2011b) and the MADE algorithm (Herawan et al 2010) use rough set techniques for clustering. In this chapter, the authors focus on rough set based clustering algorithms and provide a comparative study of all the fuzzy set based and rough set based clustering algorithms in terms of their efficiency. They also present problems for future studies in the direction of the topics covered.


2020 ◽  
Vol 18 (04) ◽  
pp. 2040005
Author(s):  
Ruiyi Li ◽  
Jihong Guan ◽  
Shuigeng Zhou

Clustering analysis has been widely applied to single-cell RNA-sequencing (scRNA-seq) data to discover cell types and cell states. Algorithms developed in recent years have greatly helped the understanding of cellular heterogeneity and the underlying mechanisms of biological processes. However, these algorithms often use different techniques, were evaluated on different datasets and compared with some of their counterparts usually using different performance metrics. Consequently, there lacks an accurate and complete picture of their merits and demerits, which makes it difficult for users to select proper algorithms for analyzing their data. To fill this gap, we first do a review on the major existing scRNA-seq data clustering methods, and then conduct a comprehensive performance comparison among them from multiple perspectives. We consider 13 state of the art scRNA-seq data clustering algorithms, and collect 12 publicly available real scRNA-seq datasets from the existing works to evaluate and compare these algorithms. Our comparative study shows that the existing methods are very diverse in performance. Even the top-performance algorithms do not perform well on all datasets, especially those with complex structures. This suggests that further research is required to explore more stable, accurate, and efficient clustering algorithms for scRNA-seq data.


Algorithms ◽  
2018 ◽  
Vol 11 (11) ◽  
pp. 177 ◽  
Author(s):  
Xuedong Gao ◽  
Minghan Yang

Clustering is one of the main tasks of machine learning. Internal clustering validation indexes (CVIs) are used to measure the quality of several clustered partitions to determine the local optimal clustering results in an unsupervised manner, and can act as the objective function of clustering algorithms. In this paper, we first studied several well-known internal CVIs for categorical data clustering, and proved the ineffectiveness of evaluating the partitions of different numbers of clusters without any inter-cluster separation measures or assumptions; the accurateness of separation, along with its coordination with the intra-cluster compactness measures, can notably affect performance. Then, aiming to enhance the internal clustering validation measurement, we proposed a new internal CVI—clustering utility based on the averaged information gain of isolating each cluster (CUBAGE)—which measures both the compactness and the separation of the partition. The experimental results supported our findings with regard to the existing internal CVIs, and showed that the proposed CUBAGE outperforms other internal CVIs with or without a pre-known number of clusters.


2019 ◽  
Vol 2019 ◽  
pp. 1-20 ◽  
Author(s):  
Ameera M. Almasoud ◽  
Hend S. Al-Khalifa ◽  
Abdulmalik S. Al-Salman

In the field of biology, researchers need to compare genes or gene products using semantic similarity measures (SSM). Continuous data growth and diversity in data characteristics comprise what is called big data; current biological SSMs cannot handle big data. Therefore, these measures need the ability to control the size of big data. We used parallel and distributed processing by splitting data into multiple partitions and applied SSM measures to each partition; this approach helped manage big data scalability and computational problems. Our solution involves three steps: split gene ontology (GO), data clustering, and semantic similarity calculation. To test this method, split GO and data clustering algorithms were defined and assessed for performance in the first two steps. Three of the best SSMs in biology [Resnik, Shortest Semantic Differentiation Distance (SSDD), and SORA] are enhanced by introducing threaded parallel processing, which is used in the third step. Our results demonstrate that introducing threads in SSMs reduced the time of calculating semantic similarity between gene pairs and improved performance of the three SSMs. Average time was reduced by 24.51% for Resnik, 22.93%, for SSDD, and 33.68% for SORA. Total time was reduced by 8.88% for Resnik, 23.14% for SSDD, and 39.27% for SORA. Using these threaded measures in the distributed system, combined with using split GO and data clustering algorithms to split input data based on their similarity, reduced the average time more than did the approach of equally dividing input data. Time reduction increased with increasing number of splits. Time reduction percentage was 24.1%, 39.2%, and 66.6% for Threaded SSDD; 33.0%, 78.2%, and 93.1% for Threaded SORA in the case of 2, 3, and 4 slaves, respectively; and 92.04% for Threaded Resnik in the case of four slaves.


10.12737/7483 ◽  
2014 ◽  
Vol 8 (7) ◽  
pp. 0-0
Author(s):  
Олег Сдвижков ◽  
Oleg Sdvizhkov

Cluster analysis [3] is a relatively new branch of mathematics that studies the methods partitioning a set of objects, given a finite set of attributes into homogeneous groups (clusters). Cluster analysis is widely used in psychology, sociology, economics (market segmentation), and many other areas in which there is a problem of classification of objects according to their characteristics. Clustering methods implemented in a package STATISTICA [1] and SPSS [2], they return the partitioning into clusters, clustering and dispersion statistics dendrogram of hierarchical clustering algorithms. MS Excel Macros for main clustering methods and application examples are given in the monograph [5]. One of the central problems of cluster analysis is to define some criteria for the number of clusters, we denote this number by K, into which separated are a given set of objects. There are several dozen approaches [4] to determine the number K. In particular, according to [6], the number of clusters K - minimum number which satisfies where - the minimum value of total dispersion for partitioning into K clusters, N - number of objects. Among the clusters automatically causes the consistent application of abnormal clusters [4]. In 2010, proposed and experimentally validated was a method for obtaining the number of K by applying the density function [4]. The article offers two simple approaches to determining K, where each cluster has at least two objects. In the first number K is determined by the shortest Hamiltonian cycles in the second - through the minimum spanning tree. The examples of clustering with detailed step by step solutions and graphic illustrations are suggested. Shown is the use of macro VBA Excel, which returns the minimum spanning tree to the problems of clustering. The article contains a macro code, with commentaries to the main unit.


2020 ◽  
Vol 2020 ◽  
pp. 1-10
Author(s):  
Yufang Min ◽  
Yaonan Zhang

The performance of graph-based clustering methods highly depends on the quality of the data affinity graph as a good affinity graph can approximate well the pairwise similarity between data samples. To a large extent, existing graph-based clustering methods construct the affinity graph based on a fixed distance metric, which is often not an accurate representation of the underlying data structure. Also, they require postprocessing on the affinity graph to obtain clustering results. Thus, the results are sensitive to the particular graph construction methods. To address these two drawbacks, we propose a k-component graph clustering (k-GC) approach to learn an intrinsic affinity graph and to obtain clustering results simultaneously. Specifically, k-GC learns the data affinity graph by assigning the adaptive and optimal neighbors for each data point based on the local distances. Efficient iterative updating algorithms are derived for k-GC, along with proofs of convergence. Experiments on several benchmark datasets have demonstrated the effectiveness of k-GC.


Complexity ◽  
2017 ◽  
Vol 2017 ◽  
pp. 1-10 ◽  
Author(s):  
Ulzii-Utas Narantsatsralt ◽  
Sanggil Kang

Community detection has become an increasingly popular tool for analyzing and researching complex networks. Many methods have been proposed for accurate community detection, and one of them is spectral clustering. Most spectral clustering algorithms have been implemented on artificial networks, and accuracy of the community detection is still unsatisfactory. Therefore, this paper proposes an agglomerative spectral clustering method with conductance and edge weights. In this method, the most similar nodes are agglomerated based on eigenvector space and edge weights. In addition, the conductance is used to identify densely connected clusters while agglomerating. The proposed method shows improved performance in related works and proves to be efficient for real life complex networks from experiments.


Author(s):  
Volodymyr Mosorov ◽  
Taras Panskyi ◽  
Sebastian Biedron

In this paper the analysis of k-specified (namely k-means) crisp data partitioning pre-clustering algorithm’s termination criterion performance is described. The results have been analyzed using the clustering validity indices. Termination criterion allows analyzing data with any number of clusters. Moreover, introduced criterion in contrast to the known validity indices enables to analyze data that make up one cluster.


2020 ◽  
Vol 38 (1) ◽  
pp. 52
Author(s):  
Felipe Vasconcelos dos Passos ◽  
Marco Antonio Braga ◽  
Thiago Gonçalves Carelli ◽  
Josiane Branco Plantz

ABSTRACT. In Ponta Grossa Formation, devonian interval of Paraná Basin, Brazil, sampling restrictions are frequent, and lithological interpretations from gamma ray logs are common. However, no single log can discriminate lithology unambiguously. An alternative to reduce the uncertainty of these assessments is to perform multivariate analysis of well logs using data clustering methods. In this sense, this study aims to apply two different clustering algorithms, trained with gamma ray, sonic and resistivity logs. Five electrofacies were differentiated and validated by core data. It was found that one of the electrofacies identified by the model was not distinguished by macroscopic descriptions. However, the model developed is sufficiently accurate for lithological predictions.Keywords: geophysical well logging, lithology prediction, Paraná Basin. CLASSIFICAÇÃO DE ELETROFÁCIES DA FORMAÇÃO PONTA GROSSA UTILIZANDO OS MÉTODOS MULTI-RESOLUTION GRAPH-BASED CLUSTERING (MRGC) E SELF-ORGANIZING MAPS (SOM)RESUMO. Na Formação Ponta Grossa, intervalo devoniano da Bacia do Paraná, Brasil, restrições de amostragem são frequentes e interpretações litológicas dos registros de raios gama são comuns. No entanto, nenhum perfil geofísico único pode discriminar litologias sem ambiguidade. Uma alternativa para reduzir a incerteza dessas avaliações é executar uma análise multivariada combinando vários perfis geofísicos de poços por meio de métodos de agrupamento de dados. Nesse sentido, este estudo tem como objetivo aplicar dois algoritmos de agrupamento aos registros de raios gama, sônico e resistividade para fins de predição litológica. Cinco eletrofácies foram diferenciadas e validadas por dados de testemunhos. Verificou-se que uma classe identificada pelo modelo não foi identificada por descrições macroscópicas. Porém, o modelo é suficientemente preciso para predições litológicas.Palavras-chave: geofísica de poços, predição litológica, correlação rocha-perfil, Bacia do Paraná.


2019 ◽  
Vol 04 (01) ◽  
pp. 1850017 ◽  
Author(s):  
Weiru Chen ◽  
Jared Oliverio ◽  
Jin Ho Kim ◽  
Jiayue Shen

Big Data is a popular cutting-edge technology nowadays. Techniques and algorithms are expanding in different areas including engineering, biomedical, and business. Due to the high-volume and complexity of Big Data, it is necessary to conduct data pre-processing methods when data mining. The pre-processing methods include data cleaning, data integration, data reduction, and data transformation. Data clustering is the most important step of data reduction. With data clustering, mining on the reduced data set should be more efficient yet produce quality analytical results. This paper presents the different data clustering methods and related algorithms for data mining with Big Data. Data clustering can increase the efficiency and accuracy of data mining.


Sign in / Sign up

Export Citation Format

Share Document