scholarly journals An Improved Pheromone-Based Kohonen Self- Organising Map in Clustering and Visualising Balanced and Imbalanced Datasets

2020 ◽  
Vol 20 (No.4) ◽  
pp. 651-676
Author(s):  
Rubiyah Yusof ◽  
Azlin Ahmad ◽  
Nor Saradatul Akmar Zulkifli ◽  
Mohd Najib Ismail

The data distribution issue remains an unsolved clustering problem in data mining, especially in dealing with imbalanced datasets. The Kohonen Self-Organising Map (KSOM) is one of the well-known clustering algorithms that can solve various problems without a pre-defined number of clusters. However, similar to other clustering algorithms, this algorithm requires sufficient data for its unsupervised learning process. The inadequate amount of class label data in a dataset significantly affects the clustering learning process, leading to inefficient and unreliable results. Numerous research have been conducted by hybridising and optimising the KSOM algorithm with various optimisation techniques. Unfortunately, the problems are still unsolved, especially separation boundary and overlapping clusters. Therefore, this research proposed an improved pheromone- based PKSOM algorithm known as iPKSOM to solve the mentioned problem. Six different datasets, i.e. Iris, Seed, Glass, Titanic, WDBC, and Tropical Wood datasets were chosen to investigate the effectiveness of the iPKSOM algorithm. All datasets were observed and compared with the original KSOM results. This modification significantly impacted the clustering process by improving and refining the scatteredness of clustering data and reducing overlapping clusters. Therefore, this proposed algorithm can be implemented in clustering other complex datasets, such as high dimensional and streaming data.

Mathematics ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. 786
Author(s):  
Yenny Villuendas-Rey ◽  
Eley Barroso-Cubas ◽  
Oscar Camacho-Nieto ◽  
Cornelio Yáñez-Márquez

Swarm intelligence has appeared as an active field for solving numerous machine-learning tasks. In this paper, we address the problem of clustering data with missing values, where the patterns are described by mixed (or hybrid) features. We introduce a generic modification to three swarm intelligence algorithms (Artificial Bee Colony, Firefly Algorithm, and Novel Bat Algorithm). We experimentally obtain the adequate values of the parameters for these three modified algorithms, with the purpose of applying them in the clustering task. We also provide an unbiased comparison among several metaheuristics based clustering algorithms, concluding that the clusters obtained by our proposals are highly representative of the “natural structure” of data.


Clustering mixed and incomplete data is a goal of frequent approaches in the last years because its common apparition in soft sciences problems. However, there is a lack of studies evaluating the performance of clustering algorithms for such kind of data. In this paper we present an experimental study about performance of seven clustering algorithms which used one of these techniques: partition, hierarchal or metaheuristic. All the methods ran over 15 databases from UCI Machine Learning Repository, having mixed and incomplete data descriptions. In external cluster validation using the indices Entropy and V-Measure, the algorithms that use the last technique showed the best results. Thus, we recommend metaheuristic based clustering algorithms for clustering data having mixed and incomplete descriptions.


2016 ◽  
Vol 69 (5) ◽  
pp. 1143-1153 ◽  
Author(s):  
Marta Wlodarczyk–Sielicka ◽  
Andrzej Stateczny

An electronic navigational chart is a major source of information for the navigator. The component that contributes most significantly to the safety of navigation on water is the information on the depth of an area. For the purposes of this article, the authors use data obtained by the interferometric sonar GeoSwath Plus. The data were collected in the area of the Port of Szczecin. The samples constitute large sets of data. Data reduction is a procedure to reduce the size of a data set to make it easier and more effective to analyse. The main objective of the authors is the compilation of a new reduction algorithm for bathymetric data. The clustering of data is the first part of the search algorithm. The next step consists of generalisation of bathymetric data. This article presents a comparison and analysis of results of clustering bathymetric data using the following selected methods:K-means clustering algorithm, traditional hierarchical clustering algorithms and self-organising map (using artificial neural networks).


Author(s):  
Slawomir T. Wierzchon

Standard clustering algorithms employ fixed assumptions about data structure. For instance, the k-means algorithm is applicable for spherical and linearly separable data clouds. When the data come from multidimensional normal distribution – so-called EM algorithm can be applied. But in practice the assumptions underlying given set of observations are too complex to fit into a single assumption. We can split these assumptions into manageable hypothesis justifying the use of particular clustering algorithms. Then we must aggregate partial results into a meaningful description of our data. The consensus clustering do this task. In this article we clarify the idea of consensus clustering, and we present conceptual frames for such a compound analysis. Next the basic approaches to implement consensus procedure are given. Finally, some new directions in this field are mentioned.


Author(s):  
Deepali Virmani ◽  
Nikita Jain ◽  
Ketan Parikh ◽  
Shefali Upadhyaya ◽  
Abhishek Srivastav

This article describes how data is relevant and if it can be organized, linked with other data and grouped into a cluster. Clustering is the process of organizing a given set of objects into a set of disjoint groups called clusters. There are a number of clustering algorithms like k-means, k-medoids, normalized k-means, etc. So, the focus remains on efficiency and accuracy of algorithms. The focus is also on the time it takes for clustering and reducing overlapping between clusters. K-means is one of the simplest unsupervised learning algorithms that solves the well-known clustering problem. The k-means algorithm partitions data into K clusters and the centroids are randomly chosen resulting numeric values prohibits it from being used to cluster real world data containing categorical values. Poor selection of initial centroids can result in poor clustering. This article deals with a proposed algorithm which is a variant of k-means with some modifications resulting in better clustering, reduced overlapping and lesser time required for clustering by selecting initial centres in k-means and normalizing the data.


2019 ◽  
Vol 33 (10) ◽  
pp. 1950086
Author(s):  
Qi Wang ◽  
Yinhe Wang ◽  
Zilin Gao ◽  
Lili Zhang ◽  
Wenli Wang

This paper investigates the clustering problem for the generalized signed networks. By rigorous derivations, a sufficient and necessary condition for clustering of the nodes in generalized signed networks is proposed in this paper. In order to obtain this condition, the concept of friends group is first introduced for the nodes based on their links’ sign. Then, the unprivileged network is also defined in this paper by employing the concepts of structural hole and broker. Compared with the existing clustering algorithms, the outstanding advantage in this paper is that only the positive or negative (especially, or zero) sign of the links is required regardless of their density or sparsity. We have proved mathematically that a generalized signed network is classifiable if and only if it is an unprivileged network. Finally, two examples associated with numerical simulations are proposed to generate the unprivileged networks.


2013 ◽  
Vol 2013 ◽  
pp. 1-10 ◽  
Author(s):  
Mingwei Leng ◽  
Jianjun Cheng ◽  
Jinjin Wang ◽  
Zhengquan Zhang ◽  
Hanhai Zhou ◽  
...  

The accuracy of most of the existing semisupervised clustering algorithms based on small size of labeled dataset is low when dealing with multidensity and imbalanced datasets, and labeling data is quite expensive and time consuming in many real-world applications. This paper focuses on active data selection and semisupervised clustering algorithm in multidensity and imbalanced datasets and proposes an active semisupervised clustering algorithm. The proposed algorithm uses an active mechanism for data selection to minimize the amount of labeled data, and it utilizes multithreshold to expand labeled datasets on multidensity and imbalanced datasets. Three standard datasets and one synthetic dataset are used to demonstrate the proposed algorithm, and the experimental results show that the proposed semisupervised clustering algorithm has a higher accuracy and a more stable performance in comparison to other clustering and semisupervised clustering algorithms, especially when the datasets are multidensity and imbalanced.


Author(s):  
Hongkang Yang ◽  
Esteban G Tabak

Abstract The clustering problem, and more generally latent factor discovery or latent space inference, is formulated in terms of the Wasserstein barycenter problem from optimal transport. The objective proposed is the maximization of the variability attributable to class, further characterized as the minimization of the variance of the Wasserstein barycenter. Existing theory, which constrains the transport maps to rigid translations, is extended to affine transformations. The resulting non-parametric clustering algorithms include $k$-means as a special case and exhibit more robust performance. A continuous version of these algorithms discovers continuous latent variables and generalizes principal curves. The strength of these algorithms is demonstrated by tests on both artificial and real-world data sets.


2021 ◽  
Author(s):  
Christian Nordahl ◽  
Veselka Boeva ◽  
Håkan Grahn ◽  
Marie Persson Netz

AbstractData has become an integral part of our society in the past years, arriving faster and in larger quantities than before. Traditional clustering algorithms rely on the availability of entire datasets to model them correctly and efficiently. Such requirements are not possible in the data stream clustering scenario, where data arrives and needs to be analyzed continuously. This paper proposes a novel evolutionary clustering algorithm, entitled EvolveCluster, capable of modeling evolving data streams. We compare EvolveCluster against two other evolutionary clustering algorithms, PivotBiCluster and Split-Merge Evolutionary Clustering, by conducting experiments on three different datasets. Furthermore, we perform additional experiments on EvolveCluster to further evaluate its capabilities on clustering evolving data streams. Our results show that EvolveCluster manages to capture evolving data stream behaviors and adapts accordingly.


Entropy ◽  
2021 ◽  
Vol 23 (10) ◽  
pp. 1271
Author(s):  
Hoyeon Jeong ◽  
Yoonbee Kim ◽  
Yi-Sue Jung ◽  
Dae Ryong Kang ◽  
Young-Rae Cho

Functional modules can be predicted using genome-wide protein–protein interactions (PPIs) from a systematic perspective. Various graph clustering algorithms have been applied to PPI networks for this task. In particular, the detection of overlapping clusters is necessary because a protein is involved in multiple functions under different conditions. graph entropy (GE) is a novel metric to assess the quality of clusters in a large, complex network. In this study, the unweighted and weighted GE algorithm is evaluated to prove the validity of predicting function modules. To measure clustering accuracy, the clustering results are compared to protein complexes and Gene Ontology (GO) annotations as references. We demonstrate that the GE algorithm is more accurate in overlapping clusters than the other competitive methods. Moreover, we confirm the biological feasibility of the proteins that occur most frequently in the set of identified clusters. Finally, novel proteins for the additional annotation of GO terms are revealed.


Sign in / Sign up

Export Citation Format

Share Document