An Improved Pheromone-Based Kohonen Self- Organising Map in Clustering and Visualising Balanced and Imbalanced Datasets

The data distribution issue remains an unsolved clustering problem in data mining, especially in dealing with imbalanced datasets. The Kohonen Self-Organising Map (KSOM) is one of the well-known clustering algorithms that can solve various problems without a pre-defined number of clusters. However, similar to other clustering algorithms, this algorithm requires sufficient data for its unsupervised learning process. The inadequate amount of class label data in a dataset significantly affects the clustering learning process, leading to inefficient and unreliable results. Numerous research have been conducted by hybridising and optimising the KSOM algorithm with various optimisation techniques. Unfortunately, the problems are still unsolved, especially separation boundary and overlapping clusters. Therefore, this research proposed an improved pheromone- based PKSOM algorithm known as iPKSOM to solve the mentioned problem. Six different datasets, i.e. Iris, Seed, Glass, Titanic, WDBC, and Tropical Wood datasets were chosen to investigate the effectiveness of the iPKSOM algorithm. All datasets were observed and compared with the original KSOM results. This modification significantly impacted the clustering process by improving and refining the scatteredness of clustering data and reducing overlapping clusters. Therefore, this proposed algorithm can be implemented in clustering other complex datasets, such as high dimensional and streaming data.

Download Full-text

A General Framework for Mixed and Incomplete Data Clustering Based on Swarm Intelligence Algorithms

Mathematics ◽

10.3390/math9070786 ◽

2021 ◽

Vol 9 (7) ◽

pp. 786

Author(s):

Yenny Villuendas-Rey ◽

Eley Barroso-Cubas ◽

Oscar Camacho-Nieto ◽

Cornelio Yáñez-Márquez

Keyword(s):

Swarm Intelligence ◽

Data Clustering ◽

Incomplete Data ◽

Missing Values ◽

Clustering Algorithms ◽

Bat Algorithm ◽

Hybrid Features ◽

Bee Colony ◽

Learning Tasks ◽

Clustering Data

Swarm intelligence has appeared as an active field for solving numerous machine-learning tasks. In this paper, we address the problem of clustering data with missing values, where the patterns are described by mixed (or hybrid) features. We introduce a generic modification to three swarm intelligence algorithms (Artificial Bee Colony, Firefly Algorithm, and Novel Bat Algorithm). We experimentally obtain the adequate values of the parameters for these three modified algorithms, with the purpose of applying them in the clustering task. We also provide an unbiased comparison among several metaheuristics based clustering algorithms, concluding that the clusters obtained by our proposals are highly representative of the “natural structure” of data.

Download Full-text

Experiments on Clustering Algorithms for Mixed and Incomplete Data

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.b2551.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 4778-4784

Keyword(s):

Machine Learning ◽

Experimental Study ◽

Incomplete Data ◽

Clustering Algorithms ◽

Cluster Validation ◽

Clustering Data

Clustering mixed and incomplete data is a goal of frequent approaches in the last years because its common apparition in soft sciences problems. However, there is a lack of studies evaluating the performance of clustering algorithms for such kind of data. In this paper we present an experimental study about performance of seven clustering algorithms which used one of these techniques: partition, hierarchal or metaheuristic. All the methods ran over 15 databases from UCI Machine Learning Repository, having mixed and incomplete data descriptions. In external cluster validation using the indices Entropy and V-Measure, the algorithms that use the last technique showed the best results. Thus, we recommend metaheuristic based clustering algorithms for clustering data having mixed and incomplete descriptions.

Download Full-text

Clustering Bathymetric Data for Electronic Navigational Charts

Journal of Navigation ◽

10.1017/s0373463316000035 ◽

2016 ◽

Vol 69 (5) ◽

pp. 1143-1153 ◽

Cited By ~ 24

Author(s):

Marta Wlodarczyk–Sielicka ◽

Andrzej Stateczny

Keyword(s):

Clustering Algorithm ◽

Search Algorithm ◽

Clustering Algorithms ◽

Data Set ◽

Bathymetric Data ◽

Large Sets ◽

Analysis Of Results ◽

Comparison And Analysis ◽

Self Organising Map ◽

Source Of Information

An electronic navigational chart is a major source of information for the navigator. The component that contributes most significantly to the safety of navigation on water is the information on the depth of an area. For the purposes of this article, the authors use data obtained by the interferometric sonar GeoSwath Plus. The data were collected in the area of the Port of Szczecin. The samples constitute large sets of data. Data reduction is a procedure to reduce the size of a data set to make it easier and more effective to analyse. The main objective of the authors is the compilation of a new reduction algorithm for bathymetric data. The clustering of data is the first part of the search algorithm. The next step consists of generalisation of bathymetric data. This article presents a comparison and analysis of results of clustering bathymetric data using the following selected methods:K-means clustering algorithm, traditional hierarchical clustering algorithms and self-organising map (using artificial neural networks).

Download Full-text

Ensemble Clustering Data Mining and Databases

Encyclopedia of Information Science and Technology, Fourth Edition ◽

10.4018/978-1-5225-2255-3.ch170 ◽

2018 ◽

pp. 1962-1973

Author(s):

Slawomir T. Wierzchon

Keyword(s):

Data Mining ◽

Data Structure ◽

Em Algorithm ◽

Normal Distribution ◽

Clustering Algorithms ◽

Consensus Clustering ◽

New Directions ◽

Consensus Procedure ◽

Basic Approaches ◽

Clustering Data

Standard clustering algorithms employ fixed assumptions about data structure. For instance, the k-means algorithm is applicable for spherical and linearly separable data clouds. When the data come from multidimensional normal distribution – so-called EM algorithm can be applied. But in practice the assumptions underlying given set of observations are too complex to fit into a single assumption. We can split these assumptions into manageable hypothesis justifying the use of particular clustering algorithms. Then we must aggregate partial results into a meaningful description of our data. The consensus clustering do this task. In this article we clarify the idea of consensus clustering, and we present conceptual frames for such a compound analysis. Next the basic approaches to implement consensus procedure are given. Finally, some new directions in this field are mentioned.

Download Full-text

Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology

International Journal of Knowledge Discovery in Bioinformatics ◽

10.4018/ijkdb.2018010104 ◽

2018 ◽

Vol 8 (1) ◽

pp. 42-59

Author(s):

Deepali Virmani ◽

Nikita Jain ◽

Ketan Parikh ◽

Shefali Upadhyaya ◽

Abhishek Srivastav

Keyword(s):

Unsupervised Learning ◽

Real World ◽

Learning Algorithms ◽

Clustering Algorithms ◽

Real World Data ◽

World Data ◽

Clustering Problem ◽

Time Required ◽

Selection Of

This article describes how data is relevant and if it can be organized, linked with other data and grouped into a cluster. Clustering is the process of organizing a given set of objects into a set of disjoint groups called clusters. There are a number of clustering algorithms like k-means, k-medoids, normalized k-means, etc. So, the focus remains on efficiency and accuracy of algorithms. The focus is also on the time it takes for clustering and reducing overlapping between clusters. K-means is one of the simplest unsupervised learning algorithms that solves the well-known clustering problem. The k-means algorithm partitions data into K clusters and the centroids are randomly chosen resulting numeric values prohibits it from being used to cluster real world data containing categorical values. Poor selection of initial centroids can result in poor clustering. This article deals with a proposed algorithm which is a variant of k-means with some modifications resulting in better clustering, reduced overlapping and lesser time required for clustering by selecting initial centres in k-means and normalizing the data.

Download Full-text

The necessary and sufficient condition for clustering of nodes based on the signs of connections in generalized signed networks

International Journal of Modern Physics B ◽

10.1142/s0217979219500863 ◽

2019 ◽

Vol 33 (10) ◽

pp. 1950086

Author(s):

Qi Wang ◽

Yinhe Wang ◽

Zilin Gao ◽

Lili Zhang ◽

Wenli Wang

Keyword(s):

Clustering Algorithms ◽

Necessary Condition ◽

Sufficient Condition ◽

Necessary And Sufficient Condition ◽

Signed Networks ◽

Clustering Problem ◽

Structural Hole ◽

Signed Network ◽

Necessary And Sufficient ◽

Sufficient And Necessary Condition

This paper investigates the clustering problem for the generalized signed networks. By rigorous derivations, a sufficient and necessary condition for clustering of the nodes in generalized signed networks is proposed in this paper. In order to obtain this condition, the concept of friends group is first introduced for the nodes based on their links’ sign. Then, the unprivileged network is also defined in this paper by employing the concepts of structural hole and broker. Compared with the existing clustering algorithms, the outstanding advantage in this paper is that only the positive or negative (especially, or zero) sign of the links is required regardless of their density or sparsity. We have proved mathematically that a generalized signed network is classifiable if and only if it is an unprivileged network. Finally, two examples associated with numerical simulations are proposed to generate the unprivileged networks.

Download Full-text

Active Semisupervised Clustering Algorithm with Label Propagation for Imbalanced and Multidensity Datasets

Mathematical Problems in Engineering ◽

10.1155/2013/641927 ◽

2013 ◽

Vol 2013 ◽

pp. 1-10 ◽

Cited By ~ 3

Author(s):

Mingwei Leng ◽

Jianjun Cheng ◽

Jinjin Wang ◽

Zhengquan Zhang ◽

Hanhai Zhou ◽

...

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Label Propagation ◽

Data Selection ◽

Imbalanced Datasets ◽

Active Mechanism ◽

Real World Applications ◽

Stable Performance ◽

Active Data ◽

Semisupervised Clustering

The accuracy of most of the existing semisupervised clustering algorithms based on small size of labeled dataset is low when dealing with multidensity and imbalanced datasets, and labeling data is quite expensive and time consuming in many real-world applications. This paper focuses on active data selection and semisupervised clustering algorithm in multidensity and imbalanced datasets and proposes an active semisupervised clustering algorithm. The proposed algorithm uses an active mechanism for data selection to minimize the amount of labeled data, and it utilizes multithreshold to expand labeled datasets on multidensity and imbalanced datasets. Three standard datasets and one synthetic dataset are used to demonstrate the proposed algorithm, and the experimental results show that the proposed semisupervised clustering algorithm has a higher accuracy and a more stable performance in comparison to other clustering and semisupervised clustering algorithms, especially when the datasets are multidensity and imbalanced.

Download Full-text

Clustering, factor discovery and optimal transport

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iaaa040 ◽

2020 ◽

Author(s):

Hongkang Yang ◽

Esteban G Tabak

Keyword(s):

Latent Variables ◽

Optimal Transport ◽

Clustering Algorithms ◽

Data Sets ◽

Affine Transformations ◽

Real World Data ◽

Continuous Version ◽

Clustering Problem ◽

Latent Space ◽

Transport Maps

Abstract The clustering problem, and more generally latent factor discovery or latent space inference, is formulated in terms of the Wasserstein barycenter problem from optimal transport. The objective proposed is the maximization of the variability attributable to class, further characterized as the minimization of the variance of the Wasserstein barycenter. Existing theory, which constrains the transport maps to rigid translations, is extended to affine transformations. The resulting non-parametric clustering algorithms include $k$-means as a special case and exhibit more robust performance. A continuous version of these algorithms discovers continuous latent variables and generalizes principal curves. The strength of these algorithms is demonstrated by tests on both artificial and real-world data sets.

Download Full-text

EvolveCluster: an evolutionary clustering algorithm for streaming data

Evolving Systems ◽

10.1007/s12530-021-09408-y ◽

2021 ◽

Author(s):

Christian Nordahl ◽

Veselka Boeva ◽

Håkan Grahn ◽

Marie Persson Netz

Keyword(s):

Data Streams ◽

Data Stream ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Streaming Data ◽

Evolutionary Clustering ◽

Stream Clustering ◽

The Past ◽

Data Stream Clustering ◽

Evolving Data

AbstractData has become an integral part of our society in the past years, arriving faster and in larger quantities than before. Traditional clustering algorithms rely on the availability of entire datasets to model them correctly and efficiently. Such requirements are not possible in the data stream clustering scenario, where data arrives and needs to be analyzed continuously. This paper proposes a novel evolutionary clustering algorithm, entitled EvolveCluster, capable of modeling evolving data streams. We compare EvolveCluster against two other evolutionary clustering algorithms, PivotBiCluster and Split-Merge Evolutionary Clustering, by conducting experiments on three different datasets. Furthermore, we perform additional experiments on EvolveCluster to further evaluate its capabilities on clustering evolving data streams. Our results show that EvolveCluster manages to capture evolving data stream behaviors and adapts accordingly.

Download Full-text

Entropy-Based Graph Clustering of PPI Networks for Predicting Overlapping Functional Modules of Proteins

Entropy ◽

10.3390/e23101271 ◽

2021 ◽

Vol 23 (10) ◽

pp. 1271

Author(s):

Hoyeon Jeong ◽

Yoonbee Kim ◽

Yi-Sue Jung ◽

Dae Ryong Kang ◽

Young-Rae Cho

Keyword(s):

Protein Interactions ◽

Protein Complexes ◽

Clustering Algorithms ◽

Graph Clustering ◽

Functional Modules ◽

Protein Protein Interactions ◽

Overlapping Clusters ◽

Novel Proteins ◽

Ppi Networks ◽

Function Modules

Functional modules can be predicted using genome-wide protein–protein interactions (PPIs) from a systematic perspective. Various graph clustering algorithms have been applied to PPI networks for this task. In particular, the detection of overlapping clusters is necessary because a protein is involved in multiple functions under different conditions. graph entropy (GE) is a novel metric to assess the quality of clusters in a large, complex network. In this study, the unweighted and weighted GE algorithm is evaluated to prove the validity of predicting function modules. To measure clustering accuracy, the clustering results are compared to protein complexes and Gene Ontology (GO) annotations as references. We demonstrate that the GE algorithm is more accurate in overlapping clusters than the other competitive methods. Moreover, we confirm the biological feasibility of the proteins that occur most frequently in the set of identified clusters. Finally, novel proteins for the additional annotation of GO terms are revealed.

Download Full-text