Clustering with Missing Features: A Density-Based Approach

<div>Clustering is an important unsupervised knowledge acquisition method, which divides the unlabeled data into different groups \cite{atilgan2021efficient,d2021automatic}. Different clustering algorithms make different assumptions on the cluster formation, thus, most clustering algorithms are able to well handle at least one particular type of data distribution but may not well handle the other types of distributions. For example, K-means identifies convex clusters well \cite{bai2017fast}, and DBSCAN is able to find clusters with similar densities \cite{DBSCAN}. </div><div>Therefore, most clustering methods may not work well on data distribution patterns that are different from the assumptions being made and on a mixture of different distribution patterns. Taking DBSCAN as an example, it is sensitive to the loosely connected points between dense natural clusters as illustrated in Figure~\ref{figconnect}. The density of the connected points shown in Figure~\ref{figconnect} is different from the natural clusters on both ends, however, DBSCAN with fixed global parameter values may wrongly assign these connected points and consider all the data points in Figure~\ref{figconnect} as one big cluster.</div>

Download Full-text

VDPC: Variational Density Peak Clustering Algorithm

10.36227/techrxiv.17597669 ◽

2021 ◽

Author(s):

Yizhang Wang ◽

Di Wang ◽

You Zhou ◽

Chai Quek ◽

Xiaofeng Zhang

Keyword(s):

Clustering Algorithm ◽

Cluster Formation ◽

Clustering Algorithms ◽

Data Distribution ◽

Distribution Patterns ◽

Clustering Methods ◽

Density Peak ◽

Global Parameter ◽

Density Peak Clustering ◽

Parameter Values

<div>Clustering is an important unsupervised knowledge acquisition method, which divides the unlabeled data into different groups \cite{atilgan2021efficient,d2021automatic}. Different clustering algorithms make different assumptions on the cluster formation, thus, most clustering algorithms are able to well handle at least one particular type of data distribution but may not well handle the other types of distributions. For example, K-means identifies convex clusters well \cite{bai2017fast}, and DBSCAN is able to find clusters with similar densities \cite{DBSCAN}. </div><div>Therefore, most clustering methods may not work well on data distribution patterns that are different from the assumptions being made and on a mixture of different distribution patterns. Taking DBSCAN as an example, it is sensitive to the loosely connected points between dense natural clusters as illustrated in Figure~\ref{figconnect}. The density of the connected points shown in Figure~\ref{figconnect} is different from the natural clusters on both ends, however, DBSCAN with fixed global parameter values may wrongly assign these connected points and consider all the data points in Figure~\ref{figconnect} as one big cluster.</div>

Download Full-text

An Affinity Propagation Clustering Algorithm for Mixed Numeric and Categorical Datasets

Mathematical Problems in Engineering ◽

10.1155/2014/486075 ◽

2014 ◽

Vol 2014 ◽

pp. 1-8 ◽

Cited By ~ 7

Author(s):

Kang Zhang ◽

Xingsheng Gu

Keyword(s):

Real World ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Affinity Propagation ◽

Mixed Data ◽

Clustering Methods ◽

Affinity Propagation Clustering ◽

Real World Datasets ◽

Data Objects ◽

Clustering Problems

Clustering has been widely used in different fields of science, technology, social science, and so forth. In real world, numeric as well as categorical features are usually used to describe the data objects. Accordingly, many clustering methods can process datasets that are either numeric or categorical. Recently, algorithms that can handle the mixed data clustering problems have been developed. Affinity propagation (AP) algorithm is an exemplar-based clustering method which has demonstrated good performance on a wide variety of datasets. However, it has limitations on processing mixed datasets. In this paper, we propose a novel similarity measure for mixed type datasets and an adaptive AP clustering algorithm is proposed to cluster the mixed datasets. Several real world datasets are studied to evaluate the performance of the proposed algorithm. Comparisons with other clustering algorithms demonstrate that the proposed method works well not only on mixed datasets but also on pure numeric and categorical datasets.

Download Full-text

A Density Peak Clustering Algorithm Based on the K-Nearest Shannon Entropy and Tissue-Like P System

Mathematical Problems in Engineering ◽

10.1155/2019/1713801 ◽

2019 ◽

Vol 2019 ◽

pp. 1-13 ◽

Cited By ~ 3

Author(s):

Zhenni Jiang ◽

Xiyu Liu ◽

Minghe Sun

Keyword(s):

Shannon Entropy ◽

Clustering Algorithm ◽

P Systems ◽

P System ◽

Clustering Methods ◽

K Nearest Neighbors ◽

Density Peak ◽

New Variant ◽

Real World Datasets ◽

Density Peak Clustering

This study proposes a novel method to calculate the density of the data points based on K-nearest neighbors and Shannon entropy. A variant of tissue-like P systems with active membranes is introduced to realize the clustering process. The new variant of tissue-like P systems can improve the efficiency of the algorithm and reduce the computation complexity. Finally, experimental results on synthetic and real-world datasets show that the new method is more effective than the other state-of-the-art clustering methods.

Download Full-text

Adaptive density peak clustering based on dimensional-free and reverse k-nearest neighbors

Information Technology And Control ◽

10.5755/j01.itc.49.3.23405 ◽

2020 ◽

Vol 49 (3) ◽

pp. 395-411

Author(s):

Qiannan Wu ◽

Qianqian Zhang ◽

Ruizhi Sun ◽

Li Li ◽

Huiyu Mu ◽

...

Keyword(s):

High Dimension ◽

Clustering Algorithm ◽

Local Density ◽

Nearest Neighbors ◽

Allocation Strategy ◽

K Nearest Neighbor ◽

K Nearest Neighbors ◽

Density Peak ◽

Real World Datasets ◽

Density Peak Clustering

Cluster analysis plays a crucial component in consumer behavior segment. The density peak clustering algorithm (DPC) is a novel density-based clustering method. However, it performs poorly in high-dimension datasets and the local density for boundary points. In addition, its fault tolerance is affected by one-step allocation strategy. To overcome these disadvantages, an adaptive density peak clustering algorithm based on dimensional-free and reverse k-nearest neighbors (ERK-DPC) is proposed in this paper. First, we compute Euler cosine distance to obtain the similarity of sample points in high-dimension datasets. Then, the adaptive local density formula is used to measure the local density of each point. Finally, the reverse k-nearest neighbor idea is added on two-step allocation strategy, which assigns the remaining points accurately and effectively. The proposed clustering algorithm is experiments on several benchmark datasets and real-world datasets. By comparing the benchmarks, the results demonstrate that the ERK-DPC algorithm superior to some state-of- the-art methods.

Download Full-text

Density Peak Clustering Algorithm Considering Topological Features

Electronics ◽

10.3390/electronics9030459 ◽

2020 ◽

Vol 9 (3) ◽

pp. 459

Author(s):

Shuyi Lu ◽

Yuanjie Zheng ◽

Rong Luo ◽

Weikuan Jia ◽

Jian Lian ◽

...

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Original Data ◽

Power Law Distribution ◽

Density Peak ◽

Topological Features ◽

Density Peaks ◽

Topological Characteristics ◽

Density Peak Clustering ◽

Clustering Data

The clustering algorithm plays an important role in data mining and image processing. The breakthrough of algorithm precision and method directly affects the direction and progress of the following research. At present, types of clustering algorithms are mainly divided into hierarchical, density-based, grid-based and model-based ones. This paper mainly studies the Clustering by Fast Search and Find of Density Peaks (CFSFDP) algorithm, which is a new clustering method based on density. The algorithm has the characteristics of no iterative process, few parameters and high precision. However, we found that the clustering algorithm did not consider the original topological characteristics of the data. We also found that the clustering data is similar to the social network nodes mentioned in DeepWalk, which satisfied power-law distribution. In this study, we tried to consider the topological characteristics of the graph in the clustering algorithm. Based on previous studies, we propose a clustering algorithm that adds the topological characteristics of original data on the basis of the CFSFDP algorithm. Our experimental results show that the clustering algorithm with topological features significantly improves the clustering effect and proves that the addition of topological features is effective and feasible.

Download Full-text

An Improved Version of K-medoid Algorithm using CRO

Modern Applied Science ◽

10.5539/mas.v12n2p116 ◽

2018 ◽

Vol 12 (2) ◽

pp. 116 ◽

Cited By ~ 2

Author(s):

Amjad Hudaib ◽

Mohammad Khanafseh ◽

Ola Surakhi

Keyword(s):

Breast Cancer ◽

Lung Cancer ◽

Hybrid Algorithm ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Data Repository ◽

The Mean ◽

Real World Datasets ◽

Actual Point ◽

Learning Data

Clustering is the process of grouping a set of patterns into different disjoint clusters where each cluster contains the alike patterns. Many algorithms had been proposed before for clustering. K-medoid is a variant of k-mean that use an actual point in the cluster to represent it instead of the mean in the k-mean algorithm to get the outliers and reduce noise in the cluster. In order to enhance performance of k-medoid algorithm and get more accurate clusters, a hybrid algorithm is proposed which use CRO algorithm along with k-medoid. In this method, CRO is used to expand searching for the optimal medoid and enhance clustering by getting more precise results. The performance of the new algorithm is evaluated by comparing its results with five clustering algorithms, k-mean, k-medoid, DB/rand/1/bin, CRO based clustering algorithm and hybrid CRO-k-mean by using four real world datasets: Lung cancer, Iris, Breast cancer Wisconsin and Haberman’s survival from UCI machine learning data repository. The results were conducted and compared base on different metrics and show that proposed algorithm enhanced clustering technique by giving more accurate results.

Download Full-text

Density Peak Clustering algorithm using knowledge learning-based fruit fly optimization

International Journal of Computers and Applications ◽

10.1080/1206212x.2018.1440340 ◽

2018 ◽

Vol 40 (3) ◽

pp. 1-10

Author(s):

Ruihong Zhou ◽

Qiaoming Liu ◽

Xuming Han ◽

Limin Wang

Keyword(s):

Clustering Algorithm ◽

Fruit Fly ◽

Density Peak ◽

Fruit Fly Optimization ◽

Density Peak Clustering ◽

Knowledge Learning

Download Full-text

A Fast Density Peak Clustering Algorithm Optimized by Uncertain Number Neighbors for Breast MR Image

Journal of Physics Conference Series ◽

10.1088/1742-6596/1229/1/012024 ◽

2019 ◽

Vol 1229 ◽

pp. 012024 ◽

Cited By ~ 1

Author(s):

Fan Hong ◽

Yang Jing ◽

Hou Cun-cun ◽

Zhang Ke-zhen ◽

Yao Ruo-xia

Keyword(s):

Clustering Algorithm ◽

Mr Image ◽

Density Peak ◽

Breast Mr ◽

Density Peak Clustering

Download Full-text

Nearest-Neighbour-Induced Isolation Similarity and Its Impact on Density-Based Clustering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014755 ◽

2019 ◽

Vol 33 ◽

pp. 4755-4762 ◽

Cited By ~ 3

Author(s):

Xiaoyu Qin ◽

Kai Ming Ting ◽

Ye Zhu ◽

Vincent CS Lee

Keyword(s):

Clustering Algorithm ◽

Distance Measure ◽

Nearest Neighbour ◽

Density Peak ◽

Density Based Clustering ◽

New Type ◽

Density Peak Clustering ◽

The Impact ◽

First Time ◽

Tree Method

A recent proposal of data dependent similarity called Isolation Kernel/Similarity has enabled SVM to produce better classification accuracy. We identify shortcomings of using a tree method to implement Isolation Similarity; and propose a nearest neighbour method instead. We formally prove the characteristic of Isolation Similarity with the use of the proposed method. The impact of Isolation Similarity on densitybased clustering is studied here. We show for the first time that the clustering performance of the classic density-based clustering algorithm DBSCAN can be significantly uplifted to surpass that of the recent density-peak clustering algorithm DP. This is achieved by simply replacing the distance measure with the proposed nearest-neighbour-induced Isolation Similarity in DBSCAN, leaving the rest of the procedure unchanged. A new type of clusters called mass-connected clusters is formally defined. We show that DBSCAN, which detects density-connected clusters, becomes one which detects mass-connected clusters, when the distance measure is replaced with the proposed similarity. We also provide the condition under which mass-connected clusters can be detected, while density-connected clusters cannot.

Download Full-text