Density Peak Clustering Based on Relative Density Optimization

Among numerous clustering algorithms, clustering by fast search and find of density peaks (DPC) is favoured because it is less affected by shapes and density structures of the data set. However, DPC still shows some limitations in clustering of data set with heterogeneity clusters and easily makes mistakes in assignment of remaining points. The new algorithm, density peak clustering based on relative density optimization (RDO-DPC), is proposed to settle these problems and try obtaining better results. With the help of neighborhood information of sample points, the proposed algorithm defines relative density of the sample data and searches and recognizes density peaks of the nonhomogeneous distribution as cluster centers. A new assignment strategy is proposed to solve the abundance classification problem. The experiments on synthetic and real data sets show good performance of the proposed algorithm.

Download Full-text

Self-Training Algorithm Based on Density Peaks Combining Globally Adaptive Multi-Local Noise Filter

10.21203/rs.3.rs-1048816/v1 ◽

2021 ◽

Author(s):

Shuaijun Li ◽

Jia Lu

Keyword(s):

Data Sets ◽

Training Algorithm ◽

Real World Data ◽

Density Peak ◽

Data Set ◽

Density Peaks ◽

Noise Filter ◽

Density Peak Clustering ◽

A Current ◽

Noise Filters

Abstract Self-training algorithm can quickly train an supervised classifier through a few labeled samples and lots of unlabeled samples. However, self-training algorithm is often affected by mislabeled samples, and local noise filter is proposed to detect the mislabeled samples. Nevertheless, current local noise filters have the problems: (a) Current local noise filters ignore the spatial distribution of the nearest neighbors in different classes. (b) They can’t perform well when mislabeled samples are located in the overlapping areas of different classes. To solve the above challenges, a new self-training algorithm based on density peaks combining globally adaptive multi-local noise filter (STDP-GAMNF) is proposed. Firstly, the spatial structure of data set is revealed by density peak clustering, and it is used for helping self-training to label unlabeled samples. In the meantime, after each epoch of labeling, GAMLNF can comprehensively judge whether a sample is a mislabeled sample from multiple classes or not, and will reduce the influence of edge samples effectively. The corresponding experimental results conducted on eighteen real-world data sets demonstrate that GAMLNF is not sensitive to the value of the neighbor parameter k, and it can be adaptive to find the appropriate number of neighbors of each class.

Download Full-text

Improving Density Peak Clustering by Automatic Peak Selection and Single Linkage Clustering

Symmetry ◽

10.3390/sym12071168 ◽

2020 ◽

Vol 12 (7) ◽

pp. 1168

Author(s):

Jun-Lin Lin ◽

Jen-Chieh Kuo ◽

Hsing-Wang Chuang

Keyword(s):

Clustering Algorithm ◽

Academic Community ◽

Performance Study ◽

Potential Density ◽

Cluster Assignment ◽

Density Peak ◽

Single Linkage ◽

Density Peaks ◽

Assignment Strategy ◽

Density Peak Clustering

Density peak clustering (DPC) is a density-based clustering method that has attracted much attention in the academic community. DPC works by first searching density peaks in the dataset, and then assigning each data point to the same cluster as its nearest higher-density point. One problem with DPC is the determination of the density peaks, where poor selection of the density peaks could yield poor clustering results. Another problem with DPC is its cluster assignment strategy, which often makes incorrect cluster assignments for data points that are far from their nearest higher-density points. This study modifies DPC and proposes a new clustering algorithm to resolve the above problems. The proposed algorithm uses the radius of the neighborhood to automatically select a set of the likely density peaks, which are far from their nearest higher-density points. Using the potential density peaks as the density peaks, it then applies DPC to yield the preliminary clustering results. Finally, it uses single-linkage clustering on the preliminary clustering results to reduce the number of clusters, if necessary. The proposed algorithm avoids the cluster assignment problem in DPC because the cluster assignments for the potential density peaks are based on single-linkage clustering, not based on DPC. Our performance study shows that the proposed algorithm outperforms DPC for datasets with irregularly shaped clusters.

Download Full-text

GRAPH BASED CLUSTERING WITH CONSTRAINTS AND ACTIVE LEARNING

Journal of Computer Science and Cybernetics ◽

10.15625/1813-9663/37/1/15773 ◽

2021 ◽

Vol 37 (1) ◽

pp. 71-89

Author(s):

Vu-Tuan Dang ◽

Viet-Vu Vu ◽

Hong-Quan Do ◽

Thi Kieu Oanh Le

Keyword(s):

Active Learning ◽

Clustering Algorithm ◽

Side Information ◽

Clustering Algorithms ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Supervised Clustering ◽

Class Labels ◽

Graph Based Clustering

During the past few years, semi-supervised clustering has emerged as a new interesting direction in machine learning research. In a semi-supervised clustering algorithm, the clustering results can be significantly improved by using side information, which is available or collected from users. There are two main kinds of side information that can be learned in semi-supervised clustering algorithms: the class labels - called seeds or the pairwise constraints. The first semi-supervised clustering was introduced in 2000, and since that, many algorithms have been presented in literature. However, it is not easy to use both types of side information in the same algorithm. To address the problem, this paper proposes a semi-supervised graph based clustering algorithm that tries to use seeds and constraints in the clustering process, called MCSSGC. Moreover, we introduces a simple but efficient active learning method to collect the constraints that can boost the performance of MCSSGC, named KMMFFQS. In order to verify effectiveness of the proposed algorithm, we conducted a series of experiments not only on real data sets from UCI, but also on a document data set applied in an Information Extraction of Vietnamese documents. These obtained results show that the proposed algorithm can significantly improve the clustering process compared to some recent algorithms.

Download Full-text

CHOOSING SEEDS FOR SEMI-SUPERVISED GRAPH BASED CLUSTERING

Journal of Computer Science and Cybernetics ◽

10.15625/1813-9663/35/4/14123 ◽

2019 ◽

Vol 35 (4) ◽

pp. 373-384

Author(s):

Cuong Le ◽

Viet Vu Vu ◽

Le Thi Kieu Oanh ◽

Nguyen Thi Hai Yen

Keyword(s):

Learning Algorithm ◽

Side Information ◽

Clustering Algorithms ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Supervised Clustering ◽

Efficient Data ◽

Graph Based Clustering

Though clustering algorithms have long history, nowadays clustering topic still attracts a lot of attention because of the need of efficient data analysis tools in many applications such as social network, electronic commerce, GIS, etc. Recently, semi-supervised clustering, for example, semi-supervised K-Means, semi-supervised DBSCAN, semi-supervised graph-based clustering (SSGC) etc., which uses side information, has received a great deal of attention. Generally, there are two forms of side information: seed form (labeled data) and constraint form (must-link, cannot-link). By integrating information provided by the user or domain expert, the semi-supervised clustering can produce expected results. In fact, clustering results usually depend on side information provided, so different side information will produce different results of clustering. In some cases, the performance of clustering may decrease if the side information is not carefully chosen. This paper addresses the problem of efficient collection of seeds for semi-supervised clustering, especially for graph based clustering by seeding (SSGC). The properly collected seeds can boost the quality of clustering and minimize the number of queries solicited from the user. For this purpose, we have developed an active learning algorithm (called SKMMM) for the seeds collection task, which identifies candidates to solicit users by using the K-Means and min-max algorithms. Experiments conducted on real data sets from UCI and a real collected document data set show the effectiveness of our approach compared with other methods.

Download Full-text

Interval Type II Fuzzy Number Generation From Data Set Applied to Sedation Stage Classification

Advances in Healthcare Information Systems and Administration - Computational Intelligence and Soft Computing Applications in Healthcare Management Science ◽

10.4018/978-1-7998-2581-4.ch008 ◽

2020 ◽

pp. 158-194

Author(s):

Efendi Nasibov ◽

Sinem Peker

Keyword(s):

Membership Function ◽

Fuzzy Number ◽

Real Data ◽

Classification Problem ◽

Data Sets ◽

Data Set ◽

Number Generation ◽

Interval Type ◽

Function Construction

There are several ways to summarize the data set by using measures of locations, dispersions, charts, and so on. But how can the data set be represented or shown when uncertainty exists in the environment process? Usage of the fuzzy number can be a way to handle the uncertainty in the representation of the data set. This chapter focuses on the membership function construction from the data set and introduces the formulas for the interval Type-2 generalized bell-shaped fuzzy number generation based on the data set. The bispectral index scores (BIS) are processed to see the ability of the offered methods in the construction of the interval Type -2 generalized bell-shaped membership function in the real data set. The obtained membership functions are used for a classification problem of sedation stages according to BIS data sets. Classification accuracies are calculated.

Download Full-text

Density Peak Clustering Algorithm Considering Topological Features

Electronics ◽

10.3390/electronics9030459 ◽

2020 ◽

Vol 9 (3) ◽

pp. 459

Author(s):

Shuyi Lu ◽

Yuanjie Zheng ◽

Rong Luo ◽

Weikuan Jia ◽

Jian Lian ◽

...

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Original Data ◽

Power Law Distribution ◽

Density Peak ◽

Topological Features ◽

Density Peaks ◽

Topological Characteristics ◽

Density Peak Clustering ◽

Clustering Data

The clustering algorithm plays an important role in data mining and image processing. The breakthrough of algorithm precision and method directly affects the direction and progress of the following research. At present, types of clustering algorithms are mainly divided into hierarchical, density-based, grid-based and model-based ones. This paper mainly studies the Clustering by Fast Search and Find of Density Peaks (CFSFDP) algorithm, which is a new clustering method based on density. The algorithm has the characteristics of no iterative process, few parameters and high precision. However, we found that the clustering algorithm did not consider the original topological characteristics of the data. We also found that the clustering data is similar to the social network nodes mentioned in DeepWalk, which satisfied power-law distribution. In this study, we tried to consider the topological characteristics of the graph in the clustering algorithm. Based on previous studies, we propose a clustering algorithm that adds the topological characteristics of original data on the basis of the CFSFDP algorithm. Our experimental results show that the clustering algorithm with topological features significantly improves the clustering effect and proves that the addition of topological features is effective and feasible.

Download Full-text

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Journal Of Big Data ◽

10.1186/s40537-021-00488-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Yahya Albalawi ◽

Jim Buckley ◽

Nikola S. Nikolov

Keyword(s):

Social Media ◽

Deep Learning ◽

Comprehensive Evaluation ◽

Classification Problem ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lower Accuracy ◽

Health Related ◽

The Impact

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

Download Full-text

A Support Based Initialization Algorithm for Categorical Data Clustering

Journal of Information Technology Research ◽

10.4018/jitr.2018040104 ◽

2018 ◽

Vol 11 (2) ◽

pp. 53-67

Author(s):

Ajay Kumar ◽

Shishir Kumar

Keyword(s):

Categorical Data ◽

Selection Process ◽

Numerical Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Data Object ◽

Data Points ◽

Wu Method ◽

Selection Algorithms

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.

Download Full-text

A SELF-ORGANIZING MAP FOR MIXED CONTINUOUS AND CATEGORICAL DATA

International Journal of Computing ◽

10.47839/ijc.10.1.733 ◽

2011 ◽

pp. 24-32 ◽

Cited By ~ 1

Author(s):

Nicoleta Rogovschi ◽

Mustapha Lebbah ◽

Younès Bennani

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Mixed Data ◽

Categorical Variables ◽

Data Sets ◽

Self Organizing Map ◽

Data Set ◽

Public Data ◽

Self Organizing

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.

Download Full-text

A Bi-directional Fuzzy C-Means Clustering Ensemble Algorithm Considering Local Information

International Journal of Computational Intelligence Systems ◽

10.1007/s44196-021-00014-z ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Chunhua Ren ◽

Linfu Sun

Keyword(s):

Clustering Algorithms ◽

Real Data ◽

Local Information ◽

Data Sets ◽

Clustering Ensemble ◽

K Nearest Neighbors ◽

Fuzzy C Means ◽

Clustering Quality ◽

Fuzzy C Means Clustering ◽

Fcm Clustering

AbstractThe classic Fuzzy C-means (FCM) algorithm has limited clustering performance and is prone to misclassification of border points. This study offers a bi-directional FCM clustering ensemble approach that takes local information into account (LI_BIFCM) to overcome these challenges and increase clustering quality. First, various membership matrices are created after running FCM multiple times, based on the randomization of the initial cluster centers, and a vertical ensemble is performed using the maximum membership principle. Second, after each execution of FCM, multiple local membership matrices of the sample points are created using multiple K-nearest neighbors, and a horizontal ensemble is performed. Multiple horizontal ensembles can be created using multiple FCM clustering. Finally, the final clustering results are obtained by combining the vertical and horizontal clustering ensembles. Twelve data sets were chosen for testing from both synthetic and real data sources. The LI_BIFCM clustering performance outperformed four traditional clustering algorithms and three clustering ensemble algorithms in the experiments. Furthermore, the final clustering results has a weak correlation with the bi-directional cluster ensemble parameters, indicating that the suggested technique is robust.

Download Full-text