A Novel K-Means Clustering Algorithm with a Noise Algorithm for Capturing Urban Hotspots

With the development of cities, urban congestion is nearly an unavoidable problem for almost every large-scale city. Road planning is an effective means to alleviate urban congestion, which is a classical non-deterministic polynomial time (NP) hard problem, and has become an important research hotspot in recent years. A K-means clustering algorithm is an iterative clustering analysis algorithm that has been regarded as an effective means to solve urban road planning problems by scholars for the past several decades; however, it is very difficult to determine the number of clusters and sensitively initialize the center cluster. In order to solve these problems, a novel K-means clustering algorithm based on a noise algorithm is developed to capture urban hotspots in this paper. The noise algorithm is employed to randomly enhance the attribution of data points and output results of clustering by adding noise judgment in order to automatically obtain the number of clusters for the given data and initialize the center cluster. Four unsupervised evaluation indexes, namely, DB, PBM, SC, and SSE, are directly used to evaluate and analyze the clustering results, and a nonparametric Wilcoxon statistical analysis method is employed to verify the distribution states and differences between clustering results. Finally, five taxi GPS datasets from Aracaju (Brazil), San Francisco (USA), Rome (Italy), Chongqing (China), and Beijing (China) are selected to test and verify the effectiveness of the proposed noise K-means clustering algorithm by comparing the algorithm with fuzzy C-means, K-means, and K-means plus approaches. The compared experiment results show that the noise algorithm can reasonably obtain the number of clusters and initialize the center cluster, and the proposed noise K-means clustering algorithm demonstrates better clustering performance and accurately obtains clustering results, as well as effectively capturing urban hotspots.

Download Full-text

A novel bidirectional clustering algorithm based on local density

Scientific Reports ◽

10.1038/s41598-021-93244-2 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Baicheng Lyu ◽

Wenhua Wu ◽

Zhiqiang Hu

Keyword(s):

Clustering Algorithm ◽

Local Density ◽

Clustering Algorithms ◽

Cluster Number ◽

Denoising Method ◽

Number Of Clusters ◽

Data Points ◽

Cutoff Distance ◽

Large Clusters ◽

Small Clusters

AbstractWith the widely application of cluster analysis, the number of clusters is gradually increasing, as is the difficulty in selecting the judgment indicators of cluster numbers. Also, small clusters are crucial to discovering the extreme characteristics of data samples, but current clustering algorithms focus mainly on analyzing large clusters. In this paper, a bidirectional clustering algorithm based on local density (BCALoD) is proposed. BCALoD establishes the connection between data points based on local density, can automatically determine the number of clusters, is more sensitive to small clusters, and can reduce the adjusted parameters to a minimum. On the basis of the robustness of cluster number to noise, a denoising method suitable for BCALoD is proposed. Different cutoff distance and cutoff density are assigned to each data cluster, which results in improved clustering performance. Clustering ability of BCALoD is verified by randomly generated datasets and city light satellite images.

Download Full-text

Self-Adaptive K-Means Based on a Covering Algorithm

Complexity ◽

10.1155/2018/7698274 ◽

2018 ◽

Vol 2018 ◽

pp. 1-16 ◽

Cited By ~ 1

Author(s):

Yiwen Zhang ◽

Yuanyuan Zhou ◽

Xing Guo ◽

Jintao Wu ◽

Qiang He ◽

...

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Real Data ◽

Second Phase ◽

Data Sets ◽

Number Of Clusters ◽

Large Scale Data ◽

Long Time ◽

Two Phases ◽

Selection Of

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.

Download Full-text

A dynamic K-means clustering for data mining

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v13.i2.pp521-526 ◽

2019 ◽

Vol 13 (2) ◽

pp. 521

Author(s):

Md. Zakir Hossain ◽

Md.Nasim Akhtar ◽

R.B. Ahmad ◽

Mostafijur Rahman

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Large Data ◽

Threshold Value ◽

Specific Pattern ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Data Points

<span>Data mining is the process of finding structure of data from large data sets. With this process, the decision makers can make a particular decision for further development of the real-world problems. Several data clusteringtechniques are used in data mining for finding a specific pattern of data. The K-means method isone of the familiar clustering techniques for clustering large data sets. The K-means clustering method partitions the data set based on the assumption that the number of clusters are fixed.The main problem of this method is that if the number of clusters is to be chosen small then there is a higher probability of adding dissimilar items into the same group. On the other hand, if the number of clusters is chosen to be high, then there is a higher chance of adding similar items in the different groups. In this paper, we address this issue by proposing a new K-Means clustering algorithm. The proposed method performs data clustering dynamically. The proposed method initially calculates a threshold value as a centroid of K-Means and based on this value the number of clusters are formed. At each iteration of K-Means, if the Euclidian distance between two points is less than or equal to the threshold value, then these two data points will be in the same group. Otherwise, the proposed method will create a new cluster with the dissimilar data point. The results show that the proposed method outperforms the original K-Means method.</span>

Download Full-text

A novel bidirectional clustering algorithm based on local density

10.21203/rs.3.rs-141525/v1 ◽

2021 ◽

Author(s):

BAICHENG LV ◽

WENHUA WU ◽

ZHIQIANG HU

Keyword(s):

Clustering Algorithm ◽

Local Density ◽

Clustering Algorithms ◽

Cluster Number ◽

Denoising Method ◽

Number Of Clusters ◽

Data Points ◽

Cutoff Distance ◽

Large Clusters ◽

Small Clusters

Abstract With the widely application of cluster analysis, the number of clusters is gradually increasing, as is the difficulty in selecting the judgment indicators of cluster numbers. Also, small clusters are crucial to discovering the extreme characteristics of data samples, but current clustering algorithms focus mainly on analyzing large clusters. In this paper, a bidirectional clustering algorithm based on local density (BCALoD) is proposed. BCALoD establishes the connection between data points based on local density, can automatically determine the number of clusters, is more sensitive to small clusters, and can reduce the adjusted parameters to a minimum. On the basis of the robustness of cluster number to noise, a denoising method suitable for BCALoD is proposed. Different cutoff distance and cutoff density are assigned to each data cluster, which results in improved clustering performance. Clustering ability of BCALoD is verified by randomly generated datasets and city light satellite images.

Download Full-text

An Improvement of K-Medoids Clustering Algorithm Based on Fixed Point Iteration

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2020100105 ◽

2020 ◽

Vol 16 (4) ◽

pp. 84-94

Author(s):

Xiaodi Huang ◽

Minglun Ren ◽

Zhongfeng Hu

Keyword(s):

Fixed Point ◽

Large Scale ◽

Clustering Algorithm ◽

Number Of Clusters ◽

Fixed Point Iteration ◽

Computational Overhead ◽

Partitioning Around Medoids ◽

Clustering Quality ◽

Conventional Algorithm

The process of K-medoids algorithm is that it first selects data randomly as initial centers to form initial clusters. Then, based on PAM (partitioning around medoids) algorithm, centers will be sequential replaced by all the remaining data to find a result has the best inherent convergence. Since PAM algorithm is an iterative ergodic strategy, when the data size or the number of clusters are huge, its expensive computational overhead will hinder its feasibility. The authors use the fixed-point iteration to search the optimal clustering centers and build a FPK-medoids (fixed point-based K-medoids) algorithm. By constructing fixed point equations for each cluster, the problem of searching optimal centers is converted into the solving of equation set in parallel. The experiment is carried on six standard datasets, and the result shows that the clustering efficiency of proposed algorithm is significantly improved compared with the conventional algorithm. In addition, the clustering quality will be markedly enhanced in handling problems with large-scale datasets or a large number of clusters.

Download Full-text

Analysis of University Students’ Behavior Based on a Fusion K-Means Clustering Algorithm

Applied Sciences ◽

10.3390/app10186566 ◽

2020 ◽

Vol 10 (18) ◽

pp. 6566

Author(s):

Wenbing Chang ◽

Xinpeng Ji ◽

Yinglai Liu ◽

Yiyong Xiao ◽

Bang Chen ◽

...

Keyword(s):

Data Mining ◽

Large Scale ◽

Clustering Algorithm ◽

Learning Performance ◽

Practical Significance ◽

Digital Campus ◽

Data Mining Algorithms ◽

Density Peaks ◽

Data Points ◽

The University

With the development of big data technology, creating the ‘Digital Campus’ is a hot issue. For an increasing amount of data, traditional data mining algorithms are not suitable. The clustering algorithm is becoming more and more important in the field of data mining, but the traditional clustering algorithm does not take the clustering efficiency and clustering effect into consideration. In this paper, the algorithm based on K-Means and clustering by fast search and find of density peaks (K-CFSFDP) is proposed, which improves on the distance and density of data points. This method is used to cluster students from four universities. The experiment shows that K-CFSFDP algorithm has better clustering results and running efficiency than the traditional K-Means clustering algorithm, and it performs well in large scale campus data. Additionally, the results of the cluster analysis show that the students of different categories in four universities had different performances in living habits and learning performance, so the university can learn about the students’ behavior of different categories and provide corresponding personalized services, which have certain practical significance.

Download Full-text

Improved Parameterless K-Means

Information Retrieval Methods for Multidisciplinary Applications ◽

10.4018/978-1-4666-3898-3.ch010 ◽

2013 ◽

pp. 156-168

Author(s):

Wan Maseri Binti Wan Mohd ◽

A.H. Beg ◽

Tutut Herawan ◽

A. Noraziah ◽

K. F. Rabbi

Keyword(s):

Unsupervised Learning ◽

Globular Clusters ◽

Clustering Algorithm ◽

Experimental Results ◽

Maximum Distance ◽

Initial Number ◽

Number Of Clusters ◽

New Approach ◽

Data Points ◽

Run Time

K-means is an unsupervised learning and partitioning clustering algorithm. It is popular and widely used for its simplicity and fastness. K-means clustering produce a number of separate flat (non-hierarchical) clusters and suitable for generating globular clusters. The main drawback of the k-means algorithm is that the user must specify the number of clusters in advance. This paper presents an improved version of K-means algorithm with auto-generate an initial number of clusters (k) and a new approach of defining initial Centroid for effective and efficient clustering process. The underlined mechanism has been analyzed and experimented. The experimental results show that the number of iteration is reduced to 50% and the run time is lower and constantly based on maximum distance of data points, regardless of how many data points.

Download Full-text

Improved Parameterless K-Means

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2011070101 ◽

2011 ◽

Vol 1 (3) ◽

pp. 1-14 ◽

Cited By ~ 5

Author(s):

Wan Maseri Binti Wan Mohd ◽

A.H. Beg ◽

Tutut Herawan ◽

A. Noraziah ◽

K. F. Rabbi

Keyword(s):

Unsupervised Learning ◽

Globular Clusters ◽

Clustering Algorithm ◽

Experimental Results ◽

Maximum Distance ◽

Initial Number ◽

Number Of Clusters ◽

New Approach ◽

Main Drawback ◽

Data Points

Download Full-text

Performance Analysis of Subtractive Clustering Algorithm in Determining the Number and Position of Cluster Centers

Randwick International of Social Science Journal ◽

10.47175/rissj.v2i2.241 ◽

2021 ◽

Vol 2 (2) ◽

pp. 193-199

Author(s):

Irwandi ◽

Opim Salim Sitompul ◽

Rahmat Widia Sembiring

Keyword(s):

Clustering Algorithm ◽

Human Life ◽

Iteration Process ◽

Breast Cancer Diagnosis ◽

Cluster Center ◽

Subtractive Clustering ◽

Number Of Clusters ◽

Average Value ◽

Data Points ◽

Data Point

The basic concept of the subtractive clustering algorithm is to choose a data point that has the highest density (potential) in a space (variable) as the center of the cluster. The number and position of the cluster centers formed are influenced by the given radius (r) parameter value. If the radius value is very small, it will result in the neglect of potential data points around the center of the cluster. If the value of the radius parameter is too large, it increases the contribution of all potential data points, thereby canceling the effect of cluster density. The number of cluster centers in the subtractive clustering algorithm is determined based on the iteration process in finding data points with the highest number of neighbors. This study uses the clustering partition as a parameter value to determine a data point (candidate cluster center) will be selected to determine the effect of the radius (r) parameter value on the subtractive clustering algorithm in generating clustering. From the experiments that have been carried out on 4 datasets, the results have been obtained, for dataset 1 the highest average value of fuzzy silhouette with a parameter value of radius (r) 0.35 is 0.9088 and the number of clusters 2. While in dataset 2, the average value The highest fuzzy silhouette with a parameter value of radius (r) 0.40 is 0.6742 and the number of clusters 3. While in dataset 3, the average value of the highest fuzzy silhouette with a parameter value of radius (r) 0.50 is 0.7434 and the number of clusters 3. While in the dataset the last is the fourth dataset, the highest fuzzy silhouette average value with a radius (r) parameter value of 0.50 is 0.6630 and the number of clusters 2. This subscractive clustering algorithm is widely applied in the fields of transportation, GIS, big data, control of electric voltages, electrical energy needs, knowing the area of population density to health such as breast cancer diagnosis, which is related to the needs of human life.

Download Full-text

Improved Mean Shift Algorithm for Maximizing Clustering Accuracy

Journal of Engineering Advancements ◽

10.38032/jea.2021.01.001 ◽

2021 ◽

pp. 01-06

Author(s):

Chinmay Bepery ◽

Shaneworn Bhadra ◽

Md. Mahbubur Rahman ◽

Mihir Kanti Sarkar ◽

Mohammad Jamal Hossain

Keyword(s):

Missing Values ◽

Clustering Algorithm ◽

Mean Shift ◽

Similar Data ◽

Number Of Clusters ◽

Mean Shift Algorithm ◽

Mean Imputation ◽

Data Points ◽

The Mean ◽

Tree Data Structure

Clustering is a machine learning method that can group similar data points. Mean Shift (MS) is a fixed window-based clustering algorithm, which calculates the number of clusters automatically but cannot guarantee the convergence of the algorithm. The main drawback of the Mean Shift Algorithm is that the algorithm requires to set a stopping criterion (threshold point) otherwise all clusters move towards one cluster and fixed bandwidth is used here. It cannot define the upper bound of iteration numbers and need to set the iteration numbers. This paper proposed a new Mean Shift Algorithm, called Improved Mean Shift (IMS) algorithm, which overcomes the all defined pitfalls of Mean Shift Algorithm. The IMS process KD-tree data structure was used to sort the dataset and all data points as initial cluster centroids without a random selection of initial centroids. In each iteration, it shifts the variable bandwidth sliding window to the actual data point nearest to the mean using k-nearest neighbours (kNN) algorithm and finds the number of clusters automatically. Also, this paper handles the missing values using Mean Imputation (MI). The IMS algorithm produces better results than the Mean Shift Algorithm on both synthetic and real datasets.

Download Full-text