Density Peaks Clustering Based on Feature Reduction and Quasi-Monte Carlo

Density peaks clustering (DPC) is a well-known density-based clustering algorithm that can deal with nonspherical clusters well. However, DPC has high computational complexity and space complexity in calculating local density ρ and distance δ , which makes it suitable only for small-scale data sets. In addition, for clustering high-dimensional data, the performance of DPC still needs to be improved. High-dimensional data not only make the data distribution more complex but also lead to more computational overheads. To address the above issues, we propose an improved density peaks clustering algorithm, which combines feature reduction and data sampling strategy. Specifically, features of the high-dimensional data are automatically extracted by principal component analysis (PCA), auto-encoder (AE), and t-distributed stochastic neighbor embedding (t-SNE). Next, in order to reduce the computational overhead, we propose a novel data sampling method for the low-dimensional feature data. Firstly, the data distribution in the low-dimensional feature space is estimated by the Quasi-Monte Carlo (QMC) sequence with low-discrepancy characteristics. Then, the representative QMC points are selected according to their cell densities. Next, the selected QMC points are used to calculate ρ and δ instead of the original data points. In general, the number of the selected QMC points is much smaller than that of the initial data set. Finally, a two-stage classification strategy based on the QMC points clustering results is proposed to classify the original data set. Compared with current works, our proposed algorithm can reduce the computational complexity from O n 2 to O N n , where N denotes the number of selected QMC points and n is the size of original data set, typically N ≪ n . Experimental results demonstrate that the proposed algorithm can effectively reduce the computational overhead and improve the model performance.

Download Full-text

Urban green economic development indicators based on spatial clustering algorithm and blockchain

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189535 ◽

2020 ◽

pp. 1-12

Author(s):

Xiaoguang Gao

Keyword(s):

Development Strategy ◽

Clustering Algorithm ◽

Spatial Clustering ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Large Data ◽

Experimental Comparison ◽

High Dimensional ◽

Density Peak ◽

Data Set

The unbalanced development strategy makes the regional development unbalanced. Therefore, in the development process, resources must be effectively utilized according to the level and characteristics of each region. Considering the resource and environmental constraints, this paper measures and analyzes China’s green economic efficiency and green total factor productivity. Moreover, by expounding the characteristics of high-dimensional data, this paper points out the problems of traditional clustering algorithms in high-dimensional data clustering. This paper proposes a density peak clustering algorithm based on sampling and residual squares, which is suitable for high-dimensional large data sets. The algorithm finds abnormal points and boundary points by identifying halo points, and finally determines clusters. In addition, from the experimental comparison on the data set, it can be seen that the improved algorithm is better than the DPC algorithm in both time complexity and clustering results. Finally, this article analyzes data based on actual cases. The research results show that the method proposed in this paper is effective.

Download Full-text

Subspace Clustering of High-Dimensional Data: An Evolutionary Approach

Applied Computational Intelligence and Soft Computing ◽

10.1155/2013/863146 ◽

2013 ◽

Vol 2013 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Singh Vijendra ◽

Sahoo Laxman

Keyword(s):

Clustering Algorithm ◽

Dimensional Space ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Data Points

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of high-dimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and real-world data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS.

Download Full-text

A New Method for Dimensionality Reduction Using KMeans Clustering Algorithm for High Dimensional Data Set

International Journal of Computer Applications ◽

10.5120/1789-2471 ◽

2011 ◽

Vol 13 (7) ◽

pp. 41-46 ◽

Cited By ~ 23

Author(s):

D. Napoleon ◽

S. Pavalakodi

Keyword(s):

Dimensionality Reduction ◽

Clustering Algorithm ◽

High Dimensional Data ◽

New Method ◽

High Dimensional ◽

Data Set

Download Full-text

Automatic Density Peaks Clustering Using DNA Genetic Algorithm Optimized Data Field and Gaussian Process

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001417500239 ◽

2017 ◽

Vol 31 (08) ◽

pp. 1750023 ◽

Cited By ~ 11

Author(s):

Wenke Zang ◽

Liyan Ren ◽

Wenqian Zhang ◽

Xiyu Liu

Keyword(s):

Genetic Algorithm ◽

Gaussian Process ◽

Real World ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Data Set ◽

Density Peaks ◽

Data Field ◽

Density Peaks Clustering ◽

Cutoff Distance

Clustering by fast search and finding of Density Peaks (called as DPC) introduced by Alex Rodríguez and Alessandro Laio attracted much attention in the field of pattern recognition and artificial intelligence. However, DPC still has a lot of defects that are not resolved. Firstly, the local density [Formula: see text] of point [Formula: see text] is affected by the cutoff distance [Formula: see text], which can influence the clustering result, especially for small real-world cases. Secondly, the number of clusters is still found intuitively by using the decision diagram to select the cluster centers. In order to overcome these defects, this paper proposes an automatic density peaks clustering approach using DNA genetic algorithm optimized data field and Gaussian process (referred to as ADPC-DNAGA). ADPC-DNAGA can extract the optimal value of threshold with the potential entropy of data field and automatically determine the cluster centers by Gaussian method. For any data set to be clustered, the threshold can be calculated from the data set objectively rather than the empirical estimation. The proposed clustering algorithm is benchmarked on publicly available synthetic and real-world datasets which are commonly used for testing the performance of clustering algorithms. The clustering results are compared not only with that of DPC but also with that of several well-known clustering algorithms such as Affinity Propagation, DBSCAN and Spectral Cluster. The experimental results demonstrate that our proposed clustering algorithm can find the optimal cutoff distance [Formula: see text], to automatically identify clusters, regardless of their shape and dimension of the embedded space, and can often outperform the comparisons.

Download Full-text

Feature Selection using Genetic Algorithm for Clustering high Dimensional Data

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.11.11001 ◽

2018 ◽

Vol 7 (2.11) ◽

pp. 27 ◽

Cited By ~ 1

Author(s):

Kahkashan Kouser ◽

Amrita Priyam

Keyword(s):

Genetic Algorithm ◽

Feature Selection ◽

Clustering Algorithm ◽

High Dimensional Data ◽

Feature Space ◽

High Dimensional ◽

Feature Subset ◽

Data Set ◽

Optimal Feature Subset ◽

Optimal Feature

One of the open problems of modern data mining is clustering high dimensional data. For this in the paper a new technique called GA-HDClustering is proposed, which works in two steps. First a GA-based feature selection algorithm is designed to determine the optimal feature subset; an optimal feature subset is consisting of important features of the entire data set next, a K-means algorithm is applied using the optimal feature subset to find the clusters. On the other hand, traditional K-means algorithm is applied on the full dimensional feature space. Finally, the result of GA-HDClustering is compared with the traditional clustering algorithm. For comparison different validity matrices such as Sum of squared error (SSE), Within Group average distance (WGAD), Between group distance (BGD), Davies-Bouldin index(DBI), are used .The GA-HDClustering uses genetic algorithm for searching an effective feature subspace in a large feature space. This large feature space is made of all dimensions of the data set. The experiment performed on the standard data set revealed that the GA-HDClustering is superior to traditional clustering algorithm.

Download Full-text

A Fast Clustering Algorithm for Large-scale and High Dimensional Data

ACTA AUTOMATICA SINICA ◽

10.3724/sp.j.1004.2009.00859 ◽

2009 ◽

Vol 35 (7) ◽

pp. 859-866

Author(s):

Ming LIU ◽

Xiao-Long WANG ◽

Yuan-Chao LIU

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

High Dimensional Data ◽

High Dimensional

Download Full-text

A meta-heuristic density-based subspace clustering algorithm for high-dimensional data

Soft Computing ◽

10.1007/s00500-021-05973-1 ◽

2021 ◽

Author(s):

Parul Agarwal ◽

Shikha Mehta ◽

Ajith Abraham

Keyword(s):

Clustering Algorithm ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional

Download Full-text

Mass-Based Density Peaks Clustering Algorithm

IFIP Advances in Information and Communication Technology - Intelligent Information Processing IX ◽

10.1007/978-3-030-00828-4_5 ◽

2018 ◽

pp. 40-48

Author(s):

Ding Ling ◽

Xu Xiao

Keyword(s):

Clustering Algorithm ◽

Density Peaks ◽

Density Peaks Clustering

Download Full-text

Scalable hierarchical clustering by composition rank vector encoding and tree structure

10.1101/2020.04.12.038026 ◽

2020 ◽

Author(s):

Xiao Lai ◽

Pu Tian

Keyword(s):

Machine Learning ◽

Hierarchical Clustering ◽

Clustering Algorithm ◽

High Dimensional Data ◽

Machine Learning Algorithms ◽

Tree Structure ◽

Supervised Machine Learning ◽

High Dimensional ◽

Rank Vector ◽

Nonlinear Correlations

AbstractSupervised machine learning, especially deep learning based on a wide variety of neural network architectures, have contributed tremendously to fields such as marketing, computer vision and natural language processing. However, development of un-supervised machine learning algorithms has been a bottleneck of artificial intelligence. Clustering is a fundamental unsupervised task in many different subjects. Unfortunately, no present algorithm is satisfactory for clustering of high dimensional data with strong nonlinear correlations. In this work, we propose a simple and highly efficient hierarchical clustering algorithm based on encoding by composition rank vectors and tree structure, and demonstrate its utility with clustering of protein structural domains. No record comparison, which is an expensive and essential common step to all present clustering algorithms, is involved. Consequently, it achieves linear time and space computational complexity hierarchical clustering, thus applicable to arbitrarily large datasets. The key factor in this algorithm is definition of composition, which is dependent upon physical nature of target data and therefore need to be constructed case by case. Nonetheless, the algorithm is general and applicable to any high dimensional data with strong nonlinear correlations. We hope this algorithm to inspire a rich research field of encoding based clustering well beyond composition rank vector trees.

Download Full-text

A Novel Density-based Technique for Outlier Detection of High Dimensional Data Utilizing Full Feature Space

Information Technology And Control ◽

10.5755/j01.itc.50.1.25588 ◽

2021 ◽

Vol 50 (1) ◽

pp. 138-152

Author(s):

Mujeeb Ur Rehman ◽

Dost Muhammad Khan

Keyword(s):

Data Mining ◽

Outlier Detection ◽

High Dimensional Data ◽

Research Work ◽

Feature Space ◽

High Dimensional ◽

Data Set ◽

Data Points ◽

Low Dimensional ◽

Intrinsic Feature

Recently, anomaly detection has acquired a realistic response from data mining scientists as a graph of its reputation has increased smoothly in various practical domains like product marketing, fraud detection, medical diagnosis, fault detection and so many other fields. High dimensional data subjected to outlier detection poses exceptional challenges for data mining experts and it is because of natural problems of the curse of dimensionality and resemblance of distant and adjoining points. Traditional algorithms and techniques were experimented on full feature space regarding outlier detection. Customary methodologies concentrate largely on low dimensional data and hence show ineffectiveness while discovering anomalies in a data set comprised of a high number of dimensions. It becomes a very difficult and tiresome job to dig out anomalies present in high dimensional data set when all subsets of projections need to be explored. All data points in high dimensional data behave like similar observations because of its intrinsic feature i.e., the distance between observations approaches to zero as the number of dimensions extends towards infinity. This research work proposes a novel technique that explores deviation among all data points and embeds its findings inside well established density-based techniques. This is a state of art technique as it gives a new breadth of research towards resolving inherent problems of high dimensional data where outliers reside within clusters having different densities. A high dimensional dataset from UCI Machine Learning Repository is chosen to test the proposed technique and then its results are compared with that of density-based techniques to evaluate its efficiency.

Download Full-text