A New Method for Dimensionality Reduction Using KMeans Clustering Algorithm for High Dimensional Data Set

Clustering of variables is the task of grouping similar variables into different groups. It may be useful in several situations such as dimensionality reduction, feature selection, and detect redundancies. In the present study, we combine two methods of features clustering the clustering of variables around latent variables (CLV) algorithm and the k-means based co-clustering algorithm (kCC). Indeed, classical CLV cannot be applied to high dimensional data because this approach becomes tedious when the number of features increases.

Download Full-text

Urban green economic development indicators based on spatial clustering algorithm and blockchain

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189535 ◽

2020 ◽

pp. 1-12

Author(s):

Xiaoguang Gao

Keyword(s):

Development Strategy ◽

Clustering Algorithm ◽

Spatial Clustering ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Large Data ◽

Experimental Comparison ◽

High Dimensional ◽

Density Peak ◽

Data Set

The unbalanced development strategy makes the regional development unbalanced. Therefore, in the development process, resources must be effectively utilized according to the level and characteristics of each region. Considering the resource and environmental constraints, this paper measures and analyzes China’s green economic efficiency and green total factor productivity. Moreover, by expounding the characteristics of high-dimensional data, this paper points out the problems of traditional clustering algorithms in high-dimensional data clustering. This paper proposes a density peak clustering algorithm based on sampling and residual squares, which is suitable for high-dimensional large data sets. The algorithm finds abnormal points and boundary points by identifying halo points, and finally determines clusters. In addition, from the experimental comparison on the data set, it can be seen that the improved algorithm is better than the DPC algorithm in both time complexity and clustering results. Finally, this article analyzes data based on actual cases. The research results show that the method proposed in this paper is effective.

Download Full-text

Density Peaks Clustering Based on Feature Reduction and Quasi-Monte Carlo

Scientific Programming ◽

10.1155/2022/8046620 ◽

2022 ◽

Vol 2022 ◽

pp. 1-17

Author(s):

Zhihui Hu ◽

Xiaoran Wei ◽

Xiaoxu Han ◽

Guang Kou ◽

Haoyu Zhang ◽

...

Keyword(s):

Clustering Algorithm ◽

High Dimensional Data ◽

Original Data ◽

Feature Reduction ◽

High Dimensional ◽

Data Sampling ◽

Data Set ◽

Quasi Monte Carlo ◽

Density Peaks ◽

Density Peaks Clustering

Density peaks clustering (DPC) is a well-known density-based clustering algorithm that can deal with nonspherical clusters well. However, DPC has high computational complexity and space complexity in calculating local density ρ and distance δ , which makes it suitable only for small-scale data sets. In addition, for clustering high-dimensional data, the performance of DPC still needs to be improved. High-dimensional data not only make the data distribution more complex but also lead to more computational overheads. To address the above issues, we propose an improved density peaks clustering algorithm, which combines feature reduction and data sampling strategy. Specifically, features of the high-dimensional data are automatically extracted by principal component analysis (PCA), auto-encoder (AE), and t-distributed stochastic neighbor embedding (t-SNE). Next, in order to reduce the computational overhead, we propose a novel data sampling method for the low-dimensional feature data. Firstly, the data distribution in the low-dimensional feature space is estimated by the Quasi-Monte Carlo (QMC) sequence with low-discrepancy characteristics. Then, the representative QMC points are selected according to their cell densities. Next, the selected QMC points are used to calculate ρ and δ instead of the original data points. In general, the number of the selected QMC points is much smaller than that of the initial data set. Finally, a two-stage classification strategy based on the QMC points clustering results is proposed to classify the original data set. Compared with current works, our proposed algorithm can reduce the computational complexity from O n 2 to O N n , where N denotes the number of selected QMC points and n is the size of original data set, typically N ≪ n . Experimental results demonstrate that the proposed algorithm can effectively reduce the computational overhead and improve the model performance.

Download Full-text

Subspace Clustering of High-Dimensional Data: An Evolutionary Approach

Applied Computational Intelligence and Soft Computing ◽

10.1155/2013/863146 ◽

2013 ◽

Vol 2013 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Singh Vijendra ◽

Sahoo Laxman

Keyword(s):

Clustering Algorithm ◽

Dimensional Space ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Data Points

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of high-dimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and real-world data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS.

Download Full-text

Feature Selection using Genetic Algorithm for Clustering high Dimensional Data

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.11.11001 ◽

2018 ◽

Vol 7 (2.11) ◽

pp. 27 ◽

Cited By ~ 1

Author(s):

Kahkashan Kouser ◽

Amrita Priyam

Keyword(s):

Genetic Algorithm ◽

Feature Selection ◽

Clustering Algorithm ◽

High Dimensional Data ◽

Feature Space ◽

High Dimensional ◽

Feature Subset ◽

Data Set ◽

Optimal Feature Subset ◽

Optimal Feature

One of the open problems of modern data mining is clustering high dimensional data. For this in the paper a new technique called GA-HDClustering is proposed, which works in two steps. First a GA-based feature selection algorithm is designed to determine the optimal feature subset; an optimal feature subset is consisting of important features of the entire data set next, a K-means algorithm is applied using the optimal feature subset to find the clusters. On the other hand, traditional K-means algorithm is applied on the full dimensional feature space. Finally, the result of GA-HDClustering is compared with the traditional clustering algorithm. For comparison different validity matrices such as Sum of squared error (SSE), Within Group average distance (WGAD), Between group distance (BGD), Davies-Bouldin index(DBI), are used .The GA-HDClustering uses genetic algorithm for searching an effective feature subspace in a large feature space. This large feature space is made of all dimensions of the data set. The experiment performed on the standard data set revealed that the GA-HDClustering is superior to traditional clustering algorithm.

Download Full-text

Quality-based guidance for exploratory dimensionality reduction

Information Visualization ◽

10.1177/1473871612460526 ◽

2012 ◽

Vol 12 (1) ◽

pp. 44-64 ◽

Cited By ~ 12

Author(s):

Sara Johansson Fernstad ◽

Jane Shaw ◽

Jimmy Johansson

Keyword(s):

Dimensionality Reduction ◽

High Dimensional Data ◽

Quality Metrics ◽

Exploratory Analysis ◽

High Dimensional ◽

Data Sets ◽

Data Set ◽

Bacterial Populations ◽

Reduction Methods ◽

Individual Variables

High-dimensional data sets containing hundreds of variables are difficult to explore, as traditional visualization methods often are unable to represent such data effectively. This is commonly addressed by employing dimensionality reduction prior to visualization. Numerous dimensionality reduction methods are available. However, few reduction approaches take the importance of several structures into account and few provide an overview of structures existing in the full high-dimensional data set. For exploratory analysis, as well as for many other tasks, several structures may be of interest. Exploration of the full high-dimensional data set without reduction may also be desirable. This paper presents flexible methods for exploratory analysis and interactive dimensionality reduction. Automated methods are employed to analyse the variables, using a range of quality metrics, providing one or more measures of ‘interestingness’ for individual variables. Through ranking, a single value of interestingness is obtained, based on several quality metrics, that is usable as a threshold for the most interesting variables. An interactive environment is presented in which the user is provided with many possibilities to explore and gain understanding of the high-dimensional data set. Guided by this, the analyst can explore the high-dimensional data set and interactively select a subset of the potentially most interesting variables, employing various methods for dimensionality reduction. The system is demonstrated through a use-case analysing data from a DNA sequence-based study of bacterial populations.

Download Full-text

Outlier Detection in the Framework of Dimensionality Reduction

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001415500172 ◽

2015 ◽

Vol 29 (04) ◽

pp. 1550017 ◽

Cited By ~ 3

Author(s):

Qiang Ye ◽

Weifeng Zhi

Keyword(s):

Dimensionality Reduction ◽

Outlier Detection ◽

Nonlinear Models ◽

High Dimensional Data ◽

Detection Algorithm ◽

High Dimensional ◽

Dimensional Manifold ◽

Data Set ◽

Manifold Models ◽

Low Dimensional

We propose an effective outlier detection algorithm for high-dimensional data. We consider manifold models of data as is typically assumed in dimensionality reduction/manifold learning. Namely, we consider a noisy data set sampled from a low-dimensional manifold in a high-dimensional data space. Our algorithm uses local geometric structure to determine inliers, from which the outliers are identified. The algorithm is applicable to both linear and nonlinear models of data. We also discuss various implementation issues and we present several examples to demonstrate the effectiveness of the new approach.

Download Full-text

A Fast Clustering Algorithm for Large-scale and High Dimensional Data

ACTA AUTOMATICA SINICA ◽

10.3724/sp.j.1004.2009.00859 ◽

2009 ◽

Vol 35 (7) ◽

pp. 859-866

Author(s):

Ming LIU ◽

Xiao-Long WANG ◽

Yuan-Chao LIU

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

High Dimensional Data ◽

High Dimensional

Download Full-text

A meta-heuristic density-based subspace clustering algorithm for high-dimensional data

Soft Computing ◽

10.1007/s00500-021-05973-1 ◽

2021 ◽

Author(s):

Parul Agarwal ◽

Shikha Mehta ◽

Ajith Abraham

Keyword(s):

Clustering Algorithm ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional

Download Full-text

Parallel Framework for Dimensionality Reduction of Large-Scale Datasets

Scientific Programming ◽

10.1155/2015/180214 ◽

2015 ◽

Vol 2015 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Sai Kiranmayee Samudrala ◽

Jaroslaw Zola ◽

Srinivas Aluru ◽

Baskar Ganapathysubramanian

Keyword(s):

Dimensionality Reduction ◽

Organic Solar Cells ◽

Large Scale ◽

Parallel Implementation ◽

High Dimensional Data ◽

Real Life ◽

Processing Parameters ◽

High Dimensional ◽

Morphology Evolution ◽

Reduction Techniques

Dimensionality reduction refers to a set of mathematical techniques used to reduce complexity of the original high-dimensional data, while preserving its selected properties. Improvements in simulation strategies and experimental data collection methods are resulting in a deluge of heterogeneous and high-dimensional data, which often makes dimensionality reduction the only viable way to gain qualitative and quantitative understanding of the data. However, existing dimensionality reduction software often does not scale to datasets arising in real-life applications, which may consist of thousands of points with millions of dimensions. In this paper, we propose a parallel framework for dimensionality reduction of large-scale data. We identify key components underlying the spectral dimensionality reduction techniques, and propose their efficient parallel implementation. We show that the resulting framework can be used to process datasets consisting of millions of points when executed on a 16,000-core cluster, which is beyond the reach of currently available methods. To further demonstrate applicability of our framework we perform dimensionality reduction of 75,000 images representing morphology evolution during manufacturing of organic solar cells in order to identify how processing parameters affect morphology evolution.

Download Full-text