scholarly journals A New Method for Dimensionality Reduction Using KMeans Clustering Algorithm for High Dimensional Data Set

2011 ◽  
Vol 13 (7) ◽  
pp. 41-46 ◽  
Author(s):  
D. Napoleon ◽  
S. Pavalakodi
2021 ◽  
Vol 297 ◽  
pp. 01070
Author(s):  
Ez-Zarrad Ghizlane ◽  
Sabbar Wafae ◽  
Bekkhoucha Abdelkrim

Clustering of variables is the task of grouping similar variables into different groups. It may be useful in several situations such as dimensionality reduction, feature selection, and detect redundancies. In the present study, we combine two methods of features clustering the clustering of variables around latent variables (CLV) algorithm and the k-means based co-clustering algorithm (kCC). Indeed, classical CLV cannot be applied to high dimensional data because this approach becomes tedious when the number of features increases.


2020 ◽  
pp. 1-12
Author(s):  
Xiaoguang Gao

The unbalanced development strategy makes the regional development unbalanced. Therefore, in the development process, resources must be effectively utilized according to the level and characteristics of each region. Considering the resource and environmental constraints, this paper measures and analyzes China’s green economic efficiency and green total factor productivity. Moreover, by expounding the characteristics of high-dimensional data, this paper points out the problems of traditional clustering algorithms in high-dimensional data clustering. This paper proposes a density peak clustering algorithm based on sampling and residual squares, which is suitable for high-dimensional large data sets. The algorithm finds abnormal points and boundary points by identifying halo points, and finally determines clusters. In addition, from the experimental comparison on the data set, it can be seen that the improved algorithm is better than the DPC algorithm in both time complexity and clustering results. Finally, this article analyzes data based on actual cases. The research results show that the method proposed in this paper is effective.


2022 ◽  
Vol 2022 ◽  
pp. 1-17
Author(s):  
Zhihui Hu ◽  
Xiaoran Wei ◽  
Xiaoxu Han ◽  
Guang Kou ◽  
Haoyu Zhang ◽  
...  

Density peaks clustering (DPC) is a well-known density-based clustering algorithm that can deal with nonspherical clusters well. However, DPC has high computational complexity and space complexity in calculating local density ρ and distance δ , which makes it suitable only for small-scale data sets. In addition, for clustering high-dimensional data, the performance of DPC still needs to be improved. High-dimensional data not only make the data distribution more complex but also lead to more computational overheads. To address the above issues, we propose an improved density peaks clustering algorithm, which combines feature reduction and data sampling strategy. Specifically, features of the high-dimensional data are automatically extracted by principal component analysis (PCA), auto-encoder (AE), and t-distributed stochastic neighbor embedding (t-SNE). Next, in order to reduce the computational overhead, we propose a novel data sampling method for the low-dimensional feature data. Firstly, the data distribution in the low-dimensional feature space is estimated by the Quasi-Monte Carlo (QMC) sequence with low-discrepancy characteristics. Then, the representative QMC points are selected according to their cell densities. Next, the selected QMC points are used to calculate ρ and δ instead of the original data points. In general, the number of the selected QMC points is much smaller than that of the initial data set. Finally, a two-stage classification strategy based on the QMC points clustering results is proposed to classify the original data set. Compared with current works, our proposed algorithm can reduce the computational complexity from O n 2 to O N n , where N denotes the number of selected QMC points and n is the size of original data set, typically N ≪ n . Experimental results demonstrate that the proposed algorithm can effectively reduce the computational overhead and improve the model performance.


2013 ◽  
Vol 2013 ◽  
pp. 1-12 ◽  
Author(s):  
Singh Vijendra ◽  
Sahoo Laxman

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of high-dimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and real-world data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS.


2018 ◽  
Vol 7 (2.11) ◽  
pp. 27 ◽  
Author(s):  
Kahkashan Kouser ◽  
Amrita Priyam

One of the open problems of modern data mining is clustering high dimensional data. For this in the paper a new technique called GA-HDClustering is proposed, which works in two steps. First a GA-based feature selection algorithm is designed to determine the optimal feature subset; an optimal feature subset is consisting of important features of the entire data set next, a K-means algorithm is applied using the optimal feature subset to find the clusters. On the other hand, traditional K-means algorithm is applied on the full dimensional feature space.    Finally, the result of GA-HDClustering  is  compared  with  the  traditional  clustering  algorithm.  For comparison different validity  matrices  such  as  Sum  of  squared  error  (SSE),  Within  Group average distance (WGAD), Between group distance (BGD), Davies-Bouldin index(DBI),   are used .The GA-HDClustering uses genetic algorithm for searching an effective feature subspace in a large feature space. This large feature space is made of all dimensions of the data set. The experiment performed on the standard data set revealed that the GA-HDClustering is superior to traditional clustering algorithm. 


2012 ◽  
Vol 12 (1) ◽  
pp. 44-64 ◽  
Author(s):  
Sara Johansson Fernstad ◽  
Jane Shaw ◽  
Jimmy Johansson

High-dimensional data sets containing hundreds of variables are difficult to explore, as traditional visualization methods often are unable to represent such data effectively. This is commonly addressed by employing dimensionality reduction prior to visualization. Numerous dimensionality reduction methods are available. However, few reduction approaches take the importance of several structures into account and few provide an overview of structures existing in the full high-dimensional data set. For exploratory analysis, as well as for many other tasks, several structures may be of interest. Exploration of the full high-dimensional data set without reduction may also be desirable. This paper presents flexible methods for exploratory analysis and interactive dimensionality reduction. Automated methods are employed to analyse the variables, using a range of quality metrics, providing one or more measures of ‘interestingness’ for individual variables. Through ranking, a single value of interestingness is obtained, based on several quality metrics, that is usable as a threshold for the most interesting variables. An interactive environment is presented in which the user is provided with many possibilities to explore and gain understanding of the high-dimensional data set. Guided by this, the analyst can explore the high-dimensional data set and interactively select a subset of the potentially most interesting variables, employing various methods for dimensionality reduction. The system is demonstrated through a use-case analysing data from a DNA sequence-based study of bacterial populations.


Author(s):  
Qiang Ye ◽  
Weifeng Zhi

We propose an effective outlier detection algorithm for high-dimensional data. We consider manifold models of data as is typically assumed in dimensionality reduction/manifold learning. Namely, we consider a noisy data set sampled from a low-dimensional manifold in a high-dimensional data space. Our algorithm uses local geometric structure to determine inliers, from which the outliers are identified. The algorithm is applicable to both linear and nonlinear models of data. We also discuss various implementation issues and we present several examples to demonstrate the effectiveness of the new approach.


2009 ◽  
Vol 35 (7) ◽  
pp. 859-866
Author(s):  
Ming LIU ◽  
Xiao-Long WANG ◽  
Yuan-Chao LIU

2015 ◽  
Vol 2015 ◽  
pp. 1-12 ◽  
Author(s):  
Sai Kiranmayee Samudrala ◽  
Jaroslaw Zola ◽  
Srinivas Aluru ◽  
Baskar Ganapathysubramanian

Dimensionality reduction refers to a set of mathematical techniques used to reduce complexity of the original high-dimensional data, while preserving its selected properties. Improvements in simulation strategies and experimental data collection methods are resulting in a deluge of heterogeneous and high-dimensional data, which often makes dimensionality reduction the only viable way to gain qualitative and quantitative understanding of the data. However, existing dimensionality reduction software often does not scale to datasets arising in real-life applications, which may consist of thousands of points with millions of dimensions. In this paper, we propose a parallel framework for dimensionality reduction of large-scale data. We identify key components underlying the spectral dimensionality reduction techniques, and propose their efficient parallel implementation. We show that the resulting framework can be used to process datasets consisting of millions of points when executed on a 16,000-core cluster, which is beyond the reach of currently available methods. To further demonstrate applicability of our framework we perform dimensionality reduction of 75,000 images representing morphology evolution during manufacturing of organic solar cells in order to identify how processing parameters affect morphology evolution.


Sign in / Sign up

Export Citation Format

Share Document