Efficiency and Effectiveness of Clustering Algorithms for High Dimensional Data

Subspace clustering approaches cluster high dimensional data in different subspaces. It means grouping the data with different relevant subsets of dimensions. This technique has become very effective as a distance measure becomes ineffective in a high dimensional space. This chapter presents a novel evolutionary approach to a bottom up subspace clustering SUBSPACE_DE which is scalable to high dimensional data. SUBSPACE_DE uses a self-adaptive DBSCAN algorithm to perform clustering in data instances of each attribute and maximal subspaces. Self-adaptive DBSCAN clustering algorithms accept input from differential evolution algorithms. The proposed SUBSPACE_DE algorithm is tested on 14 datasets, both real and synthetic. It is compared with 11 existing subspace clustering algorithms. Evaluation metrics such as F1_Measure and accuracy are used. Performance analysis of the proposed algorithms is considerably better on a success rate ratio ranking in both accuracy and F1_Measure. SUBSPACE_DE also has potential scalability on high dimensional datasets.

Download Full-text

M-Denclue for Effective Data Clustering in High Dimensional Non-Linear Data

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a9109.119119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 2925-2927

Keyword(s):

Clustering Algorithms ◽

High Dimensional Data ◽

Research Work ◽

Curse Of Dimensionality ◽

Distance Measures ◽

High Dimensional ◽

Clustering Methods ◽

Non Linear ◽

Low Dimensional ◽

Automatic Grouping

Clustering is a data mining task devoted to the automatic grouping of data based on mutual similarity. Clustering in high-dimensional spaces is a recurrent problem in many domains. It affects time complexity, space complexity, scalability and accuracy of clustering methods. Highdimensional non-linear datausually live in different low dimensional subspaces hidden in the original space. As high‐dimensional objects appear almost alike, new approaches for clustering are required. This research has focused on developing Mathematical models, techniques and clustering algorithms specifically for high‐dimensional data. The innocent growth in the fields of communication and technology, there is tremendous growth in high dimensional data spaces. As the variant of dimensions on high dimensional non-linear data increases, many clustering techniques begin to suffer from the curse of dimensionality, de-grading the quality of the results. In high dimensional non-linear data, the data becomes very sparse and distance measures become increasingly meaningless. The principal challenge for clustering high dimensional data is to overcome the “curse of dimensionality”. This research work concentrates on devising an enhanced algorithm for clustering high dimensional non-linear data.

Download Full-text

Clustering for High Dimensional Data: Density based Subspace Clustering Algorithms

International Journal of Computer Applications ◽

10.5120/10584-5732 ◽

2013 ◽

Vol 63 (20) ◽

pp. 29-35 ◽

Cited By ~ 1

Author(s):

Sunita Jahirabadkar ◽

Parag Kulkarni

Keyword(s):

Clustering Algorithms ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Data Density

Download Full-text

Detecting Outliers in High Dimensional Data Sets using Z-Score Methodology

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a3910.119119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 48-53

Keyword(s):

Outlier Detection ◽

Credit Card ◽

High Dimensional Data ◽

Research Area ◽

High Dimensional ◽

Data Sets ◽

Z Score ◽

Wide Range ◽

Efficiency And Effectiveness ◽

Projected Methods

Outlier detection is an interesting research area in machine learning. With the recently emergent tools and varied applications, the attention of outlier recognition is growing significantly. Recently, a significant number of outlier detection approaches have been observed and effectively applied in a wide range of fields, comprising medical health, credit card fraud and intrusion detection. They can be utilized for conservative data analysis. However, Outlier recognition aims to discover sequence in data that do not conform to estimated performance. In this paper, we presented a statistical approach called Z-score method for outlier recognition in high-dimensional data. Z-scores is a novel method for deciding distant data based on data positions on charts. The projected method is computationally fast and robust to outliers’ recognition. A comparative Analysis with extant methods is implemented with high dimensional datasets. Exploratory outcomes determines an enhanced accomplishment, efficiency and effectiveness of our projected methods.

Download Full-text

Instance-Wise Denoising Autoencoder for High Dimensional Data

Mathematical Problems in Engineering ◽

10.1155/2016/4365372 ◽

2016 ◽

Vol 2016 ◽

pp. 1-13 ◽

Cited By ~ 2

Author(s):

Lin Chen ◽

Wan-Yu Deng

Keyword(s):

Neural Network ◽

High Dimensional Data ◽

Information Loss ◽

Feature Representation ◽

High Dimensional ◽

Denoising Autoencoder ◽

Text Data ◽

Efficiency And Effectiveness ◽

Zero Vector ◽

Heterogeneous Feature

Denoising Autoencoder (DAE) is one of the most popular fashions that has reported significant success in recent neural network research. To be specific, DAE randomly corrupts some features of the data to zero as to utilize the cooccurrence information while avoiding overfitting. However, existing DAE approaches do not fare well on sparse and high dimensional data. In this paper, we present a Denoising Autoencoder labeled here as Instance-Wise Denoising Autoencoder (IDA), which is designed to work with high dimensional and sparse data by utilizing the instance-wise cooccurrence relation instead of the feature-wise one. IDA works ahead based on the following corruption rule: if an instance vector of nonzero feature is selected, it is forced to become a zero vector. To avoid serious information loss in the event that too many instances are discarded, an ensemble of multiple independent autoencoders built on different corrupted versions of the data is considered. Extensive experimental results on high dimensional and sparse text data show the superiority of IDA in efficiency and effectiveness. IDA is also experimented on the heterogenous transfer learning setting and cross-modal retrieval to study its generality on heterogeneous feature representation.

Download Full-text

Clustering High Dimensional Data Using Subspace and Projected Clustering Algorithms

International Journal of Computer Science and Information Technology ◽

10.5121/ijcsit.2010.2414 ◽

2010 ◽

Vol 2 (4) ◽

pp. 162-170 ◽

Cited By ~ 7

Author(s):

Rahmat Widia Sembiring ◽

Jasni Mohamad Zain ◽

Abdullah Embong

Keyword(s):

Clustering Algorithms ◽

High Dimensional Data ◽

High Dimensional ◽

Projected Clustering

Download Full-text

Clustering Algorithms For High Dimensional Data – A Survey Of Issues And Existing Approaches

International Journal of Computer Science and Informatics ◽

10.47893/ijcsi.2013.1108 ◽

2013 ◽

pp. 293-299

Author(s):

B.Hari Babu ◽

N.Subash Chandra ◽

T. Venu Gopal

Keyword(s):

Dimensional Space ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Microarray Gene Expression Data ◽

Distance Measures ◽

High Dimensional ◽

Data Mining Technique ◽

Microarray Gene Expression ◽

Redundancy Elimination ◽

Different Types

Clustering is the most prominent data mining technique used for grouping the data into clusters based on distance measures. With the advent growth of high dimensional data such as microarray gene expression data, and grouping high dimensional data into clusters will encounter the similarity between the objects in the full dimensional space is often invalid because it contains different types of data. The process of grouping into high dimensional data into clusters is not accurate and perhaps not up to the level of expectation when the dimension of the dataset is high. It is now focusing tremendous attention towards research and development. The performance issues of the data clustering in high dimensional data it is necessary to study issues like dimensionality reduction, redundancy elimination, subspace clustering, co-clustering and data labeling for clusters are to analyzed and improved. In this paper, we presented a brief comparison of the existing algorithms that were mainly focusing at clustering on high dimensional data.

Download Full-text

A Survey on Various Clustering Algorithms in High Dimensional Data

Proceedings of the International Conference on Emerging Trends in Engineering & Technology (ICETET-2015) ◽

10.3850/978-981-09-5346-1_cse-557 ◽

2015 ◽

Author(s):

M. Amina ◽

K. Syed Farook

Keyword(s):

Clustering Algorithms ◽

High Dimensional Data ◽

High Dimensional

Download Full-text

Subspace Clustering for High-Dimensional Data Using Cluster Structure Similarity

International Journal of Intelligent Information Technologies ◽

10.4018/ijiit.2018070103 ◽

2018 ◽

Vol 14 (3) ◽

pp. 38-55 ◽

Cited By ~ 2

Author(s):

Kavan Fatehi ◽

Mohsen Rezvani ◽

Mansoor Fateh ◽

Mohammad-Reza Pajoohan

Keyword(s):

Similarity Measure ◽

State Of The Art ◽

Clustering Algorithms ◽

Cluster Structure ◽

High Dimensional Data ◽

Subspace Clustering ◽

The State ◽

High Dimensional ◽

Running Time ◽

Structure Similarity

This article describes how recently, because of the curse of dimensionality in high dimensional data, a significant amount of research has been conducted on subspace clustering aiming at discovering clusters embedded in any possible attributes combination. The main goal of subspace clustering algorithms is to find all clusters in all subspaces. Previous studies have mostly been generating redundant subspace clusters, leading to clustering accuracy loss and also increasing the running time of the algorithms. A bottom-up density-based approach is suggested in this article, in which the cluster structure serves as a similarity measure to generate the optimal subspaces which result in raising the accuracy of the subspace clustering. Based on this idea, the algorithm discovers similar subspaces by considering similarity in their cluster structure, then combines them and the data in the new subspaces would be clustered again. Finally, the algorithm determines all the subspaces and also finds all clusters within them. Experiments on various synthetic and real datasets show that the results of the proposed approach are significantly better in quality and runtime than the state-of-the-art on clustering high-dimensional data.

Download Full-text

Urban green economic development indicators based on spatial clustering algorithm and blockchain

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189535 ◽

2020 ◽

pp. 1-12

Author(s):

Xiaoguang Gao

Keyword(s):

Development Strategy ◽

Clustering Algorithm ◽

Spatial Clustering ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Large Data ◽

Experimental Comparison ◽

High Dimensional ◽

Density Peak ◽

Data Set

The unbalanced development strategy makes the regional development unbalanced. Therefore, in the development process, resources must be effectively utilized according to the level and characteristics of each region. Considering the resource and environmental constraints, this paper measures and analyzes China’s green economic efficiency and green total factor productivity. Moreover, by expounding the characteristics of high-dimensional data, this paper points out the problems of traditional clustering algorithms in high-dimensional data clustering. This paper proposes a density peak clustering algorithm based on sampling and residual squares, which is suitable for high-dimensional large data sets. The algorithm finds abnormal points and boundary points by identifying halo points, and finally determines clusters. In addition, from the experimental comparison on the data set, it can be seen that the improved algorithm is better than the DPC algorithm in both time complexity and clustering results. Finally, this article analyzes data based on actual cases. The research results show that the method proposed in this paper is effective.

Download Full-text