A Hierarchical Gamma Mixture Model-Based Method for Classification of High-Dimensional Data

Data classification is an important research topic in the field of data mining. With the rapid development in social media sites and IoT devices, data have grown tremendously in volume and complexity, which has resulted in a lot of large and complex high-dimensional data. Classifying such high-dimensional complex data with a large number of classes has been a great challenge for current state-of-the-art methods. This paper presents a novel, hierarchical, gamma mixture model-based unsupervised method for classifying high-dimensional data with a large number of classes. In this method, we first partition the features of the dataset into feature strata by using k-means. Then, a set of subspace data sets is generated from the feature strata by using the stratified subspace sampling method. After that, the GMM Tree algorithm is used to identify the number of clusters and initial clusters in each subspace dataset and passing these initial cluster centers to k-means to generate base subspace clustering results. Then, the subspace clustering result is integrated into an object cluster association (OCA) matrix by using the link-based method. The ensemble clustering result is generated from the OCA matrix by the k-means algorithm with the number of clusters identified by the GMM Tree algorithm. After producing the ensemble clustering result, the dominant class label is assigned to each cluster after computing the purity. A classification is made on the object by computing the distance between the new object and the center of each cluster in the classifier, and the class label of the cluster is assigned to the new object which has the shortest distance. A series of experiments were conducted on twelve synthetic and eight real-world data sets, with different numbers of classes, features, and objects. The experimental results have shown that the new method outperforms other state-of-the-art techniques to classify data in most of the data sets.

Download Full-text

Sparse Kernel Clustering of Massive High-Dimensional Data sets with Large Number of Clusters

Proceedings of the 8th Workshop on Ph.D. Workshop in Information and Knowledge Management ◽

10.1145/2809890.2809896 ◽

2015 ◽

Author(s):

Radha Chitta ◽

Anil K. Jain ◽

Rong Jin

Keyword(s):

High Dimensional Data ◽

High Dimensional ◽

Data Sets ◽

Number Of Clusters ◽

Kernel Clustering ◽

Sparse Kernel

Download Full-text

An entropy weighting mixture model for subspace clustering of high-dimensional data

Pattern Recognition Letters ◽

10.1016/j.patrec.2011.03.003 ◽

2011 ◽

Vol 32 (8) ◽

pp. 1154-1161 ◽

Cited By ~ 11

Author(s):

Liuqing Peng ◽

Junying Zhang

Keyword(s):

Mixture Model ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional

Download Full-text

Subspace Clustering for High-Dimensional Data Using Cluster Structure Similarity

International Journal of Intelligent Information Technologies ◽

10.4018/ijiit.2018070103 ◽

2018 ◽

Vol 14 (3) ◽

pp. 38-55 ◽

Cited By ~ 2

Author(s):

Kavan Fatehi ◽

Mohsen Rezvani ◽

Mansoor Fateh ◽

Mohammad-Reza Pajoohan

Keyword(s):

Similarity Measure ◽

State Of The Art ◽

Clustering Algorithms ◽

Cluster Structure ◽

High Dimensional Data ◽

Subspace Clustering ◽

The State ◽

High Dimensional ◽

Running Time ◽

Structure Similarity

This article describes how recently, because of the curse of dimensionality in high dimensional data, a significant amount of research has been conducted on subspace clustering aiming at discovering clusters embedded in any possible attributes combination. The main goal of subspace clustering algorithms is to find all clusters in all subspaces. Previous studies have mostly been generating redundant subspace clusters, leading to clustering accuracy loss and also increasing the running time of the algorithms. A bottom-up density-based approach is suggested in this article, in which the cluster structure serves as a similarity measure to generate the optimal subspaces which result in raising the accuracy of the subspace clustering. Based on this idea, the algorithm discovers similar subspaces by considering similarity in their cluster structure, then combines them and the data in the new subspaces would be clustered again. Finally, the algorithm determines all the subspaces and also finds all clusters within them. Experiments on various synthetic and real datasets show that the results of the proposed approach are significantly better in quality and runtime than the state-of-the-art on clustering high-dimensional data.

Download Full-text

Incomplete high dimensional data streams clustering

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-200297 ◽

2020 ◽

Vol 39 (3) ◽

pp. 4227-4243

Author(s):

Fatma M. Najib ◽

Rasha M. Ismail ◽

Nagwa L. Badr ◽

Tarek F. Gharib

Keyword(s):

Data Streams ◽

Missing Values ◽

High Dimensional Data ◽

Subspace Clustering ◽

Streaming Data ◽

High Dimensionality ◽

High Dimensional ◽

Data Sets ◽

Multiple Data ◽

Sensitivity Specificity

Many recent applications such as sensor networks generate continuous and time varying data streams that are often gathered from multiple data sources with some incompleteness and high dimensionality. Clustering such incomplete high dimensional streaming data faces four constraints which are 1) data incompleteness, 2) high dimensionality of data, 3) data distribution, 4) data streams’ continuous nature. Thus, in this paper, we propose the Subspace clustering for Incomplete High dimensional Data streams (SIHD) framework that overcomes the above clustering issues. The proposed SIHD provides continuous missing values imputation for incomplete streams based on the corresponding nearest-neighbors’ intervals. An adaptive subspace clustering mechanism is proposed to deal with such incomplete high dimensional data streams. Our experimental results using two different data sets prove the efficiency of the proposed SIHD framework in clustering such incomplete high dimensional data streams in terms of accuracy, precision, sensitivity, specificity, and F-score compared to five algorithms GFCM, GBDC-P2P, DS, Ensemble, and DMSC. The proposed SIHD improved: 1) the accuracy on average over the five algorithms in the same mentioned order by 11.3%, 10.8%, 6.5%, 4.1%, and 3.6%, 2) the precision by 15%, 10.6%, 6.4%, 4%, and 3.5%, 3) the sensitivity by 16.6%, 10.6%, 5.8%, 4.2%, and 3.6%, 4) the specificity by 16.8%, 10.9%, 6.5%, 4%, and 3.5%, 5) the F-score by 16.6%, 10.7%, 6.6%, 4.1%, and 3.6%.

Download Full-text

Subspace Clustering of High-Dimensional Data: An Evolutionary Approach

Applied Computational Intelligence and Soft Computing ◽

10.1155/2013/863146 ◽

2013 ◽

Vol 2013 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Singh Vijendra ◽

Sahoo Laxman

Keyword(s):

Clustering Algorithm ◽

Dimensional Space ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Data Points

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of high-dimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and real-world data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS.

Download Full-text

An Advanced Mining Services in Predicting and Ranking User Vitality across Dynamic and High Dimensional Data Sets

SSRN Electronic Journal ◽

10.2139/ssrn.3395242 ◽

2019 ◽

Author(s):

Ch. Durga Bhavani ◽

Dr. A. Daveedu Raju ◽

Dr. V. Surya Narayana

Keyword(s):

High Dimensional Data ◽

High Dimensional ◽

Data Sets

Download Full-text

The Generalized Bayes Method for High-Dimensional Data Recognition with Applications to Audio Signal Recognition

Symmetry ◽

10.3390/sym13010019 ◽

2020 ◽

Vol 13 (1) ◽

pp. 19

Author(s):

Hsiuying Wang

Keyword(s):

Gaussian Mixture Model ◽

Mixture Model ◽

Conventional Method ◽

High Dimensional Data ◽

Audio Signal ◽

Gaussian Mixture ◽

High Dimensional ◽

Signal Recognition ◽

Bayes Method ◽

Generalized Bayes

High-dimensional data recognition problem based on the Gaussian Mixture model has useful applications in many area, such as audio signal recognition, image analysis, and biological evolution. The expectation-maximization algorithm is a popular approach to the derivation of the maximum likelihood estimators of the Gaussian mixture model (GMM). An alternative solution is to adopt a generalized Bayes estimator for parameter estimation. In this study, an estimator based on the generalized Bayes approach is established. A simulation study shows that the proposed approach has a performance competitive to that of the conventional method in high-dimensional Gaussian mixture model recognition. We use a musical data example to illustrate this recognition problem. Suppose that we have audio data of a piece of music and know that the music is from one of four compositions, but we do not know exactly which composition it comes from. The generalized Bayes method shows a higher average recognition rate than the conventional method. This result shows that the generalized Bayes method is a competitor to the conventional method in this real application.

Download Full-text

gbt-HIPS: Explaining the Classifications of Gradient Boosted Tree Ensembles

Applied Sciences ◽

10.3390/app11062511 ◽

2021 ◽

Vol 11 (6) ◽

pp. 2511

Author(s):

Julian Hatwell ◽

Mohamed Medhat Gaber ◽

R. Muhammad Atif Azad

Keyword(s):

State Of The Art ◽

Heuristic Method ◽

Good Explanation ◽

Classification Rule ◽

Data Sets ◽

Classification Models ◽

Boundary Values ◽

Class Label ◽

Input Space ◽

Boosted Tree

This research presents Gradient Boosted Tree High Importance Path Snippets (gbt-HIPS), a novel, heuristic method for explaining gradient boosted tree (GBT) classification models by extracting a single classification rule (CR) from the ensemble of decision trees that make up the GBT model. This CR contains the most statistically important boundary values of the input space as antecedent terms. The CR represents a hyper-rectangle of the input space inside which the GBT model is, very reliably, classifying all instances with the same class label as the explanandum instance. In a benchmark test using nine data sets and five competing state-of-the-art methods, gbt-HIPS offered the best trade-off between coverage (0.16–0.75) and precision (0.85–0.98). Unlike competing methods, gbt-HIPS is also demonstrably guarded against under- and over-fitting. A further distinguishing feature of our method is that, unlike much prior work, our explanations also provide counterfactual detail in accordance with widely accepted recommendations for what makes a good explanation.

Download Full-text