high dimensional clustering Latest Research Papers

Two new initialization methods for K-means clustering are proposed. Both proposals are based on applying a divide-and-conquer approach for the K-means‖ type of an initialization strategy. The second proposal also uses multiple lower-dimensional subspaces produced by the random projection method for the initialization. The proposed methods are scalable and can be run in parallel, which make them suitable for initializing large-scale problems. In the experiments, comparison of the proposed methods to the K-means++ and K-means‖ methods is conducted using an extensive set of reference and synthetic large-scale datasets. Concerning the latter, a novel high-dimensional clustering data generation algorithm is given. The experiments show that the proposed methods compare favorably to the state-of-the-art by improving clustering accuracy and the speed of convergence. We also observe that the currently most popular K-means++ initialization behaves like the random one in the very high-dimensional cases.

Download Full-text

The Sparse MinMax k-Means Algorithm for High-Dimensional Clustering

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/291 ◽

2020 ◽

Author(s):

Sayak Dey ◽

Swagatam Das ◽

Rammohan Mallipeddi

Keyword(s):

Real World ◽

Dimensional Space ◽

Sum Of Squares ◽

High Dimensional ◽

Clustering Methods ◽

High Dimensional Space ◽

Sparse Regularization ◽

Clustering Approach ◽

Real World Datasets ◽

High Dimensional Clustering

Classical clustering methods usually face tough challenges when we have a larger set of features compared to the number of items to be partitioned. We propose a Sparse MinMax k-Means Clustering approach by reformulating the objective of the MinMax k-Means algorithm (a variation of classical k-Means that minimizes the maximum intra-cluster variance instead of the sum of intra-cluster variances), into a new weighted between-cluster sum of squares (BCSS) form. We impose sparse regularization on these weights to make it suitable for high-dimensional clustering. We seek to use the advantages of the MinMax k-Means algorithm in the high-dimensional space to generate good quality clusters. The efficacy of the proposal is showcased through comparison against a few representative clustering methods over several real world datasets.

Download Full-text

High dimensional clustering for mixture models

10.32657/10356/142941 ◽

2020 ◽

Author(s):

◽

Yiming Liu

Keyword(s):

Mixture Models ◽

High Dimensional ◽

High Dimensional Clustering

Download Full-text

High-Dimensional Clustering for Incomplete Mixed Dataset Using Artificial Intelligence

IEEE Access ◽

10.1109/access.2020.2986813 ◽

2020 ◽

Vol 8 ◽

pp. 69629-69638

Author(s):

Meishan Li ◽

Xiaofeng Li ◽

Jing Li

Keyword(s):

Artificial Intelligence ◽

High Dimensional ◽

High Dimensional Clustering

Download Full-text

Regularized Gaussian Mixture Model for High-Dimensional Clustering

IEEE Transactions on Cybernetics ◽

10.1109/tcyb.2018.2846404 ◽

2019 ◽

Vol 49 (10) ◽

pp. 3677-3688 ◽

Cited By ~ 3

Author(s):

Yang Zhao ◽

Abhishek K. Shrivastava ◽

Kwok Leung Tsui

Keyword(s):

Gaussian Mixture Model ◽

Mixture Model ◽

Gaussian Mixture ◽

High Dimensional ◽

High Dimensional Clustering

Download Full-text

High Dimensional Clustering with r-nets

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013207 ◽

2019 ◽

Vol 33 ◽

pp. 3207-3214 ◽

Cited By ~ 1

Author(s):

Georgia Avarikioti ◽

Alain Ryser ◽

Yuyi Wang ◽

Roger Wattenhofer

Keyword(s):

Machine Learning ◽

Time Complexity ◽

Data Science ◽

Nearest Neighbor ◽

Approximate Solutions ◽

High Dimensional ◽

Neighbor Distance ◽

Learning Groups ◽

Nearest Neighbor Distance ◽

High Dimensional Clustering

Clustering, a fundamental task in data science and machine learning, groups a set of objects in such a way that objects in the same cluster are closer to each other than to those in other clusters. In this paper, we consider a well-known structure, so-called r-nets, which rigorously captures the properties of clustering. We devise algorithms that improve the runtime of approximating r-nets in high-dimensional spaces with1 and `2 metrics from, where . These algorithms are also used to improve a framework that provides approximate solutions to other high dimensional distance problems. Using this framework, several important related problems can also be solved efficiently, e.g.,pproximate kth-nearest neighbor distance-approximate Min-Max clustering,-approximate k-center clustering. In addition, we build an algorithm that-approximates greedy permutations in time O˜((dn+n2−α)·logΦ) where Φ is the spread of the input. This algorithm is used to -approximate k-center with the same time complexity.

Download Full-text

Clustering Approach toward Large Truck Crash Analysis

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/0361198119839347 ◽

2019 ◽

Vol 2673 (8) ◽

pp. 73-85 ◽

Cited By ~ 7

Author(s):

Alireza Rahimi ◽

Ghazaleh Azimi ◽

Hamidreza Asgari ◽

Xia Jin

Keyword(s):

Goodness Of Fit ◽

Crash Analysis ◽

Clustering Methods ◽

Crash Data ◽

Pseudo Likelihood ◽

Single Vehicle ◽

Clustering Approach ◽

Block Clustering ◽

Attribute Clustering ◽

High Dimensional Clustering

Heterogeneity of crash data masks the underlying crash patterns and perplexes crash analysis. This paper aims to explore an advanced high-dimensional clustering approach to investigate heterogeneity in large datasets. Detailed records of crashes involving large trucks occurring in the state of Florida between 2007 and 2016 were examined to identify truck crash patterns and significant conditions contributing to the patterns. The block clustering method was applied to more than 220,000 crash records with nearly 200 attributes. The analysis showed promising results in segmenting a large heterogeneous dataset into meaningful subgroups (with 95.72% average degree of homogeneity for selected blocks). The goodness of fit for clustering methods is evaluated and both integrated completed likelihood (ICL) and pseudo-likelihood values improved significantly (20.8% and 21.1% respectively). Attribute clustering showed distinct characteristics for each cluster. Crash clustering revealed significant differences among the clusters and suggested that this crash dataset could be portioned as same-direction, opposing-direction, and single-vehicle crashes. Individual blocks defined by both row and column clustering were further investigated to better understand the contribution set of conditions that lead to large truck crashes. Major features for each of the three major types of crashes were analyzed, which may provide additional insights to develop potential countermeasures and strategies that target specific segments. The clustering approach could be used as a preanalysis method to identify homogeneous subgroups for further analysis, which will help enhance the effectiveness of safety programs.

Download Full-text