Resampling-Based Similarity Measures for High-Dimensional Data

In literature studies, high-dimensional data reduces the efficiency of clustering algorithms and maximises execution time. Therefore, in this paper, we propose an approach called a BV-kmeans (Bayesian Vectorisation along with k-means) that aims to improve document representation models for text clustering. This approach consists of integrating the k-means document clustering with the Bayesian Vectoriser that is used to compute the probability distribution of the documents in the vector space in order to overcome the problems of high-dimensional data and lower the consumption time. We have used various similarity measures which are namely: K divergence, Squared Euclidean distance and Squared χ2 distance in order to determine the effective metrics for modelling the similarity between documents with the proposed approach. We have evaluated the proposed approach on a set of common newspaper websites that have highly dimensional data. Experimental results show that the proposed approach can increase the degree to which a cluster encases documents from a specific category by 85%. This is in comparison with the standard k-means algorithm and it has succeeded in lowering the runtime using the proposed approach by 95% compared to the standard k-means algorithm.

Download Full-text

Large Sample Covariance Matrices and High-Dimensional Data Analysis

10.1017/cbo9781107588080 ◽

2015 ◽

Cited By ~ 26

Author(s):

Jianfeng Yao ◽

Shurong Zheng ◽

Zhidong Bai

Keyword(s):

Data Analysis ◽

High Dimensional Data ◽

Covariance Matrices ◽

High Dimensional ◽

Large Sample ◽

Sample Covariance Matrices ◽

Sample Covariance ◽

High Dimensional Data Analysis

Download Full-text

Fractal-Based Methods as a Technique for Estimating the Intrinsic Dimensionality of High-Dimensional Data: A Survey

Informatica ◽

10.15388/informatica.2016.84 ◽

2016 ◽

Vol 27 (2) ◽

pp. 257-281 ◽

Cited By ~ 5

Author(s):

Rasa Karbauskaitė ◽

Gintautas Dzemyda

Keyword(s):

High Dimensional Data ◽

High Dimensional ◽

Intrinsic Dimensionality

Download Full-text

A Fast Clustering Algorithm for Large-scale and High Dimensional Data

ACTA AUTOMATICA SINICA ◽

10.3724/sp.j.1004.2009.00859 ◽

2009 ◽

Vol 35 (7) ◽

pp. 859-866

Author(s):

Ming LIU ◽

Xiao-Long WANG ◽

Yuan-Chao LIU

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

High Dimensional Data ◽

High Dimensional

Download Full-text

Improved negative selection algorithm for network anomaly detection on high-dimensional data

Journal of Computer Applications ◽

10.3724/sp.j.1087.2009.00805 ◽

2009 ◽

Vol 29 (3) ◽

pp. 805-807 ◽

Cited By ~ 1

Author(s):

Wen-zhong GUO ◽

Guo-long CHEN ◽

Qing-liang CHEN

Keyword(s):

Anomaly Detection ◽

Negative Selection ◽

High Dimensional Data ◽

High Dimensional ◽

Selection Algorithm ◽

Negative Selection Algorithm ◽

Network Anomaly Detection

Download Full-text

An Advanced Mining Services in Predicting and Ranking User Vitality across Dynamic and High Dimensional Data Sets

SSRN Electronic Journal ◽

10.2139/ssrn.3395242 ◽

2019 ◽

Author(s):

Ch. Durga Bhavani ◽

Dr. A. Daveedu Raju ◽

Dr. V. Surya Narayana

Keyword(s):

High Dimensional Data ◽

High Dimensional ◽

Data Sets

Download Full-text

Outlier Detection in High Dimensional Data Based on the Anti-Hub and Regression Technique

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2017.8219 ◽

2017 ◽

Vol V (VIII) ◽

pp. 1543-1551

Author(s):

Golla Hemalatha

Keyword(s):

Outlier Detection ◽

High Dimensional Data ◽

Regression Technique ◽

High Dimensional

Download Full-text

Approximate Cluster Heat Maps of Large High-Dimensional Data

2018 24th International Conference on Pattern Recognition (ICPR) ◽

10.1109/icpr.2018.8545519 ◽

2018 ◽

Cited By ~ 1

Author(s):

Punit Rathore ◽

James C. Bezdek ◽

Dheeraj Kumar ◽

Sutharshan Rajasegarar ◽

Marimuthu Palaniswami

Keyword(s):

High Dimensional Data ◽

High Dimensional ◽

Heat Maps

Download Full-text

The Generalized Bayes Method for High-Dimensional Data Recognition with Applications to Audio Signal Recognition

Symmetry ◽

10.3390/sym13010019 ◽

2020 ◽

Vol 13 (1) ◽

pp. 19

Author(s):

Hsiuying Wang

Keyword(s):

Gaussian Mixture Model ◽

Mixture Model ◽

Conventional Method ◽

High Dimensional Data ◽

Audio Signal ◽

Gaussian Mixture ◽

High Dimensional ◽

Signal Recognition ◽

Bayes Method ◽

Generalized Bayes

High-dimensional data recognition problem based on the Gaussian Mixture model has useful applications in many area, such as audio signal recognition, image analysis, and biological evolution. The expectation-maximization algorithm is a popular approach to the derivation of the maximum likelihood estimators of the Gaussian mixture model (GMM). An alternative solution is to adopt a generalized Bayes estimator for parameter estimation. In this study, an estimator based on the generalized Bayes approach is established. A simulation study shows that the proposed approach has a performance competitive to that of the conventional method in high-dimensional Gaussian mixture model recognition. We use a musical data example to illustrate this recognition problem. Suppose that we have audio data of a piece of music and know that the music is from one of four compositions, but we do not know exactly which composition it comes from. The generalized Bayes method shows a higher average recognition rate than the conventional method. This result shows that the generalized Bayes method is a competitor to the conventional method in this real application.

Download Full-text