Improved Text Clustering Using k-Mean Bayesian Vectoriser

In literature studies, high-dimensional data reduces the efficiency of clustering algorithms and maximises execution time. Therefore, in this paper, we propose an approach called a BV-kmeans (Bayesian Vectorisation along with k-means) that aims to improve document representation models for text clustering. This approach consists of integrating the k-means document clustering with the Bayesian Vectoriser that is used to compute the probability distribution of the documents in the vector space in order to overcome the problems of high-dimensional data and lower the consumption time. We have used various similarity measures which are namely: K divergence, Squared Euclidean distance and Squared χ2 distance in order to determine the effective metrics for modelling the similarity between documents with the proposed approach. We have evaluated the proposed approach on a set of common newspaper websites that have highly dimensional data. Experimental results show that the proposed approach can increase the degree to which a cluster encases documents from a specific category by 85%. This is in comparison with the standard k-means algorithm and it has succeeded in lowering the runtime using the proposed approach by 95% compared to the standard k-means algorithm.

Download Full-text

Robust models and novel similarity measures for high-dimensional data clustering

10.32657/10356/48657 ◽

2012 ◽

Author(s):

Duc Thang Nguyen

Keyword(s):

Data Clustering ◽

High Dimensional Data ◽

Similarity Measures ◽

High Dimensional

Download Full-text

Subspace Clustering of High Dimensional Data Using Differential Evolution

Nature-Inspired Algorithms for Big Data Frameworks - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-5225-5852-1.ch003 ◽

2019 ◽

pp. 47-74 ◽

Cited By ~ 1

Author(s):

Parul Agarwal ◽

Shikha Mehta

Keyword(s):

Differential Evolution ◽

Distance Measure ◽

Dimensional Space ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Dbscan Clustering ◽

Evolution Algorithms ◽

Self Adaptive

Subspace clustering approaches cluster high dimensional data in different subspaces. It means grouping the data with different relevant subsets of dimensions. This technique has become very effective as a distance measure becomes ineffective in a high dimensional space. This chapter presents a novel evolutionary approach to a bottom up subspace clustering SUBSPACE_DE which is scalable to high dimensional data. SUBSPACE_DE uses a self-adaptive DBSCAN algorithm to perform clustering in data instances of each attribute and maximal subspaces. Self-adaptive DBSCAN clustering algorithms accept input from differential evolution algorithms. The proposed SUBSPACE_DE algorithm is tested on 14 datasets, both real and synthetic. It is compared with 11 existing subspace clustering algorithms. Evaluation metrics such as F1_Measure and accuracy are used. Performance analysis of the proposed algorithms is considerably better on a success rate ratio ranking in both accuracy and F1_Measure. SUBSPACE_DE also has potential scalability on high dimensional datasets.

Download Full-text

Resampling-Based Similarity Measures for High-Dimensional Data

Journal of Computational Biology ◽

10.1089/cmb.2014.0195 ◽

2015 ◽

Vol 22 (1) ◽

pp. 54-62 ◽

Cited By ~ 3

Author(s):

Dhammika Amaratunga ◽

Javier Cabrera ◽

Yung-Seop Lee

Keyword(s):

High Dimensional Data ◽

Similarity Measures ◽

High Dimensional

Download Full-text

Efficiency and Effectiveness of Clustering Algorithms for High Dimensional Data

International Journal of Computer Applications ◽

10.5120/ijca2015906144 ◽

2015 ◽

Vol 125 (11) ◽

pp. 35-40

Author(s):

Smita Chormunge ◽

Sudarson Jena

Keyword(s):

Clustering Algorithms ◽

High Dimensional Data ◽

High Dimensional ◽

Efficiency And Effectiveness

Download Full-text

M-Denclue for Effective Data Clustering in High Dimensional Non-Linear Data

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a9109.119119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 2925-2927

Keyword(s):

Clustering Algorithms ◽

High Dimensional Data ◽

Research Work ◽

Curse Of Dimensionality ◽

Distance Measures ◽

High Dimensional ◽

Clustering Methods ◽

Non Linear ◽

Low Dimensional ◽

Automatic Grouping

Clustering is a data mining task devoted to the automatic grouping of data based on mutual similarity. Clustering in high-dimensional spaces is a recurrent problem in many domains. It affects time complexity, space complexity, scalability and accuracy of clustering methods. Highdimensional non-linear datausually live in different low dimensional subspaces hidden in the original space. As high‐dimensional objects appear almost alike, new approaches for clustering are required. This research has focused on developing Mathematical models, techniques and clustering algorithms specifically for high‐dimensional data. The innocent growth in the fields of communication and technology, there is tremendous growth in high dimensional data spaces. As the variant of dimensions on high dimensional non-linear data increases, many clustering techniques begin to suffer from the curse of dimensionality, de-grading the quality of the results. In high dimensional non-linear data, the data becomes very sparse and distance measures become increasingly meaningless. The principal challenge for clustering high dimensional data is to overcome the “curse of dimensionality”. This research work concentrates on devising an enhanced algorithm for clustering high dimensional non-linear data.

Download Full-text

Clustering for High Dimensional Data: Density based Subspace Clustering Algorithms

International Journal of Computer Applications ◽

10.5120/10584-5732 ◽

2013 ◽

Vol 63 (20) ◽

pp. 29-35 ◽

Cited By ~ 1

Author(s):

Sunita Jahirabadkar ◽

Parag Kulkarni

Keyword(s):

Clustering Algorithms ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Data Density

Download Full-text

Asymptotic properties of the misclassification rates for Euclidean Distance Discriminant rule in high-dimensional data

Journal of Multivariate Analysis ◽

10.1016/j.jmva.2015.05.008 ◽

2015 ◽

Vol 140 ◽

pp. 234-244 ◽

Cited By ~ 1

Author(s):

Hiroki Watanabe ◽

Masashi Hyodo ◽

Takashi Seo ◽

Tatjana Pavlenko

Keyword(s):

Euclidean Distance ◽

Asymptotic Properties ◽

High Dimensional Data ◽

High Dimensional ◽

Misclassification Rates

Download Full-text

Clustering High Dimensional Data Using Subspace and Projected Clustering Algorithms

International Journal of Computer Science and Information Technology ◽

10.5121/ijcsit.2010.2414 ◽

2010 ◽

Vol 2 (4) ◽

pp. 162-170 ◽

Cited By ~ 7

Author(s):

Rahmat Widia Sembiring ◽

Jasni Mohamad Zain ◽

Abdullah Embong

Keyword(s):

Clustering Algorithms ◽

High Dimensional Data ◽

High Dimensional ◽

Projected Clustering

Download Full-text

Clustering Algorithms For High Dimensional Data – A Survey Of Issues And Existing Approaches

International Journal of Computer Science and Informatics ◽

10.47893/ijcsi.2013.1108 ◽

2013 ◽

pp. 293-299

Author(s):

B.Hari Babu ◽

N.Subash Chandra ◽

T. Venu Gopal

Keyword(s):

Dimensional Space ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Microarray Gene Expression Data ◽

Distance Measures ◽

High Dimensional ◽

Data Mining Technique ◽

Microarray Gene Expression ◽

Redundancy Elimination ◽

Different Types

Clustering is the most prominent data mining technique used for grouping the data into clusters based on distance measures. With the advent growth of high dimensional data such as microarray gene expression data, and grouping high dimensional data into clusters will encounter the similarity between the objects in the full dimensional space is often invalid because it contains different types of data. The process of grouping into high dimensional data into clusters is not accurate and perhaps not up to the level of expectation when the dimension of the dataset is high. It is now focusing tremendous attention towards research and development. The performance issues of the data clustering in high dimensional data it is necessary to study issues like dimensionality reduction, redundancy elimination, subspace clustering, co-clustering and data labeling for clusters are to analyzed and improved. In this paper, we presented a brief comparison of the existing algorithms that were mainly focusing at clustering on high dimensional data.

Download Full-text

A Survey on Various Clustering Algorithms in High Dimensional Data

Proceedings of the International Conference on Emerging Trends in Engineering & Technology (ICETET-2015) ◽

10.3850/978-981-09-5346-1_cse-557 ◽

2015 ◽

Author(s):

M. Amina ◽

K. Syed Farook

Keyword(s):

Clustering Algorithms ◽

High Dimensional Data ◽

High Dimensional

Download Full-text