Assessment of Twitter Data Clusters with Cosine-Based Validation Metrics Using Hybrid Topic Models

Text data clustering is performed for organizing the set of text documents into the desired number of coherent and meaningful sub-clusters. Modeling the text documents in terms of topics derivations is a vital task in text data clustering. Each tweet is considered as a text document, and various topic models perform modeling of tweets. In existing topic models, the clustering tendency of tweets is assessed initially based on Euclidean dissimilarity features. Cosine metric is more suitable for more informative assessment, especially of text clustering. Thus, this paper develops a novel cosine based external and interval validity assessment of cluster tendency for improving the computational efficiency of tweets data clustering. In the experimental, tweets data clustering results are evaluated using cluster validity indices measures. Experimentally proved that cosine based internal and external validity metrics outperforms the other using benchmarked and Twitter-based datasets.

Download Full-text

Online cluster validity indices for performance monitoring of streaming data clustering

International Journal of Intelligent Systems ◽

10.1002/int.22064 ◽

2018 ◽

Vol 34 (4) ◽

pp. 541-563 ◽

Cited By ~ 6

Author(s):

Masud Moshtaghi ◽

James C. Bezdek ◽

Sarah M. Erfani ◽

Christopher Leckie ◽

James Bailey

Keyword(s):

Data Clustering ◽

Performance Monitoring ◽

Streaming Data ◽

Cluster Validity ◽

Cluster Validity Indices ◽

Validity Indices

Download Full-text

A Data Clustering Tool with Cluster Validity Indices

2009 International Conference on Computing, Engineering and Information ◽

10.1109/icc.2009.76 ◽

2009 ◽

Cited By ~ 4

Author(s):

Haiyan Qiao ◽

Brandon Edwards

Keyword(s):

Data Clustering ◽

Cluster Validity ◽

Cluster Validity Indices ◽

Validity Indices

Download Full-text

A survey of cluster validity indices for automatic data clustering using differential evolution

Proceedings of the Genetic and Evolutionary Computation Conference ◽

10.1145/3449639.3459341 ◽

2021 ◽

Author(s):

Adán José-García ◽

Wilfrido Gómez-Flores

Keyword(s):

Differential Evolution ◽

Data Clustering ◽

Cluster Validity ◽

Automatic Data ◽

Cluster Validity Indices ◽

Validity Indices

Download Full-text

Performance Evaluation of the Data Clustering Techniques and Cluster Validity Indices for Efficient Toolpath Development for Incremental Sheet Forming

Journal of Computing and Information Science in Engineering ◽

10.1115/1.4048914 ◽

2020 ◽

pp. 1-32

Author(s):

Aniket Nagargoje ◽

Pavan K. Kankar ◽

Prashant K. Jain ◽

Puneet Tandon

Keyword(s):

Hierarchical Clustering ◽

Data Clustering ◽

Spectral Clustering ◽

Sheet Forming ◽

Incremental Sheet Forming ◽

Cluster Validity ◽

Clustering Techniques ◽

Cluster Validity Indices ◽

Validity Indices ◽

Feature Based

Abstract The goal of this research is to compare the data clustering techniques and cluster validity indices for feature-based tool path development, in case of incremental sheet forming process. The work compares the four most popular clustering techniques, i.e., partition-based (K-means), density-based (DBSCAN), variants of hierarchical clustering and graph-based (Spectral) clustering technique. Besides, for the quality assessment of the clustering solutions and to pinpoint the superlative validity indices, techniques like Calinski-Harabasz, Ball-Hall, Davies-Bouldin, Dunn, Det Ratio, Silhouette, Trace WiB, and Log Det Ratio are compared. The Single Linkage Hierarchical Clustering is preferred over the other variants as it detects the arbitrarily shaped clusters. After comparing it with DBSCAN, K-means, and Spectral clustering, it is found that DBSCAN is the best suitable technique for the proposed application. From the comparison of the internal validity indices, the following four out of eight techniques, Ball-Hall, Dunn, Det Ratio, Log Det Ratio indices are selected as they support the application. The outcome of this research would help in building algorithms for feature-based toolpath development strategies for manufacturing industry using data science and machine learning techniques.

Download Full-text

AutoClust: A Framework for Automated Clustering Based on Cluster Validity Indices

2020 IEEE International Conference on Data Mining (ICDM) ◽

10.1109/icdm50108.2020.00153 ◽

2020 ◽

Author(s):

Yannis Poulakis ◽

Christos Doulkeridis ◽

Dimosthenis Kyriazis

Keyword(s):

Cluster Validity ◽

Cluster Validity Indices ◽

Validity Indices

Download Full-text

On fuzzy cluster validity indices for the objects of mixed features

2009 IEEE International Conference on Fuzzy Systems ◽

10.1109/fuzzy.2009.5277190 ◽

2009 ◽

Cited By ~ 2

Author(s):

Mahnhoon Lee

Keyword(s):

Fuzzy Cluster ◽

Cluster Validity ◽

Cluster Validity Indices ◽

Validity Indices ◽

Mixed Features

Download Full-text

Number of Clusters and the Quality of Hybrid Predictive Models in Analytical CRM

Studies in Logic, Grammar and Rhetoric ◽

10.2478/slgr-2014-0022 ◽

2014 ◽

Vol 37 (1) ◽

pp. 141-157 ◽

Cited By ~ 1

Author(s):

Mariusz Łapczyński ◽

Bartłomiej Jefmański

Keyword(s):

Predictive Models ◽

Cluster Validity ◽

Number Of Clusters ◽

Model Combining ◽

Cluster Validity Indices ◽

Validity Indices ◽

And Cluster Analysis ◽

Analytical Tools ◽

F Measure

Abstract Making more accurate marketing decisions by managers requires building effective predictive models. Typically, these models specify the probability of customer belonging to a particular category, group or segment. The analytical CRM categories refer to customers interested in starting cooperation with the company (acquisition models), customers who purchase additional products (cross- and up-sell models) or customers intending to resign from the cooperation (churn models). During building predictive models researchers use analytical tools from various disciplines with an emphasis on their best performance. This article attempts to build a hybrid predictive model combining decision trees (C&RT algorithm) and cluster analysis (k-means). During experiments five different cluster validity indices and eight datasets were used. The performance of models was evaluated by using popular measures such as: accuracy, precision, recall, G-mean, F-measure and lift in the first and in the second decile. The authors tried to find a connection between the number of clusters and models' quality.

Download Full-text