Which Feature is Better? TF*IDF Feature or Topic Feature in Text Clustering

Background: Considering the increasing volume of text document information on Internet pages, dealing with such a tremendous amount of knowledge becomes totally complex due to its large size. Text clustering is a common optimization problem used to manage a large amount of text information into a subset of comparable and coherent clusters. Aims: This paper presents a novel local clustering technique, namely, β-hill climbing, to solve the problem of the text document clustering through modeling the β-hill climbing technique for partitioning the similar documents into the same cluster. Methods: The β parameter is the primary innovation in β-hill climbing technique. It has been introduced in order to perform a balance between local and global search. Local search methods are successfully applied to solve the problem of the text document clustering such as; k-medoid and kmean techniques. Results: Experiments were conducted on eight benchmark standard text datasets with different characteristics taken from the Laboratory of Computational Intelligence (LABIC). The results proved that the proposed β-hill climbing achieved better results in comparison with the original hill climbing technique in solving the text clustering problem. Conclusion: The performance of the text clustering is useful by adding the β operator to the hill climbing.

Download Full-text

Extraction of Product Defects and Opinions from Customer Reviews by Using Text Clustering and Sentiment Analysis

2020 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata50022.2020.9377851 ◽

2020 ◽

Author(s):

Mustafa CATALTAS ◽

Sevcan DOGRAMACI ◽

Semih YUMUSAK ◽

Kasim OZTOPRAK

Keyword(s):

Sentiment Analysis ◽

Text Clustering ◽

Customer Reviews

Download Full-text

Confronting Sparseness and High Dimensionality in Short Text Clustering via Feature Vector Projections

2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI) ◽

10.1109/ictai50040.2020.00129 ◽

2020 ◽

Author(s):

Leonidas Akritidis ◽

Miltiadis Alamaniotis ◽

Athanasios Fevgas ◽

Panayiotis Bozanis

Keyword(s):

Feature Vector ◽

Text Clustering ◽

High Dimensionality ◽

Short Text ◽

Short Text Clustering

Download Full-text

Text Clustering

Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining ◽

10.1145/2915031.2915046 ◽

2016 ◽

pp. 275

Keyword(s):

Text Clustering

Download Full-text

A set theory based similarity measure for text clustering and classification

Journal Of Big Data ◽

10.1186/s40537-020-00344-3 ◽

2020 ◽

Vol 7 (1) ◽

Cited By ~ 1

Author(s):

Ali A. Amer ◽

Hassan I. Abdalla

Keyword(s):

Set Theory ◽

Similarity Measure ◽

Similarity Measures ◽

Text Clustering ◽

Plagiarism Detection ◽

K Nearest Neighbor ◽

Single Measure ◽

Highly Effective ◽

Clustering And Classification ◽

Effectiveness And Efficiency

Abstract Similarity measures have long been utilized in information retrieval and machine learning domains for multi-purposes including text retrieval, text clustering, text summarization, plagiarism detection, and several other text-processing applications. However, the problem with these measures is that, until recently, there has never been one single measure recorded to be highly effective and efficient at the same time. Thus, the quest for an efficient and effective similarity measure is still an open-ended challenge. This study, in consequence, introduces a new highly-effective and time-efficient similarity measure for text clustering and classification. Furthermore, the study aims to provide a comprehensive scrutinization for seven of the most widely used similarity measures, mainly concerning their effectiveness and efficiency. Using the K-nearest neighbor algorithm (KNN) for classification, the K-means algorithm for clustering, and the bag of word (BoW) model for feature selection, all similarity measures are carefully examined in detail. The experimental evaluation has been made on two of the most popular datasets, namely, Reuters-21 and Web-KB. The obtained results confirm that the proposed set theory-based similarity measure (STB-SM), as a pre-eminent measure, outweighs all state-of-art measures significantly with regards to both effectiveness and efficiency.

Download Full-text

Word2Cluster: A New Multi-Label Text Clustering Algorithm with an Adaptive Clusters Number

2019 IEEE Global Communications Conference (GLOBECOM) ◽

10.1109/globecom38437.2019.9013266 ◽

2019 ◽

Author(s):

Kaili Mao ◽

Jianwei Niu ◽

Xuefeng Liu ◽

Shui Yu ◽

Longbo Zhao

Keyword(s):

Clustering Algorithm ◽

Text Clustering

Download Full-text

Text Clustering via Constrained Nonnegative Matrix Factorization

2011 IEEE 11th International Conference on Data Mining ◽

10.1109/icdm.2011.143 ◽

2011 ◽

Cited By ~ 5

Author(s):

Yan Zhu ◽

Liping Jing ◽

Jian Yu

Keyword(s):

Matrix Factorization ◽

Nonnegative Matrix Factorization ◽

Nonnegative Matrix ◽

Text Clustering

Download Full-text

Research of text clustering based on fuzzy granular computing

2009 2nd IEEE International Conference on Computer Science and Information Technology ◽

10.1109/iccsit.2009.5234519 ◽

2009 ◽

Cited By ~ 1

Author(s):

Zhang Xia ◽

Yin Yixin ◽

Xu Mingzhu ◽

Zhao Hailong

Keyword(s):

Granular Computing ◽

Text Clustering

Download Full-text

Fuzzy Set Based Clustering Algorithm of Web Text

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.678.19 ◽

2014 ◽

Vol 678 ◽

pp. 19-22

Author(s):

Hong Xin Wan ◽

Yun Peng

Keyword(s):

Key Words ◽

Fuzzy Set ◽

Clustering Algorithm ◽

Text Clustering ◽

Classification Methods ◽

Comparative Experiment ◽

Fuzzy Algorithm ◽

Pattern Clustering ◽

The Web ◽

Computing Accuracy

Web text exists non-certain and non-structure contents ,and it is difficult to cluster the text by normal classification methods. We propose a web text clustering algorithm based on fuzzy set to increase the computing accuracy with the web text. After abstracting the key words of the text, we can look it as attributes and design the fuzzy algorithm to decide the membership of the words. The algorithm can improve the algorithm complexity of time and space, increase the robustness comparing to the normal algorithm. To test the accuracy and efficiency of the algorithm, we take the comparative experiment between pattern clustering and our algorithm. The experiment shows that our method has a better result.

Download Full-text