Metaheuristics Based Clustering Algorithms on Document Clustering

Journal of Intelligent Systems with Applications ◽

10.54856/jiswa.201905059 ◽

2019 ◽

pp. 39-45

Author(s):

Aytug Onan

Keyword(s):

Cluster Analysis ◽

Optimization Problems ◽

Clustering Algorithms ◽

Document Clustering ◽

Cuckoo Search ◽

Text Documents ◽

Text Document ◽

Analysis Technique ◽

Clustering Quality ◽

Exploratory Data

Cluster analysis is an important exploratory data analysis technique which divides data into groups based on their similarity. Document clustering is the process of employing clustering algorithms on textual data so that text documents can be retrieved, organized, navigated and summarized in an efficient way. Document clustering can be utilized in the organization, summarization and classification of text documents. Metaheuristic algorithms have been successfully utilized to deal with complex optimization problems, including cluster analysis. In this paper, we analyze the clustering quality of five metaheuristic clustering algorithms (namely, particle swarm optimization, genetic algorithm, cuckoo search, firefly algorithm and yarasa algorithm) on fifteen text collections in term of F-measure. In the empirical analysis, two conventional clustering algorithms (K-means and bi-secting k-means) are also considered. The experimental analysis indicates that swarm-based clustering algorithms outperform conventional clustering algorithms on text document clustering.

Download Full-text

A semantic approach for text document clustering using frequent itemsets and WordNet

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.9.10220 ◽

2018 ◽

Vol 7 (2.18) ◽

pp. 102

Author(s):

Harsha Patil ◽

Ramjeevan Singh Thakur

Keyword(s):

Clustering Algorithms ◽

Document Clustering ◽

Knowledge Bases ◽

Experimental Result ◽

Semantic Approach ◽

Text Document ◽

Clustering Quality ◽

Ship Function ◽

Membership Score ◽

Specific Cluster

Document Clustering is an unsupervised method for classified documents in clusters on the basis of their similarity. Any document get it place in any specific cluster, on the basis of membership score, which calculated through membership function. But many of the traditional clustering algorithms are generally based on only BOW (Bag of Words), which ignores the semantic similarity between document and Cluster. In this research we consider the semantic association between cluster and text document during the calculation of membership score of any document for any specific cluster. Several researchers are working on semantic aspects of document clustering to develop clustering performance. Many external knowledge bases like WordNet, Wikipedia, Lucene etc. are utilized for this purpose. The proposed approach exploits WordNet to improve cluster member ship function. The experimental result shows that clustering quality improved significantly by using proposed framework of semantic approach.

Download Full-text

Nature-Inspired Optimization Algorithms for Text Document Clustering—A Comprehensive Analysis

Algorithms ◽

10.3390/a13120345 ◽

2020 ◽

Vol 13 (12) ◽

pp. 345

Author(s):

Laith Abualigah ◽

Amir H. Gandomi ◽

Mohamed Abd Elaziz ◽

Abdelazim G. Hussien ◽

Ahmad M. Khasawneh ◽

...

Keyword(s):

Optimization Problems ◽

Optimization Algorithms ◽

Harmony Search ◽

Document Clustering ◽

Text Clustering ◽

Gray Wolf ◽

Text Documents ◽

Text Document ◽

Krill Herd ◽

Clustering Problems

Text clustering is one of the efficient unsupervised learning techniques used to partition a huge number of text documents into a subset of clusters. In which, each cluster contains similar documents and the clusters contain dissimilar text documents. Nature-inspired optimization algorithms have been successfully used to solve various optimization problems, including text document clustering problems. In this paper, a comprehensive review is presented to show the most related nature-inspired algorithms that have been used in solving the text clustering problem. Moreover, comprehensive experiments are conducted and analyzed to show the performance of the common well-know nature-inspired optimization algorithms in solving the text document clustering problems including Harmony Search (HS) Algorithm, Genetic Algorithm (GA), Particle Swarm Optimization (PSO) Algorithm, Ant Colony Optimization (ACO), Krill Herd Algorithm (KHA), Cuckoo Search (CS) Algorithm, Gray Wolf Optimizer (GWO), and Bat-inspired Algorithm (BA). Seven text benchmark datasets are used to validate the performance of the tested algorithms. The results showed that the performance of the well-known nurture-inspired optimization algorithms almost the same with slight differences. For improvement purposes, new modified versions of the tested algorithms can be proposed and tested to tackle the text clustering problems.

Download Full-text

An improved ant algorithm with LDA-based representation for text document clustering

Journal of Information Science ◽

10.1177/0165551516638784 ◽

2016 ◽

Vol 43 (2) ◽

pp. 275-292 ◽

Cited By ~ 24

Author(s):

Aytug Onan ◽

Hasan Bulut ◽

Serdar Korukoglu

Keyword(s):

Clustering Algorithm ◽

Latent Dirichlet Allocation ◽

Clustering Algorithms ◽

Document Clustering ◽

Clustering Methods ◽

Initial Value ◽

Text Document ◽

Clustering Quality ◽

Text Features

Document clustering can be applied in document organisation and browsing, document summarisation and classification. The identification of an appropriate representation for textual documents is extremely important for the performance of clustering or classification algorithms. Textual documents suffer from the high dimensionality and irrelevancy of text features. Besides, conventional clustering algorithms suffer from several shortcomings, such as slow convergence and sensitivity to the initial value. To tackle the problems of conventional clustering algorithms, metaheuristic algorithms are frequently applied to clustering. In this paper, an improved ant clustering algorithm is presented, where two novel heuristic methods are proposed to enhance the clustering quality of ant-based clustering. In addition, the latent Dirichlet allocation (LDA) is used to represent textual documents in a compact and efficient way. The clustering quality of the proposed ant clustering algorithm is compared to the conventional clustering algorithms using 25 text benchmarks in terms of F-measure values. The experimental results indicate that the proposed clustering scheme outperforms the compared conventional and metaheuristic clustering methods for textual documents.

Download Full-text

An Improved B-hill Climbing Optimization Technique for Solving the Text Documents Clustering Problem

Current Medical Imaging Formerly Current Medical Imaging Reviews ◽

10.2174/1573405614666180903112541 ◽

2020 ◽

Vol 16 (4) ◽

pp. 296-306 ◽

Cited By ~ 3

Author(s):

Laith Mohammad Abualigah ◽

Essam Said Hanandeh ◽

Ahamad Tajudin Khader ◽

Mohammed Abdallh Otair ◽

Shishir Kumar Shandilya

Keyword(s):

Optimization Technique ◽

Document Clustering ◽

Text Clustering ◽

Hill Climbing ◽

Text Documents ◽

Clustering Problem ◽

Text Document ◽

Text Information ◽

Amount Of Knowledge ◽

The Hill

Background: Considering the increasing volume of text document information on Internet pages, dealing with such a tremendous amount of knowledge becomes totally complex due to its large size. Text clustering is a common optimization problem used to manage a large amount of text information into a subset of comparable and coherent clusters. Aims: This paper presents a novel local clustering technique, namely, β-hill climbing, to solve the problem of the text document clustering through modeling the β-hill climbing technique for partitioning the similar documents into the same cluster. Methods: The β parameter is the primary innovation in β-hill climbing technique. It has been introduced in order to perform a balance between local and global search. Local search methods are successfully applied to solve the problem of the text document clustering such as; k-medoid and kmean techniques. Results: Experiments were conducted on eight benchmark standard text datasets with different characteristics taken from the Laboratory of Computational Intelligence (LABIC). The results proved that the proposed β-hill climbing achieved better results in comparison with the original hill climbing technique in solving the text clustering problem. Conclusion: The performance of the text clustering is useful by adding the β operator to the hill climbing.

Download Full-text

Document Clustering

Pattern and Data Analysis in Healthcare Settings - Advances in Medical Technologies and Clinical Practice ◽

10.4018/978-1-5225-0536-5.ch013 ◽

2017 ◽

pp. 264-281

Author(s):

Harsha Patil ◽

R. S. Thakur

Keyword(s):

Text Mining ◽

Clustering Algorithms ◽

Document Clustering ◽

Web Pages ◽

Digital Form ◽

Search Query ◽

Text Documents ◽

Keen Interest ◽

Use Of Internet

As we know use of Internet flourishes with its full velocity and in all dimensions. Enormous availability of Text documents in digital form (email, web pages, blog post, news articles, ebooks and other text files) on internet challenges technology to appropriate retrieval of document as a response for any search query. As a result there has been an eruption of interest in people to mine these vast resources and classify them properly. It invigorates researchers and developers to work on numerous approaches of document clustering. Researchers got keen interest in this problem of text mining. The aim of this chapter is to summarised different document clustering algorithms used by researchers.

Download Full-text

Enriching Documents by Linking Salient Entities and Lexical-Semantic Expansion

Journal of Intelligent Systems ◽

10.1515/jisys-2018-0098 ◽

2018 ◽

Vol 29 (1) ◽

pp. 1109-1121

Author(s):

Mohsen Pourvali ◽

Salvatore Orlando

Keyword(s):

Clustering Algorithms ◽

Ensemble Clustering ◽

British Broadcasting Corporation ◽

Text Documents ◽

Classical Text ◽

Text Corpora ◽

Clustering Quality ◽

Semantic Expansion ◽

Document Representations

Abstract This paper explores a multi-strategy technique that aims at enriching text documents for improving clustering quality. We use a combination of entity linking and document summarization in order to determine the identity of the most salient entities mentioned in texts. To effectively enrich documents without introducing noise, we limit ourselves to the text fragments mentioning the salient entities, in turn, belonging to a knowledge base like Wikipedia, while the actual enrichment of text fragments is carried out using WordNet. To feed clustering algorithms, we investigate different document representations obtained using several combinations of document enrichment and feature extraction. This allows us to exploit ensemble clustering, by combining multiple clustering results obtained using different document representations. Our experiments indicate that our novel enriching strategies, combined with ensemble clustering, can improve the quality of classical text clustering when applied to text corpora like The British Broadcasting Corporation (BBC) NEWS.

Download Full-text

Text documents clustering using modified multi-verse optimizer

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i6.pp6361-6369 ◽

2020 ◽

Vol 10 (6) ◽

pp. 6361

Author(s):

Ammar Kamal Abasi ◽

Ahamad Tajudin Khader ◽

Mohammed Azmi Al-Betar ◽

Syibrah Naim ◽

Mohammed A. Awadallah ◽

...

Keyword(s):

Euclidean Distance ◽

Optimization Problems ◽

Continuous Optimization ◽

Search Space ◽

Discrete Optimization Problem ◽

Text Documents ◽

Text Document ◽

Continuous Optimization Problems ◽

Measure Entropy

In this study, a multi-verse optimizer (MVO) is utilised for the text document clus- tering (TDC) problem. TDC is treated as a discrete optimization problem, and an objective function based on the Euclidean distance is applied as similarity measure. TDC is tackled by the division of the documents into clusters; documents belonging to the same cluster are similar, whereas those belonging to different clusters are dissimilar. MVO, which is a recent metaheuristic optimization algorithm established for continuous optimization problems, can intelligently navigate different areas in the search space and search deeply in each area using a particular learning mechanism. The proposed algorithm is called MVOTDC, and it adopts the convergence behaviour of MVO operators to deal with discrete, rather than continuous, optimization problems. For evaluating MVOTDC, a comprehensive comparative study is conducted on six text document datasets with various numbers of documents and clusters. The quality of the ﬁnal results is assessed using precision, recall, F-measure, entropy accuracy, and purity measures. Experimental results reveal that the proposed method performs competitively in comparison with state-of-the-art algorithms. Statistical analysis is also conducted and shows that MVOTDC can produce signiﬁcant results in comparison with three well-established methods.

Download Full-text

Optimizing Text Categorization for Indonesian Text Using Clustering Label Technique

Turkish Journal of Computer and Mathematics Education (TURCOMAT) ◽

10.17762/turcomat.v12i3.947 ◽

2021 ◽

Vol 12 (3) ◽

pp. 1483-1491

Author(s):

Syopiansyah Jaya Putra Et.al

Keyword(s):

Text Categorization ◽

Intelligent System ◽

Text Processing ◽

Cluster Formation ◽

Clustering Algorithms ◽

Experimental Result ◽

Text Documents ◽

A Value ◽

Digital Format ◽

Clustering Quality

Text Categorization plays an important role for clustering the rapidly growing, yet unstructured, Indonesian text in digital format. Furthermore, it is deemed even more important since access to digital format text has become more necessary and widespread. There are many clustering algorithms used for text categorization. Unfortunately, clustering algorithms for text categorization cannot easily cluster the texts due to imperfect process of stemming and stopword of Indonesian language. This paper presents an intelligent system that categorizes Indonesian text documents into meaningful cluster labels. Label Induction Grouping Algorithm (LINGO) and Bisecting K- means are applied to process it through five phases, namely the pre-processing, frequent phrase extraction, cluster label induction, content discovery and final cluster formation. The experimental result showed that the system could categorize Indonesian text and reach to 93%. Furthermore, clustering quality evaluation indicates that text categorization using LINGO has high Precision and Recall with a value of 0.85 and 1, respectively, compare to Bisecting K-means which has a value of 0.78 and 0.99. Therefore, the result shows that LINGO is suitable for categorizing Indonesian text. The main contribution of this study is to optimize the clustering results by applying and maximizing text processing using Indonesian stemmer and stopword.

Download Full-text

Document Clustering

Information Retrieval and Management ◽

10.4018/978-1-5225-5191-1.ch003 ◽

2018 ◽

pp. 47-64 ◽

Cited By ~ 2

Author(s):

Harsha Patil ◽

R. S. Thakur

Keyword(s):

Text Mining ◽

Clustering Algorithms ◽

Document Clustering ◽

Web Pages ◽

Digital Form ◽

Search Query ◽

Text Documents ◽

Keen Interest ◽

Use Of Internet

Download Full-text

Swarm Intelligence in Text Document Clustering

Handbook of Research on Text and Web Mining Technologies ◽

10.4018/978-1-59904-990-8.ch010 ◽

2010 ◽

pp. 165-180 ◽

Cited By ~ 2

Author(s):

Xiaohui Cui

Keyword(s):

Swarm Intelligence ◽

Clustering Analysis ◽

Information Society ◽

Clustering Algorithms ◽

Document Clustering ◽

High Quality ◽

Fish Schools ◽

Text Document ◽

Self Organized ◽

Swarm Algorithms

In this chapter, we introduce three nature inspired swarm intelligence clustering approaches for document clustering analysis. The major challenge of today’s information society is being overwhelmed with information on any topic they are searching for. Fast and high-quality document clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the overwhelmed information. The swarm intelligence clustering algorithms use stochastic and heuristic principles discovered from observing bird flocks, fish schools, and ant food forage. Compared to the traditional clustering algorithms, the swarm algorithms are usually flexible, robust, decentralized, and self-organized. These characters make the swarm algorithms suitable for solving complex problems, such as document clustering.

Download Full-text