Nature-Inspired Optimization Algorithms for Text Document Clustering—A Comprehensive Analysis

Text clustering is one of the efficient unsupervised learning techniques used to partition a huge number of text documents into a subset of clusters. In which, each cluster contains similar documents and the clusters contain dissimilar text documents. Nature-inspired optimization algorithms have been successfully used to solve various optimization problems, including text document clustering problems. In this paper, a comprehensive review is presented to show the most related nature-inspired algorithms that have been used in solving the text clustering problem. Moreover, comprehensive experiments are conducted and analyzed to show the performance of the common well-know nature-inspired optimization algorithms in solving the text document clustering problems including Harmony Search (HS) Algorithm, Genetic Algorithm (GA), Particle Swarm Optimization (PSO) Algorithm, Ant Colony Optimization (ACO), Krill Herd Algorithm (KHA), Cuckoo Search (CS) Algorithm, Gray Wolf Optimizer (GWO), and Bat-inspired Algorithm (BA). Seven text benchmark datasets are used to validate the performance of the tested algorithms. The results showed that the performance of the well-known nurture-inspired optimization algorithms almost the same with slight differences. For improvement purposes, new modified versions of the tested algorithms can be proposed and tested to tackle the text clustering problems.

Download Full-text

An Improved B-hill Climbing Optimization Technique for Solving the Text Documents Clustering Problem

Current Medical Imaging Formerly Current Medical Imaging Reviews ◽

10.2174/1573405614666180903112541 ◽

2020 ◽

Vol 16 (4) ◽

pp. 296-306 ◽

Cited By ~ 3

Author(s):

Laith Mohammad Abualigah ◽

Essam Said Hanandeh ◽

Ahamad Tajudin Khader ◽

Mohammed Abdallh Otair ◽

Shishir Kumar Shandilya

Keyword(s):

Optimization Technique ◽

Document Clustering ◽

Text Clustering ◽

Hill Climbing ◽

Text Documents ◽

Clustering Problem ◽

Text Document ◽

Text Information ◽

Amount Of Knowledge ◽

The Hill

Background: Considering the increasing volume of text document information on Internet pages, dealing with such a tremendous amount of knowledge becomes totally complex due to its large size. Text clustering is a common optimization problem used to manage a large amount of text information into a subset of comparable and coherent clusters. Aims: This paper presents a novel local clustering technique, namely, β-hill climbing, to solve the problem of the text document clustering through modeling the β-hill climbing technique for partitioning the similar documents into the same cluster. Methods: The β parameter is the primary innovation in β-hill climbing technique. It has been introduced in order to perform a balance between local and global search. Local search methods are successfully applied to solve the problem of the text document clustering such as; k-medoid and kmean techniques. Results: Experiments were conducted on eight benchmark standard text datasets with different characteristics taken from the Laboratory of Computational Intelligence (LABIC). The results proved that the proposed β-hill climbing achieved better results in comparison with the original hill climbing technique in solving the text clustering problem. Conclusion: The performance of the text clustering is useful by adding the β operator to the hill climbing.

Download Full-text

Metaheuristics Based Clustering Algorithms on Document Clustering

Journal of Intelligent Systems with Applications ◽

10.54856/jiswa.201905059 ◽

2019 ◽

pp. 39-45

Author(s):

Aytug Onan

Keyword(s):

Cluster Analysis ◽

Optimization Problems ◽

Clustering Algorithms ◽

Document Clustering ◽

Cuckoo Search ◽

Text Documents ◽

Text Document ◽

Analysis Technique ◽

Clustering Quality ◽

Exploratory Data

Cluster analysis is an important exploratory data analysis technique which divides data into groups based on their similarity. Document clustering is the process of employing clustering algorithms on textual data so that text documents can be retrieved, organized, navigated and summarized in an efficient way. Document clustering can be utilized in the organization, summarization and classification of text documents. Metaheuristic algorithms have been successfully utilized to deal with complex optimization problems, including cluster analysis. In this paper, we analyze the clustering quality of five metaheuristic clustering algorithms (namely, particle swarm optimization, genetic algorithm, cuckoo search, firefly algorithm and yarasa algorithm) on fifteen text collections in term of F-measure. In the empirical analysis, two conventional clustering algorithms (K-means and bi-secting k-means) are also considered. The experimental analysis indicates that swarm-based clustering algorithms outperform conventional clustering algorithms on text document clustering.

Download Full-text

A New “Good and Bad Groups-Based Optimizer” for Solving Various Optimization Problems

Applied Sciences ◽

10.3390/app11104382 ◽

2021 ◽

Vol 11 (10) ◽

pp. 4382

Author(s):

Ali Sadeghi ◽

Sajjad Amiri Doumari ◽

Mohammad Dehghani ◽

Zeinab Montazeri ◽

Pavel Trojovský ◽

...

Keyword(s):

Optimization Algorithm ◽

Optimization Problems ◽

Search Algorithm ◽

Fitness Function ◽

Optimization Algorithms ◽

Gravitational Search Algorithm ◽

Gray Wolf ◽

Good Group ◽

Good Ability ◽

Whale Optimization

Optimization is the science that presents a solution among the available solutions considering an optimization problem’s limitations. Optimization algorithms have been introduced as efficient tools for solving optimization problems. These algorithms are designed based on various natural phenomena, behavior, the lifestyle of living beings, physical laws, rules of games, etc. In this paper, a new optimization algorithm called the good and bad groups-based optimizer (GBGBO) is introduced to solve various optimization problems. In GBGBO, population members update under the influence of two groups named the good group and the bad group. The good group consists of a certain number of the population members with better fitness function than other members and the bad group consists of a number of the population members with worse fitness function than other members of the population. GBGBO is mathematically modeled and its performance in solving optimization problems was tested on a set of twenty-three different objective functions. In addition, for further analysis, the results obtained from the proposed algorithm were compared with eight optimization algorithms: genetic algorithm (GA), particle swarm optimization (PSO), gravitational search algorithm (GSA), teaching–learning-based optimization (TLBO), gray wolf optimizer (GWO), and the whale optimization algorithm (WOA), tunicate swarm algorithm (TSA), and marine predators algorithm (MPA). The results show that the proposed GBGBO algorithm has a good ability to solve various optimization problems and is more competitive than other similar algorithms.

Download Full-text

Text documents clustering using modified multi-verse optimizer

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i6.pp6361-6369 ◽

2020 ◽

Vol 10 (6) ◽

pp. 6361

Author(s):

Ammar Kamal Abasi ◽

Ahamad Tajudin Khader ◽

Mohammed Azmi Al-Betar ◽

Syibrah Naim ◽

Mohammed A. Awadallah ◽

...

Keyword(s):

Euclidean Distance ◽

Optimization Problems ◽

Continuous Optimization ◽

Search Space ◽

Discrete Optimization Problem ◽

Text Documents ◽

Text Document ◽

Continuous Optimization Problems ◽

Measure Entropy

In this study, a multi-verse optimizer (MVO) is utilised for the text document clus- tering (TDC) problem. TDC is treated as a discrete optimization problem, and an objective function based on the Euclidean distance is applied as similarity measure. TDC is tackled by the division of the documents into clusters; documents belonging to the same cluster are similar, whereas those belonging to different clusters are dissimilar. MVO, which is a recent metaheuristic optimization algorithm established for continuous optimization problems, can intelligently navigate different areas in the search space and search deeply in each area using a particular learning mechanism. The proposed algorithm is called MVOTDC, and it adopts the convergence behaviour of MVO operators to deal with discrete, rather than continuous, optimization problems. For evaluating MVOTDC, a comprehensive comparative study is conducted on six text document datasets with various numbers of documents and clusters. The quality of the ﬁnal results is assessed using precision, recall, F-measure, entropy accuracy, and purity measures. Experimental results reveal that the proposed method performs competitively in comparison with state-of-the-art algorithms. Statistical analysis is also conducted and shows that MVOTDC can produce signiﬁcant results in comparison with three well-established methods.

Download Full-text

Improving the Decision Value of Hierarchical Text Clustering Using Term Overlap Detection

Australasian Journal of Information Systems ◽

10.3127/ajis.v19i0.1180 ◽

2015 ◽

Vol 19 ◽

Cited By ~ 2

Author(s):

Nilupulee Nathawitharana ◽

Damminda Alahakoon ◽

Sumith Matharage

Keyword(s):

Hierarchical Clustering ◽

Categorical Data ◽

Text Clustering ◽

Written Language ◽

Text Documents ◽

Text Data ◽

Text Document ◽

Cluster Accuracy ◽

Document Collection ◽

A New Technique

Humans are used to expressing themselves with written language and language provides a medium with which we can describe our experiences in detail incorporating individuality. Even though documents provide a rich source of information, it becomes very difficult to identify, extract, summarize and search when vast amounts of documents are collected especially over time. Document clustering is a technique that has been widely used to group documents based on similarity of content represented by the words used. Once key groups are identified further drill down into sub-groupings is facilitated by the use of hierarchical clustering. Clustering and hierarchical clustering are very useful when applied to numerical and categorical data and cluster accuracy and purity measures exist to evaluate the outcomes of a clustering exercise. Although the same measures have been applied to text clustering, text clusters are based on words or terms which can be repeated across documents associated with different topics. Therefore text data cannot be considered as a direct ‘coding’ of a particular experience or situation in contrast to numerical and categorical data and term overlap is a very common characteristic in text clustering. In this paper we propose a new technique and methodology for term overlap capture from text documents, highlighting the different situations such overlap could signify and discuss why such understanding is important for obtaining value from text clustering. Experiments were conducted using a widely used text document collection where the proposed methodology allowed exploring the term diversity for a given document collection and obtain clusters with minimum term overlap.

Download Full-text

Krill Herd (KH) algorithm for text document clustering using TF–IDF features

Smart Computing ◽

10.1201/9781003167488-60 ◽

2021 ◽

pp. 502-512

Author(s):

Priyanka Shivaprasad More ◽

Baljit Singh Saini ◽

Kamaljit Singh Bhatia

Keyword(s):

Document Clustering ◽

Text Document ◽

Krill Herd

Download Full-text

Feature Selection and Enhanced Krill Herd Algorithm for Text Document Clustering

10.1007/978-3-030-10674-4 ◽

2019 ◽

Cited By ~ 147

Author(s):

Laith Mohammad Qasim Abualigah

Keyword(s):

Feature Selection ◽

Document Clustering ◽

Krill Herd Algorithm ◽

Text Document ◽

Krill Herd

Download Full-text

Using Sequences of Words for Non-Disjoint Grouping of Documents

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001415500135 ◽

2015 ◽

Vol 29 (03) ◽

pp. 1550013 ◽

Cited By ~ 6

Author(s):

Chiheb-Eddine Ben N'Cir ◽

Nadia Essoussi

Keyword(s):

Learning Algorithm ◽

Text Clustering ◽

Unstructured Data ◽

Text Documents ◽

Space Model ◽

Word Sequence ◽

Text Document ◽

Text Collections ◽

Textual Data ◽

Textual Content

Grouping documents based on their textual content is an important application of clustering referred to as text clustering. This paper deals with two issues in text clustering which are the detection of non-disjoint groups and the representation of textual data. In fact, a text document can discuss several topics and then, it must belong to several groups. The learning algorithm must be able to produce non-disjoint clusters and assigns documents to several clusters. Given that text documents are considered as unstructured data, the application of a learning algorithm requires to prepare a set of documents for numerical analysis by using the vector space model (VSM). This representation of text avoids correlation between terms and does not give importance to the order of words in the text. Therefore, we present in this paper an unsupervised learning method, based on the word sequence kernel, where the correlation between adjacent words in text and the possibility of document to belong to more than one cluster are not ignored. In addition, to facilitate the use of this method in text-analytic practice, we present the "DocCO" software which is publicly available. Experiments performed on several text collections show that the proposed method outperforms existing overlapping methods using VSM representation in terms of clustering accuracy.

Download Full-text

Particle Grey Wolf Optimizer (PGWO) Algorithm and Semantic Word Processing for Automatic Text Clustering

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s0218488519500090 ◽

2019 ◽

Vol 27 (02) ◽

pp. 201-223 ◽

Cited By ~ 2

Author(s):

Ch. Vidyadhari ◽

N. Sandhya ◽

P. Premchand

Keyword(s):

Word Processing ◽

Text Categorization ◽

Text Clustering ◽

Significant Feature ◽

Grey Wolf Optimizer ◽

Grey Wolf ◽

Text Documents ◽

Text Document ◽

Text Feature ◽

Automatic Text

Text mining refers to the process of extracting the high-quality information from the text. It is broadly used in applications, like text clustering, text categorization, text classification, etc. Recently, the text clustering becomes the facilitating and challenging task used to group the text document. Due to some irrelevant terms and large dimension, the accuracy of text clustering is reduced. In this paper, the semantic word processing and novel Particle Grey Wolf Optimizer (PGWO) is proposed for automatic text clustering. Initially, the text documents are given as input to the pre-processing step which caters the useful keyword for feature extraction and clustering. Then, the resultant keyword is applied to wordnet ontology to find out the synonyms and hyponyms of every keyword. Subsequently, the frequency is determined for every keyword which is used to build the text feature library. Since the text feature library contains the larger dimension, the entropy is utilized to select the most significant feature. Finally, the new algorithm Particle Grey Wolf Optimizer (PGWO) is developed by integrating the particle swarm optimization (PSO) into the grey wolf optimizer (GWO). Thus, the proposed algorithm is used to assign the class labels to generate the different clusters of text documents. The simulation is performed to analyze the performance of the proposed algorithm, and the proposed algorithm is compared with existing algorithms. The proposed method attains the clustering accuracy of 80.36% for 20 Newsgroup dataset and the clustering accuracy of 79.63% for Reuter which ensures the better automatic text clustering.

Download Full-text

Hybrid Fruit-Fly Optimization Algorithm with K-Means for Text Document Clustering

Mathematics ◽

10.3390/math9161929 ◽

2021 ◽

Vol 9 (16) ◽

pp. 1929

Author(s):

Timea Bezdan ◽

Catalin Stoean ◽

Ahmed Al Naamany ◽

Nebojsa Bacanin ◽

Tarik A. Rashid ◽

...

Keyword(s):

Optimization Algorithm ◽

Document Clustering ◽

Fruit Fly ◽

Text Clustering ◽

Relevant Information ◽

Fruit Fly Optimization Algorithm ◽

Hybrid Swarm ◽

Text Data ◽

Fruit Fly Optimization ◽

Text Document

The fast-growing Internet results in massive amounts of text data. Due to the large volume of the unstructured format of text data, extracting relevant information and its analysis becomes very challenging. Text document clustering is a text-mining process that partitions the set of text-based documents into mutually exclusive clusters in such a way that documents within the same group are similar to each other, while documents from different clusters differ based on the content. One of the biggest challenges in text clustering is partitioning the collection of text data by measuring the relevance of the content in the documents. Addressing this issue, in this work a hybrid swarm intelligence algorithm with a K-means algorithm is proposed for text clustering. First, the hybrid fruit-fly optimization algorithm is tested on ten unconstrained CEC2019 benchmark functions. Next, the proposed method is evaluated on six standard benchmark text datasets. The experimental evaluation on the unconstrained functions, as well as on text-based documents, indicated that the proposed approach is robust and superior to other state-of-the-art methods.

Download Full-text