An Improved B-hill Climbing Optimization Technique for Solving the Text Documents Clustering Problem

Background: Considering the increasing volume of text document information on Internet pages, dealing with such a tremendous amount of knowledge becomes totally complex due to its large size. Text clustering is a common optimization problem used to manage a large amount of text information into a subset of comparable and coherent clusters. Aims: This paper presents a novel local clustering technique, namely, β-hill climbing, to solve the problem of the text document clustering through modeling the β-hill climbing technique for partitioning the similar documents into the same cluster. Methods: The β parameter is the primary innovation in β-hill climbing technique. It has been introduced in order to perform a balance between local and global search. Local search methods are successfully applied to solve the problem of the text document clustering such as; k-medoid and kmean techniques. Results: Experiments were conducted on eight benchmark standard text datasets with different characteristics taken from the Laboratory of Computational Intelligence (LABIC). The results proved that the proposed β-hill climbing achieved better results in comparison with the original hill climbing technique in solving the text clustering problem. Conclusion: The performance of the text clustering is useful by adding the β operator to the hill climbing.

Download Full-text

Nature-Inspired Optimization Algorithms for Text Document Clustering—A Comprehensive Analysis

Algorithms ◽

10.3390/a13120345 ◽

2020 ◽

Vol 13 (12) ◽

pp. 345

Author(s):

Laith Abualigah ◽

Amir H. Gandomi ◽

Mohamed Abd Elaziz ◽

Abdelazim G. Hussien ◽

Ahmad M. Khasawneh ◽

...

Keyword(s):

Optimization Problems ◽

Optimization Algorithms ◽

Harmony Search ◽

Document Clustering ◽

Text Clustering ◽

Gray Wolf ◽

Text Documents ◽

Text Document ◽

Krill Herd ◽

Clustering Problems

Text clustering is one of the efficient unsupervised learning techniques used to partition a huge number of text documents into a subset of clusters. In which, each cluster contains similar documents and the clusters contain dissimilar text documents. Nature-inspired optimization algorithms have been successfully used to solve various optimization problems, including text document clustering problems. In this paper, a comprehensive review is presented to show the most related nature-inspired algorithms that have been used in solving the text clustering problem. Moreover, comprehensive experiments are conducted and analyzed to show the performance of the common well-know nature-inspired optimization algorithms in solving the text document clustering problems including Harmony Search (HS) Algorithm, Genetic Algorithm (GA), Particle Swarm Optimization (PSO) Algorithm, Ant Colony Optimization (ACO), Krill Herd Algorithm (KHA), Cuckoo Search (CS) Algorithm, Gray Wolf Optimizer (GWO), and Bat-inspired Algorithm (BA). Seven text benchmark datasets are used to validate the performance of the tested algorithms. The results showed that the performance of the well-known nurture-inspired optimization algorithms almost the same with slight differences. For improvement purposes, new modified versions of the tested algorithms can be proposed and tested to tackle the text clustering problems.

Download Full-text

Dual Scaling in Data Mining from Text Databases

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2006.p0451 ◽

2006 ◽

Vol 10 (4) ◽

pp. 451-457 ◽

Cited By ~ 3

Author(s):

Junzo Watada ◽

◽

Keisuke Aoki ◽

Masahiro Kawano ◽

Muhammad Suzuri Hitam ◽

...

Keyword(s):

Multivariate Analysis ◽

Text Mining ◽

Kansei Engineering ◽

Semantic Meaning ◽

Dual Scaling ◽

Text Documents ◽

Text Data ◽

Text Document ◽

Text Information ◽

Quantification Model

The availability of multimedia text document information has disseminated text mining among researchers. Text documents, integrate numerical and linguistic data, making text mining interesting and challenging. We propose text mining based on a fuzzy quantification model and fuzzy thesaurus. In text mining, we focus on: 1) Sentences included in Japanese text that are broken down into words. 2) Fuzzy thesaurus for finding words matching keywords in text. 3) Fuzzy multivariate analysis to analyze semantic meaning in predefined case studies. We use a fuzzy thesaurus to translate words using Chinese and Japanese characters into keywords. This speeds up processing without requiring a dictionary to separate words. Fuzzy multivariate analysis is used to analyze such processed data and to extract latent mutual related structures in text data, i.e., to extract otherwise obscured knowledge. We apply dual scaling to mining library and Web page text information, and propose integrating the result in Kansei engineering for possible application in sales, marketing, and production.

Download Full-text

Improving the Decision Value of Hierarchical Text Clustering Using Term Overlap Detection

Australasian Journal of Information Systems ◽

10.3127/ajis.v19i0.1180 ◽

2015 ◽

Vol 19 ◽

Cited By ~ 2

Author(s):

Nilupulee Nathawitharana ◽

Damminda Alahakoon ◽

Sumith Matharage

Keyword(s):

Hierarchical Clustering ◽

Categorical Data ◽

Text Clustering ◽

Written Language ◽

Text Documents ◽

Text Data ◽

Text Document ◽

Cluster Accuracy ◽

Document Collection ◽

A New Technique

Humans are used to expressing themselves with written language and language provides a medium with which we can describe our experiences in detail incorporating individuality. Even though documents provide a rich source of information, it becomes very difficult to identify, extract, summarize and search when vast amounts of documents are collected especially over time. Document clustering is a technique that has been widely used to group documents based on similarity of content represented by the words used. Once key groups are identified further drill down into sub-groupings is facilitated by the use of hierarchical clustering. Clustering and hierarchical clustering are very useful when applied to numerical and categorical data and cluster accuracy and purity measures exist to evaluate the outcomes of a clustering exercise. Although the same measures have been applied to text clustering, text clusters are based on words or terms which can be repeated across documents associated with different topics. Therefore text data cannot be considered as a direct ‘coding’ of a particular experience or situation in contrast to numerical and categorical data and term overlap is a very common characteristic in text clustering. In this paper we propose a new technique and methodology for term overlap capture from text documents, highlighting the different situations such overlap could signify and discuss why such understanding is important for obtaining value from text clustering. Experiments were conducted using a widely used text document collection where the proposed methodology allowed exploring the term diversity for a given document collection and obtain clusters with minimum term overlap.

Download Full-text

Using Sequences of Words for Non-Disjoint Grouping of Documents

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001415500135 ◽

2015 ◽

Vol 29 (03) ◽

pp. 1550013 ◽

Cited By ~ 6

Author(s):

Chiheb-Eddine Ben N'Cir ◽

Nadia Essoussi

Keyword(s):

Learning Algorithm ◽

Text Clustering ◽

Unstructured Data ◽

Text Documents ◽

Space Model ◽

Word Sequence ◽

Text Document ◽

Text Collections ◽

Textual Data ◽

Textual Content

Grouping documents based on their textual content is an important application of clustering referred to as text clustering. This paper deals with two issues in text clustering which are the detection of non-disjoint groups and the representation of textual data. In fact, a text document can discuss several topics and then, it must belong to several groups. The learning algorithm must be able to produce non-disjoint clusters and assigns documents to several clusters. Given that text documents are considered as unstructured data, the application of a learning algorithm requires to prepare a set of documents for numerical analysis by using the vector space model (VSM). This representation of text avoids correlation between terms and does not give importance to the order of words in the text. Therefore, we present in this paper an unsupervised learning method, based on the word sequence kernel, where the correlation between adjacent words in text and the possibility of document to belong to more than one cluster are not ignored. In addition, to facilitate the use of this method in text-analytic practice, we present the "DocCO" software which is publicly available. Experiments performed on several text collections show that the proposed method outperforms existing overlapping methods using VSM representation in terms of clustering accuracy.

Download Full-text

Particle Grey Wolf Optimizer (PGWO) Algorithm and Semantic Word Processing for Automatic Text Clustering

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s0218488519500090 ◽

2019 ◽

Vol 27 (02) ◽

pp. 201-223 ◽

Cited By ~ 2

Author(s):

Ch. Vidyadhari ◽

N. Sandhya ◽

P. Premchand

Keyword(s):

Word Processing ◽

Text Categorization ◽

Text Clustering ◽

Significant Feature ◽

Grey Wolf Optimizer ◽

Grey Wolf ◽

Text Documents ◽

Text Document ◽

Text Feature ◽

Automatic Text

Text mining refers to the process of extracting the high-quality information from the text. It is broadly used in applications, like text clustering, text categorization, text classification, etc. Recently, the text clustering becomes the facilitating and challenging task used to group the text document. Due to some irrelevant terms and large dimension, the accuracy of text clustering is reduced. In this paper, the semantic word processing and novel Particle Grey Wolf Optimizer (PGWO) is proposed for automatic text clustering. Initially, the text documents are given as input to the pre-processing step which caters the useful keyword for feature extraction and clustering. Then, the resultant keyword is applied to wordnet ontology to find out the synonyms and hyponyms of every keyword. Subsequently, the frequency is determined for every keyword which is used to build the text feature library. Since the text feature library contains the larger dimension, the entropy is utilized to select the most significant feature. Finally, the new algorithm Particle Grey Wolf Optimizer (PGWO) is developed by integrating the particle swarm optimization (PSO) into the grey wolf optimizer (GWO). Thus, the proposed algorithm is used to assign the class labels to generate the different clusters of text documents. The simulation is performed to analyze the performance of the proposed algorithm, and the proposed algorithm is compared with existing algorithms. The proposed method attains the clustering accuracy of 80.36% for 20 Newsgroup dataset and the clustering accuracy of 79.63% for Reuter which ensures the better automatic text clustering.

Download Full-text

A Brief Review of Metaheuristics for Document or Text Clustering

Intelligent Techniques for Data Analysis in Diverse Settings - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-0075-9.ch012 ◽

2016 ◽

pp. 252-264 ◽

Cited By ~ 1

Author(s):

Sinem Büyüksaatçı ◽

Alp Baray

Keyword(s):

Language Processing ◽

Clustering Algorithms ◽

Document Clustering ◽

Text Clustering ◽

Metaheuristic Algorithms ◽

Research Papers ◽

High Quality ◽

Topic Extraction ◽

Clustering Problem ◽

Research Areas

Document clustering, which involves concepts from the fields of information retrieval, automatic topic extraction, natural language processing, and machine learning, is one of the most popular research areas in data mining. Due to the large amount of information in electronic form, fast and high-quality cluster analysis plays an important role in helping users to effectively navigate, summarize and organise this information for useful data. There are a number of techniques in the literature, which efficiently provide solutions for document clustering. However, during the last decade, researchers started to use metaheuristic algorithms for the document clustering problem because of the limitations of the existing traditional clustering algorithms. In this chapter, the authors will give a brief review of various research papers that present the area of document or text clustering approaches with different metaheuristic algorithms.

Download Full-text

Hierarchical Document Clustering

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch150 ◽

2011 ◽

pp. 970-975 ◽

Cited By ~ 1

Author(s):

Benjamin C.M. Fung ◽

Ke Wang ◽

Martin Ester

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithms ◽

Document Clustering ◽

High Volume ◽

Frequent Itemset ◽

Parent Child Relationship ◽

Text Documents ◽

Clustering Problem ◽

Child Relationship ◽

Divisive Hierarchical Clustering

Document clustering is an automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters. Unlike document classification (Wang, Zhou, & He, 2001), no labeled documents are provided in clustering; hence, clustering is also known as unsupervised learning. Hierarchical document clustering organizes clusters into a tree or a hierarchy that facilitates browsing. The parent-child relationship among the nodes in the tree can be viewed as a topic-subtopic relationship in a subject hierarchy such as the Yahoo! directory. This chapter discusses several special challenges in hierarchical document clustering: high dimensionality, high volume of data, ease of browsing, and meaningful cluster labels. State-of-the-art document clustering algorithms are reviewed: the partitioning method (Steinbach, Karypis, & Kumar, 2000), agglomerative and divisive hierarchical clustering (Kaufman & Rousseeuw, 1990), and frequent itemset-based hierarchical clustering (Fung, Wang, & Ester, 2003). The last one, which was developed by the authors, is further elaborated since it has been specially designed to address the hierarchical document clustering problem.

Download Full-text

Hybrid Fruit-Fly Optimization Algorithm with K-Means for Text Document Clustering

Mathematics ◽

10.3390/math9161929 ◽

2021 ◽

Vol 9 (16) ◽

pp. 1929

Author(s):

Timea Bezdan ◽

Catalin Stoean ◽

Ahmed Al Naamany ◽

Nebojsa Bacanin ◽

Tarik A. Rashid ◽

...

Keyword(s):

Optimization Algorithm ◽

Document Clustering ◽

Fruit Fly ◽

Text Clustering ◽

Relevant Information ◽

Fruit Fly Optimization Algorithm ◽

Hybrid Swarm ◽

Text Data ◽

Fruit Fly Optimization ◽

Text Document

The fast-growing Internet results in massive amounts of text data. Due to the large volume of the unstructured format of text data, extracting relevant information and its analysis becomes very challenging. Text document clustering is a text-mining process that partitions the set of text-based documents into mutually exclusive clusters in such a way that documents within the same group are similar to each other, while documents from different clusters differ based on the content. One of the biggest challenges in text clustering is partitioning the collection of text data by measuring the relevance of the content in the documents. Addressing this issue, in this work a hybrid swarm intelligence algorithm with a K-means algorithm is proposed for text clustering. First, the hybrid fruit-fly optimization algorithm is tested on ten unconstrained CEC2019 benchmark functions. Next, the proposed method is evaluated on six standard benchmark text datasets. The experimental evaluation on the unconstrained functions, as well as on text-based documents, indicated that the proposed approach is robust and superior to other state-of-the-art methods.

Download Full-text

Metaheuristics Based Clustering Algorithms on Document Clustering

Journal of Intelligent Systems with Applications ◽

10.54856/jiswa.201905059 ◽

2019 ◽

pp. 39-45

Author(s):

Aytug Onan

Keyword(s):

Cluster Analysis ◽

Optimization Problems ◽

Clustering Algorithms ◽

Document Clustering ◽

Cuckoo Search ◽

Text Documents ◽

Text Document ◽

Analysis Technique ◽

Clustering Quality ◽

Exploratory Data

Cluster analysis is an important exploratory data analysis technique which divides data into groups based on their similarity. Document clustering is the process of employing clustering algorithms on textual data so that text documents can be retrieved, organized, navigated and summarized in an efficient way. Document clustering can be utilized in the organization, summarization and classification of text documents. Metaheuristic algorithms have been successfully utilized to deal with complex optimization problems, including cluster analysis. In this paper, we analyze the clustering quality of five metaheuristic clustering algorithms (namely, particle swarm optimization, genetic algorithm, cuckoo search, firefly algorithm and yarasa algorithm) on fifteen text collections in term of F-measure. In the empirical analysis, two conventional clustering algorithms (K-means and bi-secting k-means) are also considered. The experimental analysis indicates that swarm-based clustering algorithms outperform conventional clustering algorithms on text document clustering.

Download Full-text

OAPS: An Optimization Algorithm for Part Separation in Assembly Design for Additive Manufacturing

Volume 4: 23rd Design for Manufacturing and the Life Cycle Conference; 12th International Conference on Micro- and Nanosystems ◽

10.1115/detc2018-85662 ◽

2018 ◽

Author(s):

Angshuman Deka ◽

Sara Behdad

Keyword(s):

Additive Manufacturing ◽

Optimization Algorithm ◽

Cutting Planes ◽

Processing Time ◽

Optimization Technique ◽

Optimal Number ◽

Hill Climbing ◽

Productivity Improvement ◽

Cutting Processes ◽

The Hill

Additive Manufacturing (AM) provides the advantage of producing complex shapes that are not possible through traditional cutting processes. Along with this line, assembly-based part design in AM creates some opportunities for productivity improvement. This paper proposes an improved optimization algorithm for part separation (OAPS) in assembly-based part design in additive manufacturing. For a given object, previous studies often provide the optimal number of parts resulting from cutting processes and their corresponding orientation to obtain the minimum processing time. During part separation, the cutting plane direction to generate subparts for assembly was often selected randomly in previous studies. The current work addresses the use of random cutting planes for part separation and instead uses the hill climbing optimization technique to generate the cutting planes to separate the parts. The OAPS provides the optimal number of assemblies and the build orientation of the parts for the minimum processing time. Two examples are provided to demonstrate the application of OAPS algorithm.

Download Full-text