Solving document clustering problem through meta heuristic algorithm

Background: Considering the increasing volume of text document information on Internet pages, dealing with such a tremendous amount of knowledge becomes totally complex due to its large size. Text clustering is a common optimization problem used to manage a large amount of text information into a subset of comparable and coherent clusters. Aims: This paper presents a novel local clustering technique, namely, β-hill climbing, to solve the problem of the text document clustering through modeling the β-hill climbing technique for partitioning the similar documents into the same cluster. Methods: The β parameter is the primary innovation in β-hill climbing technique. It has been introduced in order to perform a balance between local and global search. Local search methods are successfully applied to solve the problem of the text document clustering such as; k-medoid and kmean techniques. Results: Experiments were conducted on eight benchmark standard text datasets with different characteristics taken from the Laboratory of Computational Intelligence (LABIC). The results proved that the proposed β-hill climbing achieved better results in comparison with the original hill climbing technique in solving the text clustering problem. Conclusion: The performance of the text clustering is useful by adding the β operator to the hill climbing.

Download Full-text

An Efficient Robust Hyper-Heuristic Algorithm to Clustering Problem

Advances in Intelligent Systems and Computing - Computational Intelligence in Information Systems ◽

10.1007/978-3-030-03302-6_5 ◽

2018 ◽

pp. 48-60

Author(s):

Mohammad Babrdel Bonab ◽

Yong Haur Tay ◽

Siti Zaiton Mohd Hashim ◽

Khoo Thau Soon

Keyword(s):

Heuristic Algorithm ◽

Clustering Problem

Download Full-text

A New Genetic-Based Hyper-Heuristic Algorithm for Clustering Problem

Advances in Intelligent Systems and Computing - Proceedings of the 12th International Conference on Soft Computing and Pattern Recognition (SoCPaR 2020) ◽

10.1007/978-3-030-73689-7_15 ◽

2021 ◽

pp. 145-155

Author(s):

Mohammad Babrdel Bonab ◽

Goi Bok-Min ◽

Madhavan a/l Balan Nair ◽

Chua Kein Huat ◽

Wong Chim Chwee

Keyword(s):

Heuristic Algorithm ◽

Clustering Problem

Download Full-text

A Brief Review of Metaheuristics for Document or Text Clustering

Intelligent Techniques for Data Analysis in Diverse Settings - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-0075-9.ch012 ◽

2016 ◽

pp. 252-264 ◽

Cited By ~ 1

Author(s):

Sinem Büyüksaatçı ◽

Alp Baray

Keyword(s):

Language Processing ◽

Clustering Algorithms ◽

Document Clustering ◽

Text Clustering ◽

Metaheuristic Algorithms ◽

Research Papers ◽

High Quality ◽

Topic Extraction ◽

Clustering Problem ◽

Research Areas

Document clustering, which involves concepts from the fields of information retrieval, automatic topic extraction, natural language processing, and machine learning, is one of the most popular research areas in data mining. Due to the large amount of information in electronic form, fast and high-quality cluster analysis plays an important role in helping users to effectively navigate, summarize and organise this information for useful data. There are a number of techniques in the literature, which efficiently provide solutions for document clustering. However, during the last decade, researchers started to use metaheuristic algorithms for the document clustering problem because of the limitations of the existing traditional clustering algorithms. In this chapter, the authors will give a brief review of various research papers that present the area of document or text clustering approaches with different metaheuristic algorithms.

Download Full-text

Hierarchical Document Clustering

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch150 ◽

2011 ◽

pp. 970-975 ◽

Cited By ~ 1

Author(s):

Benjamin C.M. Fung ◽

Ke Wang ◽

Martin Ester

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithms ◽

Document Clustering ◽

High Volume ◽

Frequent Itemset ◽

Parent Child Relationship ◽

Text Documents ◽

Clustering Problem ◽

Child Relationship ◽

Divisive Hierarchical Clustering

Document clustering is an automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters. Unlike document classification (Wang, Zhou, & He, 2001), no labeled documents are provided in clustering; hence, clustering is also known as unsupervised learning. Hierarchical document clustering organizes clusters into a tree or a hierarchy that facilitates browsing. The parent-child relationship among the nodes in the tree can be viewed as a topic-subtopic relationship in a subject hierarchy such as the Yahoo! directory. This chapter discusses several special challenges in hierarchical document clustering: high dimensionality, high volume of data, ease of browsing, and meaningful cluster labels. State-of-the-art document clustering algorithms are reviewed: the partitioning method (Steinbach, Karypis, & Kumar, 2000), agglomerative and divisive hierarchical clustering (Kaufman & Rousseeuw, 1990), and frequent itemset-based hierarchical clustering (Fung, Wang, & Ester, 2003). The last one, which was developed by the authors, is further elaborated since it has been specially designed to address the hierarchical document clustering problem.

Download Full-text

Efficient Retrieval Of Html Documents Using Hybrid Meta-Heuristic Approaches In Web Document Clustering

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e1192.0585c19 ◽

2019 ◽

Vol 8 (5C) ◽

pp. 1350-1354

Keyword(s):

Web Search ◽

Heuristic Algorithms ◽

Document Clustering ◽

Web Documents ◽

Clustering Problem ◽

Web Document ◽

Efficient Retrieval ◽

Web Document Clustering ◽

The Given ◽

The Web

With the rapid growth of web documents on WWW, it is becoming difficult to organize, analyze and present these documents efficiently. Web search engines return many documents to the web user, out of which some are relevant and some irrelevant documents to the topic, for the given query. Web search is usually performed using only features extracted from the web page text. HTML tags with particular meanings have been found to improve the efficiency of the information retrieval System. However, organizing documents in a way that will improve search without additional cost or complexity is still a great challenge. Clustering can play an important role to organize such a large number of documents into several groups. However due to limitations in existing techniques of clustering, scientists have begun using Meta-heuristic algorithms for the clustering problem of documents. In this paper, we presented a document clustering method that uses HTML tags and Metaheuristic approaches. The hybrid PSO+ACO+K-means algorithm is used for clustering the documents. In the proposed approach, results are analyzed on WEBKB dataset

Download Full-text

A New Swarm-Based Simulated Annealing Hyper-Heuristic Algorithm for Clustering Problem

Procedia Computer Science ◽

10.1016/j.procs.2019.12.104 ◽

2019 ◽

Vol 163 ◽

pp. 228-236 ◽

Cited By ~ 1

Author(s):

Mohammad Babrdel Bonab ◽

Siti Zaiton Mohd Hashim ◽

Tay Yong Haur ◽

Goh Yong Kheng

Keyword(s):

Simulated Annealing ◽

Heuristic Algorithm ◽

Clustering Problem

Download Full-text

PDC-Transitive: An Enhanced Heuristic for Document Clustering Based on Relational Analysis Approach and Iterative MapReduce

Journal of Information & Knowledge Management ◽

10.1142/s0219649218500211 ◽

2018 ◽

Vol 17 (02) ◽

pp. 1850021

Author(s):

Yasmine Lamari ◽

Said Chah Slaoui

Keyword(s):

Clustering Algorithms ◽

Computing Time ◽

Document Clustering ◽

Large Data ◽

Original Method ◽

Data Partitioning ◽

Data Dependencies ◽

Clustering Problem ◽

Relational Analysis ◽

Benchmark Datasets

Recently, MapReduce-based implementations of clustering algorithms have been developed to cope with the Big Data phenomenon, and they show promising results particularly for the document clustering problem. In this paper, we extend an efficient data partitioning method based on the relational analysis (RA) approach and applied to the document clustering problem, called PDC-Transitive. The introduced heuristic is parallelised using the MapReduce model iteratively and designed with a single reducer which represents a bottleneck when processing large data, we improved the design of the PDC-Transitive method to avoid the data dependencies and reduce the computation cost. Experiment results on benchmark datasets demonstrate that the enhanced heuristic yields better quality results and requires less computing time compared to the original method.

Download Full-text

XML Document Clustering

Handbook of Research on Innovations in Database Technologies and Applications ◽

10.4018/978-1-60566-242-8.ch071 ◽

2009 ◽

pp. 665-673 ◽

Cited By ~ 1

Author(s):

Andrea Tagarelli

Keyword(s):

Document Clustering ◽

Hierarchical Structures ◽

Research Area ◽

Clustering Problem ◽

Xml Document ◽

Extensible Markup ◽

Language Data ◽

Document Structures ◽

Fruitful Research ◽

The Web

The ability of providing a “standardized, extensible means of coupling semantic information within documents describing semistructured data” (Chaudhri, Rashid, & Zicari, 2003) has led to a steady growth of XML (extensible markup language) data sources, so that XML is touted as the driving force for representing and exchanging data on the Web. The motivation behind any clustering problem is to find an inherent structure of relationships in the data and expose this structure as a set of clusters where the objects within the same cluster are each to other highly similar but very dissimilar from objects in different clusters. The clustering problem finds in text databases a fruitful research area. Since today semistructured text data has become more prevalent on the Web, and XML is the de facto standard for such data, clustering XML documents has increasingly attracted great attention. Any application domain that needs organization of complex document structures (e.g., hierarchical structures with unbounded nesting, object-oriented hierarchies) as well as data containing a few structured fields together with some largely unstructured text components can be profitably assisted by an XML document clustering task.

Download Full-text

Max stable set problem to found the initial centroids in clustering problem

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v25.i1.pp569-579 ◽

2022 ◽

Vol 25 (1) ◽

pp. 569

Author(s):

Awatif Karim ◽

Chakir Loqman ◽

Youssef Hami ◽

Jaouad Boumhidi

Keyword(s):

Document Clustering ◽

Large Data ◽

Hopfield Network ◽

Large Data Sets ◽

Stable Set ◽

Data Sets ◽

Clustering Problem ◽

Text Document ◽

Stable Set Problem

In this paper, we propose a new approach to solve the document-clustering using the K-Means algorithm. The latter is sensitive to the random selection of the k cluster centroids in the initialization phase. To evaluate the quality of K-Means clustering we propose to model the text document clustering problem as the max stable set problem (MSSP) and use continuous Hopfield network to solve the MSSP problem to have initial centroids. The idea is inspired by the fact that MSSP and clustering share the same principle, MSSP consists to find the largest set of nodes completely disconnected in a graph, and in clustering, all objects are divided into disjoint clusters. Simulation results demonstrate that the proposed K-Means improved by MSSP (KM_MSSP) is efficient of large data sets, is much optimized in terms of time, and provides better quality of clustering than other methods.

Download Full-text