Solving document clustering problem through meta heuristic algorithm

Author(s):  
Muhammad Rafi ◽  
Bilal Aamer ◽  
Mubashir Naseem ◽  
Muhammad Osama
Author(s):  
Laith Mohammad Abualigah ◽  
Essam Said Hanandeh ◽  
Ahamad Tajudin Khader ◽  
Mohammed Abdallh Otair ◽  
Shishir Kumar Shandilya

Background: Considering the increasing volume of text document information on Internet pages, dealing with such a tremendous amount of knowledge becomes totally complex due to its large size. Text clustering is a common optimization problem used to manage a large amount of text information into a subset of comparable and coherent clusters. Aims: This paper presents a novel local clustering technique, namely, β-hill climbing, to solve the problem of the text document clustering through modeling the β-hill climbing technique for partitioning the similar documents into the same cluster. Methods: The β parameter is the primary innovation in β-hill climbing technique. It has been introduced in order to perform a balance between local and global search. Local search methods are successfully applied to solve the problem of the text document clustering such as; k-medoid and kmean techniques. Results: Experiments were conducted on eight benchmark standard text datasets with different characteristics taken from the Laboratory of Computational Intelligence (LABIC). The results proved that the proposed β-hill climbing achieved better results in comparison with the original hill climbing technique in solving the text clustering problem. Conclusion: The performance of the text clustering is useful by adding the β operator to the hill climbing.


Author(s):  
Sinem Büyüksaatçı ◽  
Alp Baray

Document clustering, which involves concepts from the fields of information retrieval, automatic topic extraction, natural language processing, and machine learning, is one of the most popular research areas in data mining. Due to the large amount of information in electronic form, fast and high-quality cluster analysis plays an important role in helping users to effectively navigate, summarize and organise this information for useful data. There are a number of techniques in the literature, which efficiently provide solutions for document clustering. However, during the last decade, researchers started to use metaheuristic algorithms for the document clustering problem because of the limitations of the existing traditional clustering algorithms. In this chapter, the authors will give a brief review of various research papers that present the area of document or text clustering approaches with different metaheuristic algorithms.


Author(s):  
Benjamin C.M. Fung ◽  
Ke Wang ◽  
Martin Ester

Document clustering is an automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters. Unlike document classification (Wang, Zhou, & He, 2001), no labeled documents are provided in clustering; hence, clustering is also known as unsupervised learning. Hierarchical document clustering organizes clusters into a tree or a hierarchy that facilitates browsing. The parent-child relationship among the nodes in the tree can be viewed as a topic-subtopic relationship in a subject hierarchy such as the Yahoo! directory. This chapter discusses several special challenges in hierarchical document clustering: high dimensionality, high volume of data, ease of browsing, and meaningful cluster labels. State-of-the-art document clustering algorithms are reviewed: the partitioning method (Steinbach, Karypis, & Kumar, 2000), agglomerative and divisive hierarchical clustering (Kaufman & Rousseeuw, 1990), and frequent itemset-based hierarchical clustering (Fung, Wang, & Ester, 2003). The last one, which was developed by the authors, is further elaborated since it has been specially designed to address the hierarchical document clustering problem.


With the rapid growth of web documents on WWW, it is becoming difficult to organize, analyze and present these documents efficiently. Web search engines return many documents to the web user, out of which some are relevant and some irrelevant documents to the topic, for the given query. Web search is usually performed using only features extracted from the web page text. HTML tags with particular meanings have been found to improve the efficiency of the information retrieval System. However, organizing documents in a way that will improve search without additional cost or complexity is still a great challenge. Clustering can play an important role to organize such a large number of documents into several groups. However due to limitations in existing techniques of clustering, scientists have begun using Meta-heuristic algorithms for the clustering problem of documents. In this paper, we presented a document clustering method that uses HTML tags and Metaheuristic approaches. The hybrid PSO+ACO+K-means algorithm is used for clustering the documents. In the proposed approach, results are analyzed on WEBKB dataset


2019 ◽  
Vol 163 ◽  
pp. 228-236 ◽  
Author(s):  
Mohammad Babrdel Bonab ◽  
Siti Zaiton Mohd Hashim ◽  
Tay Yong Haur ◽  
Goh Yong Kheng

2018 ◽  
Vol 17 (02) ◽  
pp. 1850021
Author(s):  
Yasmine Lamari ◽  
Said Chah Slaoui

Recently, MapReduce-based implementations of clustering algorithms have been developed to cope with the Big Data phenomenon, and they show promising results particularly for the document clustering problem. In this paper, we extend an efficient data partitioning method based on the relational analysis (RA) approach and applied to the document clustering problem, called PDC-Transitive. The introduced heuristic is parallelised using the MapReduce model iteratively and designed with a single reducer which represents a bottleneck when processing large data, we improved the design of the PDC-Transitive method to avoid the data dependencies and reduce the computation cost. Experiment results on benchmark datasets demonstrate that the enhanced heuristic yields better quality results and requires less computing time compared to the original method.


Author(s):  
Andrea Tagarelli

The ability of providing a “standardized, extensible means of coupling semantic information within documents describing semistructured data” (Chaudhri, Rashid, & Zicari, 2003) has led to a steady growth of XML (extensible markup language) data sources, so that XML is touted as the driving force for representing and exchanging data on the Web. The motivation behind any clustering problem is to find an inherent structure of relationships in the data and expose this structure as a set of clusters where the objects within the same cluster are each to other highly similar but very dissimilar from objects in different clusters. The clustering problem finds in text databases a fruitful research area. Since today semistructured text data has become more prevalent on the Web, and XML is the de facto standard for such data, clustering XML documents has increasingly attracted great attention. Any application domain that needs organization of complex document structures (e.g., hierarchical structures with unbounded nesting, object-oriented hierarchies) as well as data containing a few structured fields together with some largely unstructured text components can be profitably assisted by an XML document clustering task.


Author(s):  
Awatif Karim ◽  
Chakir Loqman ◽  
Youssef Hami ◽  
Jaouad Boumhidi

In this paper, we propose a new approach to solve the document-clustering using the K-Means algorithm. The latter is sensitive to the random selection of the k cluster centroids in the initialization phase. To evaluate the quality of K-Means clustering we propose to model the text document clustering problem as the max stable set problem (MSSP) and use continuous Hopfield network to solve the MSSP problem to have initial centroids. The idea is inspired by the fact that MSSP and clustering share the same principle, MSSP consists to find the largest set of nodes completely disconnected in a graph, and in clustering, all objects are divided into disjoint clusters. Simulation results demonstrate that the proposed K-Means improved by MSSP (KM_MSSP) is efficient of large data sets, is much optimized in terms of time, and provides better quality of clustering than other methods.


Sign in / Sign up

Export Citation Format

Share Document