A Brief Review of Metaheuristics for Document or Text Clustering

Author(s):  
Sinem Büyüksaatçı ◽  
Alp Baray

Document clustering, which involves concepts from the fields of information retrieval, automatic topic extraction, natural language processing, and machine learning, is one of the most popular research areas in data mining. Due to the large amount of information in electronic form, fast and high-quality cluster analysis plays an important role in helping users to effectively navigate, summarize and organise this information for useful data. There are a number of techniques in the literature, which efficiently provide solutions for document clustering. However, during the last decade, researchers started to use metaheuristic algorithms for the document clustering problem because of the limitations of the existing traditional clustering algorithms. In this chapter, the authors will give a brief review of various research papers that present the area of document or text clustering approaches with different metaheuristic algorithms.

Author(s):  
Laith Mohammad Abualigah ◽  
Essam Said Hanandeh ◽  
Ahamad Tajudin Khader ◽  
Mohammed Abdallh Otair ◽  
Shishir Kumar Shandilya

Background: Considering the increasing volume of text document information on Internet pages, dealing with such a tremendous amount of knowledge becomes totally complex due to its large size. Text clustering is a common optimization problem used to manage a large amount of text information into a subset of comparable and coherent clusters. Aims: This paper presents a novel local clustering technique, namely, β-hill climbing, to solve the problem of the text document clustering through modeling the β-hill climbing technique for partitioning the similar documents into the same cluster. Methods: The β parameter is the primary innovation in β-hill climbing technique. It has been introduced in order to perform a balance between local and global search. Local search methods are successfully applied to solve the problem of the text document clustering such as; k-medoid and kmean techniques. Results: Experiments were conducted on eight benchmark standard text datasets with different characteristics taken from the Laboratory of Computational Intelligence (LABIC). The results proved that the proposed β-hill climbing achieved better results in comparison with the original hill climbing technique in solving the text clustering problem. Conclusion: The performance of the text clustering is useful by adding the β operator to the hill climbing.


2021 ◽  
pp. 1063293X2098297
Author(s):  
Ivar Örn Arnarsson ◽  
Otto Frost ◽  
Emil Gustavsson ◽  
Mats Jirstrand ◽  
Johan Malmqvist

Product development companies collect data in form of Engineering Change Requests for logged design issues, tests, and product iterations. These documents are rich in unstructured data (e.g. free text). Previous research affirms that product developers find that current IT systems lack capabilities to accurately retrieve relevant documents with unstructured data. In this research, we demonstrate a method using Natural Language Processing and document clustering algorithms to find structurally or contextually related documents from databases containing Engineering Change Request documents. The aim is to radically decrease the time needed to effectively search for related engineering documents, organize search results, and create labeled clusters from these documents by utilizing Natural Language Processing algorithms. A domain knowledge expert at the case company evaluated the results and confirmed that the algorithms we applied managed to find relevant document clusters given the queries tested.


2018 ◽  
Vol 7 (4.11) ◽  
pp. 246
Author(s):  
N. M. Ariff ◽  
M. A. A. Bakar ◽  
M. I. Rahmad

Text clustering is a data mining technique that is becoming more important in present studies. Document clustering makes use of text clustering to divide documents according to the various topics. The choice of words in document clustering is important to ensure that the document can be classified correctly. Three different methods of clustering which are hierarchical clustering, k-means and k-medoids are used and compared in this study in order to identify the best method which produce the best result in document clustering. The three methods are applied on 60 sports articles involving four different types of sports. The k-medoids clustering produced the worst result while k-means clustering is found to be more sensitive towards general words. Therefore, the method of hierarchical clustering is deemed more stable to produce a meaningful result in document clustering analysis. 


Author(s):  
Xiaohui Cui

In this chapter, we introduce three nature inspired swarm intelligence clustering approaches for document clustering analysis. The major challenge of today’s information society is being overwhelmed with information on any topic they are searching for. Fast and high-quality document clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the overwhelmed information. The swarm intelligence clustering algorithms use stochastic and heuristic principles discovered from observing bird flocks, fish schools, and ant food forage. Compared to the traditional clustering algorithms, the swarm algorithms are usually flexible, robust, decentralized, and self-organized. These characters make the swarm algorithms suitable for solving complex problems, such as document clustering.


Author(s):  
Benjamin C.M. Fung ◽  
Ke Wang ◽  
Martin Ester

Document clustering is an automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters. Unlike document classification (Wang, Zhou, & He, 2001), no labeled documents are provided in clustering; hence, clustering is also known as unsupervised learning. Hierarchical document clustering organizes clusters into a tree or a hierarchy that facilitates browsing. The parent-child relationship among the nodes in the tree can be viewed as a topic-subtopic relationship in a subject hierarchy such as the Yahoo! directory. This chapter discusses several special challenges in hierarchical document clustering: high dimensionality, high volume of data, ease of browsing, and meaningful cluster labels. State-of-the-art document clustering algorithms are reviewed: the partitioning method (Steinbach, Karypis, & Kumar, 2000), agglomerative and divisive hierarchical clustering (Kaufman & Rousseeuw, 1990), and frequent itemset-based hierarchical clustering (Fung, Wang, & Ester, 2003). The last one, which was developed by the authors, is further elaborated since it has been specially designed to address the hierarchical document clustering problem.


2011 ◽  
Vol 57 (3) ◽  
pp. 271-277 ◽  
Author(s):  
Tomasz Tarczynski

Document Clustering - Concepts, Metrics and AlgorithmsDocument clustering, which is also refered to astext clustering, is a technique of unsupervised document organisation. Text clustering is used to group documents into subsets that consist of texts that are similar to each orher. These subsets are called clusters. Document clustering algorithms are widely used in web searching engines to produce results relevant to a query. An example of practical use of those techniques are Yahoo! hierarchies of documents [1]. Another application of document clustering is browsing which is defined as searching session without well specific goal. The browsing techniques heavily relies on document clustering. In this article we examine the most important concepts related to document clustering. Besides the algorithms we present comprehensive discussion about representation of documents, calculation of similarity between documents and evaluation of clusters quality.


2018 ◽  
Vol 17 (02) ◽  
pp. 1850021
Author(s):  
Yasmine Lamari ◽  
Said Chah Slaoui

Recently, MapReduce-based implementations of clustering algorithms have been developed to cope with the Big Data phenomenon, and they show promising results particularly for the document clustering problem. In this paper, we extend an efficient data partitioning method based on the relational analysis (RA) approach and applied to the document clustering problem, called PDC-Transitive. The introduced heuristic is parallelised using the MapReduce model iteratively and designed with a single reducer which represents a bottleneck when processing large data, we improved the design of the PDC-Transitive method to avoid the data dependencies and reduce the computation cost. Experiment results on benchmark datasets demonstrate that the enhanced heuristic yields better quality results and requires less computing time compared to the original method.


2019 ◽  
Author(s):  
Inc. OEAPS ◽  
Михаил Владимирович Кармаза ◽  
Роман Владимирович Мотылев ◽  
Вероника Александровна Одрузова ◽  
Нишчхал ◽  
...  

Authoritative and critical reviews of the latest achievements of natural and technical disciplines are published by Journal of Technical and Natural Sciences.Journal of Technical and Natural Sciences, an international peer¬reviewed journal, publishes both theoretical and experimental high¬quality documents of constant interest, previously unpublished in journals, in the field of technical and natural sciences, whose purpose is to promote theory and practice. In addition to the peer¬reviewed original research papers, the Editorial Board welcomes original research reports, modern surveys and communications in a broadly defined field of technical and natural sciences.


2020 ◽  
Vol 12 (17) ◽  
pp. 6846
Author(s):  
Jinyuan Ma ◽  
Fan Jiang ◽  
Liujian Gu ◽  
Xiang Zheng ◽  
Xiao Lin ◽  
...  

This study analyzes the patterns of university co-authorship networks in the Guangdong-Hong Kong-Macau Greater Bay Area. It also examines the quality and subject distribution of co-authored articles within these networks. Social network analysis is used to outline the structure and evolution of the networks that have produced co-authored articles at universities in the Greater Bay Area from 2014 to 2018, at both regional and institutional levels. Field-weighted citation impact (FWCI) is used to analyze the quality and citation impact of co-authored articles in different subject fields. The findings of the study reveal that university co-authorship networks in the Greater Bay Area are still dispersed, and their disciplinary development is unbalanced. The study also finds that, while the research areas covered by high-quality co-authored articles fit the strategic needs of technological innovation and industrial distribution in the Greater Bay Area, high-quality research collaboration in the humanities and social sciences is insufficient.


Sensors ◽  
2021 ◽  
Vol 21 (13) ◽  
pp. 4496
Author(s):  
Vlad Pandelea ◽  
Edoardo Ragusa ◽  
Tommaso Apicella ◽  
Paolo Gastaldo ◽  
Erik Cambria

Emotion recognition, among other natural language processing tasks, has greatly benefited from the use of large transformer models. Deploying these models on resource-constrained devices, however, is a major challenge due to their computational cost. In this paper, we show that the combination of large transformers, as high-quality feature extractors, and simple hardware-friendly classifiers based on linear separators can achieve competitive performance while allowing real-time inference and fast training. Various solutions including batch and Online Sequential Learning are analyzed. Additionally, our experiments show that latency and performance can be further improved via dimensionality reduction and pre-training, respectively. The resulting system is implemented on two types of edge device, namely an edge accelerator and two smartphones.


Sign in / Sign up

Export Citation Format

Share Document