A Brief Review of Metaheuristics for Document or Text Clustering

Document clustering, which involves concepts from the fields of information retrieval, automatic topic extraction, natural language processing, and machine learning, is one of the most popular research areas in data mining. Due to the large amount of information in electronic form, fast and high-quality cluster analysis plays an important role in helping users to effectively navigate, summarize and organise this information for useful data. There are a number of techniques in the literature, which efficiently provide solutions for document clustering. However, during the last decade, researchers started to use metaheuristic algorithms for the document clustering problem because of the limitations of the existing traditional clustering algorithms. In this chapter, the authors will give a brief review of various research papers that present the area of document or text clustering approaches with different metaheuristic algorithms.

Download Full-text

An Improved B-hill Climbing Optimization Technique for Solving the Text Documents Clustering Problem

Current Medical Imaging Formerly Current Medical Imaging Reviews ◽

10.2174/1573405614666180903112541 ◽

2020 ◽

Vol 16 (4) ◽

pp. 296-306 ◽

Cited By ~ 3

Author(s):

Laith Mohammad Abualigah ◽

Essam Said Hanandeh ◽

Ahamad Tajudin Khader ◽

Mohammed Abdallh Otair ◽

Shishir Kumar Shandilya

Keyword(s):

Optimization Technique ◽

Document Clustering ◽

Text Clustering ◽

Hill Climbing ◽

Text Documents ◽

Clustering Problem ◽

Text Document ◽

Text Information ◽

Amount Of Knowledge ◽

The Hill

Background: Considering the increasing volume of text document information on Internet pages, dealing with such a tremendous amount of knowledge becomes totally complex due to its large size. Text clustering is a common optimization problem used to manage a large amount of text information into a subset of comparable and coherent clusters. Aims: This paper presents a novel local clustering technique, namely, β-hill climbing, to solve the problem of the text document clustering through modeling the β-hill climbing technique for partitioning the similar documents into the same cluster. Methods: The β parameter is the primary innovation in β-hill climbing technique. It has been introduced in order to perform a balance between local and global search. Local search methods are successfully applied to solve the problem of the text document clustering such as; k-medoid and kmean techniques. Results: Experiments were conducted on eight benchmark standard text datasets with different characteristics taken from the Laboratory of Computational Intelligence (LABIC). The results proved that the proposed β-hill climbing achieved better results in comparison with the original hill climbing technique in solving the text clustering problem. Conclusion: The performance of the text clustering is useful by adding the β operator to the hill climbing.

Download Full-text

Natural language processing methods for knowledge management—Applying document clustering for fast search and grouping of engineering documents

Concurrent Engineering ◽

10.1177/1063293x20982973 ◽

2021 ◽

pp. 1063293X2098297

Author(s):

Ivar Örn Arnarsson ◽

Otto Frost ◽

Emil Gustavsson ◽

Mats Jirstrand ◽

Johan Malmqvist

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Domain Knowledge ◽

Clustering Algorithms ◽

Document Clustering ◽

Unstructured Data ◽

Free Text ◽

Engineering Change ◽

Engineering Documents

Product development companies collect data in form of Engineering Change Requests for logged design issues, tests, and product iterations. These documents are rich in unstructured data (e.g. free text). Previous research affirms that product developers find that current IT systems lack capabilities to accurately retrieve relevant documents with unstructured data. In this research, we demonstrate a method using Natural Language Processing and document clustering algorithms to find structurally or contextually related documents from databases containing Engineering Change Request documents. The aim is to radically decrease the time needed to effectively search for related engineering documents, organize search results, and create labeled clusters from these documents by utilizing Natural Language Processing algorithms. A domain knowledge expert at the case company evaluated the results and confirmed that the algorithms we applied managed to find relevant document clusters given the queries tested.

Download Full-text

Comparative Study of Document Clustering Algorithms

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i4.11.20816 ◽

2018 ◽

Vol 7 (4.11) ◽

pp. 246

Author(s):

N. M. Ariff ◽

M. A. A. Bakar ◽

M. I. Rahmad

Keyword(s):

Data Mining ◽

Hierarchical Clustering ◽

Clustering Analysis ◽

Clustering Algorithms ◽

Document Clustering ◽

Text Clustering ◽

Data Mining Technique ◽

Mining Technique ◽

Meaningful Result ◽

Different Types

Text clustering is a data mining technique that is becoming more important in present studies. Document clustering makes use of text clustering to divide documents according to the various topics. The choice of words in document clustering is important to ensure that the document can be classified correctly. Three different methods of clustering which are hierarchical clustering, k-means and k-medoids are used and compared in this study in order to identify the best method which produce the best result in document clustering. The three methods are applied on 60 sports articles involving four different types of sports. The k-medoids clustering produced the worst result while k-means clustering is found to be more sensitive towards general words. Therefore, the method of hierarchical clustering is deemed more stable to produce a meaningful result in document clustering analysis.

Download Full-text

Swarm Intelligence in Text Document Clustering

Handbook of Research on Text and Web Mining Technologies ◽

10.4018/978-1-59904-990-8.ch010 ◽

2010 ◽

pp. 165-180 ◽

Cited By ~ 2

Author(s):

Xiaohui Cui

Keyword(s):

Swarm Intelligence ◽

Clustering Analysis ◽

Information Society ◽

Clustering Algorithms ◽

Document Clustering ◽

High Quality ◽

Fish Schools ◽

Text Document ◽

Self Organized ◽

Swarm Algorithms

In this chapter, we introduce three nature inspired swarm intelligence clustering approaches for document clustering analysis. The major challenge of today’s information society is being overwhelmed with information on any topic they are searching for. Fast and high-quality document clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the overwhelmed information. The swarm intelligence clustering algorithms use stochastic and heuristic principles discovered from observing bird flocks, fish schools, and ant food forage. Compared to the traditional clustering algorithms, the swarm algorithms are usually flexible, robust, decentralized, and self-organized. These characters make the swarm algorithms suitable for solving complex problems, such as document clustering.

Download Full-text

Hierarchical Document Clustering

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch150 ◽

2011 ◽

pp. 970-975 ◽

Cited By ~ 1

Author(s):

Benjamin C.M. Fung ◽

Ke Wang ◽

Martin Ester

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithms ◽

Document Clustering ◽

High Volume ◽

Frequent Itemset ◽

Parent Child Relationship ◽

Text Documents ◽

Clustering Problem ◽

Child Relationship ◽

Divisive Hierarchical Clustering

Document clustering is an automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters. Unlike document classification (Wang, Zhou, & He, 2001), no labeled documents are provided in clustering; hence, clustering is also known as unsupervised learning. Hierarchical document clustering organizes clusters into a tree or a hierarchy that facilitates browsing. The parent-child relationship among the nodes in the tree can be viewed as a topic-subtopic relationship in a subject hierarchy such as the Yahoo! directory. This chapter discusses several special challenges in hierarchical document clustering: high dimensionality, high volume of data, ease of browsing, and meaningful cluster labels. State-of-the-art document clustering algorithms are reviewed: the partitioning method (Steinbach, Karypis, & Kumar, 2000), agglomerative and divisive hierarchical clustering (Kaufman & Rousseeuw, 1990), and frequent itemset-based hierarchical clustering (Fung, Wang, & Ester, 2003). The last one, which was developed by the authors, is further elaborated since it has been specially designed to address the hierarchical document clustering problem.

Download Full-text

Document Clustering - Concepts, Metrics and Algorithms

International Journal of Electronics and Telecommunications ◽

10.2478/v10177-011-0036-5 ◽

2011 ◽

Vol 57 (3) ◽

pp. 271-277 ◽

Cited By ~ 4

Author(s):

Tomasz Tarczynski

Keyword(s):

Clustering Algorithms ◽

Document Clustering ◽

Text Clustering ◽

Web Searching ◽

Specific Goal ◽

Comprehensive Discussion

Document Clustering - Concepts, Metrics and AlgorithmsDocument clustering, which is also refered to astext clustering, is a technique of unsupervised document organisation. Text clustering is used to group documents into subsets that consist of texts that are similar to each orher. These subsets are called clusters. Document clustering algorithms are widely used in web searching engines to produce results relevant to a query. An example of practical use of those techniques are Yahoo! hierarchies of documents [1]. Another application of document clustering is browsing which is defined as searching session without well specific goal. The browsing techniques heavily relies on document clustering. In this article we examine the most important concepts related to document clustering. Besides the algorithms we present comprehensive discussion about representation of documents, calculation of similarity between documents and evaluation of clusters quality.

Download Full-text

PDC-Transitive: An Enhanced Heuristic for Document Clustering Based on Relational Analysis Approach and Iterative MapReduce

Journal of Information & Knowledge Management ◽

10.1142/s0219649218500211 ◽

2018 ◽

Vol 17 (02) ◽

pp. 1850021

Author(s):

Yasmine Lamari ◽

Said Chah Slaoui

Keyword(s):

Clustering Algorithms ◽

Computing Time ◽

Document Clustering ◽

Large Data ◽

Original Method ◽

Data Partitioning ◽

Data Dependencies ◽

Clustering Problem ◽

Relational Analysis ◽

Benchmark Datasets

Recently, MapReduce-based implementations of clustering algorithms have been developed to cope with the Big Data phenomenon, and they show promising results particularly for the document clustering problem. In this paper, we extend an efficient data partitioning method based on the relational analysis (RA) approach and applied to the document clustering problem, called PDC-Transitive. The introduced heuristic is parallelised using the MapReduce model iteratively and designed with a single reducer which represents a bottleneck when processing large data, we improved the design of the PDC-Transitive method to avoid the data dependencies and reduce the computation cost. Experiment results on benchmark datasets demonstrate that the enhanced heuristic yields better quality results and requires less computing time compared to the original method.

Download Full-text

Journal of Technical and Natural Sciences 9, 2018/ OEAPS Inc.(Open European Academy of Public Sciences); Chief Editor Vitaly Yakovlev- Berlin, Germany. 18.12.2018: OEAPS Inc., 2019. - 35 P.

10.31219/osf.io/kems3 ◽

2019 ◽

Author(s):

Inc. OEAPS ◽

Михаил Владимирович Кармаза ◽

Роман Владимирович Мотылев ◽

Вероника Александровна Одрузова ◽

Нишчхал ◽

...

Keyword(s):

Chief Editor ◽

Natural Sciences ◽

Theory And Practice ◽

Original Research ◽

Research Papers ◽

High Quality ◽

European Academy ◽

Constant Interest ◽

Critical Reviews

Authoritative and critical reviews of the latest achievements of natural and technical disciplines are published by Journal of Technical and Natural Sciences.Journal of Technical and Natural Sciences, an international peer¬reviewed journal, publishes both theoretical and experimental high¬quality documents of constant interest, previously unpublished in journals, in the field of technical and natural sciences, whose purpose is to promote theory and practice. In addition to the peer¬reviewed original research papers, the Editorial Board welcomes original research reports, modern surveys and communications in a broadly defined field of technical and natural sciences.

Download Full-text

Patterns of the Network of Cross-Border University Research Collaboration in the Guangdong-Hong Kong-Macau Greater Bay Area

Sustainability ◽

10.3390/su12176846 ◽

2020 ◽

Vol 12 (17) ◽

pp. 6846

Author(s):

Jinyuan Ma ◽

Fan Jiang ◽

Liujian Gu ◽

Xiang Zheng ◽

Xiao Lin ◽

...

Keyword(s):

Social Sciences ◽

Hong Kong ◽

Research Collaboration ◽

Citation Impact ◽

High Quality ◽

High Quality Research ◽

Cross Border ◽

Bay Area ◽

Research Areas ◽

Industrial Distribution

This study analyzes the patterns of university co-authorship networks in the Guangdong-Hong Kong-Macau Greater Bay Area. It also examines the quality and subject distribution of co-authored articles within these networks. Social network analysis is used to outline the structure and evolution of the networks that have produced co-authored articles at universities in the Greater Bay Area from 2014 to 2018, at both regional and institutional levels. Field-weighted citation impact (FWCI) is used to analyze the quality and citation impact of co-authored articles in different subject fields. The findings of the study reveal that university co-authorship networks in the Greater Bay Area are still dispersed, and their disciplinary development is unbalanced. The study also finds that, while the research areas covered by high-quality co-authored articles fit the strategic needs of technological innovation and industrial distribution in the Greater Bay Area, high-quality research collaboration in the humanities and social sciences is insufficient.

Download Full-text

Emotion Recognition on Edge Devices: Training and Deployment

Sensors ◽

10.3390/s21134496 ◽

2021 ◽

Vol 21 (13) ◽

pp. 4496

Author(s):

Vlad Pandelea ◽

Edoardo Ragusa ◽

Tommaso Apicella ◽

Paolo Gastaldo ◽

Erik Cambria

Keyword(s):

Emotion Recognition ◽

Language Processing ◽

Computational Cost ◽

Sequential Learning ◽

High Quality ◽

Fast Training ◽

Online Sequential Learning ◽

And Performance ◽

Resource Constrained Devices ◽

Constrained Devices

Emotion recognition, among other natural language processing tasks, has greatly benefited from the use of large transformer models. Deploying these models on resource-constrained devices, however, is a major challenge due to their computational cost. In this paper, we show that the combination of large transformers, as high-quality feature extractors, and simple hardware-friendly classifiers based on linear separators can achieve competitive performance while allowing real-time inference and fast training. Various solutions including batch and Online Sequential Learning are analyzed. Additionally, our experiments show that latency and performance can be further improved via dimensionality reduction and pre-training, respectively. The resulting system is implemented on two types of edge device, namely an edge accelerator and two smartphones.

Download Full-text