A hybrid approach for text document clustering using Jaya optimization algorithm

The fast-growing Internet results in massive amounts of text data. Due to the large volume of the unstructured format of text data, extracting relevant information and its analysis becomes very challenging. Text document clustering is a text-mining process that partitions the set of text-based documents into mutually exclusive clusters in such a way that documents within the same group are similar to each other, while documents from different clusters differ based on the content. One of the biggest challenges in text clustering is partitioning the collection of text data by measuring the relevance of the content in the documents. Addressing this issue, in this work a hybrid swarm intelligence algorithm with a K-means algorithm is proposed for text clustering. First, the hybrid fruit-fly optimization algorithm is tested on ten unconstrained CEC2019 benchmark functions. Next, the proposed method is evaluated on six standard benchmark text datasets. The experimental evaluation on the unconstrained functions, as well as on text-based documents, indicated that the proposed approach is robust and superior to other state-of-the-art methods.

Download Full-text

Optimal Text Document Clustering Enabled by Weighed Similarity Oriented Jaya With Grey Wolf Optimization Algorithm

The Computer Journal ◽

10.1093/comjnl/bxab013 ◽

2021 ◽

Author(s):

Gugulothu Venkanna ◽

Dr K F Bharati

Keyword(s):

Optimization Algorithm ◽

Document Clustering ◽

Similarity Index ◽

Scientific Development ◽

Similarity Function ◽

Grey Wolf ◽

Text Document ◽

Clustering Model ◽

Proposed Model ◽

Better Than

Abstract Owing to scientific development, a variety of challenges present in the field of information retrieval. These challenges are because of the increased usage of large volumes of data. These huge amounts of data are presented from large-scale distributed networks. Centralization of these data to carry out analysis is tricky. There exists a requirement for novel text document clustering algorithms, which overcomes challenges in clustering. The two most important challenges in clustering are clustering accuracy and quality. For this reason, this paper intends to present an ideal clustering model for text document using term frequency–inverse document frequency, which is considered as feature sets. Here, the initial centroid selection is much concentrated which can automatically cluster the text using weighted similarity measure in the proposed clustering process. In fact, the weighted similarity function involves the inter-cluster, and intra-cluster similarity of both ordered and unordered documents, which is used to minimize weighted similarity among the documents. An advanced model for clustering is proposed by the hybrid optimization algorithm, which is the combination of the Jaya Algorithm (JA) and Grey Wolf Algorithm (GWO), and so the proposed algorithm is termed as JA-based GWO. Finally, the performance of the proposed model is verified through a comparative analysis with the state-of-the-art models. The performance analysis exhibits that the proposed model is 96.56% better than genetic algorithm, 99.46% better than particle swarm optimization, 97.09% superior to Dragonfly algorithm, and 96.21% better than JA for the similarity index. Therefore, the proposed model has confirmed its efficiency through valuable analysis.

Download Full-text

Evaluation of text document clustering approach based on particle swarm optimization

Open Computer Science ◽

10.2478/s13537-013-0104-2 ◽

2013 ◽

Vol 3 (2) ◽

Cited By ~ 18

Author(s):

Stuti Karol ◽

Veenu Mangat

Keyword(s):

Data Mining ◽

Particle Swarm Optimization ◽

Clustering Algorithms ◽

Hybrid Approach ◽

Document Clustering ◽

Particle Swarm ◽

Pso Algorithm ◽

Swarm Optimization ◽

Fuzzy C Means ◽

Text Document

AbstractClustering, an extremely important technique in Data Mining is an automatic learning technique aimed at grouping a set of objects into subsets or clusters. The goal is to create clusters that are coherent internally, but substantially different from each other. Text Document Clustering refers to the clustering of related text documents into groups based upon their content. It is a fundamental operation used in unsupervised document organization, text data mining, automatic topic extraction, and information retrieval. Fast and high-quality document clustering algorithms play an important role in effectively navigating, summarizing, and organizing information. The documents to be clustered can be web news articles, abstracts of research papers etc. This paper proposes two techniques for efficient document clustering involving the application of soft computing approach as an intelligent hybrid approach PSO algorithm. The proposed approach involves partitioning Fuzzy C-Means algorithm and K-Means algorithm each hybridized with Particle Swarm Optimization (PSO). The performance of these hybrid algorithms has been evaluated against traditional partitioning techniques (K-Means and Fuzzy C Means).

Download Full-text

Comparision of Different Distance Measure Methods in Text Document Clustering

INTERNATIONAL JOURNAL OF RESEARCH AND ENGINEERING ◽

10.21276/ijre.2018.5.7.2 ◽

2018 ◽

Vol 5 (7) ◽

Author(s):

Yin Min Tun ◽

Keyword(s):

Distance Measure ◽

Document Clustering ◽

Text Document ◽

Measure Methods

Download Full-text

An Improved B-hill Climbing Optimization Technique for Solving the Text Documents Clustering Problem

Current Medical Imaging Formerly Current Medical Imaging Reviews ◽

10.2174/1573405614666180903112541 ◽

2020 ◽

Vol 16 (4) ◽

pp. 296-306 ◽

Cited By ~ 3

Author(s):

Laith Mohammad Abualigah ◽

Essam Said Hanandeh ◽

Ahamad Tajudin Khader ◽

Mohammed Abdallh Otair ◽

Shishir Kumar Shandilya

Keyword(s):

Optimization Technique ◽

Document Clustering ◽

Text Clustering ◽

Hill Climbing ◽

Text Documents ◽

Clustering Problem ◽

Text Document ◽

Text Information ◽

Amount Of Knowledge ◽

The Hill

Background: Considering the increasing volume of text document information on Internet pages, dealing with such a tremendous amount of knowledge becomes totally complex due to its large size. Text clustering is a common optimization problem used to manage a large amount of text information into a subset of comparable and coherent clusters. Aims: This paper presents a novel local clustering technique, namely, β-hill climbing, to solve the problem of the text document clustering through modeling the β-hill climbing technique for partitioning the similar documents into the same cluster. Methods: The β parameter is the primary innovation in β-hill climbing technique. It has been introduced in order to perform a balance between local and global search. Local search methods are successfully applied to solve the problem of the text document clustering such as; k-medoid and kmean techniques. Results: Experiments were conducted on eight benchmark standard text datasets with different characteristics taken from the Laboratory of Computational Intelligence (LABIC). The results proved that the proposed β-hill climbing achieved better results in comparison with the original hill climbing technique in solving the text clustering problem. Conclusion: The performance of the text clustering is useful by adding the β operator to the hill climbing.

Download Full-text

MultiStep Ahead Forecasting for Hourly PM10 and PM2.5 Based on Two-Stage Decomposition Embedded Sample Entropy and Group Teacher Optimization Algorithm

Atmosphere ◽

10.3390/atmos12010064 ◽

2021 ◽

Vol 12 (1) ◽

pp. 64

Author(s):

Feng Jiang ◽

Yaqian Qiao ◽

Xuchu Jiang ◽

Tianhai Tian

Keyword(s):

Optimization Algorithm ◽

Hybrid Approach ◽

Forecast Accuracy ◽

Ensemble Empirical Mode Decomposition ◽

Sample Entropy ◽

Air Pollutant ◽

Intrinsic Mode Functions ◽

Concentration Data ◽

Two Stage ◽

Pm10 And Pm2.5

The randomness, nonstationarity and irregularity of air pollutant data bring difficulties to forecasting. To improve the forecast accuracy, we propose a novel hybrid approach based on two-stage decomposition embedded sample entropy, group teaching optimization algorithm (GTOA), and extreme learning machine (ELM) to forecast the concentration of particulate matter (PM10 and PM2.5). First, the improvement complementary ensemble empirical mode decomposition with adaptive noise (ICEEMDAN) is employed to decompose the concentration data of PM10 and PM2.5 into a set of intrinsic mode functions (IMFs) with different frequencies. In addition, wavelet transform (WT) is utilized to decompose the IMFs with high frequency based on sample entropy values. Then the GTOA algorithm is used to optimize ELM. Furthermore, the GTOA-ELM is utilized to predict all the subseries. The final forecast result is obtained by ensemble of the forecast results of all subseries. To further prove the predictable performance of the hybrid approach on air pollutants, the hourly concentration data of PM2.5 and PM10 are used to make one-step-, two-step- and three-step-ahead predictions. The empirical results demonstrate that the hybrid ICEEMDAN-WT-GTOA-ELM approach has superior forecasting performance and stability over other methods. This novel method also provides an effective and efficient approach to make predictions for nonlinear, nonstationary and irregular data.

Download Full-text