Particle Grey Wolf Optimizer (PGWO) Algorithm and Semantic Word Processing for Automatic Text Clustering

Author(s):  
Ch. Vidyadhari ◽  
N. Sandhya ◽  
P. Premchand

Text mining refers to the process of extracting the high-quality information from the text. It is broadly used in applications, like text clustering, text categorization, text classification, etc. Recently, the text clustering becomes the facilitating and challenging task used to group the text document. Due to some irrelevant terms and large dimension, the accuracy of text clustering is reduced. In this paper, the semantic word processing and novel Particle Grey Wolf Optimizer (PGWO) is proposed for automatic text clustering. Initially, the text documents are given as input to the pre-processing step which caters the useful keyword for feature extraction and clustering. Then, the resultant keyword is applied to wordnet ontology to find out the synonyms and hyponyms of every keyword. Subsequently, the frequency is determined for every keyword which is used to build the text feature library. Since the text feature library contains the larger dimension, the entropy is utilized to select the most significant feature. Finally, the new algorithm Particle Grey Wolf Optimizer (PGWO) is developed by integrating the particle swarm optimization (PSO) into the grey wolf optimizer (GWO). Thus, the proposed algorithm is used to assign the class labels to generate the different clusters of text documents. The simulation is performed to analyze the performance of the proposed algorithm, and the proposed algorithm is compared with existing algorithms. The proposed method attains the clustering accuracy of 80.36% for 20 Newsgroup dataset and the clustering accuracy of 79.63% for Reuter which ensures the better automatic text clustering.

Author(s):  
Laith Mohammad Abualigah ◽  
Essam Said Hanandeh ◽  
Ahamad Tajudin Khader ◽  
Mohammed Abdallh Otair ◽  
Shishir Kumar Shandilya

Background: Considering the increasing volume of text document information on Internet pages, dealing with such a tremendous amount of knowledge becomes totally complex due to its large size. Text clustering is a common optimization problem used to manage a large amount of text information into a subset of comparable and coherent clusters. Aims: This paper presents a novel local clustering technique, namely, β-hill climbing, to solve the problem of the text document clustering through modeling the β-hill climbing technique for partitioning the similar documents into the same cluster. Methods: The β parameter is the primary innovation in β-hill climbing technique. It has been introduced in order to perform a balance between local and global search. Local search methods are successfully applied to solve the problem of the text document clustering such as; k-medoid and kmean techniques. Results: Experiments were conducted on eight benchmark standard text datasets with different characteristics taken from the Laboratory of Computational Intelligence (LABIC). The results proved that the proposed β-hill climbing achieved better results in comparison with the original hill climbing technique in solving the text clustering problem. Conclusion: The performance of the text clustering is useful by adding the β operator to the hill climbing.


2020 ◽  
pp. 3397-3407
Author(s):  
Nur Syafiqah Mohd Nafis ◽  
Suryanti Awang

Text documents are unstructured and high dimensional. Effective feature selection is required to select the most important and significant feature from the sparse feature space. Thus, this paper proposed an embedded feature selection technique based on Term Frequency-Inverse Document Frequency (TF-IDF) and Support Vector Machine-Recursive Feature Elimination (SVM-RFE) for unstructured and high dimensional text classificationhis technique has the ability to measure the feature’s importance in a high-dimensional text document. In addition, it aims to increase the efficiency of the feature selection. Hence, obtaining a promising text classification accuracy. TF-IDF act as a filter approach which measures features importance of the text documents at the first stage. SVM-RFE utilized a backward feature elimination scheme to recursively remove insignificant features from the filtered feature subsets at the second stage. This research executes sets of experiments using a text document retrieved from a benchmark repository comprising a collection of Twitter posts. Pre-processing processes are applied to extract relevant features. After that, the pre-processed features are divided into training and testing datasets. Next, feature selection is implemented on the training dataset by calculating the TF-IDF score for each feature. SVM-RFE is applied for feature ranking as the next feature selection step. Only top-rank features will be selected for text classification using the SVM classifier. Based on the experiments, it shows that the proposed technique able to achieve 98% accuracy that outperformed other existing techniques. In conclusion, the proposed technique able to select the significant features in the unstructured and high dimensional text document.


2020 ◽  
Vol 11 (4) ◽  
pp. 72-92
Author(s):  
Ch. Vidyadhari ◽  
N. Sandhya ◽  
P. Premchand

In this research paper, an incremental clustering approach-enabled MapReduce framework is implemented that include two phases, mapper and reducer phase. In the mapper phase, there are two processes, pre-processing and feature extraction. Once the input data is pre-processed, the feature extraction is done using wordnet features. Then, the features are fed to the reducer phase, where the features are selected using entropy function. Then, the automatic incremental clustering is done using bat-grey wolf optimizer (BAGWO). BAGWO is the integration of bat algorithm (BA) into grey wolf optimization (GWO) for generating various clusters of text documents. Upon the arrival of the incremental data, the mapping of the new data with respect to the centroids is done to obtain the effective cluster. For mapping, kernel-based deep point distance and for centroid update, fuzzy concept is used. The performance of the proposed framework outperformed the existing techniques using rand coefficient, Jaccard coefficient, and clustering accuracy with maximal values 0.921, 0.920, and 0.95, respectively.


Author(s):  
Nilupulee Nathawitharana ◽  
Damminda Alahakoon ◽  
Sumith Matharage

Humans are used to expressing themselves with written language and language provides a medium with which we can describe our experiences in detail incorporating individuality. Even though documents provide a rich source of information, it becomes very difficult to identify, extract, summarize and search when vast amounts of documents are collected especially over time. Document clustering is a technique that has been widely used to group documents based on similarity of content represented by the words used. Once key groups are identified further drill down into sub-groupings is facilitated by the use of hierarchical clustering. Clustering and hierarchical clustering are very useful when applied to numerical and categorical data and cluster accuracy and purity measures exist to evaluate the outcomes of a clustering exercise. Although the same measures have been applied to text clustering, text clusters are based on words or terms which can be repeated across documents associated with different topics. Therefore text data cannot be considered as a direct ‘coding’ of a particular experience or situation in contrast to numerical and categorical data and term overlap is a very common characteristic in text clustering. In this paper we propose a new technique and methodology for term overlap capture from text documents, highlighting the different situations such overlap could signify and discuss why such understanding is important for obtaining value from text clustering. Experiments were conducted using a widely used text document collection where the proposed methodology allowed exploring the term diversity for a given document collection and obtain clusters with minimum term overlap.


Algorithms ◽  
2020 ◽  
Vol 13 (12) ◽  
pp. 345
Author(s):  
Laith Abualigah ◽  
Amir H. Gandomi ◽  
Mohamed Abd Elaziz ◽  
Abdelazim G. Hussien ◽  
Ahmad M. Khasawneh ◽  
...  

Text clustering is one of the efficient unsupervised learning techniques used to partition a huge number of text documents into a subset of clusters. In which, each cluster contains similar documents and the clusters contain dissimilar text documents. Nature-inspired optimization algorithms have been successfully used to solve various optimization problems, including text document clustering problems. In this paper, a comprehensive review is presented to show the most related nature-inspired algorithms that have been used in solving the text clustering problem. Moreover, comprehensive experiments are conducted and analyzed to show the performance of the common well-know nature-inspired optimization algorithms in solving the text document clustering problems including Harmony Search (HS) Algorithm, Genetic Algorithm (GA), Particle Swarm Optimization (PSO) Algorithm, Ant Colony Optimization (ACO), Krill Herd Algorithm (KHA), Cuckoo Search (CS) Algorithm, Gray Wolf Optimizer (GWO), and Bat-inspired Algorithm (BA). Seven text benchmark datasets are used to validate the performance of the tested algorithms. The results showed that the performance of the well-known nurture-inspired optimization algorithms almost the same with slight differences. For improvement purposes, new modified versions of the tested algorithms can be proposed and tested to tackle the text clustering problems.


Author(s):  
C. Vidyadhari ◽  
N. Sandhya ◽  
P. Premchand

The technical advancement in information systems contributes towards the massive availability of the documents stored in the electronic databases such as e-mails, internet and web pages. Therefore, it becomes a complex task for arranging and browsing the required document. This paper proposes an approach for incremental clustering using the Bat-Grey Wolf Optimizer (BAGWO). The input documents are initially subjected to the pre-processing module to obtain useful keywords, and then the feature extraction is performed based on wordnet features. After feature extraction, feature selection is carried out using entropy function. Subsequently, the clustering is done using the proposed BAGWO algorithm. The BAGWO algorithm is designed by integrating the Bat Algorithm (BA) and Grey Wolf Optimizer (GWO) for generating the different clusters of text documents. Hence, the clustering is determined using the BAGWO algorithm, yielding the group of clusters. On the other side, upon the arrival of a new document, the same steps of pre-processing and feature extraction are performed. Based on the features of the test document, the mapping is done between the features of the test document, and the clusters obtained by the proposed BAGWO approach. The mapping is performed using the kernel-based deep point distance and once the mapping terminated, the representatives are updated based on the fuzzy-based representative update. The performance of the developed BAGWO outperformed the existing techniques in terms of clustering accuracy, Jaccard coefficient, and rand coefficient with maximal values 0.948, 0.968, and 0.969, respectively.


Author(s):  
Amal M. Al-Numai ◽  
Aqil M. Azmi

As the number of electronic text documents is increasing so is the need for an automatic text summarizer. The summary can be extractive, compression, or abstractive. In the former, the more important sentences are retained, more or less in their original structure, while the second one involves reducing the length of each sentence. For the latter, it requires a fusion of multiple sentences and/or paraphrasing. This chapter focuses on the abstractive text summarization (ATS) of a single text document. The study explores what ATS is. Additionally, the literature of the field of ATS is investigated. Different datasets and evaluation techniques used in assessing the summarizers are discussed. The fact is that ATS is much more challenging than its extractive counterpart, and as such, there are a few works in this area for all the languages.


Author(s):  
Chiheb-Eddine Ben N'Cir ◽  
Nadia Essoussi

Grouping documents based on their textual content is an important application of clustering referred to as text clustering. This paper deals with two issues in text clustering which are the detection of non-disjoint groups and the representation of textual data. In fact, a text document can discuss several topics and then, it must belong to several groups. The learning algorithm must be able to produce non-disjoint clusters and assigns documents to several clusters. Given that text documents are considered as unstructured data, the application of a learning algorithm requires to prepare a set of documents for numerical analysis by using the vector space model (VSM). This representation of text avoids correlation between terms and does not give importance to the order of words in the text. Therefore, we present in this paper an unsupervised learning method, based on the word sequence kernel, where the correlation between adjacent words in text and the possibility of document to belong to more than one cluster are not ignored. In addition, to facilitate the use of this method in text-analytic practice, we present the "DocCO" software which is publicly available. Experiments performed on several text collections show that the proposed method outperforms existing overlapping methods using VSM representation in terms of clustering accuracy.


2018 ◽  
Vol 29 (1) ◽  
pp. 814-830 ◽  
Author(s):  
Hasan Rashaideh ◽  
Ahmad Sawaie ◽  
Mohammed Azmi Al-Betar ◽  
Laith Mohammad Abualigah ◽  
Mohammed M. Al-laham ◽  
...  

Abstract Text clustering problem (TCP) is a leading process in many key areas such as information retrieval, text mining, and natural language processing. This presents the need for a potent document clustering algorithm that can be used effectively to navigate, summarize, and arrange information to congregate large data sets. This paper encompasses an adaptation of the grey wolf optimizer (GWO) for TCP, referred to as TCP-GWO. The TCP demands a degree of accuracy beyond that which is possible with metaheuristic swarm-based algorithms. The main issue to be addressed is how to split text documents on the basis of GWO into homogeneous clusters that are sufficiently precise and functional. Specifically, TCP-GWO, or referred to as the document clustering algorithm, used the average distance of documents to the cluster centroid (ADDC) as an objective function to repeatedly optimize the distance between the clusters of the documents. The accuracy and efficiency of the proposed TCP-GWO was demonstrated on a sufficiently large number of documents of variable sizes, documents that were randomly selected from a set of six publicly available data sets. Documents of high complexity were also included in the evaluation process to assess the recall detection rate of the document clustering algorithm. The experimental results for a test set of over a part of 1300 documents showed that failure to correctly cluster a document occurred in less than 20% of cases with a recall rate of more than 65% for a highly complex data set. The high F-measure rate and ability to cluster documents in an effective manner are important advances resulting from this research. The proposed TCP-GWO method was compared to the other well-established text clustering methods using randomly selected data sets. Interestingly, TCP-GWO outperforms the comparative methods in terms of precision, recall, and F-measure rates. In a nutshell, the results illustrate that the proposed TCP-GWO is able to excel compared to the other comparative clustering methods in terms of measurement criteria, whereby more than 55% of the documents were correctly clustered with a high level of accuracy.


Author(s):  
A. P. Tawdar ◽  
M. S. Bewoor ◽  
S. H. Patil

Text Classification is also called as Text Categorization (TC), is the task of classifying a set of text documents automatically into different categories from a predefined set. If a text document relates to exactly one of the categories, then it is called as single-label classification task; otherwise, it is called as multi-label classification task. For Information Retrieval (IR) and Machine Learning (ML), TC uses several tools and has received much attention in the last decades. In this paper, first classifies the text documents using MLP based machine learning approach (BPP) and then return the most relevant documents. And also describes a proposed back propagation neural network classifier that performs cross validation for original Neural Network. In order to optimize the classification accuracy, training time. Proposed web content mining methodology in the exploration with the aid of BPP. The main objective of this investigation is web document extraction and utilizing different grouping algorithm. This work extricates the data from the web URL.


Sign in / Sign up

Export Citation Format

Share Document