A Grey Wolf Optimizer for Text Document Clustering

2018 ◽  
Vol 29 (1) ◽  
pp. 814-830 ◽  
Author(s):  
Hasan Rashaideh ◽  
Ahmad Sawaie ◽  
Mohammed Azmi Al-Betar ◽  
Laith Mohammad Abualigah ◽  
Mohammed M. Al-laham ◽  
...  

Abstract Text clustering problem (TCP) is a leading process in many key areas such as information retrieval, text mining, and natural language processing. This presents the need for a potent document clustering algorithm that can be used effectively to navigate, summarize, and arrange information to congregate large data sets. This paper encompasses an adaptation of the grey wolf optimizer (GWO) for TCP, referred to as TCP-GWO. The TCP demands a degree of accuracy beyond that which is possible with metaheuristic swarm-based algorithms. The main issue to be addressed is how to split text documents on the basis of GWO into homogeneous clusters that are sufficiently precise and functional. Specifically, TCP-GWO, or referred to as the document clustering algorithm, used the average distance of documents to the cluster centroid (ADDC) as an objective function to repeatedly optimize the distance between the clusters of the documents. The accuracy and efficiency of the proposed TCP-GWO was demonstrated on a sufficiently large number of documents of variable sizes, documents that were randomly selected from a set of six publicly available data sets. Documents of high complexity were also included in the evaluation process to assess the recall detection rate of the document clustering algorithm. The experimental results for a test set of over a part of 1300 documents showed that failure to correctly cluster a document occurred in less than 20% of cases with a recall rate of more than 65% for a highly complex data set. The high F-measure rate and ability to cluster documents in an effective manner are important advances resulting from this research. The proposed TCP-GWO method was compared to the other well-established text clustering methods using randomly selected data sets. Interestingly, TCP-GWO outperforms the comparative methods in terms of precision, recall, and F-measure rates. In a nutshell, the results illustrate that the proposed TCP-GWO is able to excel compared to the other comparative clustering methods in terms of measurement criteria, whereby more than 55% of the documents were correctly clustered with a high level of accuracy.

Author(s):  
Amolkumar Narayan Jadhav ◽  
Gomathi N.

The widespread application of clustering in various fields leads to the discovery of different clustering techniques in order to partition multidimensional data into separable clusters. Although there are various clustering approaches used in literature, optimized clustering techniques with multi-objective consideration are rare. This paper proposes a novel data clustering algorithm, Enhanced Kernel-based Exponential Grey Wolf Optimization (EKEGWO), handling two objectives. EKEGWO, which is the extension of KEGWO, adopts weight exponential functions to improve the searching process of clustering. Moreover, the fitness function of the algorithm includes intra-cluster distance and the inter-cluster distance as an objective to provide an optimum selection of cluster centroids. The performance of the proposed technique is evaluated by comparing with the existing approaches PSC, mPSC, GWO, and EGWO for two datasets: banknote authentication and iris. Four metrics, Mean Square Error (MSE), F-measure, rand and jaccord coefficient, estimates the clustering efficiency of the algorithm. The proposed EKEGWO algorithm can attain an MSE of 837, F-measure of 0.9657, rand coefficient of 0.8472, jaccord coefficient of 0.7812, for the banknote dataset.


Author(s):  
Ayad Mohammed Jabbar ◽  
Ku Ruhana Ku-Mahamud

In data mining, the application of grey wolf optimization (GWO) algorithm has been used in several learning approaches because of its simplicity in adapting to different application domains. Most recent works that concern unsupervised learning have focused on text clustering, where the GWO algorithm shows promising results. Although GWO has great potential in performing text clustering, it has limitations in dealing with outlier documents and noise data. This research introduces medoid GWO (M-GWO) algorithm, which incorporates a medoid recalculation process to share the information of medoids among the three best wolves and the rest of the population. This improvement aims to find the best set of medoids during the algorithm run and increases the exploitation search to find more local regions in the search space. Experimental results obtained from using well-known algorithms, such as genetic, firefly, GWO, and k-means algorithms, in four benchmarks. The results of external evaluation metrics, such as rand, purity, F-measure, and entropy, indicates that the proposed M-GWO algorithm achieves better document clustering than all other algorithms (i.e., 75% better when using Rand metric, 50% better than all algorithm based on purity metric, 75% better than all algorithms using F-measure metric, and 100% based on entropy metric).


Sensors ◽  
2020 ◽  
Vol 20 (7) ◽  
pp. 1845 ◽  
Author(s):  
Xiaohui Gu ◽  
Shaopu Yang ◽  
Yongqiang Liu ◽  
Rujiang Hao ◽  
Zechao Liu

Informative frequency band (IFB) selection is a challenging task in envelope analysis for the localized fault detection of rolling element bearings. In previous studies, it was often conducted with a single indicator, such as kurtosis, etc., to guide the automatic selection. However, in some cases, it is difficult for that to fully depict and balance the fault characters from impulsiveness and cyclostationarity of the repetitive transients. To solve this problem, a novel negentropy-induced multi-objective optimized wavelet filter is proposed in this paper. The wavelet parameters are determined by a grey wolf optimizer with two independent objective functions i.e., maximizing the negentropy of squared envelope and squared envelope spectrum to capture impulsiveness and cyclostationarity, respectively. Subsequently, the average negentropy is utilized in identifying the IFB from the obtained Pareto set, which are non-dominated by other solutions to balance the impulsive and cyclostationary features and eliminate the background noise. Two cases of real vibration signals with slight bearing faults are applied in order to evaluate the performance of the proposed methodology, and the results demonstrate its effectiveness over some fast and optimal filtering methods. In addition, its stability in tracking the IFB is also tested by a case of condition monitoring data sets.


Sensors ◽  
2020 ◽  
Vol 20 (3) ◽  
pp. 820 ◽  
Author(s):  
Xiaoqiang Zhao ◽  
Shaoya Ren ◽  
Heng Quan ◽  
Qiang Gao

Wireless sensor network (WSN) nodes are devices with limited power, and rational utilization of node energy and prolonging the network lifetime are the main objectives of the WSN’s routing protocol. However, irrational considerations of heterogeneity of node energy will lead to an energy imbalance between nodes in heterogeneous WSNs (HWSNs). Therefore, in this paper, a routing protocol for HWSNs based on the modified grey wolf optimizer (HMGWO) is proposed. First, the protocol selects the appropriate initial clusters by defining different fitness functions for heterogeneous energy nodes; the nodes’ fitness values are then calculated and treated as initial weights in the GWO. At the same time, the weights are dynamically updated according to the distance between the wolves and their prey and coefficient vectors to improve the GWO’s optimization ability and ensure the selection of the optimal cluster heads (CHs). The experimental results indicate that the network lifecycle of the HMGWO protocol improves by 55.7%, 31.9%, 46.3%, and 27.0%, respectively, compared with the stable election protocol (SEP), distributed energy-efficient clustering algorithm (DEEC), modified SEP (M-SEP), and fitness-value-based improved GWO (FIGWO) protocols. In terms of the power consumption and network throughput, the HMGWO is also superior to other protocols.


Author(s):  
Yasunori Endo ◽  
◽  
Tomoyuki Suzuki ◽  
Naohiko Kinoshita ◽  
Yukihiro Hamasuna ◽  
...  

The fuzzy non-metric model (FNM) is a representative non-hierarchical clustering method, which is very useful because the belongingness or the membership degree of each datum to each cluster can be calculated directly from the dissimilarities between data and the cluster centers are not used. However, the original FNM cannot handle data with uncertainty. In this study, we refer to the data with uncertainty as “uncertain data,” e.g., incomplete data or data that have errors. Previously, a methods was proposed based on the concept of a tolerance vector for handling uncertain data and some clustering methods were constructed according to this concept, e.g. fuzzyc-means for data with tolerance. These methods can handle uncertain data in the framework of optimization. Thus, in the present study, we apply the concept to FNM. First, we propose a new clustering algorithm based on FNM using the concept of tolerance, which we refer to as the fuzzy non-metric model for data with tolerance. Second, we show that the proposed algorithm can handle incomplete data sets. Third, we verify the effectiveness of the proposed algorithm based on comparisons with conventional methods for incomplete data sets in some numerical examples.


2018 ◽  
Vol 66 (6) ◽  
pp. 1215-1226 ◽  
Author(s):  
Aayush Agarwal ◽  
Akash Chandra ◽  
Shalivahan Shalivahan ◽  
Roshan K Singh

2007 ◽  
Vol 17 (01) ◽  
pp. 71-103 ◽  
Author(s):  
NARGESS MEMARSADEGHI ◽  
DAVID M. MOUNT ◽  
NATHAN S. NETANYAHU ◽  
JACQUELINE LE MOIGNE

Clustering is central to many image processing and remote sensing applications. ISODATA is one of the most popular and widely used clustering methods in geoscience applications, but it can run slowly, particularly with large data sets. We present a more efficient approach to ISODATA clustering, which achieves better running times by storing the points in a kd-tree and through a modification of the way in which the algorithm estimates the dispersion of each cluster. We also present an approximate version of the algorithm which allows the user to further improve the running time, at the expense of lower fidelity in computing the nearest cluster center to each point. We provide both theoretical and empirical justification that our modified approach produces clusterings that are very similar to those produced by the standard ISODATA approach. We also provide empirical studies on both synthetic data and remotely sensed Landsat and MODIS images that show that our approach has significantly lower running times.


Author(s):  
P. Viswanth

Clustering is a process of finding natural grouping present in a dataset. Various clustering methods are proposed to work with various types of data. The quality of the solution as well as the time taken to derive the solution is important when dealing with large datasets like that in a typical documents database. Recently hybrid and ensemble based clustering methods are shown to yield better results than conventional methods. The chapter proposes two clustering methods; one is based on a hybrid scheme and the other based on an ensemble scheme. Both of these are experimentally verified and are shown to yield better and faster results.


2016 ◽  
Vol 43 (2) ◽  
pp. 275-292 ◽  
Author(s):  
Aytug Onan ◽  
Hasan Bulut ◽  
Serdar Korukoglu

Document clustering can be applied in document organisation and browsing, document summarisation and classification. The identification of an appropriate representation for textual documents is extremely important for the performance of clustering or classification algorithms. Textual documents suffer from the high dimensionality and irrelevancy of text features. Besides, conventional clustering algorithms suffer from several shortcomings, such as slow convergence and sensitivity to the initial value. To tackle the problems of conventional clustering algorithms, metaheuristic algorithms are frequently applied to clustering. In this paper, an improved ant clustering algorithm is presented, where two novel heuristic methods are proposed to enhance the clustering quality of ant-based clustering. In addition, the latent Dirichlet allocation (LDA) is used to represent textual documents in a compact and efficient way. The clustering quality of the proposed ant clustering algorithm is compared to the conventional clustering algorithms using 25 text benchmarks in terms of F-measure values. The experimental results indicate that the proposed clustering scheme outperforms the compared conventional and metaheuristic clustering methods for textual documents.


Author(s):  
Aayush Agarwal ◽  
Akash Chandra ◽  
Shalivahan Srivastava ◽  
Roshan K Singh

Sign in / Sign up

Export Citation Format

Share Document