Comparison of Document Clustering Methods Based on Bees Algorithm and Firefly Algorithm Using Thai Documents

Author(s):  
Pokpong Songmuang ◽  
Vorapon Luantangsrisuk
2018 ◽  
Vol 7 (3.3) ◽  
pp. 90
Author(s):  
Sumathi Rani Manukonda ◽  
Asst.Prof Kmit ◽  
Narayanguda . ◽  
Hyderabad . ◽  
Nomula Divya ◽  
...  

Clustering the document in data mining is one of the traditional approach in which the same documents that are more relevant are grouped together. Document clustering take part in achieving accuracy that retrieve information for systems that identifies the nearest neighbors of the document. Day to day the massive quantity of data is being generated and it is clustered. According to particular sequence to improve the cluster qualityeven though different clustering methods have been introduced, still many challenges exist for the improvement of document clustering. For web search purposea document in group is efficiently arranged for the result retrieval.The users accordingly search query in an organized way. Hierarchical clustering is attained by document clustering.To the greatest algorithms for groupingdo not concentrate on the semantic approach, hence resulting to the unsatisfactory output clustering. The involuntary approach of organizing documents of web like Google, Yahoo is often considered as a reference. A distinct method to identify the existing group of similar things in the previously organized documents and retrieves effective document classifier for new documents. In this paper the main concentration is on hierarchical clustering and k-means algorithms, hence prove that k-means and its variant are efficient than hierarchical clustering along with this by implementing greedy fast k-means algorithm (GFA) for cluster document in efficient way is considered.  


2020 ◽  
Vol 11 (2) ◽  
pp. 27-55
Author(s):  
Mohamed Amine Nemmich ◽  
Fatima Debbat ◽  
Mohamed Slimane

In this article, two hybrid schemes using the Bees Algorithm (BA) and the Firefly Algorithm (FA) are presented for numerical complex problem resolution. The BA is a recent population-based optimization algorithm, which tries to imitate the natural behaviour of honey bees foraging for food. The FA is a swarm intelligence technique based upon the communication behaviour and the idealized flashing features of tropical fireflies. The first approach, called the Hybrid Bee Firefly Algorithm (HBAFA), centres on improvements to the BA with FA during the local search thus increasing exploitation in each research zone. The second one, namely the Hybrid Firefly Bee Algorithm (HFBA), uses FA in the initialization step for a best exploration and detection of promising areas in research space. The performance of the novel hybrid algorithms was investigated on a set of various benchmarks and compared with standard BA, and other methods found in the literature. The results show that the proposed algorithms perform better than the Standard BA, and confirm their effectiveness in solving continuous optimization functions.


Author(s):  
P. Viswanth

Clustering is a process of finding natural grouping present in a dataset. Various clustering methods are proposed to work with various types of data. The quality of the solution as well as the time taken to derive the solution is important when dealing with large datasets like that in a typical documents database. Recently hybrid and ensemble based clustering methods are shown to yield better results than conventional methods. The chapter proposes two clustering methods; one is based on a hybrid scheme and the other based on an ensemble scheme. Both of these are experimentally verified and are shown to yield better and faster results.


Author(s):  
Jie Ji ◽  
◽  
Qiangfu Zhao

Document clustering partitions sets of unlabeled documents so that documents in clusters share common concepts. A Naive Bayes Classifier (BC) is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions. BC requires a small amount of training data to estimate parameters required for classification. Since training data must be labeled, we propose an Iterative Bayes Clustering (IBC) algorithm. To improve IBC performance, we propose combining IBC with Comparative Advantage-based (CA) initialization method. Experimental results show that our proposal improves performance significantly over classical clustering methods.


2016 ◽  
Vol 43 (2) ◽  
pp. 275-292 ◽  
Author(s):  
Aytug Onan ◽  
Hasan Bulut ◽  
Serdar Korukoglu

Document clustering can be applied in document organisation and browsing, document summarisation and classification. The identification of an appropriate representation for textual documents is extremely important for the performance of clustering or classification algorithms. Textual documents suffer from the high dimensionality and irrelevancy of text features. Besides, conventional clustering algorithms suffer from several shortcomings, such as slow convergence and sensitivity to the initial value. To tackle the problems of conventional clustering algorithms, metaheuristic algorithms are frequently applied to clustering. In this paper, an improved ant clustering algorithm is presented, where two novel heuristic methods are proposed to enhance the clustering quality of ant-based clustering. In addition, the latent Dirichlet allocation (LDA) is used to represent textual documents in a compact and efficient way. The clustering quality of the proposed ant clustering algorithm is compared to the conventional clustering algorithms using 25 text benchmarks in terms of F-measure values. The experimental results indicate that the proposed clustering scheme outperforms the compared conventional and metaheuristic clustering methods for textual documents.


2018 ◽  
Vol 29 (1) ◽  
pp. 814-830 ◽  
Author(s):  
Hasan Rashaideh ◽  
Ahmad Sawaie ◽  
Mohammed Azmi Al-Betar ◽  
Laith Mohammad Abualigah ◽  
Mohammed M. Al-laham ◽  
...  

Abstract Text clustering problem (TCP) is a leading process in many key areas such as information retrieval, text mining, and natural language processing. This presents the need for a potent document clustering algorithm that can be used effectively to navigate, summarize, and arrange information to congregate large data sets. This paper encompasses an adaptation of the grey wolf optimizer (GWO) for TCP, referred to as TCP-GWO. The TCP demands a degree of accuracy beyond that which is possible with metaheuristic swarm-based algorithms. The main issue to be addressed is how to split text documents on the basis of GWO into homogeneous clusters that are sufficiently precise and functional. Specifically, TCP-GWO, or referred to as the document clustering algorithm, used the average distance of documents to the cluster centroid (ADDC) as an objective function to repeatedly optimize the distance between the clusters of the documents. The accuracy and efficiency of the proposed TCP-GWO was demonstrated on a sufficiently large number of documents of variable sizes, documents that were randomly selected from a set of six publicly available data sets. Documents of high complexity were also included in the evaluation process to assess the recall detection rate of the document clustering algorithm. The experimental results for a test set of over a part of 1300 documents showed that failure to correctly cluster a document occurred in less than 20% of cases with a recall rate of more than 65% for a highly complex data set. The high F-measure rate and ability to cluster documents in an effective manner are important advances resulting from this research. The proposed TCP-GWO method was compared to the other well-established text clustering methods using randomly selected data sets. Interestingly, TCP-GWO outperforms the comparative methods in terms of precision, recall, and F-measure rates. In a nutshell, the results illustrate that the proposed TCP-GWO is able to excel compared to the other comparative clustering methods in terms of measurement criteria, whereby more than 55% of the documents were correctly clustered with a high level of accuracy.


2012 ◽  
Vol 532-533 ◽  
pp. 939-943
Author(s):  
Yi Ding ◽  
Xian Fu

Text clustering typically involves clustering in a high dimensional space, which appears difficult with regard to virtually all practical settings. In addition, given a particular clustering result it is typically very hard to come up with a good explanation of why the text clusters have been constructed the way they are. . To solve these problems, based on topic concept clustering, this paper proposes a method for Chinese document clustering. In this paper, we introduce a novel topical document clustering method called Document Features Indexing Clustering (DFIC), which can identify topics accurately and cluster documents according to these topics. In DFIC, “topic elements” are defined and extracted for indexing base clusters. Additionally, document features are investigated and exploited. Experimental results show that DFIC can gain a higher precision (92.76%) than some widely used traditional clustering methods.


Author(s):  
Seiki Ubukata ◽  
◽  
Katsuya Koike ◽  
Akira Notsu ◽  
Katsuhiro Honda

In the field of cluster analysis, fuzzy theory including the concept of fuzzy sets has been actively utilized to realize flexible and robust clustering methods. FuzzyC-means (FCM), which is the most representative fuzzy clustering method, has been extended to achieve more robust clustering. For example, noise FCM (NFCM) performs noise rejection by introducing a noise cluster that absorbs noise objects and possibilisticC-means (PCM) performs the independent extraction of possibilistic clusters by introducing cluster-wise noise clusters. Similarly, in the field of co-clustering, fuzzy co-clustering induced by multinomial mixture models (FCCMM) was proposed and extended to noise FCCMM (NFCCMM) in an analogous fashion to the NFCM. Ubukata et al. have proposed noise clustering-based possibilistic co-clustering induced by multinomial mixture models (NPCCMM) in an analogous fashion to the PCM. In this study, we develop an NPCCMM scheme considering variable cluster volumes and the fuzziness degree of item memberships to investigate the specific aspects of fuzzy nature rather than probabilistic nature in co-clustering tasks. We investigated the characteristics of the proposed NPCCMM by applying it to an artificial data set and conducted document clustering experiments using real-life data sets. As a result, we found that the proposed method can derive more flexible possibilistic partitions than the probabilistic model by adjusting the fuzziness degrees of object and item memberships. The document clustering experiments also indicated the effectiveness of tuning the fuzziness degree of object and item memberships, and the optimization of cluster volumes to improve classification performance.


Author(s):  
Mohamed Amine Nemmich ◽  
Fatima Debbat ◽  
Mohamed Slimane

The Bees Algorithm (BA) is a recent and powerful foraging algorithm which imitates the natural behaviour of bees. However, it suffers from certain limitations, essentially in the initialization step of the research areas, which is generally random and depends on the individuals' number in the population. In order to solve this problem, this paper proposes a novel hybrid optimisation approach, namely a Hybrid Firefly Bee Algorithm (HFBA), by using the Bees Algorithm (BA) and the Firefly Algorithm (FA). The FA is a swarm intelligence technique based upon the communication behaviour and the idealized flashing features of tropical fireflies. The proposed approach uses a FA in initialization step for a best exploration and detection of promising areas in research space. The performance of HFBA was investigated on a set of benchmark functions and compared with BA, and other well-knows methods. The results show that the HFBA has improved the computational time. It is also very efficient in finding optimal or near optimal solutions, and outperforms the other algorithms in terms of accuracy and speed.


Sign in / Sign up

Export Citation Format

Share Document