scholarly journals FIREFLYCLUST: AN AUTOMATED HIERARCHICAL TEXT CLUSTERING APPROACH

2017 ◽  
Vol 79 (5) ◽  
Author(s):  
Athraa Jasim Mohammed ◽  
Yuhanis Yusof ◽  
Husniza Husni

Text clustering is one of the text mining tasks that is employed in search engines. Discovering the optimal number of clusters for a dataset or repository is a challenging problem. Various clustering algorithms have been reported in the literature but most of them rely on a pre-defined value of the k clusters. In this study, a variant of Firefly algorithm, termed as FireflyClust, is proposed to automatically cluster text documents in a hierarchical manner. The proposed clustering method operates based on five phases: data pre-processing, clustering, item re-location, cluster selection and cluster refinement. Experiments are undertaken based on different selections of threshold value. Results on the TREC collection named TR11, TR12, TR23 and TR45, showed that the FireflyClust is a better approach than the Bisect K-means, hybrid Bisect K-means and Practical General Stochastic Clustering Method. Such a result would enlighten the directions in developing a better information retrieval engine for this dynamic and fast growing big data era.

2020 ◽  
pp. 016555152091159
Author(s):  
Muhammad Qasim Memon ◽  
Yu Lu ◽  
Penghe Chen ◽  
Aasma Memon ◽  
Muhammad Salman Pathan ◽  
...  

Text segmentation (TS) is the process of dividing multi-topic text collections into cohesive segments using topic boundaries. Similarly, text clustering has been renowned as a major concern when it comes to multi-topic text collections, as they are distinguished by sub-topic structure and their contents are not associated with each other. Existing clustering approaches follow the TS method which relies on word frequencies and may not be suitable to cluster multi-topic text collections. In this work, we propose a new ensemble clustering approach (ECA) is a novel topic-modelling-based clustering approach, which induces the combination of TS and text clustering. We improvised a LDA-onto (LDA-ontology) is a TS-based model, which presents a deterioration of a document into segments (i.e. sub-documents), wherein each sub-document is associated with exactly one sub-topic. We deal with the problem of clustering when it comes to a document that is intrinsically related to various topics and its topical structure is missing. ECA is tested through well-known datasets in order to provide a comprehensive presentation and validation of clustering algorithms using LDA-onto. ECA exhibits the semantic relations of keywords in sub-documents and resultant clusters belong to original documents that they contain. Moreover, present research sheds the light on clustering performances and it indicates that there is no difference over performances (in terms of F-measure) when the number of topics changes. Our findings give above par results in order to analyse the problem of text clustering in a broader spectrum without applying dimension reduction techniques over high sparse data. Specifically, ECA provides an efficient and significant framework than the traditional and segment-based approach, such that achieved results are statistically significant with an average improvement of over 10.2%. For the most part, proposed framework can be evaluated in applications where meaningful data retrieval is useful, such as document summarization, text retrieval, novelty and topic detection.


2019 ◽  
Vol 2019 ◽  
pp. 1-11 ◽  
Author(s):  
Hui Huang ◽  
Yan Ma

The Bag-of-Words (BoW) model is a well-known image categorization technique. However, in conventional BoW, neither the vocabulary size nor the visual words can be determined automatically. To overcome these problems, a hybrid clustering approach that combines improved hierarchical clustering with a K-means algorithm is proposed. We present a cluster validity index for the hierarchical clustering algorithm to adaptively determine when the algorithm should terminate and the optimal number of clusters. Furthermore, we improve the max-min distance method to optimize the initial cluster centers. The optimal number of clusters and initial cluster centers are fed into K-means, and finally the vocabulary size and visual words are obtained. The proposed approach is extensively evaluated on two visual datasets. The experimental results show that the proposed method outperforms the conventional BoW model in terms of categorization and demonstrate the feasibility and effectiveness of our approach.


2021 ◽  
Author(s):  
Congming Shi ◽  
Bingtao Wei ◽  
Shoulin Wei ◽  
Wen Wang ◽  
Hai Liu ◽  
...  

Abstract Clustering, a traditional machine learning method, plays a significant role in data analysis. Most clustering algorithms depend on a predetermined exact number of clusters, whereas, in practice, clusters are usually unpredictable. Although the Elbow method is one of the most commonly used methods to discriminate the optimal cluster number, the discriminant of the number of clusters depends on the manual identification of the elbow points on the visualization curve. Thus, experienced analysts cannot clearly identify the elbow point from the plotted curve when the plotted curve is fairly smooth. To solve this problem, a new elbow point discriminant method is proposed to yield a statistical metric that estimates an optimal cluster number when clustering on a dataset. First, the average degree of distortion obtained by the Elbow method is normalized to the range of 0 to 10. Second, the normalized results are used to calculate the cosine of intersection angles between elbow points. Third, this calculated cosine of intersection angles and the arccosine theorem are used to compute the intersection angles between elbow points. Finally, the index of the above computed minimal intersection angles between elbow points is used as the estimated potential optimal cluster number. The experimental results based on simulated datasets and a well-known public dataset (Iris Dataset) demonstrated that the estimated optimal cluster number obtained by our newly proposed method is better than the widely used Silhouette method.


2016 ◽  
Vol 5 (1) ◽  
pp. 63-72 ◽  
Author(s):  
Derssie D. Mebratu ◽  
Charles Kim

Abstract. Increasing the lifespan of a group of distributed wireless sensors is one of the major challenges in research. This is especially important for distributed wireless sensor nodes used in harsh environments since it is not feasible to replace or recharge their batteries. Thus, the popular low-energy adaptive clustering hierarchy (LEACH) algorithm uses the “computation and communication energy model” to increase the lifespan of distributed wireless sensor nodes. As an improved method, we present here that a combination of three clustering algorithms performs better than the LEACH algorithm. The clustering algorithms included in the combination are the k-means+ + , k-means, and gap statistics algorithms. These three algorithms are used selectively in the following manner: the k-means+ +  algorithm initializes the center for the k-means algorithm, the k-means algorithm computes the optimal center of the clusters, and the gap statistics algorithm selects the optimal number of clusters in a distributed wireless sensor network. Our simulation shows that the approach of using a combination of clustering algorithms increases the lifespan of the wireless sensor nodes by 15 % compared with the LEACH algorithm. This paper reports the details of the clustering algorithms selected for use in the combination approach and, based on the simulation results, compares the performance of the combination approach with that of the LEACH algorithm.


Author(s):  
Ali Kaveh ◽  
Mohammad Reza Seddighian ◽  
Pouya Hassani

In this paper, an automatic data clustering approach is presented using some concepts of the graph theory. Some Cluster Validity Index (CVI) is mentioned, and DB Index is defined as the objective function of meta-heuristic algorithms. Six Finite Element meshes are decomposed containing two- and three- dimensional types that comprise simple and complex meshes. Six meta-heuristic algorithms are utilized to determine the optimal number of clusters and minimize the decomposition problem. Finally, corresponding statistical results are compared.


2013 ◽  
Vol 22 (03) ◽  
pp. 1350009 ◽  
Author(s):  
GEORGE GREKOUSIS

Choosing the optimal number of clusters is a key issue in cluster analysis. Especially when dealing with more spatial clustering, things tend to be more complicated. Cluster validation helps to determine the appropriate number of clusters present in a dataset. Furthermore, cluster validation evaluates and assesses the results of clustering algorithms. There are numerous methods and techniques for choosing the optimal number of clusters via crisp and fuzzy clustering. In this paper, we introduce a new index for fuzzy clustering to determine the optimal number of clusters. This index is not another metric for calculating compactness or separation among partitions. Instead, the index uses several existing indices to give a degree, or fuzziness, to the optimal number of clusters. In this way, not only do the objects in a fuzzy cluster get a membership value, but the number of clusters to be partitioned is given a value as well. The new index is used in the fuzzy c-means algorithm for the geodemographic segmentation of 285 postal codes.


2014 ◽  
Vol 989-994 ◽  
pp. 1853-1856
Author(s):  
Shi Dong Yu ◽  
Yuan Ding ◽  
Xi Cheng Ma ◽  
Jian Sun

The genetic algorithm (GA) is a self-adapted probability search method used to solve optimization problems, which has been applied widely in science and engineering. In this paper, we propose an improved variable string length genetic algorithm (IVGA) for text clustering. Our algorithm has been exploited for automatically evolving the optimal number of clusters as well as providing proper data set clustering. The chromosome is encoded by special indices to indicate the location of each gene. More effective version of evolutional steps can automatically adjust the influence between the diversity of the population and selective pressure during generations. The superiority of the improved genetic algorithm over conventional variable string length genetic algorithm (VGA) is demonstrated by providing proper text clustering.


2020 ◽  
Author(s):  
Congming Shi ◽  
Bingtao Wei ◽  
Shoulin Wei ◽  
Wen Wang ◽  
Hai Liu ◽  
...  

Abstract Clustering, as a traditional machine learning method, is still playing a significant role in data analysis. The most of clustering algorithms depend on a predetermined exact number of clusters, whereas, in practice, clusters are usually unpredictable. Although elbow method is one of the most commonly used methods to discriminate the optimal cluster number, the discriminant of the number of clusters depends on manual identification of the elbow points on the visualization curve, which will lead to the experienced analysts not being able to clearly identify the elbow point from the plotted curve when the plotted curve being fairly smooth. To solve this problem, a new elbow point discriminant method is proposed to work out a statistical metric estimating an optimal cluster number when clustering on a dataset. Firstly, the average degree of distortion obtained by Elbow method is normalized to the range of 0 to10; Secondly, the normalized results are used to calculate Cosine of intersection angles between elbow points; Thirdly, the above calculated Cosine of intersection angles and Arccosine theorem are used to compute the intersection angles between elbow points; Finally, the index of the above computed minimal intersection angles between elbow points is used as the estimated potential optimal cluster number. The experimental results based on simulated datasets and a public well-known dataset demonstrated that the estimated optimal cluster number output by our newly proposed method is better than widely used Silhouette method.


Sign in / Sign up

Export Citation Format

Share Document