A frequent term based text clustering approach using novel similarity measure

Author(s):  
G.Suresh Reddy ◽  
T.V. Rajinikanth ◽  
A.Ananda Rao
2017 ◽  
Vol 2017 ◽  
pp. 1-13 ◽  
Author(s):  
Junkai Yi ◽  
Yacong Zhang ◽  
Xianghui Zhao ◽  
Jing Wan

Text clustering is an effective approach to collect and organize text documents into meaningful groups for mining valuable information on the Internet. However, there exist some issues to tackle such as feature extraction and data dimension reduction. To overcome these problems, we present a novel approach named deep-learning vocabulary network. The vocabulary network is constructed based on related-word set, which contains the “cooccurrence” relations of words or terms. We replace term frequency in feature vectors with the “importance” of words in terms of vocabulary network and PageRank, which can generate more precise feature vectors to represent the meaning of text clustering. Furthermore, sparse-group deep belief network is proposed to reduce the dimensionality of feature vectors, and we introduce coverage rate for similarity measure in Single-Pass clustering. To verify the effectiveness of our work, we compare the approach to the representative algorithms, and experimental results show that feature vectors in terms of deep-learning vocabulary network have better clustering performance.


2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Ali A. Amer ◽  
Hassan I. Abdalla

Abstract Similarity measures have long been utilized in information retrieval and machine learning domains for multi-purposes including text retrieval, text clustering, text summarization, plagiarism detection, and several other text-processing applications. However, the problem with these measures is that, until recently, there has never been one single measure recorded to be highly effective and efficient at the same time. Thus, the quest for an efficient and effective similarity measure is still an open-ended challenge. This study, in consequence, introduces a new highly-effective and time-efficient similarity measure for text clustering and classification. Furthermore, the study aims to provide a comprehensive scrutinization for seven of the most widely used similarity measures, mainly concerning their effectiveness and efficiency. Using the K-nearest neighbor algorithm (KNN) for classification, the K-means algorithm for clustering, and the bag of word (BoW) model for feature selection, all similarity measures are carefully examined in detail. The experimental evaluation has been made on two of the most popular datasets, namely, Reuters-21 and Web-KB. The obtained results confirm that the proposed set theory-based similarity measure (STB-SM), as a pre-eminent measure, outweighs all state-of-art measures significantly with regards to both effectiveness and efficiency.


2020 ◽  
pp. 016555152091159
Author(s):  
Muhammad Qasim Memon ◽  
Yu Lu ◽  
Penghe Chen ◽  
Aasma Memon ◽  
Muhammad Salman Pathan ◽  
...  

Text segmentation (TS) is the process of dividing multi-topic text collections into cohesive segments using topic boundaries. Similarly, text clustering has been renowned as a major concern when it comes to multi-topic text collections, as they are distinguished by sub-topic structure and their contents are not associated with each other. Existing clustering approaches follow the TS method which relies on word frequencies and may not be suitable to cluster multi-topic text collections. In this work, we propose a new ensemble clustering approach (ECA) is a novel topic-modelling-based clustering approach, which induces the combination of TS and text clustering. We improvised a LDA-onto (LDA-ontology) is a TS-based model, which presents a deterioration of a document into segments (i.e. sub-documents), wherein each sub-document is associated with exactly one sub-topic. We deal with the problem of clustering when it comes to a document that is intrinsically related to various topics and its topical structure is missing. ECA is tested through well-known datasets in order to provide a comprehensive presentation and validation of clustering algorithms using LDA-onto. ECA exhibits the semantic relations of keywords in sub-documents and resultant clusters belong to original documents that they contain. Moreover, present research sheds the light on clustering performances and it indicates that there is no difference over performances (in terms of F-measure) when the number of topics changes. Our findings give above par results in order to analyse the problem of text clustering in a broader spectrum without applying dimension reduction techniques over high sparse data. Specifically, ECA provides an efficient and significant framework than the traditional and segment-based approach, such that achieved results are statistically significant with an average improvement of over 10.2%. For the most part, proposed framework can be evaluated in applications where meaningful data retrieval is useful, such as document summarization, text retrieval, novelty and topic detection.


2015 ◽  
Vol 21 (11) ◽  
pp. 3583-3590 ◽  
Author(s):  
G Suresh Reddy ◽  
T. V Rajini Kanth ◽  
A Ananda Rao

2021 ◽  
Vol 20 (3) ◽  
pp. 288-298
Author(s):  
Famila Dwi Winati ◽  
Fauzan Romadlon

Bus Rapid Transit (BRT) is one of the alternative public transportations in urban areas, which has begun to be implemented in some cities of Indonesia. By finding out the effectiveness of BRT as a mass transportation system, it is necessary to study the expectations of users and non-users of the Trans Jateng Purwokerto-Purbalingga BRT regarding the perceived social, economic, and environmental impacts. This study uses the text Clustering method to group public opinion based on similarities so that it can be analyzed further for policymaking. As a result, the majority of the community gave positive expectations of BRT implementation’s perceived social, economic, and environmental benefits. On the other hand, public opinion on the presence of BRT is not always positive and has a significant impact. Improvements are needed in several aspects that are considered not to meet public expectations to maximize the function of BRT as a substitute for public transportation for private vehicles.


2011 ◽  
Vol 6 (10) ◽  
Author(s):  
Chenghui HUANG ◽  
Jian YIN ◽  
Fang HOU

Sign in / Sign up

Export Citation Format

Share Document