A frequent term based text clustering approach using novel similarity measure

Text clustering is an effective approach to collect and organize text documents into meaningful groups for mining valuable information on the Internet. However, there exist some issues to tackle such as feature extraction and data dimension reduction. To overcome these problems, we present a novel approach named deep-learning vocabulary network. The vocabulary network is constructed based on related-word set, which contains the “cooccurrence” relations of words or terms. We replace term frequency in feature vectors with the “importance” of words in terms of vocabulary network and PageRank, which can generate more precise feature vectors to represent the meaning of text clustering. Furthermore, sparse-group deep belief network is proposed to reduce the dimensionality of feature vectors, and we introduce coverage rate for similarity measure in Single-Pass clustering. To verify the effectiveness of our work, we compare the approach to the representative algorithms, and experimental results show that feature vectors in terms of deep-learning vocabulary network have better clustering performance.

Download Full-text

A set theory based similarity measure for text clustering and classification

Journal Of Big Data ◽

10.1186/s40537-020-00344-3 ◽

2020 ◽

Vol 7 (1) ◽

Cited By ~ 1

Author(s):

Ali A. Amer ◽

Hassan I. Abdalla

Keyword(s):

Set Theory ◽

Similarity Measure ◽

Similarity Measures ◽

Text Clustering ◽

Plagiarism Detection ◽

K Nearest Neighbor ◽

Single Measure ◽

Highly Effective ◽

Clustering And Classification ◽

Effectiveness And Efficiency

Abstract Similarity measures have long been utilized in information retrieval and machine learning domains for multi-purposes including text retrieval, text clustering, text summarization, plagiarism detection, and several other text-processing applications. However, the problem with these measures is that, until recently, there has never been one single measure recorded to be highly effective and efficient at the same time. Thus, the quest for an efficient and effective similarity measure is still an open-ended challenge. This study, in consequence, introduces a new highly-effective and time-efficient similarity measure for text clustering and classification. Furthermore, the study aims to provide a comprehensive scrutinization for seven of the most widely used similarity measures, mainly concerning their effectiveness and efficiency. Using the K-nearest neighbor algorithm (KNN) for classification, the K-means algorithm for clustering, and the bag of word (BoW) model for feature selection, all similarity measures are carefully examined in detail. The experimental evaluation has been made on two of the most popular datasets, namely, Reuters-21 and Web-KB. The obtained results confirm that the proposed set theory-based similarity measure (STB-SM), as a pre-eminent measure, outweighs all state-of-art measures significantly with regards to both effectiveness and efficiency.

Download Full-text

A SIMILARITY MEASURE FOR TEXT CLUSTERING AND CLASSIFICATION

International Journal of Advance Engineering and Research Development ◽

10.21090/ijaerd.030133 ◽

2016 ◽

Vol 3 (01) ◽

Keyword(s):

Similarity Measure ◽

Text Clustering ◽

Clustering And Classification

Download Full-text

Text clustering approach based on maximal frequent term sets

2009 IEEE International Conference on Systems, Man and Cybernetics ◽

10.1109/icsmc.2009.5346313 ◽

2009 ◽

Cited By ~ 4

Author(s):

Chong Su ◽

Qingcai Chen ◽

Xiaolong Wang ◽

Xianjun Meng

Keyword(s):

Text Clustering ◽

Clustering Approach

Download Full-text

An ensemble clustering approach for topic discovery using implicit text segmentation

Journal of Information Science ◽

10.1177/0165551520911590 ◽

2020 ◽

pp. 016555152091159

Author(s):

Muhammad Qasim Memon ◽

Yu Lu ◽

Penghe Chen ◽

Aasma Memon ◽

Muhammad Salman Pathan ◽

...

Keyword(s):

Clustering Algorithms ◽

Data Retrieval ◽

Text Clustering ◽

Text Segmentation ◽

Ensemble Clustering ◽

Reduction Techniques ◽

Average Improvement ◽

Text Collections ◽

Clustering Approach ◽

Topic Structure

Text segmentation (TS) is the process of dividing multi-topic text collections into cohesive segments using topic boundaries. Similarly, text clustering has been renowned as a major concern when it comes to multi-topic text collections, as they are distinguished by sub-topic structure and their contents are not associated with each other. Existing clustering approaches follow the TS method which relies on word frequencies and may not be suitable to cluster multi-topic text collections. In this work, we propose a new ensemble clustering approach (ECA) is a novel topic-modelling-based clustering approach, which induces the combination of TS and text clustering. We improvised a LDA-onto (LDA-ontology) is a TS-based model, which presents a deterioration of a document into segments (i.e. sub-documents), wherein each sub-document is associated with exactly one sub-topic. We deal with the problem of clustering when it comes to a document that is intrinsically related to various topics and its topical structure is missing. ECA is tested through well-known datasets in order to provide a comprehensive presentation and validation of clustering algorithms using LDA-onto. ECA exhibits the semantic relations of keywords in sub-documents and resultant clusters belong to original documents that they contain. Moreover, present research sheds the light on clustering performances and it indicates that there is no difference over performances (in terms of F-measure) when the number of topics changes. Our findings give above par results in order to analyse the problem of text clustering in a broader spectrum without applying dimension reduction techniques over high sparse data. Specifically, ECA provides an efficient and significant framework than the traditional and segment-based approach, such that achieved results are statistically significant with an average improvement of over 10.2%. For the most part, proposed framework can be evaluated in applications where meaningful data retrieval is useful, such as document summarization, text retrieval, novelty and topic detection.

Download Full-text

An Improved Similarity Measure for Text Clustering and Classification

Advanced Science Letters ◽

10.1166/asl.2015.6603 ◽

2015 ◽

Vol 21 (11) ◽

pp. 3583-3590 ◽

Cited By ~ 4

Author(s):

G Suresh Reddy ◽

T. V Rajini Kanth ◽

A Ananda Rao

Keyword(s):

Similarity Measure ◽

Text Clustering ◽

Clustering And Classification

Download Full-text

An Efficient Text Clustering Approach using Biased Affinity Propagation

International Journal of Computer Applications ◽

10.5120/16755-6273 ◽

2014 ◽

Vol 96 (1) ◽

pp. 1-4 ◽

Cited By ~ 1

Author(s):

Isha Sharma ◽

Mahak Motwani

Keyword(s):

Text Clustering ◽

Affinity Propagation ◽

Clustering Approach

Download Full-text

TEXT CLUSTERING APPROACH TOWARD COMMUNITY EXPECTATIONS TO THE BUS RAPID TRANSIT (BRT) TRANSJATENG PURWOKERTO-PURBALINGGA OPERATIONS

Jurnal Sosioteknologi ◽

10.5614/sostek.itbj.2021.20.3.1 ◽

2021 ◽

Vol 20 (3) ◽

pp. 288-298

Author(s):

Famila Dwi Winati ◽

Fauzan Romadlon

Keyword(s):

Public Opinion ◽

Public Transportation ◽

Urban Areas ◽

Social Economic ◽

Text Clustering ◽

Environmental Benefits ◽

Bus Rapid Transit ◽

Rapid Transit ◽

Public Expectations ◽

Clustering Approach

Bus Rapid Transit (BRT) is one of the alternative public transportations in urban areas, which has begun to be implemented in some cities of Indonesia. By finding out the effectiveness of BRT as a mass transportation system, it is necessary to study the expectations of users and non-users of the Trans Jateng Purwokerto-Purbalingga BRT regarding the perceived social, economic, and environmental impacts. This study uses the text Clustering method to group public opinion based on similarities so that it can be analyzed further for policymaking. As a result, the majority of the community gave positive expectations of BRT implementation’s perceived social, economic, and environmental benefits. On the other hand, public opinion on the presence of BRT is not always positive and has a significant impact. Improvements are needed in several aspects that are considered not to meet public expectations to maximize the function of BRT as a substitute for public transportation for private vehicles.

Download Full-text