The Performance of BERT as Data Representation of Text Clustering

Abstract Text clustering is the task of grouping a set of texts so that text in the same group will be more similar than those from a different group. The process of grouping text manually requires a significant amount of time and labor. Therefore, automation utilizing machine learning is necessary. The standard method used to represent textual data is Term Frequency Inverse Document Frequency (TFIDF). However, TFIDF cannot consider the position and context of a word in a sentence. Bidirectional Encoder Representation from Transformers (BERT) model can produce text representation that incorporates the position and context of a word in a sentence. This research analyzed the performance of the BERT model as data representation for text. Moreover, various feature extraction and normalization methods are also applied for the data representation of the BERT model. To examine the performances of BERT, we use four clustering algorithms, i.e., k-means clustering, eigenspace-based fuzzy c-means, deep embedded clustering, and improved deep embedded clustering. Our simulations show that BERT outperforms the standard TFIDF method in 28 out of 36 metrics. Furthermore, different feature extraction and normalization produced varied performances. The usage of these feature extraction and normalization must be altered depending on the text clustering algorithm used.

Download Full-text

A Modified Overlapping Partitioning Clustering Algorithm for Categorical Data Clustering

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v7i1.896 ◽

2018 ◽

Vol 7 (1) ◽

pp. 55-62

Author(s):

Mohammad Alaqtash ◽

Moayad A.Fadhil ◽

Ali F. Al-Azzawi

Keyword(s):

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Numerical Data ◽

Data Representation ◽

The Past ◽

Textual Data ◽

Traditional Algorithm ◽

Clustering Problems ◽

Categorical Data Clustering

Clustering is one of the important approaches for Clustering enables the grouping of unlabeled data by partitioning data into clusters with similar patterns. Over the past decades, many clustering algorithms have been developed for various clustering problems. An overlapping partitioning clustering (OPC) algorithm can only handle numerical data. Hence, novel clustering algorithms have been studied extensively to overcome this issue. By increasing the number of objects belonging to one cluster and distance between cluster centers, the study aimed to cluster the textual data type without losing the main functions. The proposed study herein included over twenty newsgroup dataset, which consisted of approximately 20000 textual documents. By introducing some modifications to the traditional algorithm, an acceptable level of homogeneity and completeness of clusters were generated. Modifications were performed on the pre-processing phase and data representation, along with the number methods which influence the primary function of the algorithm. Subsequently, the results were evaluated and compared with the k-means algorithm of the training and test datasets. The results indicated that the modified algorithm could successfully handle the categorical data and produce satisfactory clusters.

Download Full-text

A Text Clustering Algorithms Based on Hidden Markov Model

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.135-136.1155 ◽

2011 ◽

Vol 135-136 ◽

pp. 1155-1158

Author(s):

Wei Li ◽

Mei An Li

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Transfer Matrix ◽

Clustering Algorithm ◽

Hidden Markov ◽

Probability Model ◽

Clustering Algorithms ◽

Text Clustering ◽

Global Perspective ◽

Abstract Structure

Based on the probability model of clustering algorithm constructs a model for each cluster, calculate probability of every text falls in different models to decide text belongs to which cluster, conveniently in global Angle represents abstract structure of clusters. In this paper combining the hidden Markov model and k - means clustering algorithm realize text clustering, first produces first clustering results by k - means algorithm, as the initial probability model of a hidden Markov model ,constructed probability transfer matrix prediction every step of clustering iteration, when subtraction value of two probability transfer matrix is 0, clustering end. This algorithm can in global perspective every cluster of document clustering process, to avoid the repetition of clustering process, effectively improve the clustering algorithm .

Download Full-text

Sentiment word co-occurrence and knowledge pair feature extraction based LDA short text clustering algorithm

Journal of Intelligent Information Systems ◽

10.1007/s10844-020-00597-7 ◽

2020 ◽

Author(s):

Di Wu ◽

Ruixin Yang ◽

Chao Shen

Keyword(s):

Feature Extraction ◽

Clustering Algorithm ◽

Text Clustering ◽

Short Text ◽

Short Text Clustering ◽

Sentiment Word

Download Full-text

Graph-based Representation for Sentence Similarity Measure : A Comparative Analysis

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.14.11149 ◽

2018 ◽

Vol 7 (2.14) ◽

pp. 32

Author(s):

Siti Sakira Kamaruddin ◽

Yuhanis Yusof ◽

Nur Azzah Abu Bakar ◽

Mohamed Ahmed Tayie ◽

Ghaith Abdulsattar A.Jabbar Alkubaisi

Keyword(s):

Comparative Analysis ◽

Similarity Measure ◽

Semantic Analysis ◽

Similarity Measures ◽

Jaccard Similarity ◽

Inverse Document Frequency ◽

Document Frequency ◽

Sentence Level ◽

Textual Data ◽

Document Level

Textual data are a rich source of knowledge; hence, sentence comparison has become one of the important tasks in text mining related works. Most previous work in text comparison are performed at document level, research suggest that comparing sentence level text is a non-trivial problem. One of the reason is two sentences can convey the same meaning with totally dissimilar words. This paper presents the results of a comparative analysis on three representation schemes i.e. term frequency inverse document frequency, Latent Semantic Analysis and Graph based representation using three similarity measures i.e. Cosine, Dice coefficient and Jaccard similarity to compare the similarity of sentences. Results reveal that the graph based representation and the Jaccard similarity measure outperforms the others in terms of precision, recall and F-measures.

Download Full-text

Research of Clustering Algorithm Based on Different Data Field Model

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.760-762.1925 ◽

2013 ◽

Vol 760-762 ◽

pp. 1925-1929 ◽

Cited By ~ 1

Author(s):

Hai Dong Meng ◽

Peng Fei Wu ◽

Yu Chen Song ◽

Guan Dong Xu

Keyword(s):

Feature Extraction ◽

Knowledge Discovery ◽

Dynamic Characteristics ◽

Clustering Algorithm ◽

Field Model ◽

Golden Section ◽

Clustering Algorithms ◽

Model Fitting ◽

Data Field

Data field clustering algorithm possesses dynamic characteristics compared with other clustering algorithms. By changing the parameters of the data field model, the results can be dynamically adjusted to meet the target of feature extraction and knowledge discovery in different scales, but the selection and construction of data field model can give rise to different clustering results. This paper presents the different effectiveness of clustering based on various of data field models and its parameters, provides with the scheme to chose the best data field model fitting to the characteristics of the data radiation, and verifies that the best clustering effectiveness can be achieved with the value of radial energy in the golden section.

Download Full-text

Mini-Batch k-Means versus k-Means to Cluster English Tafseer Text: View of Al-Baqarah Chapter

Journal of Quranic Sciences and Research ◽

10.30880/jqsr.2021.02.02.006 ◽

2021 ◽

Vol 02 (02) ◽

Author(s):

Mohammed A. Ahmed ◽

◽

Hanif Baharin ◽

Puteri N. E. Nohuddin ◽

◽

...

Keyword(s):

Three Dimensional ◽

Principal Component ◽

Text Clustering ◽

Mining Method ◽

Inverse Document Frequency ◽

Reference Guide ◽

Document Frequency ◽

The World ◽

Implementation Time ◽

Cluster Categories

Al-Quran is the primary text of Muslims’ religion and practise. Millions of Muslims around the world use al-Quran as their reference guide, and so knowledge can be obtained from it by Muslims and Islamic scholars in general. Al-Quran has been reinterpreted to various languages in the world, for example, English and has been written by several translators. Each translator has ideas, comments and statements to translate the verses from which he has obtained (Tafseer). Therefore, this paper tries to cluster the translation of the Tafseer using text clustering. Text clustering is the text mining method that needs to be clustered in the same section of related documents. The study adapted (mini-batch k-means and k-means) algorithms of clustering techniques to explain and to define the link between keywords known as features or concepts for Al-Baqarah chapter of 286 verses. For this dataset, data preprocessing and extraction of features using Term Frequency-Inverse Document Frequency (TF-IDF) and Principal Component Analysis (PCA) applied. Results showed that two/three-dimensional clustering plotting assigning seven cluster categories (k = 7) for the Tafseer. The implementation time of the mini-batch k-means algorithm (0.05485s) outperformed the time of the k-means algorithm (0.23334s). Finally, the features ‘god’, ‘people’, and ‘believe’ was the most frequent features.

Download Full-text

Silhouette + attraction: A simple and effective method for text clustering

Natural Language Engineering ◽

10.1017/s1351324915000273 ◽

2015 ◽

Vol 22 (5) ◽

pp. 687-726 ◽

Cited By ~ 3

Author(s):

MARCELO L. ERRECALDE ◽

LETICIA C. CAGNINA ◽

PAOLO ROSSO

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Text Clustering ◽

Positive Aspect ◽

General Technique ◽

Text Corpora ◽

Silhouette Coefficient ◽

Stable Performance ◽

Boosting Method ◽

Different Characteristics

AbstractThis article presents silhouette–attraction (Sil–Att), a simple and effective method for text clustering, which is based on two main concepts: the silhouette coefficient and the idea of attraction. The combination of both principles allows us to obtain a general technique that can be used either as a boosting method, which improves results of other clustering algorithms, or as an independent clustering algorithm. The experimental work shows that Sil–Att is able to obtain high-quality results on text corpora with very different characteristics. Furthermore, its stable performance on all the considered corpora is indicative that it is a very robust method. This is a very interesting positive aspect of Sil–Att with respect to the other algorithms used in the experiments, whose performances heavily depend on specific characteristics of the corpora being considered.

Download Full-text

Analyzing Urban Spatial Patterns and Functional Zones Using Sina Weibo POI Data: A Case Study of Beijing

Sustainability ◽

10.3390/su13020647 ◽

2021 ◽

Vol 13 (2) ◽

pp. 647

Author(s):

Ruomu Miao ◽

Yuxia Wang ◽

Shuang Li

Keyword(s):

Spatial Patterns ◽

Spatial Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Population Data ◽

Mobile Internet ◽

Activity Data ◽

Document Frequency ◽

Functional Zones ◽

New Ideas

With the development of Web2.0 and mobile Internet, urban residents, a new type of “sensor”, provide us with massive amounts of volunteered geographic information (VGI). Quantifying the spatial patterns of VGI plays an increasingly important role in the understanding and development of urban spatial functions. Using VGI and social media activity data, this article developed a method to automatically extract and identify urban spatial patterns and functional zones. The method is put forward based on the case of Beijing, China, and includes the following three steps: (1) Obtain multi-source urban spatial data, such as Weibo data (equivalent to Twitter in Chinese), OpenStreetMap, population data, etc.; (2) Use the hierarchical clustering algorithm, term frequency-inverse document frequency (TF-IDF) method, and improved k-means clustering algorithms to identify functional zones; (3) Compare the identified results with the actual urban land uses and verify its accuracy. The experiment results proved that our method can effectively identify urban functional zones, and the results provide new ideas for the study of urban spatial patterns and have great significance in optimizing urban spatial planning.

Download Full-text

UNSUPERVISED FEATURE SELECTION FOR TEXT CLUSTERING USING DIFFERENTIAL INVERSE DOCUMENT FREQUENCY

Indian Journal of Computer Science and Engineering ◽

10.21817/indjcse/2021/v12i4/211204014 ◽

2021 ◽

Vol 12 (4) ◽

pp. 790-797

Author(s):

Sivaram Prasad Nalluri ◽

Rajasekhara Rao Kurra

Keyword(s):

Feature Selection ◽

Text Clustering ◽

Inverse Document Frequency ◽

Unsupervised Feature Selection ◽

Document Frequency ◽

Selection For

Download Full-text

Double Clustering Based Neural Feedback Method for Unstructured Text Data

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2021.9385 ◽

2021 ◽

Vol 18 (4) ◽

pp. 1306-1311

Author(s):

S. Sarannya ◽

M. Venkatesan ◽

Prabhavathy Panner

Keyword(s):

Neural Network ◽

Feature Extraction ◽

Language Processing ◽

Clustering Algorithm ◽

Short Term Memory ◽

Clustering Algorithms ◽

Semantic Meaning ◽

Feedback Information ◽

Neural Feedback ◽

Double Clustering

Text clustering has now a days become a very major technique in many fields including data mining, Natural Language Processing etc. It’s also broadly used for information retrieval and assimilation of textual data. Majority of the works which were carried out previously focuses on the clustering algorithms where feature extraction is done without considering the semantic meaning of word based on its context. In the given work, we introduce a double clustering algorithm using K -Means, by using in conjuction, a Bi-directional Long Short-Term Memory and a Convolutional Neural Network for the purpose of feature extraction, so that the semantic meaning is also considered. Recurrent neural network (RNN) has the ability to study long-term dependencies prevailing in input whereas CNN models are for long known to be effective in feature extraction of local features of given input data. Unlike all the works previously carried out, this proposed work considers and carries out extraction of features and clustering of documents as one combined mechanism. Here result of clustering is send back to the model as feedback information thereby optimizing the parameters of the network model dynamically. Clustering in a double-clustering manner is implemented, which increases the time efficiency.

Download Full-text