scholarly journals The Performance of BERT as Data Representation of Text Clustering

Author(s):  
Alvin Subakti ◽  
Hendri Murfi ◽  
Nora Hariadi

Abstract Text clustering is the task of grouping a set of texts so that text in the same group will be more similar than those from a different group. The process of grouping text manually requires a significant amount of time and labor. Therefore, automation utilizing machine learning is necessary. The standard method used to represent textual data is Term Frequency Inverse Document Frequency (TFIDF). However, TFIDF cannot consider the position and context of a word in a sentence. Bidirectional Encoder Representation from Transformers (BERT) model can produce text representation that incorporates the position and context of a word in a sentence. This research analyzed the performance of the BERT model as data representation for text. Moreover, various feature extraction and normalization methods are also applied for the data representation of the BERT model. To examine the performances of BERT, we use four clustering algorithms, i.e., k-means clustering, eigenspace-based fuzzy c-means, deep embedded clustering, and improved deep embedded clustering. Our simulations show that BERT outperforms the standard TFIDF method in 28 out of 36 metrics. Furthermore, different feature extraction and normalization produced varied performances. The usage of these feature extraction and normalization must be altered depending on the text clustering algorithm used.

2018 ◽  
Vol 7 (1) ◽  
pp. 55-62
Author(s):  
Mohammad Alaqtash ◽  
Moayad A.Fadhil ◽  
Ali F. Al-Azzawi

Clustering is one of the important approaches for Clustering enables the grouping of unlabeled data by partitioning data into clusters with similar patterns. Over the past decades, many clustering algorithms have been developed for various clustering problems. An overlapping partitioning clustering (OPC) algorithm can only handle numerical data. Hence, novel clustering algorithms have been studied extensively to overcome this issue. By increasing the number of objects belonging to one cluster and distance between cluster centers, the study aimed to cluster the textual data type without losing the main functions. The proposed study herein included over twenty newsgroup dataset, which consisted of approximately 20000 textual documents. By introducing some modifications to the traditional algorithm, an acceptable level of homogeneity and completeness of clusters were generated. Modifications were performed on the pre-processing phase and data representation, along with the number methods which influence the primary function of the algorithm. Subsequently, the results were evaluated and compared with the k-means algorithm of the training and test datasets. The results indicated that the modified algorithm could successfully handle the categorical data and produce satisfactory clusters.


2011 ◽  
Vol 135-136 ◽  
pp. 1155-1158
Author(s):  
Wei Li ◽  
Mei An Li

Based on the probability model of clustering algorithm constructs a model for each cluster, calculate probability of every text falls in different models to decide text belongs to which cluster, conveniently in global Angle represents abstract structure of clusters. In this paper combining the hidden Markov model and k - means clustering algorithm realize text clustering, first produces first clustering results by k - means algorithm, as the initial probability model of a hidden Markov model ,constructed probability transfer matrix prediction every step of clustering iteration, when subtraction value of two probability transfer matrix is 0, clustering end. This algorithm can in global perspective every cluster of document clustering process, to avoid the repetition of clustering process, effectively improve the clustering algorithm .


2018 ◽  
Vol 7 (2.14) ◽  
pp. 32
Author(s):  
Siti Sakira Kamaruddin ◽  
Yuhanis Yusof ◽  
Nur Azzah Abu Bakar ◽  
Mohamed Ahmed Tayie ◽  
Ghaith Abdulsattar A.Jabbar Alkubaisi

Textual data are a rich source of knowledge; hence, sentence comparison has become one of the important tasks in text mining related works. Most previous work in text comparison are performed at document level, research suggest that comparing sentence level text is a non-trivial problem.  One of the reason is two sentences can convey the same meaning with totally dissimilar words.  This paper presents the results of a comparative analysis on three representation schemes i.e. term frequency inverse document frequency, Latent Semantic Analysis and Graph based representation using three similarity measures i.e. Cosine, Dice coefficient and Jaccard similarity to compare the similarity of sentences.  Results reveal that the graph based representation and the Jaccard similarity measure outperforms the others in terms of precision, recall and F-measures. 


2013 ◽  
Vol 760-762 ◽  
pp. 1925-1929 ◽  
Author(s):  
Hai Dong Meng ◽  
Peng Fei Wu ◽  
Yu Chen Song ◽  
Guan Dong Xu

Data field clustering algorithm possesses dynamic characteristics compared with other clustering algorithms. By changing the parameters of the data field model, the results can be dynamically adjusted to meet the target of feature extraction and knowledge discovery in different scales, but the selection and construction of data field model can give rise to different clustering results. This paper presents the different effectiveness of clustering based on various of data field models and its parameters, provides with the scheme to chose the best data field model fitting to the characteristics of the data radiation, and verifies that the best clustering effectiveness can be achieved with the value of radial energy in the golden section.


2021 ◽  
Vol 02 (02) ◽  
Author(s):  
Mohammed A. Ahmed ◽  
◽  
Hanif Baharin ◽  
Puteri N. E. Nohuddin ◽  
◽  
...  

Al-Quran is the primary text of Muslims’ religion and practise. Millions of Muslims around the world use al-Quran as their reference guide, and so knowledge can be obtained from it by Muslims and Islamic scholars in general. Al-Quran has been reinterpreted to various languages in the world, for example, English and has been written by several translators. Each translator has ideas, comments and statements to translate the verses from which he has obtained (Tafseer). Therefore, this paper tries to cluster the translation of the Tafseer using text clustering. Text clustering is the text mining method that needs to be clustered in the same section of related documents. The study adapted (mini-batch k-means and k-means) algorithms of clustering techniques to explain and to define the link between keywords known as features or concepts for Al-Baqarah chapter of 286 verses. For this dataset, data preprocessing and extraction of features using Term Frequency-Inverse Document Frequency (TF-IDF) and Principal Component Analysis (PCA) applied. Results showed that two/three-dimensional clustering plotting assigning seven cluster categories (k = 7) for the Tafseer. The implementation time of the mini-batch k-means algorithm (0.05485s) outperformed the time of the k-means algorithm (0.23334s). Finally, the features ‘god’, ‘people’, and ‘believe’ was the most frequent features.


2015 ◽  
Vol 22 (5) ◽  
pp. 687-726 ◽  
Author(s):  
MARCELO L. ERRECALDE ◽  
LETICIA C. CAGNINA ◽  
PAOLO ROSSO

AbstractThis article presents silhouette–attraction (Sil–Att), a simple and effective method for text clustering, which is based on two main concepts: the silhouette coefficient and the idea of attraction. The combination of both principles allows us to obtain a general technique that can be used either as a boosting method, which improves results of other clustering algorithms, or as an independent clustering algorithm. The experimental work shows that Sil–Att is able to obtain high-quality results on text corpora with very different characteristics. Furthermore, its stable performance on all the considered corpora is indicative that it is a very robust method. This is a very interesting positive aspect of Sil–Att with respect to the other algorithms used in the experiments, whose performances heavily depend on specific characteristics of the corpora being considered.


2021 ◽  
Vol 13 (2) ◽  
pp. 647
Author(s):  
Ruomu Miao ◽  
Yuxia Wang ◽  
Shuang Li

With the development of Web2.0 and mobile Internet, urban residents, a new type of “sensor”, provide us with massive amounts of volunteered geographic information (VGI). Quantifying the spatial patterns of VGI plays an increasingly important role in the understanding and development of urban spatial functions. Using VGI and social media activity data, this article developed a method to automatically extract and identify urban spatial patterns and functional zones. The method is put forward based on the case of Beijing, China, and includes the following three steps: (1) Obtain multi-source urban spatial data, such as Weibo data (equivalent to Twitter in Chinese), OpenStreetMap, population data, etc.; (2) Use the hierarchical clustering algorithm, term frequency-inverse document frequency (TF-IDF) method, and improved k-means clustering algorithms to identify functional zones; (3) Compare the identified results with the actual urban land uses and verify its accuracy. The experiment results proved that our method can effectively identify urban functional zones, and the results provide new ideas for the study of urban spatial patterns and have great significance in optimizing urban spatial planning.


2021 ◽  
Vol 18 (4) ◽  
pp. 1306-1311
Author(s):  
S. Sarannya ◽  
M. Venkatesan ◽  
Prabhavathy Panner

Text clustering has now a days become a very major technique in many fields including data mining, Natural Language Processing etc. It’s also broadly used for information retrieval and assimilation of textual data. Majority of the works which were carried out previously focuses on the clustering algorithms where feature extraction is done without considering the semantic meaning of word based on its context. In the given work, we introduce a double clustering algorithm using K -Means, by using in conjuction, a Bi-directional Long Short-Term Memory and a Convolutional Neural Network for the purpose of feature extraction, so that the semantic meaning is also considered. Recurrent neural network (RNN) has the ability to study long-term dependencies prevailing in input whereas CNN models are for long known to be effective in feature extraction of local features of given input data. Unlike all the works previously carried out, this proposed work considers and carries out extraction of features and clustering of documents as one combined mechanism. Here result of clustering is send back to the model as feedback information thereby optimizing the parameters of the network model dynamically. Clustering in a double-clustering manner is implemented, which increases the time efficiency.


Sign in / Sign up

Export Citation Format

Share Document