A Document Clustering Algorithm Based on Semi-constrained Hierarchical Latent Dirichlet Allocation

An improved ant algorithm with LDA-based representation for text document clustering

Journal of Information Science ◽

10.1177/0165551516638784 ◽

2016 ◽

Vol 43 (2) ◽

pp. 275-292 ◽

Cited By ~ 24

Author(s):

Aytug Onan ◽

Hasan Bulut ◽

Serdar Korukoglu

Keyword(s):

Clustering Algorithm ◽

Latent Dirichlet Allocation ◽

Clustering Algorithms ◽

Document Clustering ◽

Clustering Methods ◽

Initial Value ◽

Text Document ◽

Clustering Quality ◽

Text Features

Document clustering can be applied in document organisation and browsing, document summarisation and classification. The identification of an appropriate representation for textual documents is extremely important for the performance of clustering or classification algorithms. Textual documents suffer from the high dimensionality and irrelevancy of text features. Besides, conventional clustering algorithms suffer from several shortcomings, such as slow convergence and sensitivity to the initial value. To tackle the problems of conventional clustering algorithms, metaheuristic algorithms are frequently applied to clustering. In this paper, an improved ant clustering algorithm is presented, where two novel heuristic methods are proposed to enhance the clustering quality of ant-based clustering. In addition, the latent Dirichlet allocation (LDA) is used to represent textual documents in a compact and efficient way. The clustering quality of the proposed ant clustering algorithm is compared to the conventional clustering algorithms using 25 text benchmarks in terms of F-measure values. The experimental results indicate that the proposed clustering scheme outperforms the compared conventional and metaheuristic clustering methods for textual documents.

Download Full-text

Document Clustering Dengan Latent Dirichlet Allocation dan Ward Hierarichal Clustering

Pseudocode ◽

10.33369/pseudocode.5.2.29-37 ◽

2018 ◽

Vol 5 (2) ◽

pp. 29-37

Author(s):

Guntur Budi Herwanto

Keyword(s):

Latent Dirichlet Allocation ◽

Document Clustering ◽

Dirichlet Allocation

Download Full-text

Using Latent Dirichlet Allocation for Topic Modeling and Document Clustering of Dumaguete City Twitter Dataset

Proceedings of the 2018 International Conference on Computing and Data Engineering - ICCDE 2018 ◽

10.1145/3219788.3219799 ◽

2018 ◽

Author(s):

Chuchi Montenegro ◽

Cerino Ligutom ◽

Jay Vincent Orio ◽

Dyannah Alexa Marie Ramacho

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Document Clustering ◽

Dirichlet Allocation

Download Full-text

Document Clustering and Topic Classification Using Latent Dirichlet Allocation

10.1109/icses52305.2021.9633830 ◽

2021 ◽

Author(s):

Meenu Gupta ◽

Abdul Wasi ◽

Ankit Verma ◽

Somesh Awasthi

Keyword(s):

Latent Dirichlet Allocation ◽

Document Clustering ◽

Dirichlet Allocation

Download Full-text

Topic Modelling and Clustering of Disaster-Related Tweets using Bilingual Latent Dirichlet Allocation and Incremental Clustering Algorithm with Support Vector Machines for Need Assessment

10.1109/icsecs52883.2021.00041 ◽

2021 ◽

Author(s):

Lady Angelica Buen Guerzo ◽

Hans Aaron O. Kilkenny ◽

Raphael Noel D. Osorio ◽

Andrei Hart E. Villegas ◽

Charmaine S. Ponay

Keyword(s):

Support Vector Machines ◽

Clustering Algorithm ◽

Latent Dirichlet Allocation ◽

Support Vector ◽

Topic Modelling ◽

Incremental Clustering ◽

Need Assessment ◽

Vector Machines ◽

Dirichlet Allocation

Download Full-text

A Hybrid Model for Topic Modeling Using Latent Dirichlet Allocation and Feature Selection Method

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2019.8234 ◽

2019 ◽

Vol 16 (8) ◽

pp. 3367-3371

Author(s):

A. Christy ◽

Anto Praveena ◽

Jany Shabu

Keyword(s):

Feature Selection ◽

Hybrid Model ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Document Clustering ◽

Feature Selection Method ◽

Latent Semantic Indexing ◽

Selection Method ◽

Feature Reduction ◽

Dirichlet Allocation

In this information age, Knowledge discovery and pattern matching plays a significant role. Topic Modeling, an area of Text mining is used detecting hidden patterns in a document collection. Topic Modeling and Document Clustering are two important key terms which are similar in concepts and functionality. In this paper, topic modeling is carried out using Latent Dirichlet Allocation-Brute Force Method (LDA-BF), Latent Dirichlet Allocation-Back Tracking (LDA-BT), Latent Semantic Indexing (LSI) method and Nonnegative Matrix Factorization (NMF) method. A hybrid model is proposed which uses Latent Dirichlet Allocation (LDA) for extracting feature terms and Feature Selection (FS) method for feature reduction. The efficiency of document clustering depends upon the selection of good features. Topic modeling is performed by enriching the good features obtained through feature selection method. The proposed hybrid model produces improved accuracy than K-Means clustering method.

Download Full-text

Cluster Analysis for Internet Public Sentiment in Universities by Combining Methods

International Journal of Recent Contributions from Engineering Science & IT (iJES) ◽

10.3991/ijes.v6i3.9670 ◽

2018 ◽

Vol 6 (3) ◽

pp. 60

Author(s):

Na Zheng ◽

Jie Yu Wu

Keyword(s):

Cluster Analysis ◽

Clustering Algorithm ◽

Latent Dirichlet Allocation ◽

Semantic Information ◽

Information Leakage ◽

Text Clustering ◽

Text Similarity ◽

Public Sentiment ◽

Combining Methods ◽

Dirichlet Allocation

A clustering method based on the Latent Dirichlet Allocation and the VSM model to compute the text similarity is presented. The Latent Dirichlet Allocation subject models and the VSM vector space model weights strategy are used respectively to calculate the text similarity. The linear combination of the two results is used to get the text similarity. Then the k-means clustering algorithm is chosen for cluster analysis. It can not only solve the deep semantic information leakage problems of traditional text clustering, but also solve the problem of the LDA that could not distinguish the texts because of too much dimension reduction. So the deep semantic information is mined from the text, and the clustering efficiency is improved. Through the comparisons with the traditional methods, the result shows that this algorithm can improve the performance of text clustering.

Download Full-text

Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling

Information ◽

10.3390/info11110518 ◽

2020 ◽

Vol 11 (11) ◽

pp. 518

Author(s):

Mubashar Mustafa ◽

Feng Zeng ◽

Hussain Ghulam ◽

Hafiz Muhammad Arslan

Keyword(s):

English Language ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Document Clustering ◽

Semantic Features ◽

Text Documents ◽

Proposed Model ◽

Probabilistic Topic Modeling ◽

Processing Techniques ◽

Dirichlet Allocation

Document clustering is to group documents according to certain semantic features. Topic model has a richer semantic structure and considerable potential for helping users to know document corpora. Unfortunately, this potential is stymied on text documents which have overlapping nature, due to their purely unsupervised nature. To solve this problem, some semi-supervised models have been proposed for English language. However, no such work is available for poor resource language Urdu. Therefore, document clustering has become a challenging task in Urdu language, which has its own morphology, syntax and semantics. In this study, we proposed a semi-supervised framework for Urdu documents clustering to deal with the Urdu morphology challenges. The proposed model is a combination of pre-processing techniques, seeded-LDA model and Gibbs sampling, we named it seeded-Urdu Latent Dirichlet Allocation (seeded-ULDA). We apply the proposed model and other methods to Urdu news datasets for categorizing. For the datasets, two conditions are considered for document clustering, one is “Dataset without overlapping” in which all classes have distinct nature. The other is “Dataset with overlapping” in which the categories are overlapping and the classes are connected to each other. The aim of this study is threefold: it first shows that unsupervised models (Latent Dirichlet Allocation (LDA), Non-negative matrix factorization (NMF) and K-means) are giving satisfying results on the dataset without overlapping. Second, it shows that these unsupervised models are not performing well on the dataset with overlapping, because, on this dataset, these algorithms find some topics that are neither entirely meaningful nor effective in extrinsic tasks. Third, our proposed semi-supervised model Seeded-ULDA performs well on both datasets because this model is straightforward and effective to instruct topic models to find topics of specific interest. It is shown in this paper that the semi-supervised model, Seeded-ULDA, provides significant results as compared to unsupervised algorithms.

Download Full-text

An object-oriented clustering algorithm for VHR panchromatic images using nonparametric latent Dirichlet allocation

2012 IEEE International Geoscience and Remote Sensing Symposium ◽

10.1109/igarss.2012.6351028 ◽

2012 ◽

Cited By ~ 1

Author(s):

Yinfeng Qi ◽

Hong Tang ◽

Yang Shu ◽

Li Shen ◽

Jianwei Yue ◽

...

Keyword(s):

Clustering Algorithm ◽

Latent Dirichlet Allocation ◽

Object Oriented ◽

Dirichlet Allocation

Download Full-text

A MRF-Based Clustering Algorithm for Remote Sensing Images by Using the Latent Dirichlet Allocation Model

Procedia Earth and Planetary Science ◽

10.1016/j.proeps.2011.09.056 ◽

2011 ◽

Vol 2 ◽

pp. 358-363 ◽

Cited By ~ 3

Author(s):

Hong Tang ◽

Li Shen ◽

Xin Yang ◽

Yinfeng Qi ◽

Weiguo Jiang ◽

...

Keyword(s):

Remote Sensing ◽

Clustering Algorithm ◽

Latent Dirichlet Allocation ◽

Remote Sensing Images ◽

Allocation Model ◽

Latent Dirichlet Allocation Model ◽

Dirichlet Allocation

Download Full-text