scholarly journals Document Clustering Dengan Latent Dirichlet Allocation dan Ward Hierarichal Clustering

Pseudocode ◽  
2018 ◽  
Vol 5 (2) ◽  
pp. 29-37
Author(s):  
Guntur Budi Herwanto
2021 ◽  
Author(s):  
Meenu Gupta ◽  
Abdul Wasi ◽  
Ankit Verma ◽  
Somesh Awasthi

2019 ◽  
Vol 16 (8) ◽  
pp. 3367-3371
Author(s):  
A. Christy ◽  
Anto Praveena ◽  
Jany Shabu

In this information age, Knowledge discovery and pattern matching plays a significant role. Topic Modeling, an area of Text mining is used detecting hidden patterns in a document collection. Topic Modeling and Document Clustering are two important key terms which are similar in concepts and functionality. In this paper, topic modeling is carried out using Latent Dirichlet Allocation-Brute Force Method (LDA-BF), Latent Dirichlet Allocation-Back Tracking (LDA-BT), Latent Semantic Indexing (LSI) method and Nonnegative Matrix Factorization (NMF) method. A hybrid model is proposed which uses Latent Dirichlet Allocation (LDA) for extracting feature terms and Feature Selection (FS) method for feature reduction. The efficiency of document clustering depends upon the selection of good features. Topic modeling is performed by enriching the good features obtained through feature selection method. The proposed hybrid model produces improved accuracy than K-Means clustering method.


Information ◽  
2020 ◽  
Vol 11 (11) ◽  
pp. 518
Author(s):  
Mubashar Mustafa ◽  
Feng Zeng ◽  
Hussain Ghulam ◽  
Hafiz Muhammad Arslan

Document clustering is to group documents according to certain semantic features. Topic model has a richer semantic structure and considerable potential for helping users to know document corpora. Unfortunately, this potential is stymied on text documents which have overlapping nature, due to their purely unsupervised nature. To solve this problem, some semi-supervised models have been proposed for English language. However, no such work is available for poor resource language Urdu. Therefore, document clustering has become a challenging task in Urdu language, which has its own morphology, syntax and semantics. In this study, we proposed a semi-supervised framework for Urdu documents clustering to deal with the Urdu morphology challenges. The proposed model is a combination of pre-processing techniques, seeded-LDA model and Gibbs sampling, we named it seeded-Urdu Latent Dirichlet Allocation (seeded-ULDA). We apply the proposed model and other methods to Urdu news datasets for categorizing. For the datasets, two conditions are considered for document clustering, one is “Dataset without overlapping” in which all classes have distinct nature. The other is “Dataset with overlapping” in which the categories are overlapping and the classes are connected to each other. The aim of this study is threefold: it first shows that unsupervised models (Latent Dirichlet Allocation (LDA), Non-negative matrix factorization (NMF) and K-means) are giving satisfying results on the dataset without overlapping. Second, it shows that these unsupervised models are not performing well on the dataset with overlapping, because, on this dataset, these algorithms find some topics that are neither entirely meaningful nor effective in extrinsic tasks. Third, our proposed semi-supervised model Seeded-ULDA performs well on both datasets because this model is straightforward and effective to instruct topic models to find topics of specific interest. It is shown in this paper that the semi-supervised model, Seeded-ULDA, provides significant results as compared to unsupervised algorithms.


Author(s):  
Priyanka R. Patil ◽  
Shital A. Patil

Similarity View is an application for visually comparing and exploring multiple models of text and collection of document. Friendbook finds ways of life of clients from client driven sensor information, measures the closeness of ways of life amongst clients, and prescribes companions to clients if their ways of life have high likeness. Roused by demonstrate a clients day by day life as life records, from their ways of life are separated by utilizing the Latent Dirichlet Allocation Algorithm. Manual techniques can't be utilized for checking research papers, as the doled out commentator may have lacking learning in the exploration disciplines. For different subjective views, causing possible misinterpretations. An urgent need for an effective and feasible approach to check the submitted research papers with support of automated software. A method like text mining method come to solve the problem of automatically checking the research papers semantically. The proposed method to finding the proper similarity of text from the collection of documents by using Latent Dirichlet Allocation (LDA) algorithm and Latent Semantic Analysis (LSA) with synonym algorithm which is used to find synonyms of text index wise by using the English wordnet dictionary, another algorithm is LSA without synonym used to find the similarity of text based on index. LSA with synonym rate of accuracy is greater when the synonym are consider for matching.


2021 ◽  
Vol 920 ◽  
Author(s):  
Mohamed Frihat ◽  
Bérengère Podvin ◽  
Lionel Mathelin ◽  
Yann Fraigneau ◽  
François Yvon

Abstract


Sign in / Sign up

Export Citation Format

Share Document