Implementasi Latent Dirichlet Allocation (LDA) untuk Klasterisasi Cerita Berbahasa Bali

Cerita-cerita berbahasa Bali memiliki topik yang beragam namun memuat nilai kearifan lokal yang perlu untuk dilestarikan. Jika cerita-cerita tersebut dapat dikelompokkan berdasarkan topik, tentu akan sangat memudahkan bagi para pembacanya dalam memilih bacaan yang diinginkan. Latent Dirichlet Allocation (LDA) mengasumsikan bahwa suatu dokumen dibangun dari perpaduan topik-topik tersembunyi. Dengan menerapkan LDA pada kumpulan dokumen, maka dapat diketahui distribusi topik-topik tersembunyi pada kumpulan dokumen secara umum maupun masing-masing dokumen. Pada penelitian ini, distribusi topik yang ditemukan oleh LDA pada kumpulan cerita berbahasa Bali digunakan untuk melakukan pengelompokkan cerita secara otomatis. Tahapan penelitian meliputi digitalisasi cerita, tokenisasi, case-folding, stemming, pencarian topik dengan LDA, representasi dokumen dan klasterisasi hirarki secara agglomerative. Pengujian dilakukan menggunakan 100 buah data cerita berbahasa Bali yang didapat dari situs daring maupun Dinas Kebudayaan Provinsi Bali untuk menghitung akurasi hasil klasterisasi. Evaluasi dilakukan juga untuk melihat pengaruh jumlah kata dan ukuran kesamaan yang digunakan terhadap akurasi. Akurasi hasil klasterisasi tertinggi yang didapatkan adalah 62% pada saat jumlah kata yang digunakan sebagai representasi dokumen berjumlah 3000 kata. Selain itu, didapatkan suatu kesimpulan bahwa akurasi klasterisasi juga sangat dipengaruhi oleh ukuran kesamaan yang digunakan ketika melakukan penggabungan dokumen serta jumlah kata sebagai representasi dokumen. AbstractBalinese folklores have diverse topics but contain local wisdom that needs to be preserved. Grouping the stories based on the topics can certainly help readers to choose their readings accordingly. Latent Dirichlet Allocation (LDA) assumes that a document is built from a combination of hidden topics. By applying LDA to a collection of documents (corpus), the global distribution of hidden topics in the corpus as well as the distribution of each individual document in the corpus can be identified. In this research, the individual distribution of topics in Balinese folklores is used to group stories based on common topics. The research stages include story digitization, tokenization, case-folding, stemming, topic search with LDA, document representation and agglomerative hierarchical clustering. Performance evaluation was carried out using 100 Balinese folklores data obtained from online sites and the Bali Provincial Cultural Office to calculate the accuracy of the clustering results. Evaluation is also carried out to see the effect of the number of words and the similarity measure used on accuracy. The highest accuracy obtained is 62% when the number of words used as the representation of a document is 3000 words. In addition, it can be concluded that accuracy is also greatly influenced by the similarity measure used when merging the documents and the number of words for document representation.

Download Full-text

Document Similarity Measure Based on the Earth Mover\'s Distance Utilizing Latent Dirichlet Allocation

Research Journal of Applied Sciences Engineering and Technology ◽

10.19026/rjaset.12.2323 ◽

2016 ◽

Vol 12 (2) ◽

pp. 214-222 ◽

Cited By ~ 1

Author(s):

Min-Hee Jang ◽

Tae-Hwan Eom ◽

Sang-Wook Kim ◽

Young-Sup Hwang

Keyword(s):

Similarity Measure ◽

Latent Dirichlet Allocation ◽

Document Similarity ◽

The Earth ◽

Earth Mover ◽

Dirichlet Allocation

Download Full-text

Applying Text Mining, Clustering Analysis, and Latent Dirichlet Allocation Techniques for Topic Classification of Environmental Education Journals

Sustainability ◽

10.3390/su131910856 ◽

2021 ◽

Vol 13 (19) ◽

pp. 10856

Author(s):

I-Cheng Chang ◽

Tai-Kuei Yu ◽

Yu-Jie Chang ◽

Tai-Yi Yu

Keyword(s):

Artificial Intelligence ◽

Cluster Analysis ◽

Text Mining ◽

Environmental Education ◽

Hierarchical Clustering ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Word Analysis ◽

Dirichlet Allocation

Facing the big data wave, this study applied artificial intelligence to cite knowledge and find a feasible process to play a crucial role in supplying innovative value in environmental education. Intelligence agents of artificial intelligence and natural language processing (NLP) are two key areas leading the trend in artificial intelligence; this research adopted NLP to analyze the research topics of environmental education research journals in the Web of Science (WoS) database during 2011–2020 and interpret the categories and characteristics of abstracts for environmental education papers. The corpus data were selected from abstracts and keywords of research journal papers, which were analyzed with text mining, cluster analysis, latent Dirichlet allocation (LDA), and co-word analysis methods. The decisions regarding the classification of feature words were determined and reviewed by domain experts, and the associated TF-IDF weights were calculated for the following cluster analysis, which involved a combination of hierarchical clustering and K-means analysis. The hierarchical clustering and LDA decided the number of required categories as seven, and the K-means cluster analysis classified the overall documents into seven categories. This study utilized co-word analysis to check the suitability of the K-means classification, analyzed the terms with high TF-IDF wights for distinct K-means groups, and examined the terms for different topics with the LDA technique. A comparison of the results demonstrated that most categories that were recognized with K-means and LDA methods were the same and shared similar words; however, two categories had slight differences. The involvement of field experts assisted with the consistency and correctness of the classified topics and documents.

Download Full-text

Two-stage topic modelling of scientific publications: A case study of University of Nairobi, Kenya

PLoS ONE ◽

10.1371/journal.pone.0243208 ◽

2021 ◽

Vol 16 (1) ◽

pp. e0243208

Author(s):

Leacky Muchene ◽

Wende Safari

Keyword(s):

Hierarchical Clustering ◽

Language Processing ◽

Latent Dirichlet Allocation ◽

Topic Modelling ◽

Two Stage ◽

Scientific Publications ◽

Statistical Tool ◽

Second Stage ◽

The University ◽

Dirichlet Allocation

Unsupervised statistical analysis of unstructured data has gained wide acceptance especially in natural language processing and text mining domains. Topic modelling with Latent Dirichlet Allocation is one such statistical tool that has been successfully applied to synthesize collections of legal, biomedical documents and journalistic topics. We applied a novel two-stage topic modelling approach and illustrated the methodology with data from a collection of published abstracts from the University of Nairobi, Kenya. In the first stage, topic modelling with Latent Dirichlet Allocation was applied to derive the per-document topic probabilities. To more succinctly present the topics, in the second stage, hierarchical clustering with Hellinger distance was applied to derive the final clusters of topics. The analysis showed that dominant research themes in the university include: HIV and malaria research, research on agricultural and veterinary services as well as cross-cutting themes in humanities and social sciences. Further, the use of hierarchical clustering in the second stage reduces the discovered latent topics to clusters of homogeneous topics.

Download Full-text

Performance evaluation of Latent Dirichlet Allocation in text mining

2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) ◽

10.1109/fskd.2011.6020066 ◽

2011 ◽

Cited By ~ 4

Author(s):

Zelong Liu ◽

Maozhen Li ◽

Yang Liu ◽

Mahesh Ponraj

Keyword(s):

Performance Evaluation ◽

Text Mining ◽

Latent Dirichlet Allocation ◽

Dirichlet Allocation

Download Full-text

Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s gibbs latent dirichlet allocation

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v9i3.pp2103-2111 ◽

2019 ◽

Vol 9 (3) ◽

pp. 2103

Author(s):

P. M. Prihatini ◽

I. K. G. D. Putra ◽

I. A. D. Giriantari ◽

M. Sudarma

Keyword(s):

Latent Dirichlet Allocation ◽

Agglomerative Hierarchical Clustering ◽

Clustering Method ◽

Term Weighting ◽

Term Frequency ◽

Term Selection ◽

Feature Values ◽

Better Than ◽

Dirichlet Allocation

Agglomerative hierarchical is a bottom up clustering method, where the distances between documents can be retrieved by extracting feature values using a topic-based latent dirichlet allocation method. To reduce the number of features, term selection can be done using Luhn’s Idea. Those methods can be used to build the better clusters for document. But, there is less research discusses it. Therefore, in this research, the term weighting calculation uses Luhn’s Idea to select the terms by defining upper and lower cut-off, and then extracts the feature of terms using gibbs sampling latent dirichlet allocation combined with term frequency and fuzzy Sugeno method. The feature values used to be the distance between documents, and clustered with single, complete and average link algorithm. The evaluations show the feature extraction with and without lower cut-off have less difference. But, the topic determination of each term based on term frequency and fuzzy Sugeno method is better than Tsukamoto method in finding more relevant documents. The used of lower cut-off and fuzzy Sugeno gibbs latent dirichlet allocation for complete agglomerative hierarchical clustering have consistent metric values. This clustering method suggested as a better method in clustering documents that is more relevant to its gold standard.

Download Full-text

Handling WSD using Hierarchical Clustering Algorithm with sentences

International Journal of Scientific Research in Science Engineering and Technology ◽

10.32628/ijsrset1841120 ◽

2018 ◽

pp. 83-88

Author(s):

Mohana Priya K ◽

Pooja Ragavi S ◽

Krishna Priya G

Keyword(s):

Hierarchical Clustering ◽

Similarity Measure ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Cosine Similarity Measure ◽

Hierarchical Clustering Algorithm ◽

Multiple Levels ◽

Pos Tagger ◽

Sentence Clustering ◽

The Right

Clustering is the process of grouping objects into subsets that have meaning in the context of a particular problem. It does not rely on predefined classes. It is referred to as an unsupervised learning method because no information is provided about the "right answer" for any of the objects. Many clustering algorithms have been proposed and are used based on different applications. Sentence clustering is one of best clustering technique. Hierarchical Clustering Algorithm is applied for multiple levels for accuracy. For tagging purpose POS tagger, porter stemmer is used. WordNet dictionary is utilized for determining the similarity by invoking the Jiang Conrath and Cosine similarity measure. Grouping is performed with respect to the highest similarity measure value with a mean threshold. This paper incorporates many parameters for finding similarity between words. In order to identify the disambiguated words, the sense identification is performed for the adjectives and comparison is performed. semcor and machine learning datasets are employed. On comparing with previous results for WSD, our work has improvised a lot which gives a percentage of 91.2%

Download Full-text

Evaluation of Text Semantic Features using Latent Dirichlet Allocation Model

International Journal of Performability Engineering ◽

10.23940/ijpe.20.06.p15.968978 ◽

2020 ◽

Vol 16 (6) ◽

pp. 968

Author(s):

Zhou Chunjie ◽

Li Nao ◽

Zhang Chi ◽

Yang Xiaoyu

Keyword(s):

Latent Dirichlet Allocation ◽

Semantic Features ◽

Allocation Model ◽

Latent Dirichlet Allocation Model ◽

Dirichlet Allocation

Download Full-text

Similarity Detection Using Latent Semantic Analysis Algorithm

International Journal of Emerging Research in Management and Technology ◽

10.23956/ijermt.v6i8.124 ◽

2018 ◽

Vol 6 (8) ◽

pp. 102

Author(s):

Priyanka R. Patil ◽

Shital A. Patil

Keyword(s):

Latent Semantic Analysis ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Mining Method ◽

Research Papers ◽

Information Measures ◽

Automated Software ◽

Day By Day ◽

Ways Of Life ◽

Dirichlet Allocation

Similarity View is an application for visually comparing and exploring multiple models of text and collection of document. Friendbook finds ways of life of clients from client driven sensor information, measures the closeness of ways of life amongst clients, and prescribes companions to clients if their ways of life have high likeness. Roused by demonstrate a clients day by day life as life records, from their ways of life are separated by utilizing the Latent Dirichlet Allocation Algorithm. Manual techniques can't be utilized for checking research papers, as the doled out commentator may have lacking learning in the exploration disciplines. For different subjective views, causing possible misinterpretations. An urgent need for an effective and feasible approach to check the submitted research papers with support of automated software. A method like text mining method come to solve the problem of automatically checking the research papers semantically. The proposed method to finding the proper similarity of text from the collection of documents by using Latent Dirichlet Allocation (LDA) algorithm and Latent Semantic Analysis (LSA) with synonym algorithm which is used to find synonyms of text index wise by using the English wordnet dictionary, another algorithm is LSA without synonym used to find the similarity of text based on index. LSA with synonym rate of accuracy is greater when the synonym are consider for matching.

Download Full-text

Efficient Topic Level Opinion Mining and Sentiment Analysis Algorithm using Latent Dirichlet Allocation Model

International Journal of Advanced Trends in Computer Science and Engineering ◽

10.30534/ijatcse/2019/105852019 ◽

2019 ◽

Vol 8 (5) ◽

pp. 2568-2572

Author(s):

Vamshi Krishna B ◽

Keyword(s):

Sentiment Analysis ◽

Latent Dirichlet Allocation ◽

Opinion Mining ◽

Allocation Model ◽

Analysis Algorithm ◽

Latent Dirichlet Allocation Model ◽

Dirichlet Allocation

Download Full-text

A hierarchical clustering method for random intervals based on a similarity measure

Computational Statistics ◽

10.1007/s00180-021-01121-3 ◽

2021 ◽

Author(s):

Ana Belén Ramos-Guajardo

Keyword(s):

Hierarchical Clustering ◽

Similarity Measure ◽

Clustering Algorithm ◽

Real Life ◽

Stopping Criterion ◽

Clustering Method ◽

Bootstrap Test ◽

Empirical Performance ◽

Random Intervals ◽

Expected Values

AbstractA new clustering method for random intervals that are measured in the same units over the same group of individuals is provided. It takes into account the similarity degree between the expected values of the random intervals that can be analyzed by means of a two-sample similarity bootstrap test. Thus, the expectations of each pair of random intervals are compared through that test and a p-value matrix is finally obtained. The suggested clustering algorithm considers such a matrix where each p-value can be seen at the same time as a kind of similarity between the random intervals. The algorithm is iterative and includes an objective stopping criterion that leads to statistically similar clusters that are different from each other. Some simulations to show the empirical performance of the proposal are developed and the approach is applied to two real-life situations.

Download Full-text