scholarly journals Implementasi Latent Dirichlet Allocation (LDA) untuk Klasterisasi Cerita Berbahasa Bali

2021 ◽  
Vol 8 (1) ◽  
pp. 127
Author(s):  
Ngurah Agus Sanjaya ER

<p class="Abstrak">Cerita-cerita berbahasa Bali memiliki topik yang beragam namun memuat nilai kearifan lokal yang perlu untuk dilestarikan. Jika cerita-cerita tersebut dapat dikelompokkan berdasarkan topik, tentu akan sangat memudahkan bagi para pembacanya dalam memilih bacaan yang diinginkan. <em>Latent Dirichlet Allocation</em> (<em>LDA</em>) mengasumsikan bahwa suatu dokumen dibangun dari perpaduan topik-topik tersembunyi. Dengan menerapkan <em>LDA</em> pada kumpulan dokumen, maka dapat diketahui distribusi topik-topik tersembunyi pada kumpulan dokumen secara umum maupun masing-masing dokumen. Pada penelitian ini, distribusi topik yang ditemukan oleh LDA pada  kumpulan cerita berbahasa Bali digunakan untuk melakukan pengelompokkan cerita secara otomatis. Tahapan penelitian meliputi digitalisasi cerita, tokenisasi, <em>case-folding</em>, <em>stemming</em>, pencarian topik dengan <em>LDA</em>, representasi dokumen dan klasterisasi hirarki secara <em>agglomerative</em>. Pengujian dilakukan menggunakan 100 buah data cerita berbahasa Bali yang didapat dari situs daring maupun Dinas Kebudayaan Provinsi Bali untuk menghitung akurasi hasil klasterisasi. Evaluasi dilakukan juga untuk melihat pengaruh jumlah kata dan ukuran kesamaan yang digunakan terhadap akurasi. Akurasi hasil klasterisasi tertinggi yang didapatkan adalah 62% pada saat jumlah kata yang digunakan sebagai representasi dokumen berjumlah 3000 kata. Selain itu, didapatkan suatu kesimpulan bahwa akurasi klasterisasi juga sangat dipengaruhi oleh ukuran kesamaan yang digunakan ketika melakukan penggabungan dokumen serta jumlah kata sebagai representasi dokumen.</p><p class="Abstrak"> </p><p class="Abstrak"><em><strong>Abstract</strong></em></p><p class="Abstrak"><em>Balinese folklores have diverse topics but contain local wisdom that needs to be preserved. Grouping the stories based on the topics can certainly help readers to choose their readings accordingly. Latent Dirichlet Allocation (LDA) assumes that a document is built from a combination of hidden topics. By applying LDA to a collection of documents (corpus), the global distribution of hidden topics in the corpus as well as the distribution of each individual document in the corpus can be identified. In this research, the individual distribution of topics in Balinese folklores is used to group stories based on common topics. The research stages include story digitization, tokenization, case-folding, stemming, topic search with LDA, document representation and agglomerative hierarchical clustering. Performance evaluation was carried out using 100 Balinese folklores data obtained from online sites and the Bali Provincial Cultural Office to calculate the accuracy of the clustering results. Evaluation is also carried out to see the effect of the number of words and the similarity measure used on accuracy. The highest accuracy obtained is 62% when the number of words used as the representation of a document is 3000 words. In addition, it can be concluded that accuracy is also greatly influenced by the similarity measure used when merging the documents and the number of words for document representation.</em></p>

2021 ◽  
Vol 13 (19) ◽  
pp. 10856
Author(s):  
I-Cheng Chang ◽  
Tai-Kuei Yu ◽  
Yu-Jie Chang ◽  
Tai-Yi Yu

Facing the big data wave, this study applied artificial intelligence to cite knowledge and find a feasible process to play a crucial role in supplying innovative value in environmental education. Intelligence agents of artificial intelligence and natural language processing (NLP) are two key areas leading the trend in artificial intelligence; this research adopted NLP to analyze the research topics of environmental education research journals in the Web of Science (WoS) database during 2011–2020 and interpret the categories and characteristics of abstracts for environmental education papers. The corpus data were selected from abstracts and keywords of research journal papers, which were analyzed with text mining, cluster analysis, latent Dirichlet allocation (LDA), and co-word analysis methods. The decisions regarding the classification of feature words were determined and reviewed by domain experts, and the associated TF-IDF weights were calculated for the following cluster analysis, which involved a combination of hierarchical clustering and K-means analysis. The hierarchical clustering and LDA decided the number of required categories as seven, and the K-means cluster analysis classified the overall documents into seven categories. This study utilized co-word analysis to check the suitability of the K-means classification, analyzed the terms with high TF-IDF wights for distinct K-means groups, and examined the terms for different topics with the LDA technique. A comparison of the results demonstrated that most categories that were recognized with K-means and LDA methods were the same and shared similar words; however, two categories had slight differences. The involvement of field experts assisted with the consistency and correctness of the classified topics and documents.


PLoS ONE ◽  
2021 ◽  
Vol 16 (1) ◽  
pp. e0243208
Author(s):  
Leacky Muchene ◽  
Wende Safari

Unsupervised statistical analysis of unstructured data has gained wide acceptance especially in natural language processing and text mining domains. Topic modelling with Latent Dirichlet Allocation is one such statistical tool that has been successfully applied to synthesize collections of legal, biomedical documents and journalistic topics. We applied a novel two-stage topic modelling approach and illustrated the methodology with data from a collection of published abstracts from the University of Nairobi, Kenya. In the first stage, topic modelling with Latent Dirichlet Allocation was applied to derive the per-document topic probabilities. To more succinctly present the topics, in the second stage, hierarchical clustering with Hellinger distance was applied to derive the final clusters of topics. The analysis showed that dominant research themes in the university include: HIV and malaria research, research on agricultural and veterinary services as well as cross-cutting themes in humanities and social sciences. Further, the use of hierarchical clustering in the second stage reduces the discovered latent topics to clusters of homogeneous topics.


Author(s):  
P. M. Prihatini ◽  
I. K. G. D. Putra ◽  
I. A. D. Giriantari ◽  
M. Sudarma

Agglomerative hierarchical is a bottom up clustering method, where the distances between documents can be retrieved by extracting feature values using a topic-based latent dirichlet allocation method. To reduce the number of features, term selection can be done using Luhn’s Idea. Those methods can be used to build the better clusters for document. But, there is less research discusses it. Therefore, in this research, the term weighting calculation uses Luhn’s Idea to select the terms by defining upper and lower cut-off, and then extracts the feature of terms using gibbs sampling latent dirichlet allocation combined with term frequency and fuzzy Sugeno method. The feature values used to be the distance between documents, and clustered with single, complete and average link algorithm. The evaluations show the feature extraction with and without lower cut-off have less difference. But, the topic determination of each term based on term frequency and fuzzy Sugeno method is better than Tsukamoto method in finding more relevant documents. The used of lower cut-off and fuzzy Sugeno gibbs latent dirichlet allocation for complete agglomerative hierarchical clustering have consistent metric values. This clustering method suggested as a better method in clustering documents that is more relevant to its gold standard.


Author(s):  
Mohana Priya K ◽  
Pooja Ragavi S ◽  
Krishna Priya G

Clustering is the process of grouping objects into subsets that have meaning in the context of a particular problem. It does not rely on predefined classes. It is referred to as an unsupervised learning method because no information is provided about the "right answer" for any of the objects. Many clustering algorithms have been proposed and are used based on different applications. Sentence clustering is one of best clustering technique. Hierarchical Clustering Algorithm is applied for multiple levels for accuracy. For tagging purpose POS tagger, porter stemmer is used. WordNet dictionary is utilized for determining the similarity by invoking the Jiang Conrath and Cosine similarity measure. Grouping is performed with respect to the highest similarity measure value with a mean threshold. This paper incorporates many parameters for finding similarity between words. In order to identify the disambiguated words, the sense identification is performed for the adjectives and comparison is performed. semcor and machine learning datasets are employed. On comparing with previous results for WSD, our work has improvised a lot which gives a percentage of 91.2%


Author(s):  
Priyanka R. Patil ◽  
Shital A. Patil

Similarity View is an application for visually comparing and exploring multiple models of text and collection of document. Friendbook finds ways of life of clients from client driven sensor information, measures the closeness of ways of life amongst clients, and prescribes companions to clients if their ways of life have high likeness. Roused by demonstrate a clients day by day life as life records, from their ways of life are separated by utilizing the Latent Dirichlet Allocation Algorithm. Manual techniques can't be utilized for checking research papers, as the doled out commentator may have lacking learning in the exploration disciplines. For different subjective views, causing possible misinterpretations. An urgent need for an effective and feasible approach to check the submitted research papers with support of automated software. A method like text mining method come to solve the problem of automatically checking the research papers semantically. The proposed method to finding the proper similarity of text from the collection of documents by using Latent Dirichlet Allocation (LDA) algorithm and Latent Semantic Analysis (LSA) with synonym algorithm which is used to find synonyms of text index wise by using the English wordnet dictionary, another algorithm is LSA without synonym used to find the similarity of text based on index. LSA with synonym rate of accuracy is greater when the synonym are consider for matching.


Author(s):  
Ana Belén Ramos-Guajardo

AbstractA new clustering method for random intervals that are measured in the same units over the same group of individuals is provided. It takes into account the similarity degree between the expected values of the random intervals that can be analyzed by means of a two-sample similarity bootstrap test. Thus, the expectations of each pair of random intervals are compared through that test and a p-value matrix is finally obtained. The suggested clustering algorithm considers such a matrix where each p-value can be seen at the same time as a kind of similarity between the random intervals. The algorithm is iterative and includes an objective stopping criterion that leads to statistically similar clusters that are different from each other. Some simulations to show the empirical performance of the proposal are developed and the approach is applied to two real-life situations.


Sign in / Sign up

Export Citation Format

Share Document