Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collapsed Gibbs Sampling Inference Process

<span lang="EN-GB">Latent Dirichlet Allocation (LDA) is a probability model for grouping hidden topics in documents by the number of predefined topics. If conducted incorrectly, determining the amount of K topics will result in limited word correlation with topics. Too large or too small number of K topics causes inaccuracies in grouping topics in the formation of training models. This study aims to determine the optimal number of corpus topics in the LDA method using the maximum likelihood and Minimum Description Length (MDL) approach. The experimental process uses Indonesian news articles with the number of documents at 25, 50, 90, and 600; in each document, the numbers of words are 3898, 7760, 13005, and 4365. The results show that the maximum likelihood and MDL approach result in the same number of optimal topics. The optimal number of topics is influenced by alpha and beta parameters. In addition, the number of documents does not affect the computation times but the number of words does. Computational times for each of those datasets are 2.9721, 6.49637, 13.2967, and 3.7152 seconds. The optimisation model has resulted in many LDA topics as a classification model. This experiment shows that the highest average accuracy is 61% with alpha 0.1 and beta 0.001.</span>

Download Full-text

Analisis Sentimen dan Pemodelan Topik Pariwisata Lombok Menggunakan Algoritma Naive Bayes dan Latent Dirichlet Allocation

Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) ◽

10.29207/resti.v5i1.2587 ◽

2021 ◽

Vol 5 (1) ◽

pp. 123-131

Author(s):

Ni Luh Putu Merawati Putu ◽

Ahmad Zuli Amrullah ◽

Ismarmiaty

Keyword(s):

Social Media ◽

Latent Dirichlet Allocation ◽

Naive Bayes ◽

Naïve Bayes ◽

Classification Model ◽

Bayes Method ◽

A Value ◽

Negative Class ◽

Naive Bayes Method ◽

Dirichlet Allocation

Lombok Island is one of the favorite tourist destinations. Various topics and comments about Lombok tourism experience through social media accounts are difficult to manually identify public sentiments and topics. The opinion expressed by tourists through social media is interesting for further research. This study aims to classify tourists' opinions into two classes, positive and negative, and topics modelling by using the Naive Bayes method and modeling the topic by using Latent Dirichlet Allocation (LDA). The stages of this research include data collection, data cleaning, data transformation, data classification. The results performance testing of the classification model using Naive Bayes method is shown with an accuracy value of 92%, precision of 100%, recall of 84% and specificity of 100%. The results of modeling topics using LDA in each positive and negative class from the coherence value shows the highest value for the positive class was obtained on the 8th topic with a value of 0.613 and for the negative class on the 12th topic with a value of 0.528. The use of the Naive Bayes and LDA algorithms is considered effective for analyzing the sentiment and topic modelling for Lombok tourism.

Download Full-text

Analisis topik konten channel YouTube K-pop Indonesia menggunakan Latent Dirichlet Allocation

Teknologi ◽

10.26594/teknologi.v11i1.2155 ◽

2021 ◽

Vol 11 (1) ◽

pp. 16-25

Author(s):

Alfrida Rahmawati ◽

◽

Najla Lailin Nikmah ◽

Reynaldi Drajat Ageng Perwira ◽

Nur Aini Rakhmawati ◽

...

Keyword(s):

Text Mining ◽

New Media ◽

Digital Technology ◽

Latent Dirichlet Allocation ◽

Optimal Number ◽

Allocation Method ◽

Internet Users ◽

The World ◽

Dirichlet Allocation

The development of digital technology has brought new media, one of which is Youtube, which is now one of the most widely used applications for internet users in the world. The growth of the audience which is known as viewers, is also suported by the contribution from the content creators or also known as YouTubers from Indonesian. The more the viewers grow, the more their demand for trend content are also grwoing at surprisingly speed in one of the topics which is H-pop. In this study, the author wanted to see the dominant topics that K-pop YouTubers often upload to support content creator. This research was conducted using the Latent Dirichlet Allocation method. The analysis was carried out on after using text mining on 2563 videos from 10 K-pop YouTuber accounts with more than 100,000 subscribers. To determine the optimal number of topics by looking at the value of perplexity and topic conherence. The results obtained are the top 5 topics that are the content material in the uploaded video. These topics include reactions to dance covers, unboxing on albums and conducting reviews, riddles from K-pop dances and vlogs together to discuss about covers and reactions to sounds on K-pop songs.

Download Full-text

A Comparative Automated Text Analysis of Airbnb Reviews in Hong Kong and Singapore Using Latent Dirichlet Allocation

Sustainability ◽

10.3390/su12166673 ◽

2020 ◽

Vol 12 (16) ◽

pp. 6673 ◽

Cited By ~ 1

Author(s):

Kiattipoom Kiatkawsin ◽

Ian Sutherland ◽

Jin-Young Kim

Keyword(s):

Hong Kong ◽

Text Analysis ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Optimal Number ◽

Text Data ◽

Customer Reviews ◽

Gaining Insight ◽

Dirichlet Allocation

Airbnb has emerged as a platform where unique accommodation options can be found. Due to the uniqueness of each accommodation unit and host combination, each listing offers a one-of-a-kind experience. As consumers increasingly rely on text reviews of other customers, managers are also increasingly gaining insight from customer reviews. Thus, this present study aimed to extract those insights from reviews using latent Dirichlet allocation, an unsupervised type of topic modeling that extracts latent discussion topics from text data. Findings of Hong Kong’s 185,695 and Singapore’s 93,571 Airbnb reviews, two long-term rival destinations, were compared. Hong Kong produced 12 total topics that can be categorized into four distinct groups whereas Singapore’s optimal number of topics was only five. Topics produced from both destinations covered the same range of attributes, but Hong Kong’s 12 topics provide a greater degree of precision to formulate managerial recommendations. While many topics are similar to established hotel attributes, topics related to the host and listing management are unique to the Airbnb experience. The findings also revealed keywords used when evaluating the experience that provide more insight beyond typical numeric ratings.

Download Full-text

Fast collapsed gibbs sampling for latent dirichlet allocation

Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD 08 ◽

10.1145/1401890.1401960 ◽

2008 ◽

Cited By ~ 207

Author(s):

Ian Porteous ◽

David Newman ◽

Alexander Ihler ◽

Arthur Asuncion ◽

Padhraic Smyth ◽

...

Keyword(s):

Gibbs Sampling ◽

Latent Dirichlet Allocation ◽

Collapsed Gibbs Sampling ◽

Dirichlet Allocation

Download Full-text

Renormalization Analysis of Topic Models

Entropy ◽

10.3390/e22050556 ◽

2020 ◽

Vol 22 (5) ◽

pp. 556

Author(s):

Sergei Koltcov ◽

Vera Ignatenko

Keyword(s):

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Optimal Number ◽

Probabilistic Latent Semantic Analysis ◽

Model Parameters ◽

Grid Search ◽

Renormalization Procedure ◽

Allocation Model ◽

Latent Dirichlet Allocation Model ◽

Dirichlet Allocation

In practice, to build a machine learning model of big data, one needs to tune model parameters. The process of parameter tuning involves extremely time-consuming and computationally expensive grid search. However, the theory of statistical physics provides techniques allowing us to optimize this process. The paper shows that a function of the output of topic modeling demonstrates self-similar behavior under variation of the number of clusters. Such behavior allows using a renormalization technique. A combination of renormalization procedure with the Renyi entropy approach allows for quick searching of the optimal number of topics. In this paper, the renormalization procedure is developed for the probabilistic Latent Semantic Analysis (pLSA), and the Latent Dirichlet Allocation model with variational Expectation–Maximization algorithm (VLDA) and the Latent Dirichlet Allocation model with granulated Gibbs sampling procedure (GLDA). The experiments were conducted on two test datasets with a known number of topics in two different languages and on one unlabeled test dataset with an unknown number of topics. The paper shows that the renormalization procedure allows for finding an approximation of the optimal number of topics at least 30 times faster than the grid search without significant loss of quality.

Download Full-text

A text classification model constructed by Latent Dirichlet Allocation and Deep Learning

Proceedings of the 4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering 2015 ◽

10.2991/icmmcce-15.2015.482 ◽

2015 ◽

Author(s):

Yu Liu ◽

Zhengping Jin

Keyword(s):

Deep Learning ◽

Text Classification ◽

Latent Dirichlet Allocation ◽

Classification Model ◽

Dirichlet Allocation

Download Full-text

Normalized Approach to Find Optimal Number of Topics in Latent Dirichlet Allocation (LDA)

Advances in Intelligent Systems and Computing - Proceedings of International Conference on Trends in Computational and Cognitive Engineering ◽

10.1007/978-981-33-4673-4_27 ◽

2020 ◽

pp. 341-354

Author(s):

Mahedi Hasan ◽

Anichur Rahman ◽

Md. Razaul Karim ◽

Md. Saikat Islam Khan ◽

Md. Jahidul Islam

Keyword(s):

Latent Dirichlet Allocation ◽

Optimal Number ◽

Dirichlet Allocation

Download Full-text

Partially collapsed Gibbs sampling for latent Dirichlet allocation

Expert Systems with Applications ◽

10.1016/j.eswa.2019.04.028 ◽

2019 ◽

Vol 131 ◽

pp. 208-218 ◽

Cited By ~ 4

Author(s):

Hongju Park ◽

Taeyoung Park ◽

Yung-Seop Lee

Keyword(s):

Gibbs Sampling ◽

Latent Dirichlet Allocation ◽

Collapsed Gibbs Sampling ◽

Dirichlet Allocation

Download Full-text

Evaluation of Text Semantic Features using Latent Dirichlet Allocation Model

International Journal of Performability Engineering ◽

10.23940/ijpe.20.06.p15.968978 ◽

2020 ◽

Vol 16 (6) ◽

pp. 968

Author(s):

Zhou Chunjie ◽

Li Nao ◽

Zhang Chi ◽

Yang Xiaoyu

Keyword(s):

Latent Dirichlet Allocation ◽

Semantic Features ◽

Allocation Model ◽

Latent Dirichlet Allocation Model ◽

Dirichlet Allocation

Download Full-text

Similarity Detection Using Latent Semantic Analysis Algorithm

International Journal of Emerging Research in Management and Technology ◽

10.23956/ijermt.v6i8.124 ◽

2018 ◽

Vol 6 (8) ◽

pp. 102

Author(s):

Priyanka R. Patil ◽

Shital A. Patil

Keyword(s):

Latent Semantic Analysis ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Mining Method ◽

Research Papers ◽

Information Measures ◽

Automated Software ◽

Day By Day ◽

Ways Of Life ◽

Dirichlet Allocation

Similarity View is an application for visually comparing and exploring multiple models of text and collection of document. Friendbook finds ways of life of clients from client driven sensor information, measures the closeness of ways of life amongst clients, and prescribes companions to clients if their ways of life have high likeness. Roused by demonstrate a clients day by day life as life records, from their ways of life are separated by utilizing the Latent Dirichlet Allocation Algorithm. Manual techniques can't be utilized for checking research papers, as the doled out commentator may have lacking learning in the exploration disciplines. For different subjective views, causing possible misinterpretations. An urgent need for an effective and feasible approach to check the submitted research papers with support of automated software. A method like text mining method come to solve the problem of automatically checking the research papers semantically. The proposed method to finding the proper similarity of text from the collection of documents by using Latent Dirichlet Allocation (LDA) algorithm and Latent Semantic Analysis (LSA) with synonym algorithm which is used to find synonyms of text index wise by using the English wordnet dictionary, another algorithm is LSA without synonym used to find the similarity of text based on index. LSA with synonym rate of accuracy is greater when the synonym are consider for matching.

Download Full-text