Topic Modeling Using Latent Dirichlet allocation

2022 ◽  
Vol 54 (7) ◽  
pp. 1-35
Author(s):  
Uttam Chauhan ◽  
Apurva Shah

We are not able to deal with a mammoth text corpus without summarizing them into a relatively small subset. A computational tool is extremely needed to understand such a gigantic pool of text. Probabilistic Topic Modeling discovers and explains the enormous collection of documents by reducing them in a topical subspace. In this work, we study the background and advancement of topic modeling techniques. We first introduce the preliminaries of the topic modeling techniques and review its extensions and variations, such as topic modeling over various domains, hierarchical topic modeling, word embedded topic models, and topic models in multilingual perspectives. Besides, the research work for topic modeling in a distributed environment, topic visualization approaches also have been explored. We also covered the implementation and evaluation techniques for topic models in brief. Comparison matrices have been shown over the experimental results of the various categories of topic modeling. Diverse technical challenges and future directions have been discussed.

2021 ◽  
Vol 26 (6) ◽  
Author(s):  
Camila Costa Silva ◽  
Matthias Galster ◽  
Fabian Gilson

AbstractTopic modeling using models such as Latent Dirichlet Allocation (LDA) is a text mining technique to extract human-readable semantic “topics” (i.e., word clusters) from a corpus of textual documents. In software engineering, topic modeling has been used to analyze textual data in empirical studies (e.g., to find out what developers talk about online), but also to build new techniques to support software engineering tasks (e.g., to support source code comprehension). Topic modeling needs to be applied carefully (e.g., depending on the type of textual data analyzed and modeling parameters). Our study aims at describing how topic modeling has been applied in software engineering research with a focus on four aspects: (1) which topic models and modeling techniques have been applied, (2) which textual inputs have been used for topic modeling, (3) how textual data was “prepared” (i.e., pre-processed) for topic modeling, and (4) how generated topics (i.e., word clusters) were named to give them a human-understandable meaning. We analyzed topic modeling as applied in 111 papers from ten highly-ranked software engineering venues (five journals and five conferences) published between 2009 and 2020. We found that (1) LDA and LDA-based techniques are the most frequent topic modeling techniques, (2) developer communication and bug reports have been modelled most, (3) data pre-processing and modeling parameters vary quite a bit and are often vaguely reported, and (4) manual topic naming (such as deducting names based on frequent words in a topic) is common.


2020 ◽  
Vol 8 ◽  
pp. 439-453 ◽  
Author(s):  
Adji B. Dieng ◽  
Francisco J. R. Ruiz ◽  
David M. Blei

Topic modeling analyzes documents to learn meaningful patterns of words. However, existing topic models fail to learn interpretable topics when working with large and heavy-tailed vocabularies. To this end, we develop the embedded topic model (etm), a generative model of documents that marries traditional topic models with word embeddings. More specifically, the etm models each word with a categorical distribution whose natural parameter is the inner product between the word’s embedding and an embedding of its assigned topic. To fit the etm, we develop an efficient amortized variational inference algorithm. The etm discovers interpretable topics even with large vocabularies that include rare words and stop words. It outperforms existing document models, such as latent Dirichlet allocation, in terms of both topic quality and predictive performance.


Author(s):  
Carlo Schwarz

In this article, I introduce the ldagibbs command, which implements latent Dirichlet allocation in Stata. Latent Dirichlet allocation is the most popular machine-learning topic model. Topic models automatically cluster text documents into a user-chosen number of topics. Latent Dirichlet allocation represents each document as a probability distribution over topics and represents each topic as a probability distribution over words. Therefore, latent Dirichlet allocation provides a way to analyze the content of large unclassified text data and an alternative to predefined document classifications.


2021 ◽  
Vol 12 (1) ◽  
pp. 1-22
Author(s):  
Di Jiang ◽  
Yongxin Tong ◽  
Yuanfeng Song ◽  
Xueyang Wu ◽  
Weiwei Zhao ◽  
...  

Probabilistic topic modeling has been applied in a variety of industrial applications. Training a high-quality model usually requires a massive amount of data to provide comprehensive co-occurrence information for the model to learn. However, industrial data such as medical or financial records are often proprietary or sensitive, which precludes uploading to data centers. Hence, training topic models in industrial scenarios using conventional approaches faces a dilemma: A party (i.e., a company or institute) has to either tolerate data scarcity or sacrifice data privacy. In this article, we propose a framework named Industrial Federated Topic Modeling (iFTM), in which multiple parties collaboratively train a high-quality topic model by simultaneously alleviating data scarcity and maintaining immunity to privacy adversaries. iFTM is inspired by federated learning, supports two representative topic models (i.e., Latent Dirichlet Allocation and SentenceLDA) in industrial applications, and consists of novel techniques such as private Metropolis-Hastings, topic-wise normalization, and heterogeneous model integration. We conduct quantitative evaluations to verify the effectiveness of iFTM and deploy iFTM in two real-life applications to demonstrate its utility. Experimental results verify iFTM’s superiority over conventional topic modeling.


2015 ◽  
Author(s):  
Ziyun Xu

Despite being a relatively new discipline, Chinese Interpreting Studies (CIS) has witnessed tremendous growth in the number of publications and diversity of topics investigated over the past two decades. The number of doctoral dissertations produced has also increased rapidly since the late 1990s. As CIS continues to mature, it is important to evaluate its dominant topics, trends and institutions, as well as the career development of PhD graduates in the subject. In addition to traditional scientometric techniques, this study’s empirical objectivity is heightened by its use of Probabilistic Topic Modeling (PTM), which uses Latent Dirichlet Allocation (LDA) to analyze the topics covered in a near-exhaustive corpus of CIS dissertations. The analysis reveals that the topics of allocation of cognitive resources, deverbalization, and modeling the interpreting process attracted most attention from doctoral researchers. Additional analyses were used to track the research productivity of institutions and the career trajectories of PhD holders: one school was found to stand out, accounting for more than half of the total dissertations produced, and a PhD in CIS was found to be a highly useful asset for new professional interpreters.


Author(s):  
Julien Velcin ◽  
Antoine Gourru ◽  
Erwan Giry-Fouquet ◽  
Christophe Gravier ◽  
Mathieu Roche ◽  
...  

Readitopics provides a new tool for browsing a textual corpus that showcases several recent work on topic labeling and topic coherence. We demonstrate the potential of these techniques to get a deeper understanding of the topics that structure different datasets. This tool is provided as a Web demo but it can be installed to experiment with your own dataset. It can be further extended to deal with more advanced topic modeling techniques.


Author(s):  
R. Derbanosov ◽  
◽  
M. Bakhanova ◽  
◽  

Probabilistic topic modeling is a tool for statistical text analysis that can give us information about the inner structure of a large corpus of documents. The most popular models—Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation—produce topics in a form of discrete distributions over the set of all words of the corpus. They build topics using an iterative algorithm that starts from some random initialization and optimizes a loss function. One of the main problems of topic modeling is sensitivity to random initialization that means producing significantly different solutions from different initial points. Several studies showed that side information about documents may improve the overall quality of a topic model. In this paper, we consider the use of additional information in the context of the stability problem. We represent auxiliary information as an additional modality and use BigARTM library in order to perform experiments on several text collections. We show that using side information as an additional modality improves topics stability without significant quality loss of the model.


2012 ◽  
Vol 18 (2) ◽  
pp. 263-289 ◽  
Author(s):  
DINGCHENG LI ◽  
SWAPNA SOMASUNDARAN ◽  
AMIT CHAKRABORTY

AbstractThis paper proposes a novel application of topic models to do entity relation detection (ERD). In order to make use of the latent semantics of text, we formulate the task of relation detection as a topic modeling problem. The motivation is to find underlying topics that are indicative of relations between named entities (NEs). Our approach considers pairs of NEs and features associated with them as mini documents, and aims to utilize the underlying topic distributions as indicators for the types of relations that may exist between the NE pair. Our system, ERD-MedLDA, adapts Maximum Entropy Discriminant Latent Dirichlet Allocation (MedLDA) with mixed membership for relation detection. By using supervision, ERD-MedLDA is able to learn topic distributions indicative of relation types. Further, ERD-MedLDA is a topic model that combines the benefits of both, maximum likelihood estimation (MLE) and maximum margin estimation (MME), and the mixed-membership formulation enables the system to incorporate heterogeneous features. We incorporate different features into the system and perform experiments on the ACE 2005 corpus. Our approach achieves better overall performance for precision, recall, and F-measure metrics as compared to baseline SVM-based and LDA-based models. We also find that our system shows better and consistent improvements with the addition of complex informative features as compared to baseline systems.


2016 ◽  
Vol 24 (3) ◽  
pp. 472-480 ◽  
Author(s):  
Jonathan H Chen ◽  
Mary K Goldstein ◽  
Steven M Asch ◽  
Lester Mackey ◽  
Russ B Altman

Objective: Build probabilistic topic model representations of hospital admissions processes and compare the ability of such models to predict clinical order patterns as compared to preconstructed order sets. Materials and Methods: The authors evaluated the first 24 hours of structured electronic health record data for > 10 K inpatients. Drawing an analogy between structured items (e.g., clinical orders) to words in a text document, the authors performed latent Dirichlet allocation probabilistic topic modeling. These topic models use initial clinical information to predict clinical orders for a separate validation set of > 4 K patients. The authors evaluated these topic model-based predictions vs existing human-authored order sets by area under the receiver operating characteristic curve, precision, and recall for subsequent clinical orders. Results: Existing order sets predict clinical orders used within 24 hours with area under the receiver operating characteristic curve 0.81, precision 16%, and recall 35%. This can be improved to 0.90, 24%, and 47% (P < 10−20) by using probabilistic topic models to summarize clinical data into up to 32 topics. Many of these latent topics yield natural clinical interpretations (e.g., “critical care,” “pneumonia,” “neurologic evaluation”). Discussion: Existing order sets tend to provide nonspecific, process-oriented aid, with usability limitations impairing more precise, patient-focused support. Algorithmic summarization has the potential to breach this usability barrier by automatically inferring patient context, but with potential tradeoffs in interpretability. Conclusion: Probabilistic topic modeling provides an automated approach to detect thematic trends in patient care and generate decision support content. A potential use case finds related clinical orders for decision support.


2015 ◽  
Author(s):  
Ziyun Xu

Despite being a relatively new discipline, Chinese Interpreting Studies (CIS) has witnessed tremendous growth in the number of publications and diversity of topics investigated over the past two decades. The number of doctoral dissertations produced has also increased rapidly since the late 1990s. As CIS continues to mature, it is important to evaluate its dominant topics, trends and institutions, as well as the career development of PhD graduates in the subject. In addition to traditional scientometric techniques, this study’s empirical objectivity is heightened by its use of Probabilistic Topic Modeling (PTM), which uses Latent Dirichlet Allocation (LDA) to analyze the topics covered in a near-exhaustive corpus of CIS dissertations. The analysis reveals that the topics of allocation of cognitive resources, deverbalization, and modeling the interpreting process attracted most attention from doctoral researchers. Additional analyses were used to track the research productivity of institutions and the career trajectories of PhD holders: one school was found to stand out, accounting for more than half of the total dissertations produced, and a PhD in CIS was found to be a highly useful asset for new professional interpreters.


Sign in / Sign up

Export Citation Format

Share Document