Topic Modeling Using Latent Dirichlet allocation

Uttam Chauhan; Apurva Shah

doi:10.1145/3462478

Topic modeling in software engineering research

Empirical Software Engineering ◽

10.1007/s10664-021-10026-0 ◽

2021 ◽

Vol 26 (6) ◽

Author(s):

Camila Costa Silva ◽

Matthias Galster ◽

Fabian Gilson

Keyword(s):

Software Engineering ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Empirical Studies ◽

Engineering Research ◽

Bug Reports ◽

Textual Data ◽

Modeling Techniques ◽

Software Engineering Research ◽

Support Software

AbstractTopic modeling using models such as Latent Dirichlet Allocation (LDA) is a text mining technique to extract human-readable semantic “topics” (i.e., word clusters) from a corpus of textual documents. In software engineering, topic modeling has been used to analyze textual data in empirical studies (e.g., to find out what developers talk about online), but also to build new techniques to support software engineering tasks (e.g., to support source code comprehension). Topic modeling needs to be applied carefully (e.g., depending on the type of textual data analyzed and modeling parameters). Our study aims at describing how topic modeling has been applied in software engineering research with a focus on four aspects: (1) which topic models and modeling techniques have been applied, (2) which textual inputs have been used for topic modeling, (3) how textual data was “prepared” (i.e., pre-processed) for topic modeling, and (4) how generated topics (i.e., word clusters) were named to give them a human-understandable meaning. We analyzed topic modeling as applied in 111 papers from ten highly-ranked software engineering venues (five journals and five conferences) published between 2009 and 2020. We found that (1) LDA and LDA-based techniques are the most frequent topic modeling techniques, (2) developer communication and bug reports have been modelled most, (3) data pre-processing and modeling parameters vary quite a bit and are often vaguely reported, and (4) manual topic naming (such as deducting names based on frequent words in a topic) is common.

Download Full-text

Topic Modeling in Embedding Spaces

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00325 ◽

2020 ◽

Vol 8 ◽

pp. 439-453 ◽

Cited By ~ 2

Author(s):

Adji B. Dieng ◽

Francisco J. R. Ruiz ◽

David M. Blei

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Topic Models ◽

Predictive Performance ◽

Inner Product ◽

Natural Parameter ◽

Document Models ◽

Heavy Tailed ◽

Categorical Distribution

Topic modeling analyzes documents to learn meaningful patterns of words. However, existing topic models fail to learn interpretable topics when working with large and heavy-tailed vocabularies. To this end, we develop the embedded topic model (etm), a generative model of documents that marries traditional topic models with word embeddings. More specifically, the etm models each word with a categorical distribution whose natural parameter is the inner product between the word’s embedding and an embedding of its assigned topic. To fit the etm, we develop an efficient amortized variational inference algorithm. The etm discovers interpretable topics even with large vocabularies that include rare words and stop words. It outperforms existing document models, such as latent Dirichlet allocation, in terms of both topic quality and predictive performance.

Download Full-text

Ldagibbs: A Command for Topic Modeling in Stata Using Latent Dirichlet Allocation

The Stata Journal Promoting communications on statistics and Stata ◽

10.1177/1536867x1801800107 ◽

2018 ◽

Vol 18 (1) ◽

pp. 101-117 ◽

Cited By ~ 10

Author(s):

Carlo Schwarz

Keyword(s):

Machine Learning ◽

Probability Distribution ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Topic Models ◽

Text Documents ◽

Text Data ◽

Dirichlet Allocation

In this article, I introduce the ldagibbs command, which implements latent Dirichlet allocation in Stata. Latent Dirichlet allocation is the most popular machine-learning topic model. Topic models automatically cluster text documents into a user-chosen number of topics. Latent Dirichlet allocation represents each document as a probability distribution over topics and represents each topic as a probability distribution over words. Therefore, latent Dirichlet allocation provides a way to analyze the content of large unclassified text data and an alternative to predefined document classifications.

Download Full-text

Industrial Federated Topic Modeling

ACM Transactions on Intelligent Systems and Technology ◽

10.1145/3418283 ◽

2021 ◽

Vol 12 (1) ◽

pp. 1-22

Author(s):

Di Jiang ◽

Yongxin Tong ◽

Yuanfeng Song ◽

Xueyang Wu ◽

Weiwei Zhao ◽

...

Keyword(s):

Topic Modeling ◽

Data Privacy ◽

Topic Models ◽

Real Life ◽

Industrial Applications ◽

High Quality ◽

Heterogeneous Model ◽

Data Scarcity ◽

Probabilistic Topic Modeling ◽

Training Topic

Probabilistic topic modeling has been applied in a variety of industrial applications. Training a high-quality model usually requires a massive amount of data to provide comprehensive co-occurrence information for the model to learn. However, industrial data such as medical or financial records are often proprietary or sensitive, which precludes uploading to data centers. Hence, training topic models in industrial scenarios using conventional approaches faces a dilemma: A party (i.e., a company or institute) has to either tolerate data scarcity or sacrifice data privacy. In this article, we propose a framework named Industrial Federated Topic Modeling (iFTM), in which multiple parties collaboratively train a high-quality topic model by simultaneously alleviating data scarcity and maintaining immunity to privacy adversaries. iFTM is inspired by federated learning, supports two representative topic models (i.e., Latent Dirichlet Allocation and SentenceLDA) in industrial applications, and consists of novel techniques such as private Metropolis-Hastings, topic-wise normalization, and heterogeneous model integration. We conduct quantitative evaluations to verify the effectiveness of iFTM and deploy iFTM in two real-life applications to demonstrate its utility. Experimental results verify iFTM’s superiority over conventional topic modeling.

Download Full-text

Doctoral dissertations in Chinese Interpreting Studies: A scientometric survey using topic modeling

10.7287/peerj.preprints.1277v1 ◽

2015 ◽

Author(s):

Ziyun Xu

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Cognitive Resources ◽

The Past ◽

Career Trajectories ◽

Probabilistic Topic Modeling ◽

The Subject ◽

Number Of Publications ◽

Doctoral Dissertations ◽

In Cis

Despite being a relatively new discipline, Chinese Interpreting Studies (CIS) has witnessed tremendous growth in the number of publications and diversity of topics investigated over the past two decades. The number of doctoral dissertations produced has also increased rapidly since the late 1990s. As CIS continues to mature, it is important to evaluate its dominant topics, trends and institutions, as well as the career development of PhD graduates in the subject. In addition to traditional scientometric techniques, this study’s empirical objectivity is heightened by its use of Probabilistic Topic Modeling (PTM), which uses Latent Dirichlet Allocation (LDA) to analyze the topics covered in a near-exhaustive corpus of CIS dissertations. The analysis reveals that the topics of allocation of cognitive resources, deverbalization, and modeling the interpreting process attracted most attention from doctoral researchers. Additional analyses were used to track the research productivity of institutions and the career trajectories of PhD holders: one school was found to stand out, accounting for more than half of the total dissertations produced, and a PhD in CIS was found to be a highly useful asset for new professional interpreters.

Download Full-text

Readitopics: Make Your Topic Models Readable via Labeling and Browsing

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/867 ◽

2018 ◽

Cited By ~ 1

Author(s):

Julien Velcin ◽

Antoine Gourru ◽

Erwan Giry-Fouquet ◽

Christophe Gravier ◽

Mathieu Roche ◽

...

Keyword(s):

Recent Work ◽

Topic Modeling ◽

Topic Models ◽

Modeling Techniques ◽

Topic Labeling ◽

Advanced Topic

Readitopics provides a new tool for browsing a textual corpus that showcases several recent work on topic labeling and topic coherence. We demonstrate the potential of these techniques to get a deeper understanding of the topics that structure different datasets. This tool is provided as a Web demo but it can be installed to experiment with your own dataset. It can be further extended to deal with more advanced topic modeling techniques.

Download Full-text

STABILITY OF TOPIC MODELING VIA MODALITY REGULARIZATION

Computational Linguistics and Intellectual Technologies ◽

10.28995/2075-7182-2020-19-198-210 ◽

2020 ◽

Author(s):

R. Derbanosov ◽

◽

M. Bakhanova ◽

◽

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Topic Model ◽

Side Information ◽

Auxiliary Information ◽

Discrete Distributions ◽

Probabilistic Latent Semantic Analysis ◽

Probabilistic Topic Modeling ◽

Random Initialization

Probabilistic topic modeling is a tool for statistical text analysis that can give us information about the inner structure of a large corpus of documents. The most popular models—Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation—produce topics in a form of discrete distributions over the set of all words of the corpus. They build topics using an iterative algorithm that starts from some random initialization and optimizes a loss function. One of the main problems of topic modeling is sensitivity to random initialization that means producing significantly different solutions from different initial points. Several studies showed that side information about documents may improve the overall quality of a topic model. In this paper, we consider the use of additional information in the context of the stability problem. We represent auxiliary information as an additional modality and use BigARTM library in order to perform experiments on several text collections. We show that using side information as an additional modality improves topics stability without significant quality loss of the model.

Download Full-text

ERD-MedLDA: Entity relation detection using supervised topic models with maximum margin learning

Natural Language Engineering ◽

10.1017/s1351324912000058 ◽

2012 ◽

Vol 18 (2) ◽

pp. 263-289 ◽

Cited By ~ 2

Author(s):

DINGCHENG LI ◽

SWAPNA SOMASUNDARAN ◽

AMIT CHAKRABORTY

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Topic Models ◽

Likelihood Estimation ◽

Named Entities ◽

Maximum Margin ◽

Heterogeneous Features ◽

Overall Performance ◽

Entity Relation Detection

AbstractThis paper proposes a novel application of topic models to do entity relation detection (ERD). In order to make use of the latent semantics of text, we formulate the task of relation detection as a topic modeling problem. The motivation is to find underlying topics that are indicative of relations between named entities (NEs). Our approach considers pairs of NEs and features associated with them as mini documents, and aims to utilize the underlying topic distributions as indicators for the types of relations that may exist between the NE pair. Our system, ERD-MedLDA, adapts Maximum Entropy Discriminant Latent Dirichlet Allocation (MedLDA) with mixed membership for relation detection. By using supervision, ERD-MedLDA is able to learn topic distributions indicative of relation types. Further, ERD-MedLDA is a topic model that combines the benefits of both, maximum likelihood estimation (MLE) and maximum margin estimation (MME), and the mixed-membership formulation enables the system to incorporate heterogeneous features. We incorporate different features into the system and perform experiments on the ACE 2005 corpus. Our approach achieves better overall performance for precision, recall, and F-measure metrics as compared to baseline SVM-based and LDA-based models. We also find that our system shows better and consistent improvements with the addition of complex informative features as compared to baseline systems.

Download Full-text

Predicting inpatient clinical order patterns with probabilistic topic models vs conventional order sets

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocw136 ◽

2016 ◽

Vol 24 (3) ◽

pp. 472-480 ◽

Cited By ~ 13

Author(s):

Jonathan H Chen ◽

Mary K Goldstein ◽

Steven M Asch ◽

Lester Mackey ◽

Russ B Altman

Keyword(s):

Decision Support ◽

Topic Modeling ◽

Operating Characteristic ◽

Topic Model ◽

Characteristic Curve ◽

Topic Models ◽

Probabilistic Topic Models ◽

Probabilistic Topic Modeling ◽

Order Sets ◽

Operating Characteristic Curve

Objective: Build probabilistic topic model representations of hospital admissions processes and compare the ability of such models to predict clinical order patterns as compared to preconstructed order sets. Materials and Methods: The authors evaluated the first 24 hours of structured electronic health record data for > 10 K inpatients. Drawing an analogy between structured items (e.g., clinical orders) to words in a text document, the authors performed latent Dirichlet allocation probabilistic topic modeling. These topic models use initial clinical information to predict clinical orders for a separate validation set of > 4 K patients. The authors evaluated these topic model-based predictions vs existing human-authored order sets by area under the receiver operating characteristic curve, precision, and recall for subsequent clinical orders. Results: Existing order sets predict clinical orders used within 24 hours with area under the receiver operating characteristic curve 0.81, precision 16%, and recall 35%. This can be improved to 0.90, 24%, and 47% (P < 10−20) by using probabilistic topic models to summarize clinical data into up to 32 topics. Many of these latent topics yield natural clinical interpretations (e.g., “critical care,” “pneumonia,” “neurologic evaluation”). Discussion: Existing order sets tend to provide nonspecific, process-oriented aid, with usability limitations impairing more precise, patient-focused support. Algorithmic summarization has the potential to breach this usability barrier by automatically inferring patient context, but with potential tradeoffs in interpretability. Conclusion: Probabilistic topic modeling provides an automated approach to detect thematic trends in patient care and generate decision support content. A potential use case finds related clinical orders for decision support.

Download Full-text

Doctoral dissertations in Chinese Interpreting Studies: A scientometric survey using topic modeling

10.7287/peerj.preprints.1277 ◽

2015 ◽

Author(s):

Ziyun Xu

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Cognitive Resources ◽

The Past ◽

Career Trajectories ◽

Probabilistic Topic Modeling ◽

The Subject ◽

Number Of Publications ◽

Doctoral Dissertations ◽

In Cis

Despite being a relatively new discipline, Chinese Interpreting Studies (CIS) has witnessed tremendous growth in the number of publications and diversity of topics investigated over the past two decades. The number of doctoral dissertations produced has also increased rapidly since the late 1990s. As CIS continues to mature, it is important to evaluate its dominant topics, trends and institutions, as well as the career development of PhD graduates in the subject. In addition to traditional scientometric techniques, this study’s empirical objectivity is heightened by its use of Probabilistic Topic Modeling (PTM), which uses Latent Dirichlet Allocation (LDA) to analyze the topics covered in a near-exhaustive corpus of CIS dissertations. The analysis reveals that the topics of allocation of cognitive resources, deverbalization, and modeling the interpreting process attracted most attention from doctoral researchers. Additional analyses were used to track the research productivity of institutions and the career trajectories of PhD holders: one school was found to stand out, accounting for more than half of the total dissertations produced, and a PhD in CIS was found to be a highly useful asset for new professional interpreters.

Download Full-text