User–Topic Modeling for Online Community Analysis

Analyzing user behavior in online spaces is an important task. This paper is dedicated to analyzing the online community in terms of topics. We present a user–topic model based on the latent Dirichlet allocation (LDA), as an application of topic modeling in a domain other than textual data. This model substitutes the concept of word occurrence in the original LDA method with user participation. The proposed method deals with many problems regarding topic modeling and user analysis, which include: inclusion of dynamic topics, visualization of user interaction networks, and event detection. We collected datasets from four online communities with different characteristics, and conducted experiments to demonstrate the effectiveness of our method by revealing interesting findings covering numerous aspects.

Download Full-text

Automated Seeded Latent Dirichlet Allocation for Social Media Based Event Detection and Mapping

Information ◽

10.3390/info11080376 ◽

2020 ◽

Vol 11 (8) ◽

pp. 376 ◽

Cited By ~ 2

Author(s):

Cornelia Ferner ◽

Clemens Havas ◽

Elisabeth Birnbacher ◽

Stefan Wegenkittl ◽

Bernd Resch

Keyword(s):

Event Detection ◽

Disaster Response ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Geographic Area ◽

Relevant Information ◽

Suggested Approach ◽

Napa Valley ◽

Source Of Information

In the event of a natural disaster, geo-tagged Tweets are an immediate source of information for locating casualties and damages, and for supporting disaster management. Topic modeling can help in detecting disaster-related Tweets in the noisy Twitter stream in an unsupervised manner. However, the results of topic models are difficult to interpret and require manual identification of one or more “disaster topics”. Immediate disaster response would benefit from a fully automated process for interpreting the modeled topics and extracting disaster relevant information. Initializing the topic model with a set of seed words already allows to directly identify the corresponding disaster topic. In order to enable an automated end-to-end process, we automatically generate seed words using older Tweets from the same geographic area. The results of two past events (Napa Valley earthquake 2014 and hurricane Harvey 2017) show that the geospatial distribution of Tweets identified as disaster related conforms with the officially released disaster footprints. The suggested approach is applicable when there is a single topic of interest and comparative data available.

Download Full-text

Topic Modeling in Embedding Spaces

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00325 ◽

2020 ◽

Vol 8 ◽

pp. 439-453 ◽

Cited By ~ 2

Author(s):

Adji B. Dieng ◽

Francisco J. R. Ruiz ◽

David M. Blei

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Topic Models ◽

Predictive Performance ◽

Inner Product ◽

Natural Parameter ◽

Document Models ◽

Heavy Tailed ◽

Categorical Distribution

Topic modeling analyzes documents to learn meaningful patterns of words. However, existing topic models fail to learn interpretable topics when working with large and heavy-tailed vocabularies. To this end, we develop the embedded topic model (etm), a generative model of documents that marries traditional topic models with word embeddings. More specifically, the etm models each word with a categorical distribution whose natural parameter is the inner product between the word’s embedding and an embedding of its assigned topic. To fit the etm, we develop an efficient amortized variational inference algorithm. The etm discovers interpretable topics even with large vocabularies that include rare words and stop words. It outperforms existing document models, such as latent Dirichlet allocation, in terms of both topic quality and predictive performance.

Download Full-text

Incorporating Biterm Correlation Knowledge into Topic Modeling for Short Texts

The Computer Journal ◽

10.1093/comjnl/bxaa079 ◽

2020 ◽

Author(s):

Kai Zhang ◽

Yuan Zhou ◽

Zheng Chen ◽

Yufei Liu ◽

Zhuo Tang ◽

...

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Semantic Knowledge ◽

Superior Performance ◽

Knowledge Based ◽

Modeling Process ◽

Proposed Model ◽

Benchmark Datasets ◽

Latent Topic

Abstract The prevalence of short texts on the Web has made mining the latent topic structures of short texts a critical and fundamental task for many applications. However, due to the lack of word co-occurrence information induced by the content sparsity of short texts, it is challenging for traditional topic models like latent Dirichlet allocation (LDA) to extract coherent topic structures on short texts. Incorporating external semantic knowledge into the topic modeling process is an effective strategy to improve the coherence of inferred topics. In this paper, we develop a novel topic model—called biterm correlation knowledge-based topic model (BCK-TM)—to infer latent topics from short texts. Specifically, the proposed model mines biterm correlation knowledge automatically based on recent progress in word embedding, which can represent semantic information of words in a continuous vector space. To incorporate external knowledge, a knowledge incorporation mechanism is designed over the latent topic layer to regularize the topic assignment of each biterm during the topic sampling process. Experimental results on three public benchmark datasets illustrate the superior performance of the proposed approach over several state-of-the-art baseline models.

Download Full-text

Ldagibbs: A Command for Topic Modeling in Stata Using Latent Dirichlet Allocation

The Stata Journal Promoting communications on statistics and Stata ◽

10.1177/1536867x1801800107 ◽

2018 ◽

Vol 18 (1) ◽

pp. 101-117 ◽

Cited By ~ 10

Author(s):

Carlo Schwarz

Keyword(s):

Machine Learning ◽

Probability Distribution ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Topic Models ◽

Text Documents ◽

Text Data ◽

Dirichlet Allocation

In this article, I introduce the ldagibbs command, which implements latent Dirichlet allocation in Stata. Latent Dirichlet allocation is the most popular machine-learning topic model. Topic models automatically cluster text documents into a user-chosen number of topics. Latent Dirichlet allocation represents each document as a probability distribution over topics and represents each topic as a probability distribution over words. Therefore, latent Dirichlet allocation provides a way to analyze the content of large unclassified text data and an alternative to predefined document classifications.

Download Full-text

CLDA: An Effective Topic Model for Mining User Interest Preference under Big Data Background

Complexity ◽

10.1155/2018/2503816 ◽

2018 ◽

Vol 2018 ◽

pp. 1-10 ◽

Cited By ~ 5

Author(s):

Lirong Qiu ◽

Jia Yu

Keyword(s):

Big Data ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

User Interest ◽

Text Data ◽

Data Set ◽

Data Sparsity ◽

Short Text ◽

Text Filtering

In the present big data background, how to effectively excavate useful information is the problem that big data is facing now. The purpose of this study is to construct a more effective method of mining interest preferences of users in a particular field in the context of today’s big data. We mainly use a large number of user text data from microblog to study. LDA is an effective method of text mining, but it will not play a very good role in applying LDA directly to a large number of short texts in microblog. In today’s more effective topic modeling project, short texts need to be aggregated into long texts to avoid data sparsity. However, aggregated short texts are mixed with a lot of noise, reducing the accuracy of mining the user’s interest preferences. In this paper, we propose Combining Latent Dirichlet Allocation (CLDA), a new topic model that can learn the potential topics of microblog short texts and long texts simultaneously. The data sparsity of short texts is avoided by aggregating long texts to assist in learning short texts. Short text filtering long text is reused to improve mining accuracy, making long texts and short texts effectively combined. Experimental results in a real microblog data set show that CLDA outperforms many advanced models in mining user interest, and we also confirm that CLDA also has good performance in recommending systems.

Download Full-text

STABILITY OF TOPIC MODELING VIA MODALITY REGULARIZATION

Computational Linguistics and Intellectual Technologies ◽

10.28995/2075-7182-2020-19-198-210 ◽

2020 ◽

Author(s):

R. Derbanosov ◽

◽

M. Bakhanova ◽

◽

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Topic Model ◽

Side Information ◽

Auxiliary Information ◽

Discrete Distributions ◽

Probabilistic Latent Semantic Analysis ◽

Probabilistic Topic Modeling ◽

Random Initialization

Probabilistic topic modeling is a tool for statistical text analysis that can give us information about the inner structure of a large corpus of documents. The most popular models—Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation—produce topics in a form of discrete distributions over the set of all words of the corpus. They build topics using an iterative algorithm that starts from some random initialization and optimizes a loss function. One of the main problems of topic modeling is sensitivity to random initialization that means producing significantly different solutions from different initial points. Several studies showed that side information about documents may improve the overall quality of a topic model. In this paper, we consider the use of additional information in the context of the stability problem. We represent auxiliary information as an additional modality and use BigARTM library in order to perform experiments on several text collections. We show that using side information as an additional modality improves topics stability without significant quality loss of the model.

Download Full-text

ERD-MedLDA: Entity relation detection using supervised topic models with maximum margin learning

Natural Language Engineering ◽

10.1017/s1351324912000058 ◽

2012 ◽

Vol 18 (2) ◽

pp. 263-289 ◽

Cited By ~ 2

Author(s):

DINGCHENG LI ◽

SWAPNA SOMASUNDARAN ◽

AMIT CHAKRABORTY

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Topic Models ◽

Likelihood Estimation ◽

Named Entities ◽

Maximum Margin ◽

Heterogeneous Features ◽

Overall Performance ◽

Entity Relation Detection

AbstractThis paper proposes a novel application of topic models to do entity relation detection (ERD). In order to make use of the latent semantics of text, we formulate the task of relation detection as a topic modeling problem. The motivation is to find underlying topics that are indicative of relations between named entities (NEs). Our approach considers pairs of NEs and features associated with them as mini documents, and aims to utilize the underlying topic distributions as indicators for the types of relations that may exist between the NE pair. Our system, ERD-MedLDA, adapts Maximum Entropy Discriminant Latent Dirichlet Allocation (MedLDA) with mixed membership for relation detection. By using supervision, ERD-MedLDA is able to learn topic distributions indicative of relation types. Further, ERD-MedLDA is a topic model that combines the benefits of both, maximum likelihood estimation (MLE) and maximum margin estimation (MME), and the mixed-membership formulation enables the system to incorporate heterogeneous features. We incorporate different features into the system and perform experiments on the ACE 2005 corpus. Our approach achieves better overall performance for precision, recall, and F-measure metrics as compared to baseline SVM-based and LDA-based models. We also find that our system shows better and consistent improvements with the addition of complex informative features as compared to baseline systems.

Download Full-text

Comparative Study on Perceived Trust of Topic Modeling Based on Affective Level of Educational Text

Applied Sciences ◽

10.3390/app9214565 ◽

2019 ◽

Vol 9 (21) ◽

pp. 4565 ◽

Cited By ~ 1

Author(s):

Youngjae Im ◽

Jaehyun Park ◽

Minyeong Kim ◽

Kijung Park

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Negative Mood ◽

Ability Test ◽

Perceived Trust ◽

Significant Difference ◽

Traditional Algorithm ◽

Independent Variable ◽

Latent Topics

Latent dirichlet allocation (LDA) is a representative topic model to extract keywords related to latent topics embedded in a document set. Despite its effectiveness in finding underlying topics in documents, the traditional algorithm of LDA does not have a process to reflect sentimental meanings in text for topic extraction. Focusing on this issue, this study aims to investigate the usability of both LDA and sentiment analysis (SA) algorithms based on the affective level of text. This study defines the affective level of a given set of paragraphs and attempts to analyze the perceived trust of the methodologies in regards to usability. In our experiments, the text of the college scholastic ability test was selected as the set of evaluation paragraphs, and the affective level of the paragraphs was manipulated into three levels (low, medium, and high) as an independent variable. The LDA algorithm was used to extract the keywords of the paragraph, while SA was used to identify the positive or negative mood of the extracted subject word. In addition, the perceived trust score of the algorithm was evaluated by the subjects, and this study verifies whether there is a difference in the score according to the affective levels of the paragraphs. The results show that paragraphs with low affect lead to the high perceived trust of LDA from the participants. However, the perceived trust of SA does not show a statistically significant difference between the affect levels. The findings from this study indicate that LDA is more effective to find topics in text that mainly contains objective information.

Download Full-text

An evolutionary analysis of new energy and industry policy tools in China based on large-scale policy topic modeling

PLoS ONE ◽

10.1371/journal.pone.0252502 ◽

2021 ◽

Vol 16 (5) ◽

pp. e0252502

Author(s):

Qiqing Wang ◽

Cunbin Li

Keyword(s):

Topic Modeling ◽

Large Scale ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Policy Tools ◽

Evolutionary Analysis ◽

Energy Policies ◽

New Energy ◽

Industry Policy ◽

And Control

This study investigates the evolution of provincial new energy policies and industries of China using a topic modeling approach. To this end, six out of 31 provinces in China are first selected as research samples, central and provincial new energy policies in the period of 2010 to 2019 are collected to establish a text corpus with 23, 674 documents. Then, the policy corpus is fed to two different topic models, one is the Latent Dirichlet Allocation for modeling static policy topics, another is the Dynamic Topic Model for extracting topics over time. Finally, the obtained topics are mapped into policy tools for comparisons. The dynamic policy topics are further analyzed with the panel data from provincial new energy industries. The results show that the provincial new energy policies moved to different tracks after about 2014 due to the regional conditions such as the economy and CO2 emission intensity. Underdeveloped provinces tend to use environment-oriented tools to regulate and control CO2 emissions, while developed regions employ the more balanced policy mix for improving new energy vehicles and other industries. Widespread hysteretic effects are revealed during the correlation analysis of the policy topics and new energy capacity.

Download Full-text

Data Analysis and Visualization of Newspaper Articles on Thirdhand Smoke: A Topic Modeling Approach (Preprint)

10.2196/preprints.12414 ◽

2018 ◽

Author(s):

Qian Liu ◽

Qiuyi Chen ◽

Jiayi Shen ◽

Huailiang Wu ◽

Yimeng Sun ◽

...

Keyword(s):

United States ◽

Data Analysis ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Preliminary Investigation ◽

The United States ◽

Chinese Media ◽

Thirdhand Smoke ◽

News Reports

BACKGROUND Thirdhand smoke has been a growing topic for years in China. Thirdhand smoke (THS) consists of residual tobacco smoke pollutants that remain on surfaces and in dust. These pollutants are re-emitted as a gas or react with oxidants and other compounds in the environment to yield secondary pollutants. OBJECTIVE Collecting media reports on THS from major media outlets and analyzing this subject using topic modeling can facilitate a better understanding of the role that the media plays in communicating this health issue to the public. METHODS The data were retrieved from the Wiser and Factiva news databases. A preliminary investigation focused on articles dated between January 1, 2013, and December 31, 2017. Use of Latent Dirichlet Allocation yielded the top 10 topics about THS. The use of the modified LDAvis tool enabled an overall view of the topic model, which visualizes different topics as circles. Multidimensional scaling was used to represent the intertopic distances on a two-dimensional plane. RESULTS We found 745 articles dated between January 1, 2013, and December 31, 2017. The United States ranked first in terms of publications (152 articles on THS from 2013-2017). We found 279 news reports about THS from the Chinese media over the same period and 363 news reports from the United States. Given our analysis of the percentage of news related to THS in China, Topic 1 (Cancer) was the most popular among the topics and was mentioned in 31.9% of all news stories. Topic 2 (Control of quitting smoking) was related to roughly 15% of news items on THS. CONCLUSIONS Data analysis and the visualization of news articles can generate useful information. Our study shows that topic modeling can offer insights into understanding news reports related to THS. This analysis of media trends indicated that related diseases, air and particulate matter (PM2.5), and control and restrictions are the major concerns of the Chinese media reporting on THS. The Chinese press still needs to consider fuller reports on THS based on scientific evidence and with less focus on sensational headlines. We recommend that additional studies be conducted related to sentiment analysis of news data to verify and measure the influence of THS-related topics.

Download Full-text