Topic Models: A Tutorial with R

This tutorial presents topic models for organizing and comparing documents. The technique and corresponding discussion focuses on analysis of short text documents, particularly micro-blogs. However, the base topic model and R implementation are generally applicable to text analytics of document databases.

Download Full-text

Dirichlet Multinomial Mixture with Variational Manifold Regularization: Topic Modeling over Short Texts

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33017884 ◽

2019 ◽

Vol 33 ◽

pp. 7884-7891 ◽

Cited By ~ 2

Author(s):

Ximing Li ◽

Jiaojiao Zhang ◽

Jihong Ouyang

Keyword(s):

Topic Model ◽

Topic Models ◽

Manifold Regularization ◽

Neighborhood Structure ◽

Short Text ◽

Semantic Level ◽

Clustering And Classification ◽

Significant Performance ◽

Sparsity Problem ◽

Performance Gains

Conventional topic models suffer from a severe sparsity problem when facing extremely short texts such as social media posts. The family of Dirichlet multinomial mixture (DMM) can handle the sparsity problem, however, they are still very sensitive to ordinary and noisy words, resulting in inaccurate topic representations at the document level. In this paper, we alleviate this problem by preserving local neighborhood structure of short texts, enabling to spread topical signals among neighboring documents, so as to correct the inaccurate topic representations. This is achieved by using variational manifold regularization, constraining the close short texts should have similar variational topic representations. Upon this idea, we propose a novel Laplacian DMM (LapDMM) topic model. During the document graph construction, we further use the word mover’s distance with word embeddings to measure document similarities at the semantic level. To evaluate LapDMM, we compare it against the state-of-theart short text topic models on several traditional tasks. Experimental results demonstrate that our LapDMM achieves very significant performance gains over baseline models, e.g., achieving even about 0.2 higher scores on clustering and classification tasks in many cases.

Download Full-text

Ldagibbs: A Command for Topic Modeling in Stata Using Latent Dirichlet Allocation

The Stata Journal Promoting communications on statistics and Stata ◽

10.1177/1536867x1801800107 ◽

2018 ◽

Vol 18 (1) ◽

pp. 101-117 ◽

Cited By ~ 10

Author(s):

Carlo Schwarz

Keyword(s):

Machine Learning ◽

Probability Distribution ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Topic Models ◽

Text Documents ◽

Text Data ◽

Dirichlet Allocation

In this article, I introduce the ldagibbs command, which implements latent Dirichlet allocation in Stata. Latent Dirichlet allocation is the most popular machine-learning topic model. Topic models automatically cluster text documents into a user-chosen number of topics. Latent Dirichlet allocation represents each document as a probability distribution over topics and represents each topic as a probability distribution over words. Therefore, latent Dirichlet allocation provides a way to analyze the content of large unclassified text data and an alternative to predefined document classifications.

Download Full-text

Examining LDA2Vec and Tweet Pooling for Topic Modeling on Twitter Data

WSEAS TRANSACTIONS ON INFORMATION SCIENCE AND APPLICATIONS ◽

10.37394/23209.2021.18.13 ◽

2021 ◽

Vol 18 ◽

pp. 102-114

Author(s):

Kristofferson Culmer ◽

Jeffrey Uhlmann

Keyword(s):

Statistical Analysis ◽

Topic Modeling ◽

Topic Model ◽

Text Documents ◽

Short Text ◽

Amount Of Information ◽

Twitter Data

The short lengths of tweets present a challenge for topic modeling to extend beyond what is provided explicitly from hashtag information. This is particularly true for LDAbased methods because the amount of information available from pertweet statistical analysis is severely limited. In this paper we present LDA2Vec paired with temporal tweet pooling (LDA2VecTTP) and assess its performance on this problem relative to traditional LDA and to Biterm Topic Model (Biterm), which was developed specifically for topic modeling on short text documents. We paired each of the three topic modeling algorithms with three tweet pooling schemes: no pooling, authorbased pooling, and temporal pooling. We then conducted topic modeling on two Twitter datasets using each of the algorithms and the tweet pooling schemes. Our results on the largest dataset suggest that LDA2VecTTP can produce higher coherence scores and more logically coherent and interpretable topics.

Download Full-text

A Gamma-Poisson Mixture Topic Model for Short Text

Mathematical Problems in Engineering ◽

10.1155/2020/4728095 ◽

2020 ◽

Vol 2020 ◽

pp. 1-17

Author(s):

Jocelyn Mazarura ◽

Alta de Waal ◽

Pieter de Villiers

Keyword(s):

Poisson Distribution ◽

Mixture Model ◽

Gibbs Sampler ◽

Topic Model ◽

Topic Models ◽

Multinomial Distribution ◽

Topic Modelling ◽

Poisson Mixture ◽

Short Text ◽

Poisson Mixture Model

Most topic models are constructed under the assumption that documents follow a multinomial distribution. The Poisson distribution is an alternative distribution to describe the probability of count data. For topic modelling, the Poisson distribution describes the number of occurrences of a word in documents of fixed length. The Poisson distribution has been successfully applied in text classification, but its application to topic modelling is not well documented, specifically in the context of a generative probabilistic model. Furthermore, the few Poisson topic models in the literature are admixture models, making the assumption that a document is generated from a mixture of topics. In this study, we focus on short text. Many studies have shown that the simpler assumption of a mixture model fits short text better. With mixture models, as opposed to admixture models, the generative assumption is that a document is generated from a single topic. One topic model, which makes this one-topic-per-document assumption, is the Dirichlet-multinomial mixture model. The main contributions of this work are a new Gamma-Poisson mixture model, as well as a collapsed Gibbs sampler for the model. The benefit of the collapsed Gibbs sampler derivation is that the model is able to automatically select the number of topics contained in the corpus. The results show that the Gamma-Poisson mixture model performs better than the Dirichlet-multinomial mixture model at selecting the number of topics in labelled corpora. Furthermore, the Gamma-Poisson mixture produces better topic coherence scores than the Dirichlet-multinomial mixture model, thus making it a viable option for the challenging task of topic modelling of short text.

Download Full-text

Document Informed Neural Autoregressive Topic Models with Distributional Prior

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016505 ◽

2019 ◽

Vol 33 ◽

pp. 6505-6512 ◽

Cited By ~ 2

Author(s):

Pankaj Gupta ◽

Yatin Chaudhary ◽

Florian Buettner ◽

Hinrich Schütze

Keyword(s):

Topic Model ◽

State Of The Art ◽

Topic Models ◽

Language Modeling ◽

Context Information ◽

Short Text ◽

Neuron Networks ◽

Proposed Model ◽

Actual Meaning ◽

Biological Neuron

We address two challenges in topic models: (1) Context information around words helps in determining their actual meaning, e.g., “networks” used in the contexts artificial neural networks vs. biological neuron networks. Generative topic models infer topic-word distributions, taking no or only little context into account. Here, we extend a neural autoregressive topic model to exploit the full context information around words in a document in a language modeling fashion. The proposed model is named as iDocNADE. (2) Due to the small number of word occurrences (i.e., lack of context) in short text and data sparsity in a corpus of few documents, the application of topic models is challenging on such texts. Therefore, we propose a simple and efficient way of incorporating external knowledge into neural autoregressive topic models: we use embeddings as a distributional prior. The proposed variants are named as DocNADEe and iDocNADEe. We present novel neural autoregressive topic model variants that consistently outperform state-of-the-art generative topic models in terms of generalization, interpretability (topic coherence) and applicability (retrieval and classification) over 7 long-text and 8 short-text datasets from diverse domains.

Download Full-text

A Study on Bestseller Short Text Semantics Analysis Using Topic Model

The Journal of Image and Cultural Contents ◽

10.24174/jicc.2018.10.15.101 ◽

2018 ◽

Vol 15 ◽

pp. 101-112

Author(s):

So-Hyun Park ◽

Ae-Rin Song ◽

Young-Ho Park ◽

Sun-Young Ihm

Keyword(s):

Topic Model ◽

Short Text

Download Full-text

Latent Topic Model for Indexing Arabic Documents

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2014010102 ◽

2014 ◽

Vol 4 (1) ◽

pp. 29-45 ◽

Cited By ~ 3

Author(s):

Rami Ayadi ◽

Mohsen Maraoui ◽

Mounir Zrigui

Keyword(s):

Topic Model ◽

Inflectional Morphology ◽

Arabic Text ◽

Text Representation ◽

Text Documents ◽

Latent Topic ◽

Latent Topics ◽

F Measure

In this paper, the authors present latent topic model to index and represent the Arabic text documents reflecting more semantics. Text representation in a language with high inflectional morphology such as Arabic is not a trivial task and requires some special treatments. The authors describe our approach for analyzing and preprocessing Arabic text then we describe the stemming process. Finally, the latent model (LDA) is adapted to extract Arabic latent topics, the authors extracted significant topics of all texts, each theme is described by a particular distribution of descriptors then each text is represented on the vectors of these topics. The experiment of classification is conducted on in house corpus; latent topics are learned with LDA for different topic numbers K (25, 50, 75, and 100) then the authors compare this result with classification in the full words space. The results show that performances, in terms of precision, recall and f-measure, of classification in the reduced topics space outperform classification in full words space and when using LSI reduction.

Download Full-text

Author–Subject–Topic model for reviewer recommendation

Journal of Information Science ◽

10.1177/0165551518806116 ◽

2018 ◽

Vol 45 (4) ◽

pp. 554-570 ◽

Cited By ~ 1

Author(s):

Jian Jin ◽

Qian Geng ◽

Haikun Mou ◽

Chong Chen

Keyword(s):

Information System ◽

Topic Model ◽

Academic Library ◽

Topic Models ◽

Interdisciplinary Studies ◽

Distribution Analysis ◽

Topic Distribution ◽

Research Domains

Interdisciplinary studies are becoming increasingly popular, and research domains of many experts are becoming diverse. This phenomenon brings difficulty in recommending experts to review interdisciplinary submissions. In this study, an Author–Subject–Topic (AST) model is proposed with two versions. In the model, reviewers’ subject information is embedded to analyse topic distributions of submissions and reviewers’ publications. The major difference between the AST and Author–Topic models lies in the introduction of a ‘Subject’ layer, which supervises the generation of hierarchical topics and allows sharing of subjects among authors. To evaluate the performance of the AST model, papers in Information System and Management (a typical interdisciplinary domain) in a famous Chinese academic library are investigated. Comparative experiments are conducted, which show the effectiveness of the AST model in topic distribution analysis and reviewer recommendation for interdisciplinary studies.

Download Full-text

Incorporate Syntactic Information for Short Text Classification

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.268-270.697 ◽

2011 ◽

Vol 268-270 ◽

pp. 697-700

Author(s):

Rui Xue Duan ◽

Xiao Jie Wang ◽

Wen Feng Li

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Environment ◽

Text Classification ◽

The Internet ◽

Selection Methods ◽

Text Documents ◽

Short Text ◽

Syntactic Information ◽

Dependency Relations

As the volume of online short text documents grow tremendously on the Internet, it is much more urgent to solve the task of organizing the short texts well. However, the traditional feature selection methods cannot suitable for the short text. In this paper, we proposed a method to incorporate syntactic information for the short text. It emphasizes the feature which has more dependency relations with other words. The classifier SVM and machine learning environment Weka are involved in our experiments. The experiment results show that incorporate syntactic information in the short text, we can get more powerful features than traditional feature selection methods, such as DF, CHI. The precision of short text classification improved from 86.2% to 90.8%.

Download Full-text

IDENTIFICATION OF ARGUMENTATIVE RELATIONS IN POPULAR SCIENCE TEXTS

System Informatics ◽

10.31144/si.2307-6410.2020.n16.p149-164 ◽

2020 ◽

Author(s):

Natalia Vasilievna Salomatina ◽

◽

Irina Semenovna Kononenko ◽

Elena Anatolvna Sidorova ◽

Ivan Sergeevich Pimenov ◽

...

Keyword(s):

Topic Model ◽

Topic Models ◽

Popular Science ◽

Automatic Recognition ◽

Scan Statistics ◽

Science Texts

The presented work describes the analysis of argumentative statements included into the same text topic fragment as a recognition feature in terms of its efficiency. This study is performed with the purpose of using this feature in automatic recognition of argumentative structures presented in the popular science texts written in Russian. The topic model of a text is constructed based on superphrasal units (text fragments united by one topic) that are identified by detecting clusters of words and word-combinations with the use of scan statistics. Potential relations, extracted from topic models, are verified through the use of texts with manually annotated argumentation structures. The comparison between potential (based on topic models) and manually constructed relations is performed automatically. Macro-average scores of precision and recall are equal to 48.6% and 76.2% correspondingly.

Download Full-text