Topic Models: A Tutorial with R

2014 ◽  
Vol 08 (01) ◽  
pp. 85-98 ◽  
Author(s):  
G. Manning Richardson ◽  
Janet Bowers ◽  
A. John Woodill ◽  
Joseph R. Barr ◽  
Jean Mark Gawron ◽  
...  

This tutorial presents topic models for organizing and comparing documents. The technique and corresponding discussion focuses on analysis of short text documents, particularly micro-blogs. However, the base topic model and R implementation are generally applicable to text analytics of document databases.

Author(s):  
Ximing Li ◽  
Jiaojiao Zhang ◽  
Jihong Ouyang

Conventional topic models suffer from a severe sparsity problem when facing extremely short texts such as social media posts. The family of Dirichlet multinomial mixture (DMM) can handle the sparsity problem, however, they are still very sensitive to ordinary and noisy words, resulting in inaccurate topic representations at the document level. In this paper, we alleviate this problem by preserving local neighborhood structure of short texts, enabling to spread topical signals among neighboring documents, so as to correct the inaccurate topic representations. This is achieved by using variational manifold regularization, constraining the close short texts should have similar variational topic representations. Upon this idea, we propose a novel Laplacian DMM (LapDMM) topic model. During the document graph construction, we further use the word mover’s distance with word embeddings to measure document similarities at the semantic level. To evaluate LapDMM, we compare it against the state-of-theart short text topic models on several traditional tasks. Experimental results demonstrate that our LapDMM achieves very significant performance gains over baseline models, e.g., achieving even about 0.2 higher scores on clustering and classification tasks in many cases.


Author(s):  
Carlo Schwarz

In this article, I introduce the ldagibbs command, which implements latent Dirichlet allocation in Stata. Latent Dirichlet allocation is the most popular machine-learning topic model. Topic models automatically cluster text documents into a user-chosen number of topics. Latent Dirichlet allocation represents each document as a probability distribution over topics and represents each topic as a probability distribution over words. Therefore, latent Dirichlet allocation provides a way to analyze the content of large unclassified text data and an alternative to predefined document classifications.


Author(s):  
Kristofferson Culmer ◽  
Jeffrey Uhlmann

The short lengths of tweets present a challenge for topic modeling to extend beyond what is provided explicitly from hashtag information. This is particularly true for LDAbased methods because the amount of information available from pertweet statistical analysis is severely limited. In this paper we present LDA2Vec paired with temporal tweet pooling (LDA2VecTTP) and assess its performance on this problem relative to traditional LDA and to Biterm Topic Model (Biterm), which was developed specifically for topic modeling on short text documents. We paired each of the three topic modeling algorithms with three tweet pooling schemes: no pooling, authorbased pooling, and temporal pooling. We then conducted topic modeling on two Twitter datasets using each of the algorithms and the tweet pooling schemes. Our results on the largest dataset suggest that LDA2VecTTP can produce higher coherence scores and more logically coherent and interpretable topics.


2020 ◽  
Vol 2020 ◽  
pp. 1-17
Author(s):  
Jocelyn Mazarura ◽  
Alta de Waal ◽  
Pieter de Villiers

Most topic models are constructed under the assumption that documents follow a multinomial distribution. The Poisson distribution is an alternative distribution to describe the probability of count data. For topic modelling, the Poisson distribution describes the number of occurrences of a word in documents of fixed length. The Poisson distribution has been successfully applied in text classification, but its application to topic modelling is not well documented, specifically in the context of a generative probabilistic model. Furthermore, the few Poisson topic models in the literature are admixture models, making the assumption that a document is generated from a mixture of topics. In this study, we focus on short text. Many studies have shown that the simpler assumption of a mixture model fits short text better. With mixture models, as opposed to admixture models, the generative assumption is that a document is generated from a single topic. One topic model, which makes this one-topic-per-document assumption, is the Dirichlet-multinomial mixture model. The main contributions of this work are a new Gamma-Poisson mixture model, as well as a collapsed Gibbs sampler for the model. The benefit of the collapsed Gibbs sampler derivation is that the model is able to automatically select the number of topics contained in the corpus. The results show that the Gamma-Poisson mixture model performs better than the Dirichlet-multinomial mixture model at selecting the number of topics in labelled corpora. Furthermore, the Gamma-Poisson mixture produces better topic coherence scores than the Dirichlet-multinomial mixture model, thus making it a viable option for the challenging task of topic modelling of short text.


Author(s):  
Pankaj Gupta ◽  
Yatin Chaudhary ◽  
Florian Buettner ◽  
Hinrich Schütze

We address two challenges in topic models: (1) Context information around words helps in determining their actual meaning, e.g., “networks” used in the contexts artificial neural networks vs. biological neuron networks. Generative topic models infer topic-word distributions, taking no or only little context into account. Here, we extend a neural autoregressive topic model to exploit the full context information around words in a document in a language modeling fashion. The proposed model is named as iDocNADE. (2) Due to the small number of word occurrences (i.e., lack of context) in short text and data sparsity in a corpus of few documents, the application of topic models is challenging on such texts. Therefore, we propose a simple and efficient way of incorporating external knowledge into neural autoregressive topic models: we use embeddings as a distributional prior. The proposed variants are named as DocNADEe and iDocNADEe. We present novel neural autoregressive topic model variants that consistently outperform state-of-the-art generative topic models in terms of generalization, interpretability (topic coherence) and applicability (retrieval and classification) over 7 long-text and 8 short-text datasets from diverse domains.


2018 ◽  
Vol 15 ◽  
pp. 101-112
Author(s):  
So-Hyun Park ◽  
Ae-Rin Song ◽  
Young-Ho Park ◽  
Sun-Young Ihm
Keyword(s):  

2014 ◽  
Vol 4 (1) ◽  
pp. 29-45 ◽  
Author(s):  
Rami Ayadi ◽  
Mohsen Maraoui ◽  
Mounir Zrigui

In this paper, the authors present latent topic model to index and represent the Arabic text documents reflecting more semantics. Text representation in a language with high inflectional morphology such as Arabic is not a trivial task and requires some special treatments. The authors describe our approach for analyzing and preprocessing Arabic text then we describe the stemming process. Finally, the latent model (LDA) is adapted to extract Arabic latent topics, the authors extracted significant topics of all texts, each theme is described by a particular distribution of descriptors then each text is represented on the vectors of these topics. The experiment of classification is conducted on in house corpus; latent topics are learned with LDA for different topic numbers K (25, 50, 75, and 100) then the authors compare this result with classification in the full words space. The results show that performances, in terms of precision, recall and f-measure, of classification in the reduced topics space outperform classification in full words space and when using LSI reduction.


2018 ◽  
Vol 45 (4) ◽  
pp. 554-570 ◽  
Author(s):  
Jian Jin ◽  
Qian Geng ◽  
Haikun Mou ◽  
Chong Chen

Interdisciplinary studies are becoming increasingly popular, and research domains of many experts are becoming diverse. This phenomenon brings difficulty in recommending experts to review interdisciplinary submissions. In this study, an Author–Subject–Topic (AST) model is proposed with two versions. In the model, reviewers’ subject information is embedded to analyse topic distributions of submissions and reviewers’ publications. The major difference between the AST and Author–Topic models lies in the introduction of a ‘Subject’ layer, which supervises the generation of hierarchical topics and allows sharing of subjects among authors. To evaluate the performance of the AST model, papers in Information System and Management (a typical interdisciplinary domain) in a famous Chinese academic library are investigated. Comparative experiments are conducted, which show the effectiveness of the AST model in topic distribution analysis and reviewer recommendation for interdisciplinary studies.


2011 ◽  
Vol 268-270 ◽  
pp. 697-700
Author(s):  
Rui Xue Duan ◽  
Xiao Jie Wang ◽  
Wen Feng Li

As the volume of online short text documents grow tremendously on the Internet, it is much more urgent to solve the task of organizing the short texts well. However, the traditional feature selection methods cannot suitable for the short text. In this paper, we proposed a method to incorporate syntactic information for the short text. It emphasizes the feature which has more dependency relations with other words. The classifier SVM and machine learning environment Weka are involved in our experiments. The experiment results show that incorporate syntactic information in the short text, we can get more powerful features than traditional feature selection methods, such as DF, CHI. The precision of short text classification improved from 86.2% to 90.8%.


Author(s):  
Natalia Vasilievna Salomatina ◽  
◽  
Irina Semenovna Kononenko ◽  
Elena Anatolvna Sidorova ◽  
Ivan Sergeevich Pimenov ◽  
...  

The presented work describes the analysis of argumentative statements included into the same text topic fragment as a recognition feature in terms of its efficiency. This study is performed with the purpose of using this feature in automatic recognition of argumentative structures presented in the popular science texts written in Russian. The topic model of a text is constructed based on superphrasal units (text fragments united by one topic) that are identified by detecting clusters of words and word-combinations with the use of scan statistics. Potential relations, extracted from topic models, are verified through the use of texts with manually annotated argumentation structures. The comparison between potential (based on topic models) and manually constructed relations is performed automatically. Macro-average scores of precision and recall are equal to 48.6% and 76.2% correspondingly.


Sign in / Sign up

Export Citation Format

Share Document