scholarly journals Examining LDA2Vec and Tweet Pooling for Topic Modeling on Twitter Data

Author(s):  
Kristofferson Culmer ◽  
Jeffrey Uhlmann

The short lengths of tweets present a challenge for topic modeling to extend beyond what is provided explicitly from hashtag information. This is particularly true for LDAbased methods because the amount of information available from pertweet statistical analysis is severely limited. In this paper we present LDA2Vec paired with temporal tweet pooling (LDA2VecTTP) and assess its performance on this problem relative to traditional LDA and to Biterm Topic Model (Biterm), which was developed specifically for topic modeling on short text documents. We paired each of the three topic modeling algorithms with three tweet pooling schemes: no pooling, authorbased pooling, and temporal pooling. We then conducted topic modeling on two Twitter datasets using each of the algorithms and the tweet pooling schemes. Our results on the largest dataset suggest that LDA2VecTTP can produce higher coherence scores and more logically coherent and interpretable topics.

Author(s):  
Carlo Schwarz

In this article, I introduce the ldagibbs command, which implements latent Dirichlet allocation in Stata. Latent Dirichlet allocation is the most popular machine-learning topic model. Topic models automatically cluster text documents into a user-chosen number of topics. Latent Dirichlet allocation represents each document as a probability distribution over topics and represents each topic as a probability distribution over words. Therefore, latent Dirichlet allocation provides a way to analyze the content of large unclassified text data and an alternative to predefined document classifications.


Complexity ◽  
2018 ◽  
Vol 2018 ◽  
pp. 1-10 ◽  
Author(s):  
Lirong Qiu ◽  
Jia Yu

In the present big data background, how to effectively excavate useful information is the problem that big data is facing now. The purpose of this study is to construct a more effective method of mining interest preferences of users in a particular field in the context of today’s big data. We mainly use a large number of user text data from microblog to study. LDA is an effective method of text mining, but it will not play a very good role in applying LDA directly to a large number of short texts in microblog. In today’s more effective topic modeling project, short texts need to be aggregated into long texts to avoid data sparsity. However, aggregated short texts are mixed with a lot of noise, reducing the accuracy of mining the user’s interest preferences. In this paper, we propose Combining Latent Dirichlet Allocation (CLDA), a new topic model that can learn the potential topics of microblog short texts and long texts simultaneously. The data sparsity of short texts is avoided by aggregating long texts to assist in learning short texts. Short text filtering long text is reused to improve mining accuracy, making long texts and short texts effectively combined. Experimental results in a real microblog data set show that CLDA outperforms many advanced models in mining user interest, and we also confirm that CLDA also has good performance in recommending systems.


Author(s):  
Yiming Wang ◽  
Ximing Li ◽  
Jihong Ouyang

Neural topic modeling provides a flexible, efficient, and powerful way to extract topic representations from text documents. Unfortunately, most existing models cannot handle the text data with network links, such as web pages with hyperlinks and scientific papers with citations. To resolve this kind of data, we develop a novel neural topic model , namely Layer-Assisted Neural Topic Model (LANTM), which can be interpreted from the perspective of variational auto-encoders. Our major motivation is to enhance the topic representation encoding by not only using text contents, but also the assisted network links. Specifically, LANTM encodes the texts and network links to the topic representations by an augmented network with graph convolutional modules, and decodes them by maximizing the likelihood of the generative process. The neural variational inference is adopted for efficient inference. Experimental results validate that LANTM significantly outperforms the existing models on topic quality, text classification and link prediction..


2019 ◽  
Vol 8 (2S8) ◽  
pp. 1366-1371

Topic modeling, such as LDA is considered as a useful tool for the statistical analysis of text document collections and other text-based data. Recently, topic modeling becomes an attractive researching field due to its wide applications. However, there are remained disadvantages of traditional topic modeling like as LDA due the shortcoming of bag-of-words (BOW) model as well as low-performance in handle large text corpus. Therefore, in this paper, we present a novel approach of topic model, called LDA-GOW, which is the combination of word co-occurrence, also called: graph-of-words (GOW) model and traditional LDA topic discovering model. The LDA-GOW topic model not only enable to extract more informative topics from text but also be able to leverage the topic discovering process from large-scaled text corpus. We test our proposed model in comparing with the traditional LDA topic model, within several standardized datasets, include: WebKB, Reuters-R8 and annotated scientific documents which are collected from ACM digital library to demonstrate the effectiveness of our proposed model. For overall experiments, our proposed LDA-GOW model gains approximately 70.86% in accuracy.


2018 ◽  
Vol 62 (3) ◽  
pp. 359-372 ◽  
Author(s):  
Ximing Li ◽  
Ang Zhang ◽  
Changchun Li ◽  
Lantian Guo ◽  
Wenting Wang ◽  
...  

2014 ◽  
Vol 08 (01) ◽  
pp. 85-98 ◽  
Author(s):  
G. Manning Richardson ◽  
Janet Bowers ◽  
A. John Woodill ◽  
Joseph R. Barr ◽  
Jean Mark Gawron ◽  
...  

This tutorial presents topic models for organizing and comparing documents. The technique and corresponding discussion focuses on analysis of short text documents, particularly micro-blogs. However, the base topic model and R implementation are generally applicable to text analytics of document databases.


1966 ◽  
Vol 24 ◽  
pp. 188-189
Author(s):  
T. J. Deeming

If we make a set of measurements, such as narrow-band or multicolour photo-electric measurements, which are designed to improve a scheme of classification, and in particular if they are designed to extend the number of dimensions of classification, i.e. the number of classification parameters, then some important problems of analytical procedure arise. First, it is important not to reproduce the errors of the classification scheme which we are trying to improve. Second, when trying to extend the number of dimensions of classification we have little or nothing with which to test the validity of the new parameters.Problems similar to these have occurred in other areas of scientific research (notably psychology and education) and the branch of Statistics called Multivariate Analysis has been developed to deal with them. The techniques of this subject are largely unknown to astronomers, but, if carefully applied, they should at the very least ensure that the astronomer gets the maximum amount of information out of his data and does not waste his time looking for information which is not there. More optimistically, these techniques are potentially capable of indicating the number of classification parameters necessary and giving specific formulas for computing them, as well as pinpointing those particular measurements which are most crucial for determining the classification parameters.


2018 ◽  
Vol 15 ◽  
pp. 101-112
Author(s):  
So-Hyun Park ◽  
Ae-Rin Song ◽  
Young-Ho Park ◽  
Sun-Young Ihm
Keyword(s):  

2014 ◽  
Vol 4 (1) ◽  
pp. 29-45 ◽  
Author(s):  
Rami Ayadi ◽  
Mohsen Maraoui ◽  
Mounir Zrigui

In this paper, the authors present latent topic model to index and represent the Arabic text documents reflecting more semantics. Text representation in a language with high inflectional morphology such as Arabic is not a trivial task and requires some special treatments. The authors describe our approach for analyzing and preprocessing Arabic text then we describe the stemming process. Finally, the latent model (LDA) is adapted to extract Arabic latent topics, the authors extracted significant topics of all texts, each theme is described by a particular distribution of descriptors then each text is represented on the vectors of these topics. The experiment of classification is conducted on in house corpus; latent topics are learned with LDA for different topic numbers K (25, 50, 75, and 100) then the authors compare this result with classification in the full words space. The results show that performances, in terms of precision, recall and f-measure, of classification in the reduced topics space outperform classification in full words space and when using LSI reduction.


Sign in / Sign up

Export Citation Format

Share Document