Relational Biterm Topic Model: Short-Text Topic Modeling using Word Embeddings

PurposeTopic model has been widely applied to discover important information from a vast amount of unstructured data. Traditional long-text topic models such as Latent Dirichlet Allocation may suffer from the sparsity problem when dealing with short texts, which mostly come from the Web. These models also exist the readability problem when displaying the discovered topics. The purpose of this paper is to propose a novel model called the Sense Unit based Phrase Topic Model (SenU-PTM) for both the sparsity and readability problems.Design/methodology/approachSenU-PTM is a novel phrase-based short-text topic model under a two-phase framework. The first phase introduces a phrase-generation algorithm by exploiting word embeddings, which aims to generate phrases with the original corpus. The second phase introduces a new concept of sense unit, which consists of a set of semantically similar tokens for modeling topics with token vectors generated in the first phase. Finally, SenU-PTM infers topics based on the above two phases.FindingsExperimental results on two real-world and publicly available datasets show the effectiveness of SenU-PTM from the perspectives of topical quality and document characterization. It reveals that modeling topics on sense units can solve the sparsity of short texts and improve the readability of topics at the same time.Originality/valueThe originality of SenU-PTM lies in the new procedure of modeling topics on the proposed sense units with word embeddings for short-text topic discovery.

Download Full-text

CLDA: An Effective Topic Model for Mining User Interest Preference under Big Data Background

Complexity ◽

10.1155/2018/2503816 ◽

2018 ◽

Vol 2018 ◽

pp. 1-10 ◽

Cited By ~ 5

Author(s):

Lirong Qiu ◽

Jia Yu

Keyword(s):

Big Data ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

User Interest ◽

Text Data ◽

Data Set ◽

Data Sparsity ◽

Short Text ◽

Text Filtering

In the present big data background, how to effectively excavate useful information is the problem that big data is facing now. The purpose of this study is to construct a more effective method of mining interest preferences of users in a particular field in the context of today’s big data. We mainly use a large number of user text data from microblog to study. LDA is an effective method of text mining, but it will not play a very good role in applying LDA directly to a large number of short texts in microblog. In today’s more effective topic modeling project, short texts need to be aggregated into long texts to avoid data sparsity. However, aggregated short texts are mixed with a lot of noise, reducing the accuracy of mining the user’s interest preferences. In this paper, we propose Combining Latent Dirichlet Allocation (CLDA), a new topic model that can learn the potential topics of microblog short texts and long texts simultaneously. The data sparsity of short texts is avoided by aggregating long texts to assist in learning short texts. Short text filtering long text is reused to improve mining accuracy, making long texts and short texts effectively combined. Experimental results in a real microblog data set show that CLDA outperforms many advanced models in mining user interest, and we also confirm that CLDA also has good performance in recommending systems.

Download Full-text

Examining LDA2Vec and Tweet Pooling for Topic Modeling on Twitter Data

WSEAS TRANSACTIONS ON INFORMATION SCIENCE AND APPLICATIONS ◽

10.37394/23209.2021.18.13 ◽

2021 ◽

Vol 18 ◽

pp. 102-114

Author(s):

Kristofferson Culmer ◽

Jeffrey Uhlmann

Keyword(s):

Statistical Analysis ◽

Topic Modeling ◽

Topic Model ◽

Text Documents ◽

Short Text ◽

Amount Of Information ◽

Twitter Data

The short lengths of tweets present a challenge for topic modeling to extend beyond what is provided explicitly from hashtag information. This is particularly true for LDAbased methods because the amount of information available from pertweet statistical analysis is severely limited. In this paper we present LDA2Vec paired with temporal tweet pooling (LDA2VecTTP) and assess its performance on this problem relative to traditional LDA and to Biterm Topic Model (Biterm), which was developed specifically for topic modeling on short text documents. We paired each of the three topic modeling algorithms with three tweet pooling schemes: no pooling, authorbased pooling, and temporal pooling. We then conducted topic modeling on two Twitter datasets using each of the algorithms and the tweet pooling schemes. Our results on the largest dataset suggest that LDA2VecTTP can produce higher coherence scores and more logically coherent and interpretable topics.

Download Full-text

Short Text Topic Model with Word Embeddings and Context Information

Recent Advances in Information and Communication Technology 2018 - Advances in Intelligent Systems and Computing ◽

10.1007/978-3-319-93692-5_6 ◽

2018 ◽

pp. 55-64

Author(s):

Xianchao Zhang ◽

Ran Feng ◽

Wenxin Liang

Keyword(s):

Topic Model ◽

Context Information ◽

Word Embeddings ◽

Short Text

Download Full-text

Dataless Short Text Classification Based on Biterm Topic Model and Word Embeddings

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/549 ◽

2020 ◽

Author(s):

Yi Yang ◽

Hongan Wang ◽

Jiaqi Zhu ◽

Yunkun Wu ◽

Kailong Jiang ◽

...

Keyword(s):

Text Classification ◽

Classification Accuracy ◽

Topic Model ◽

State Of The Art ◽

The Internet ◽

Word Embeddings ◽

Word Similarity ◽

Short Text ◽

Category Word ◽

Novel Model

Dataless text classification has attracted increasing attentions recently. It only needs very few seed words of each category to classify documents, which is much cheaper than supervised text classification that requires massive labeling efforts. However, most of existing models pay attention to long texts, but get unsatisfactory performance on short texts, which have become increasingly popular on the Internet. In this paper, we at first propose a novel model named Seeded Biterm Topic Model (SeedBTM) extending BTM to solve the problem of dataless short text classification with seed words. It takes advantage of both word co-occurrence information in the topic model and category-word similarity from widely used word embeddings as the prior topic-in-set knowledge. Moreover, with the same approach, we also propose Seeded Twitter Biterm Topic Model (SeedTBTM), which extends Twitter-BTM and utilizes additional user information to achieve higher classification accuracy. Experimental results on five real short-text datasets show that our models outperform the state-of-the-art methods, and especially perform well when the categories are overlapping and interrelated.

Download Full-text

A Study on Bestseller Short Text Semantics Analysis Using Topic Model

The Journal of Image and Cultural Contents ◽

10.24174/jicc.2018.10.15.101 ◽

2018 ◽

Vol 15 ◽

pp. 101-112

Author(s):

So-Hyun Park ◽

Ae-Rin Song ◽

Young-Ho Park ◽

Sun-Young Ihm

Keyword(s):

Topic Model ◽

Short Text

Download Full-text

Contextual Word Embeddings and Topic Modeling in Healthy Dieting and Obesity

Journal of Healthcare Informatics Research ◽

10.1007/s41666-019-00052-5 ◽

2019 ◽

Vol 3 (2) ◽

pp. 159-183 ◽

Cited By ~ 1

Author(s):

Vijaya Kumari Yeruva ◽

Sidrah Junaid ◽

Yugyung Lee

Keyword(s):

Topic Modeling ◽

Word Embeddings

Download Full-text

Learning to Classify Short Text with Topic Model and External Knowledge

Knowledge Science, Engineering and Management - Lecture Notes in Computer Science ◽

10.1007/978-3-642-39787-5_41 ◽

2013 ◽

pp. 493-503 ◽

Cited By ~ 12

Author(s):

Ying Zhu ◽

Li Li ◽

Le Luo

Keyword(s):

Topic Model ◽

Short Text ◽

External Knowledge

Download Full-text

Topic-BERT: Detecting harmful information from social media

Intelligent Decision Technologies ◽

10.3233/idt-200094 ◽

2021 ◽

pp. 1-10

Author(s):

Wang Gao ◽

Hongtao Deng ◽

Xun Zhu ◽

Yuan Fang

Keyword(s):

Social Media ◽

Language Processing ◽

Topic Model ◽

Classification Performance ◽

Critical Research ◽

Short Text ◽

Additional Information ◽

Proposed Model ◽

Weight Calculation ◽

Two Stages

Harmful information identification is a critical research topic in natural language processing. Existing approaches have been focused either on rule-based methods or harmful text identification of normal documents. In this paper, we propose a BERT-based model to identify harmful information from social media, called Topic-BERT. Firstly, Topic-BERT utilizes BERT to take additional information as input to alleviate the sparseness of short texts. The GPU-DMM topic model is used to capture hidden topics of short texts for attention weight calculation. Secondly, the proposed model divides harmful short text identification into two stages, and different granularity labels are identified by two similar sub-models. Finally, we conduct extensive experiments on a real-world social media dataset to evaluate our model. Experimental results demonstrate that our model can significantly improve the classification performance compared with baseline methods.

Download Full-text