Short Text Topic Discovery Based on BTM Topic Model

Bursty topic discovery aims to automatically identify bursty events and continuously keep track of known events. The existing methods focus on the topic model. However, the sparsity of short text brings the challenge to the traditional topic models because the words are too few to learn from the original corpus. To tackle this problem, we propose a Sparse Topic Model (STM) for bursty topic discovery. First, we distinguish the modeling between the bursty topic and the common topic to detect the change of the words in time and discover the bursty words. Second, we introduce “Spike and Slab” prior to decouple the sparsity and smoothness of a distribution. The bursty words are leveraged to achieve automatic discovery of the bursty topics. Finally, to evaluate the effectiveness of our proposed algorithm, we collect Sina weibo dataset to conduct various experiments. Both qualitative and quantitative evaluations demonstrate that the proposed STM algorithm outperforms favorably against several state-of-the-art methods

Download Full-text

SenU-PTM: a novel phrase-based topic model for short-text topic discovery by exploiting word embeddings

Data Technologies and Applications ◽

10.1108/dta-02-2021-0039 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Heng-Yang Lu ◽

Yi Zhang ◽

Yuntao Du

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Model ◽

Second Phase ◽

Word Embeddings ◽

Two Phase ◽

Content Type ◽

Short Text ◽

Topic Discovery ◽

Two Phases ◽

Sense Unit

PurposeTopic model has been widely applied to discover important information from a vast amount of unstructured data. Traditional long-text topic models such as Latent Dirichlet Allocation may suffer from the sparsity problem when dealing with short texts, which mostly come from the Web. These models also exist the readability problem when displaying the discovered topics. The purpose of this paper is to propose a novel model called the Sense Unit based Phrase Topic Model (SenU-PTM) for both the sparsity and readability problems.Design/methodology/approachSenU-PTM is a novel phrase-based short-text topic model under a two-phase framework. The first phase introduces a phrase-generation algorithm by exploiting word embeddings, which aims to generate phrases with the original corpus. The second phase introduces a new concept of sense unit, which consists of a set of semantically similar tokens for modeling topics with token vectors generated in the first phase. Finally, SenU-PTM infers topics based on the above two phases.FindingsExperimental results on two real-world and publicly available datasets show the effectiveness of SenU-PTM from the perspectives of topical quality and document characterization. It reveals that modeling topics on sense units can solve the sparsity of short texts and improve the readability of topics at the same time.Originality/valueThe originality of SenU-PTM lies in the new procedure of modeling topics on the proposed sense units with word embeddings for short-text topic discovery.

Download Full-text

A Study on Bestseller Short Text Semantics Analysis Using Topic Model

The Journal of Image and Cultural Contents ◽

10.24174/jicc.2018.10.15.101 ◽

2018 ◽

Vol 15 ◽

pp. 101-112

Author(s):

So-Hyun Park ◽

Ae-Rin Song ◽

Young-Ho Park ◽

Sun-Young Ihm

Keyword(s):

Topic Model ◽

Short Text

Download Full-text

Learning to Classify Short Text with Topic Model and External Knowledge

Knowledge Science, Engineering and Management - Lecture Notes in Computer Science ◽

10.1007/978-3-642-39787-5_41 ◽

2013 ◽

pp. 493-503 ◽

Cited By ~ 12

Author(s):

Ying Zhu ◽

Li Li ◽

Le Luo

Keyword(s):

Topic Model ◽

Short Text ◽

External Knowledge

Download Full-text

Topic-BERT: Detecting harmful information from social media

Intelligent Decision Technologies ◽

10.3233/idt-200094 ◽

2021 ◽

pp. 1-10

Author(s):

Wang Gao ◽

Hongtao Deng ◽

Xun Zhu ◽

Yuan Fang

Keyword(s):

Social Media ◽

Language Processing ◽

Topic Model ◽

Classification Performance ◽

Critical Research ◽

Short Text ◽

Additional Information ◽

Proposed Model ◽

Weight Calculation ◽

Two Stages

Harmful information identification is a critical research topic in natural language processing. Existing approaches have been focused either on rule-based methods or harmful text identification of normal documents. In this paper, we propose a BERT-based model to identify harmful information from social media, called Topic-BERT. Firstly, Topic-BERT utilizes BERT to take additional information as input to alleviate the sparseness of short texts. The GPU-DMM topic model is used to capture hidden topics of short texts for attention weight calculation. Secondly, the proposed model divides harmful short text identification into two stages, and different granularity labels are identified by two similar sub-models. Finally, we conduct extensive experiments on a real-world social media dataset to evaluate our model. Experimental results demonstrate that our model can significantly improve the classification performance compared with baseline methods.

Download Full-text

A word embedding topic model for topic detection and summary in social networks

Measurement and Control ◽

10.1177/0020294019865750 ◽

2019 ◽

Vol 52 (9-10) ◽

pp. 1289-1298 ◽

Cited By ~ 1

Author(s):

Lei Shi ◽

Gang Cheng ◽

Shang-ru Xie ◽

Gang Xie

Keyword(s):

Social Networks ◽

Social Network ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Topic Model ◽

Word Embedding ◽

Probabilistic Latent Semantic Analysis ◽

Topic Detection ◽

Short Text ◽

Internal Relationship

The aim of topic detection is to automatically identify the events and hot topics in social networks and continuously track known topics. Applying the traditional methods such as Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis is difficult given the high dimensionality of massive event texts and the short-text sparsity problems of social networks. The problem also exists of unclear topics caused by the sparse distribution of topics. To solve the above challenge, we propose a novel word embedding topic model by combining the topic model and the continuous bag-of-words mode (Cbow) method in word embedding method, named Cbow Topic Model (CTM), for topic detection and summary in social networks. We conduct similar word clustering of the target social network text dataset by introducing the classic Cbow word vectorization method, which can effectively learn the internal relationship between words and reduce the dimensionality of the input texts. We employ the topic model-to-model short text for effectively weakening the sparsity problem of social network texts. To detect and summarize the topic, we propose a topic detection method by leveraging similarity computing for social networks. We collected a Sina microblog dataset to conduct various experiments. The experimental results demonstrate that the CTM method is superior to the existing topic model method.

Download Full-text

Dirichlet Multinomial Mixture with Variational Manifold Regularization: Topic Modeling over Short Texts

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33017884 ◽

2019 ◽

Vol 33 ◽

pp. 7884-7891 ◽

Cited By ~ 2

Author(s):

Ximing Li ◽

Jiaojiao Zhang ◽

Jihong Ouyang

Keyword(s):

Topic Model ◽

Topic Models ◽

Manifold Regularization ◽

Neighborhood Structure ◽

Short Text ◽

Semantic Level ◽

Clustering And Classification ◽

Significant Performance ◽

Sparsity Problem ◽

Performance Gains

Conventional topic models suffer from a severe sparsity problem when facing extremely short texts such as social media posts. The family of Dirichlet multinomial mixture (DMM) can handle the sparsity problem, however, they are still very sensitive to ordinary and noisy words, resulting in inaccurate topic representations at the document level. In this paper, we alleviate this problem by preserving local neighborhood structure of short texts, enabling to spread topical signals among neighboring documents, so as to correct the inaccurate topic representations. This is achieved by using variational manifold regularization, constraining the close short texts should have similar variational topic representations. Upon this idea, we propose a novel Laplacian DMM (LapDMM) topic model. During the document graph construction, we further use the word mover’s distance with word embeddings to measure document similarities at the semantic level. To evaluate LapDMM, we compare it against the state-of-theart short text topic models on several traditional tasks. Experimental results demonstrate that our LapDMM achieves very significant performance gains over baseline models, e.g., achieving even about 0.2 higher scores on clustering and classification tasks in many cases.

Download Full-text

Filtering and Classifying Relevant Short Text with a Few Seed Words

Data and Information Management ◽

10.2478/dim-2019-0011 ◽

2019 ◽

Vol 3 (3) ◽

pp. 165-186 ◽

Cited By ~ 1

Author(s):

Chenliang Li ◽

Shiqian Chen ◽

Yan Qi

Keyword(s):

Latent Dirichlet Allocation ◽

Topic Model ◽

State Of The Art ◽

Superior Performance ◽

Support Vector ◽

Short Text ◽

Text Filtering ◽

Supervised Classifiers ◽

Real World Datasets ◽

Weakly Supervised

Abstract Filtering out irrelevant documents and classifying the relevant ones into topical categories is a de facto task in many applications. However, supervised learning solutions require extravagant human efforts on document labeling. In this paper, we propose a novel seed-guided topic model for dataless short text classification and filtering, named SSCF. Without using any labeled documents, SSCF takes a few “seed words” for each category of interest, and conducts short text filtering and classification in a weakly supervised manner. To overcome the issues of data sparsity and imbalance, the short text collection is mapped to a collection of pseudodocuments, one for each word. SSCF infers two kinds of topics on pseudo-documents: category-topics and general-topics. Each category-topic is associated with one category of interest, covering the meaning of the latter. In SSCF, we devise a novel word relevance estimation process based on the seed words, for hidden topic inference. The dominating topic of a short text is identified through post inference and then used for filtering and classification. On two real-world datasets in two languages, experimental results show that our proposed SSCF consistently achieves better classification accuracy than state-of-the-art baselines. We also observe that SSCF can even achieve superior performance than the supervised classifiers supervised latent dirichlet allocation (sLDA) and support vector machine (SVM) on some testing tasks.

Download Full-text

A New Vector Representation of Short Texts for Classification

The International Arab Journal of Information Technology ◽

10.34028/iajit/17/2/12 ◽

2019 ◽

Vol 17 (2) ◽

pp. 241-249

Author(s):

Yangyang Li ◽

Bo Liu

Keyword(s):

Text Classification ◽

Web Search ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Classification Performance ◽

New Method ◽

Data Sets ◽

Text Data ◽

Short Text ◽

Space Model

Short and sparse characteristics and synonyms and homonyms are main obstacles for short-text classification. In recent years, research on short-text classification has focused on expanding short texts but has barely guaranteed the validity of expanded words. This study proposes a new method to weaken these effects without external knowledge. The proposed method analyses short texts by using the topic model based on Latent Dirichlet Allocation (LDA), represents each short text by using a vector space model and presents a new method to adjust the vector of short texts. In the experiments, two open short-text data sets composed of google news and web search snippets are utilised to evaluate the classification performance and prove the effectiveness of our method.

Download Full-text

Exploiting Global Semantic Similarity Biterms for Short-Text Topic Discovery

2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI) ◽

10.1109/ictai.2018.00151 ◽

2018 ◽

Author(s):

Heng-Yang Lu ◽

Gao-Jian Ge ◽

Yun Li ◽

Chong-Jun Wang ◽

Jun-Yuan Xie

Keyword(s):

Semantic Similarity ◽

Short Text ◽

Topic Discovery

Download Full-text