A Word Embedding Topic Model for Robust Inference of Topics and Visualization

2021 ◽  
Author(s):  
Sanuj Kumar ◽  
Tuan Le
2018 ◽  
Vol 6 (3) ◽  
pp. 67-78
Author(s):  
Tian Nie ◽  
Yi Ding ◽  
Chen Zhao ◽  
Youchao Lin ◽  
Takehito Utsuro

The background of this article is the issue of how to overview the knowledge of a given query keyword. Especially, the authors focus on concerns of those who search for web pages with a given query keyword. The Web search information needs of a given query keyword is collected through search engine suggests. Given a query keyword, the authors collect up to around 1,000 suggests, while many of them are redundant. They classify redundant search engine suggests based on a topic model. However, one limitation of the topic model based classification of search engine suggests is that the granularity of the topics, i.e., the clusters of search engine suggests, is too coarse. In order to overcome the problem of the coarse-grained classification of search engine suggests, this article further applies the word embedding technique to the webpages used during the training of the topic model, in addition to the text data of the whole Japanese version of Wikipedia. Then, the authors examine the word embedding based similarity between search engines suggests and further classify search engine suggests within a single topic into finer-grained subtopics based on the similarity of word embeddings. Evaluation results prove that the proposed approach performs well in the task of subtopic classification of search engine suggests.


2019 ◽  
Vol 52 (9-10) ◽  
pp. 1289-1298 ◽  
Author(s):  
Lei Shi ◽  
Gang Cheng ◽  
Shang-ru Xie ◽  
Gang Xie

The aim of topic detection is to automatically identify the events and hot topics in social networks and continuously track known topics. Applying the traditional methods such as Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis is difficult given the high dimensionality of massive event texts and the short-text sparsity problems of social networks. The problem also exists of unclear topics caused by the sparse distribution of topics. To solve the above challenge, we propose a novel word embedding topic model by combining the topic model and the continuous bag-of-words mode (Cbow) method in word embedding method, named Cbow Topic Model (CTM), for topic detection and summary in social networks. We conduct similar word clustering of the target social network text dataset by introducing the classic Cbow word vectorization method, which can effectively learn the internal relationship between words and reduce the dimensionality of the input texts. We employ the topic model-to-model short text for effectively weakening the sparsity problem of social network texts. To detect and summarize the topic, we propose a topic detection method by leveraging similarity computing for social networks. We collected a Sina microblog dataset to conduct various experiments. The experimental results demonstrate that the CTM method is superior to the existing topic model method.


2020 ◽  
Vol 125 (3) ◽  
pp. 2091-2108
Author(s):  
Jie Chen ◽  
Jialin Chen ◽  
Shu Zhao ◽  
Yanping Zhang ◽  
Jie Tang
Keyword(s):  

Author(s):  
Mennatallah El-Assady ◽  
Rebecca Kehlbeck ◽  
Christopher Collins ◽  
Daniel Keim ◽  
Oliver Deussen

Author(s):  
Guangxu Xun ◽  
Yaliang Li ◽  
Wayne Xin Zhao ◽  
Jing Gao ◽  
Aidong Zhang

Conventional correlated topic models are able to capture correlation structure among latent topics by replacing the Dirichlet prior with the logistic normal distribution. Word embeddings have been proven to be able to capture semantic regularities in language. Therefore, the semantic relatedness and correlations between words can be directly calculated in the word embedding space, for example, via cosine values. In this paper, we propose a novel correlated topic model using word embeddings. The proposed model enables us to exploit the additional word-level correlation information in word embeddings and directly model topic correlation in the continuous word embedding space. In the model, words in documents are replaced with meaningful word embeddings, topics are modeled as multivariate Gaussian distributions over the word embeddings and topic correlations are learned among the continuous Gaussian topics. A Gibbs sampling solution with data augmentation is given to perform inference. We evaluate our model on the 20 Newsgroups dataset and the Reuters-21578 dataset qualitatively and quantitatively. The experimental results show the effectiveness of our proposed model.


2020 ◽  
Author(s):  
VIJAYARANI J ◽  
Geetha T.V.

Abstract Social media texts like tweets and blogs are collaboratively created by human interaction. Fast change in trends leads to topic drift in the social media text. This drift is usually associated with words and hashtags. However, geotags play an important part in determining topic distribution with location context. Rate of change in the distribution of words, hashtags and geotags cannot be considered as uniform and must be handled accordingly. This paper builds a topic model that associates topic with a mixture of distributions of words, hashtags and geotags. Stochastic gradient Langevin dynamic model with varying mini-batch sizes is used to capture the changes due to the asynchronous distribution of words and tags. Topical word embedding with co-occurrence and location contexts are specified as hashtag context vector and geotag context vector respectively. These two vectors are jointly learned to yield topical word embedding vectors related to tags context. Topical word embeddings over time conditioned on hashtags and geotags predict, location-based topical variations effectively. When evaluated with Chennai and UK geolocated Twitter data, the proposed joint topical word embedding model enhanced by the social tags context, outperforms other methods.


2020 ◽  
Author(s):  
VIJAYARANI J ◽  
Geetha T.V.

Abstract Social media texts like tweets and blogs are collaboratively created by human interaction. Fast change in trends leads to topic drift in the social media text. This drift is usually associated with words and hashtags. However, geotags play an important part in determining topic distribution with location context. Rate of change in the distribution of words, hashtags and geotags cannot be considered as uniform and must be handled accordingly. This paper builds a topic model that associates topic with a mixture of distributions of words, hashtags and geotags. Stochastic gradient Langevin dynamic model with varying mini-batch sizes is used to capture the changes due to the asynchronous distribution of words and tags. Topical word embedding with co-occurrence and location contexts are specified as hashtag context vector and geotag context vector respectively. These two vectors are jointly learned to yield topical word embedding vectors related to tags context. Topical word embeddings over time conditioned on hashtags and geotags predict, location-based topical variations effectively. When evaluated with Chennai and UK geolocated Twitter data, the proposed joint topical word embedding model enhanced by the social tags context, outperforms other methods.


2020 ◽  
Vol 10 (11) ◽  
pp. 3831 ◽  
Author(s):  
Sang-Min Park ◽  
Sung Joon Lee ◽  
Byung-Won On

Detecting the main aspects of a particular product from a collection of review documents is so challenging in real applications. To address this problem, we focus on utilizing existing topic models that can briefly summarize large text documents. Unlike existing approaches that are limited because of modifying any topic model or using seed opinion words as prior knowledge, we propose a novel approach of (1) identifying starting points for learning, (2) cleaning dirty topic results through word embedding and unsupervised clustering, and (3) automatically generating right aspects using topic and head word embedding. Experimental results show that the proposed methods create more clean topics, improving about 25% of Rouge–1, compared to the baseline method. In addition, through the proposed three methods, the main aspects suitable for given data are detected automatically.


Sensors ◽  
2019 ◽  
Vol 19 (17) ◽  
pp. 3728 ◽  
Author(s):  
Zhou ◽  
Wang ◽  
Sun ◽  
Sun

Text representation is one of the key tasks in the field of natural language processing (NLP). Traditional feature extraction and weighting methods often use the bag-of-words (BoW) model, which may lead to a lack of semantic information as well as the problems of high dimensionality and high sparsity. At present, to solve these problems, a popular idea is to utilize deep learning methods. In this paper, feature weighting, word embedding, and topic models are combined to propose an unsupervised text representation method named the feature, probability, and word embedding method. The main idea is to use the word embedding technology Word2Vec to obtain the word vector, and then combine this with the feature weighted TF-IDF and the topic model LDA. Compared with traditional feature engineering, the proposed method not only increases the expressive ability of the vector space model, but also reduces the dimensions of the document vector. Besides this, it can be used to solve the problems of the insufficient information, high dimensions, and high sparsity of BoW. We use the proposed method for the task of text categorization and verify the validity of the method.


Sign in / Sign up

Export Citation Format

Share Document