Layer-Assisted Neural Topic Modeling over Document Networks

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/433 ◽

2021 ◽

Author(s):

Yiming Wang ◽

Ximing Li ◽

Jihong Ouyang

Keyword(s):

Text Classification ◽

Topic Modeling ◽

Link Prediction ◽

Topic Model ◽

Web Pages ◽

Text Documents ◽

Text Data ◽

Generative Process ◽

Network Links ◽

Scientific Papers

Neural topic modeling provides a flexible, efficient, and powerful way to extract topic representations from text documents. Unfortunately, most existing models cannot handle the text data with network links, such as web pages with hyperlinks and scientific papers with citations. To resolve this kind of data, we develop a novel neural topic model , namely Layer-Assisted Neural Topic Model (LANTM), which can be interpreted from the perspective of variational auto-encoders. Our major motivation is to enhance the topic representation encoding by not only using text contents, but also the assisted network links. Specifically, LANTM encodes the texts and network links to the topic representations by an augmented network with graph convolutional modules, and decodes them by maximizing the likelihood of the generative process. The neural variational inference is adopted for efficient inference. Experimental results validate that LANTM significantly outperforms the existing models on topic quality, text classification and link prediction..

Download Full-text

Ldagibbs: A Command for Topic Modeling in Stata Using Latent Dirichlet Allocation

The Stata Journal Promoting communications on statistics and Stata ◽

10.1177/1536867x1801800107 ◽

2018 ◽

Vol 18 (1) ◽

pp. 101-117 ◽

Cited By ~ 10

Author(s):

Carlo Schwarz

Keyword(s):

Machine Learning ◽

Probability Distribution ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Topic Models ◽

Text Documents ◽

Text Data ◽

Dirichlet Allocation

In this article, I introduce the ldagibbs command, which implements latent Dirichlet allocation in Stata. Latent Dirichlet allocation is the most popular machine-learning topic model. Topic models automatically cluster text documents into a user-chosen number of topics. Latent Dirichlet allocation represents each document as a probability distribution over topics and represents each topic as a probability distribution over words. Therefore, latent Dirichlet allocation provides a way to analyze the content of large unclassified text data and an alternative to predefined document classifications.

Download Full-text

Incorporating Text OLAP in Business Intelligence

Business Intelligence Applications and the Web - Advances in Business Information Systems and Analytics ◽

10.4018/978-1-61350-038-5.ch004 ◽

2011 ◽

pp. 77-101 ◽

Cited By ~ 1

Author(s):

Byung-Kwon Park ◽

Il-Yeol Song

Keyword(s):

Information Retrieval ◽

Text Mining ◽

Business Intelligence ◽

Multidimensional Analysis ◽

Web Pages ◽

Data Types ◽

Text Documents ◽

Text Data ◽

Platform Architecture ◽

Unstructured Text

As the amount of data grows very fast inside and outside of an enterprise, it is getting important to seamlessly analyze both data types for total business intelligence. The data can be classified into two categories: structured and unstructured. For getting total business intelligence, it is important to seamlessly analyze both of them. Especially, as most of business data are unstructured text documents, including the Web pages in Internet, we need a Text OLAP solution to perform multidimensional analysis of text documents in the same way as structured relational data. We first survey the representative works selected for demonstrating how the technologies of text mining and information retrieval can be applied for multidimensional analysis of text documents, because they are major technologies handling text data. And then, we survey the representative works selected for demonstrating how we can associate and consolidate both unstructured text documents and structured relation data for obtaining total business intelligence. Finally, we present a future business intelligence platform architecture as well as related research topics. We expect the proposed total heterogeneous business intelligence architecture, which integrates information retrieval, text mining, and information extraction technologies all together, including relational OLAP technologies, would make a better platform toward total business intelligence.

Download Full-text

A New Vector Representation of Short Texts for Classification

The International Arab Journal of Information Technology ◽

10.34028/iajit/17/2/12 ◽

2019 ◽

Vol 17 (2) ◽

pp. 241-249

Author(s):

Yangyang Li ◽

Bo Liu

Keyword(s):

Text Classification ◽

Web Search ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Classification Performance ◽

New Method ◽

Data Sets ◽

Text Data ◽

Short Text ◽

Space Model

Short and sparse characteristics and synonyms and homonyms are main obstacles for short-text classification. In recent years, research on short-text classification has focused on expanding short texts but has barely guaranteed the validity of expanded words. This study proposes a new method to weaken these effects without external knowledge. The proposed method analyses short texts by using the topic model based on Latent Dirichlet Allocation (LDA), represents each short text by using a vector space model and presents a new method to adjust the vector of short texts. In the experiments, two open short-text data sets composed of google news and web search snippets are utilised to evaluate the classification performance and prove the effectiveness of our method.

Download Full-text

Topic Modeling on Document Networks with Adjacent-Encoder

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6152 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6737-6745

Author(s):

Ce Zhang ◽

Hady W. Lauw

Keyword(s):

Network Structure ◽

Real World ◽

Topic Modeling ◽

Topic Model ◽

Web Pages ◽

Low Dimensional ◽

Textual Content

Oftentimes documents are linked to one another in a network structure,e.g., academic papers cite other papers, Web pages link to other pages. In this paper we propose a holistic topic model to learn meaningful and unified low-dimensional representations for networked documents that seek to preserve both textual content and network structure. On the basis of reconstructing not only the input document but also its adjacent neighbors, we develop two neural encoder architectures. Adjacent-Encoder, or AdjEnc, induces competition among documents for topic propagation, and reconstruction among neighbors for semantic capture. Adjacent-Encoder-X, or AdjEnc-X, extends this to also encode the network structure in addition to document content. We evaluate our models on real-world document networks quantitatively and qualitatively, outperforming comparable baselines comprehensively.

Download Full-text

CLDA: An Effective Topic Model for Mining User Interest Preference under Big Data Background

Complexity ◽

10.1155/2018/2503816 ◽

2018 ◽

Vol 2018 ◽

pp. 1-10 ◽

Cited By ~ 5

Author(s):

Lirong Qiu ◽

Jia Yu

Keyword(s):

Big Data ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

User Interest ◽

Text Data ◽

Data Set ◽

Data Sparsity ◽

Short Text ◽

Text Filtering

In the present big data background, how to effectively excavate useful information is the problem that big data is facing now. The purpose of this study is to construct a more effective method of mining interest preferences of users in a particular field in the context of today’s big data. We mainly use a large number of user text data from microblog to study. LDA is an effective method of text mining, but it will not play a very good role in applying LDA directly to a large number of short texts in microblog. In today’s more effective topic modeling project, short texts need to be aggregated into long texts to avoid data sparsity. However, aggregated short texts are mixed with a lot of noise, reducing the accuracy of mining the user’s interest preferences. In this paper, we propose Combining Latent Dirichlet Allocation (CLDA), a new topic model that can learn the potential topics of microblog short texts and long texts simultaneously. The data sparsity of short texts is avoided by aggregating long texts to assist in learning short texts. Short text filtering long text is reused to improve mining accuracy, making long texts and short texts effectively combined. Experimental results in a real microblog data set show that CLDA outperforms many advanced models in mining user interest, and we also confirm that CLDA also has good performance in recommending systems.

Download Full-text

Examining LDA2Vec and Tweet Pooling for Topic Modeling on Twitter Data

WSEAS TRANSACTIONS ON INFORMATION SCIENCE AND APPLICATIONS ◽

10.37394/23209.2021.18.13 ◽

2021 ◽

Vol 18 ◽

pp. 102-114

Author(s):

Kristofferson Culmer ◽

Jeffrey Uhlmann

Keyword(s):

Statistical Analysis ◽

Topic Modeling ◽

Topic Model ◽

Text Documents ◽

Short Text ◽

Amount Of Information ◽

Twitter Data

The short lengths of tweets present a challenge for topic modeling to extend beyond what is provided explicitly from hashtag information. This is particularly true for LDAbased methods because the amount of information available from pertweet statistical analysis is severely limited. In this paper we present LDA2Vec paired with temporal tweet pooling (LDA2VecTTP) and assess its performance on this problem relative to traditional LDA and to Biterm Topic Model (Biterm), which was developed specifically for topic modeling on short text documents. We paired each of the three topic modeling algorithms with three tweet pooling schemes: no pooling, authorbased pooling, and temporal pooling. We then conducted topic modeling on two Twitter datasets using each of the algorithms and the tweet pooling schemes. Our results on the largest dataset suggest that LDA2VecTTP can produce higher coherence scores and more logically coherent and interpretable topics.

Download Full-text

A Method for Constructing Supervised Time Topic Model Based on Variational Autoencoder

Scientific Programming ◽

10.1155/2021/6623689 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Zhinan Gou ◽

Yan Li ◽

Zheng Huo

Keyword(s):

Topic Modeling ◽

Topic Model ◽

State Of The Art ◽

Generation Model ◽

Construction Process ◽

Research Papers ◽

Generative Process ◽

Model Based ◽

Variational Autoencoder ◽

Over Time

Topic modeling is a probabilistic generation model to find the representative topic of a document and has been successfully applied to various document-related tasks in recent years. Especially in the supervised topic model and time topic model, many methods have achieved some success. The supervised topic model can learn topics from documents annotated with multiple labels and the time topic model can learn topics that evolve over time in a sequentially organized corpus. However, there are some documents with multiple labels and time-stamped in reality, which need to construct a supervised time topic model to achieve document-related tasks. There are few research papers on the supervised time topic model. To solve this problem, we propose a method for constructing a supervised time topic model. By analysing the generative process of the supervised topic model and time topic model, respectively, we introduce the construction process of the supervised time topic model based on variational autoencoder in detail and conduct preliminary experiments. Experimental results demonstrate that the supervised time topic model outperforms several state-of-the-art topic models.

Download Full-text

The structural topic model for online review analysis

Journal of Hospitality and Tourism Technology ◽

10.1108/jhtt-08-2017-0075 ◽

2018 ◽

Vol 11 (1) ◽

pp. 1-17 ◽

Cited By ~ 1

Author(s):

Eunhye (Olivia) Park ◽

Bongsug (Kevin) Chae ◽

Junehee Kwon

Keyword(s):

Topic Modeling ◽

Design Methodology ◽

Topic Model ◽

Text Data ◽

Systematic Analysis ◽

Sustainable Food ◽

Content Type ◽

Related Information ◽

Review Analysis ◽

Structural Topic Model

Purpose The purpose of this study was to explore influences of review-related information on topical proportions and the pattern of word appearances in each topic (topical content) using structural topic model (STM). Design/methodology/approach For 173,607 Yelp.com reviews written in 2005-2016, STM-based topic modeling was applied with inclusion of covariates in addition to traditional statistical analyses. Findings Differences in topic prevalence and topical contents were found between certified green and non-certified restaurants. Customers’ recognition in sustainable food topics were changed over time. Research limitations/implications This study demonstrates the application of STM for the systematic analysis of a large amount of text data. Originality/value Limited study in the hospitality literature examined the influence of review-level metadata on topic and term estimation. Through topic modeling, customers’ natural responses toward green practices were identified.

Download Full-text

LSA & LDA topic modeling classification: comparison study on e-books

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v19.i1.pp353-362 ◽

2020 ◽

Vol 19 (1) ◽

pp. 353

Author(s):

Shaymaa H. Mohammed ◽

Salam Al-augby

Keyword(s):

Digital Libraries ◽

Full Text ◽

Topic Modeling ◽

Comparison Study ◽

Text Documents ◽

Text Data ◽

Text Document ◽

Unstructured Text ◽

The One ◽

Text Document Classification

With the rapid growth of information technology, the amount of unstructured text data in digital libraries is rapidly increased and has become a big challenge in analyzing, organizing and how to classify text automatically in E-research repository to get the benefit from them is the cornerstone. The manual categorization of text documents requires a lot of financial, human resources for management. In order to get so, topic modeling are used to classify documents. This paper addresses a comparison study on scientific unstructured text document classification (e-books) based on the full text where applying the most popular topic modeling approach (LDA, LSA) to cluster the words into a set of topics as important keywords for classification. Our dataset consists of (300) books contain about 23 million words based on full text. In the used topic models (LSA, LDA) each word in the corpus of vocabulary is connected with one or more topics with a probability, as estimated by the model. Many (LDA, LSA) models were built with different values of coherence and pick the one that produces the highest coherence value. The result of this paper showed that LDA has better results than LSA and the best results obtained from the LDA method was (0.592179) of coherence value when the number of topics was 20 while the LSA coherence value was (0.5773026) when the number of topics was 10.

Download Full-text

Deep Learning for text in limted data settings

10.36227/techrxiv.12100692 ◽

2020 ◽

Author(s):

Pathikkumar Patel ◽

Bhargav Lad ◽

Jinan Fiaidhi

Keyword(s):

Machine Learning ◽

Time Series ◽

Deep Learning ◽

Sentiment Analysis ◽

Transfer Learning ◽

Text Classification ◽

State Of The Art ◽

Time Series Forecasting ◽

Text Data ◽

Performance Levels

During the last few years, RNN models have been extensively used and they have proven to be better for sequence and text data. RNNs have achieved state-of-the-art performance levels in several applications such as text classification, sequence to sequence modelling and time series forecasting. In this article we will review different Machine Learning and Deep Learning based approaches for text data and look at the results obtained from these methods. This work also explores the use of transfer learning in NLP and how it affects the performance of models on a specific application of sentiment analysis.

Download Full-text