A New Sentence-Based Interpretative Topic Modeling and Automatic Topic Labeling

This article presents a new conceptual approach for the interpretative topic modeling problem. It uses sentences as basic units of analysis, instead of words or n-grams, which are commonly used in the standard approaches.The proposed approach’s specifics are using sentence probability evaluations within the text corpus and clustering of sentence embeddings. The topic model estimates discrete distributions of sentence occurrences within topics and discrete distributions of topic occurrence within the text. Our approach provides the possibility of explicit interpretation of topics since sentences, unlike words, are more informative and have complete grammatical and semantic constructions inside. The method for automatic topic labeling is also provided. Contextual embeddings based on the BERT model are used to obtain corresponding sentence embeddings for their subsequent analysis. Moreover, our approach allows big data processing and shows the possibility of utilizing the combination of internal and external knowledge sources in the process of topic modeling. The internal knowledge source is represented by the text corpus itself and often it is a single knowledge source in the traditional topic modeling approaches. The external knowledge source is represented by the BERT, a machine learning model which was preliminarily trained on a huge amount of textual data and is used for generating the context-dependent sentence embeddings.

Download Full-text

STABILITY OF TOPIC MODELING VIA MODALITY REGULARIZATION

Computational Linguistics and Intellectual Technologies ◽

10.28995/2075-7182-2020-19-198-210 ◽

2020 ◽

Author(s):

R. Derbanosov ◽

◽

M. Bakhanova ◽

◽

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Topic Model ◽

Side Information ◽

Auxiliary Information ◽

Discrete Distributions ◽

Probabilistic Latent Semantic Analysis ◽

Probabilistic Topic Modeling ◽

Random Initialization

Probabilistic topic modeling is a tool for statistical text analysis that can give us information about the inner structure of a large corpus of documents. The most popular models—Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation—produce topics in a form of discrete distributions over the set of all words of the corpus. They build topics using an iterative algorithm that starts from some random initialization and optimizes a loss function. One of the main problems of topic modeling is sensitivity to random initialization that means producing significantly different solutions from different initial points. Several studies showed that side information about documents may improve the overall quality of a topic model. In this paper, we consider the use of additional information in the context of the stability problem. We represent auxiliary information as an additional modality and use BigARTM library in order to perform experiments on several text collections. We show that using side information as an additional modality improves topics stability without significant quality loss of the model.

Download Full-text

Applying Word Co-Occurrence Graph in Enhancing LDA Model for Topic Discovering in Large-Scaled Text Corpus

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1068.0882s819 ◽

2019 ◽

Vol 8 (2S8) ◽

pp. 1366-1371

Keyword(s):

Statistical Analysis ◽

Topic Modeling ◽

Topic Model ◽

Bag Of Words ◽

Text Corpus ◽

Document Collections ◽

Text Document ◽

Novel Approach ◽

Proposed Model ◽

Low Performance

Topic modeling, such as LDA is considered as a useful tool for the statistical analysis of text document collections and other text-based data. Recently, topic modeling becomes an attractive researching field due to its wide applications. However, there are remained disadvantages of traditional topic modeling like as LDA due the shortcoming of bag-of-words (BOW) model as well as low-performance in handle large text corpus. Therefore, in this paper, we present a novel approach of topic model, called LDA-GOW, which is the combination of word co-occurrence, also called: graph-of-words (GOW) model and traditional LDA topic discovering model. The LDA-GOW topic model not only enable to extract more informative topics from text but also be able to leverage the topic discovering process from large-scaled text corpus. We test our proposed model in comparing with the traditional LDA topic model, within several standardized datasets, include: WebKB, Reuters-R8 and annotated scientific documents which are collected from ACM digital library to demonstrate the effectiveness of our proposed model. For overall experiments, our proposed LDA-GOW model gains approximately 70.86% in accuracy.

Download Full-text

Learning to Classify Short Text with Topic Model and External Knowledge

Knowledge Science, Engineering and Management - Lecture Notes in Computer Science ◽

10.1007/978-3-642-39787-5_41 ◽

2013 ◽

pp. 493-503 ◽

Cited By ~ 12

Author(s):

Ying Zhu ◽

Li Li ◽

Le Luo

Keyword(s):

Topic Model ◽

Short Text ◽

External Knowledge

Download Full-text

Extending a Conventional Chatbot Knowledge Base to External Knowledge Source and Introducing User-Based Sessions for Diabetes Education

Enabling Technologies and Architectures for Next-Generation Networking Capabilities - Advances in Wireless Technologies and Telecommunication ◽

10.4018/978-1-5225-6023-4.ch015 ◽

2019 ◽

pp. 333-343

Author(s):

Shafquat Hussain ◽

Athula Ginige

Keyword(s):

Knowledge Base ◽

Local Knowledge ◽

Health Informatics ◽

Diabetes Management ◽

Diabetes Education ◽

Conversational Agents ◽

Knowledge Source ◽

Web Based ◽

External Knowledge ◽

User Query

Chatbots or conversational agents are computer programs that interact with users using natural language through artificial intelligence in a way that the user thinks he is having dialogue with a human. One of the main limits of chatbot technology is associated with the construction of its local knowledge base. A conventional chatbot knowledge base is typically hand constructed, which is a very time-consuming process and may take years to train a chatbot in a particular field of expertise. This chapter extends the knowledge base of a conventional chatbot beyond its local knowledge base to external knowledge source Wikipedia. This has been achieved by using Media Wiki API to retrieve information from Wikipedia when the chatbot's local knowledge base does not contain the answer to a user query. To make the conversation with the chatbot more meaningful with regards to the user's previous chat sessions, a user-specific session ability has been added to the chatbot architecture. An open source AIML web-based chatbot has been modified and programmed for use in the health informatics domain. The chatbot has been named VDMS – Virtual Diabetes Management System. It is intended to be used by the general community and diabetic patients for diabetes education and management.

Download Full-text

Probabilistic Topic Modeling of the Russian Text Corpus on Musicology

Communications in Computer and Information Science - Language, Music, and Computing ◽

10.1007/978-3-319-27498-0_6 ◽

2015 ◽

pp. 69-76 ◽

Cited By ~ 4

Author(s):

Olga Mitrofanova

Keyword(s):

Topic Modeling ◽

Russian Text ◽

Text Corpus ◽

Probabilistic Topic Modeling

Download Full-text

Transitive Topic Modeling with Conversational Structure Context: Discovering Topics that are Most Popular in Online Discussions

International Journal of Semantic Computing ◽

10.1142/s1793351x20400103 ◽

2020 ◽

Vol 14 (02) ◽

pp. 273-293

Author(s):

Yingcheng Sun ◽

Richard Kolacinski ◽

Kenneth Loparo

Keyword(s):

Social Media ◽

Topic Modeling ◽

Topic Model ◽

Online Discussions ◽

Challenging Problem ◽

Topic Extraction ◽

Limited Success ◽

Social Media Platforms ◽

Improved Performance ◽

Conversational Structure

With the explosive growth of online discussions published everyday on social media platforms, comprehension and discovery of the most popular topics have become a challenging problem. Conventional topic models have had limited success in online discussions because the corpus is extremely sparse and noisy. To overcome their limitations, we use the discussion thread tree structure and propose a “popularity” metric to quantify the number of replies to a comment to extend the frequency of word occurrences, and the “transitivity” concept to characterize topic dependency among nodes in a nested discussion thread. We build a Conversational Structure Aware Topic Model (CSATM) based on popularity and transitivity to infer topics and their assignments to comments. Experiments on real forum datasets are used to demonstrate improved performance for topic extraction with six different measurements of coherence and impressive accuracy for topic assignments.

Download Full-text

Automated Seeded Latent Dirichlet Allocation for Social Media Based Event Detection and Mapping

Information ◽

10.3390/info11080376 ◽

2020 ◽

Vol 11 (8) ◽

pp. 376 ◽

Cited By ~ 2

Author(s):

Cornelia Ferner ◽

Clemens Havas ◽

Elisabeth Birnbacher ◽

Stefan Wegenkittl ◽

Bernd Resch

Keyword(s):

Event Detection ◽

Disaster Response ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Geographic Area ◽

Relevant Information ◽

Suggested Approach ◽

Napa Valley ◽

Source Of Information

In the event of a natural disaster, geo-tagged Tweets are an immediate source of information for locating casualties and damages, and for supporting disaster management. Topic modeling can help in detecting disaster-related Tweets in the noisy Twitter stream in an unsupervised manner. However, the results of topic models are difficult to interpret and require manual identification of one or more “disaster topics”. Immediate disaster response would benefit from a fully automated process for interpreting the modeled topics and extracting disaster relevant information. Initializing the topic model with a set of seed words already allows to directly identify the corresponding disaster topic. In order to enable an automated end-to-end process, we automatically generate seed words using older Tweets from the same geographic area. The results of two past events (Napa Valley earthquake 2014 and hurricane Harvey 2017) show that the geospatial distribution of Tweets identified as disaster related conforms with the officially released disaster footprints. The suggested approach is applicable when there is a single topic of interest and comparative data available.

Download Full-text

A Method for Constructing Supervised Topic Model Based on Term Frequency-Inverse Topic Frequency

Symmetry ◽

10.3390/sym11121486 ◽

2019 ◽

Vol 11 (12) ◽

pp. 1486

Author(s):

Zhinan Gou ◽

Zheng Huo ◽

Yuanzhen Liu ◽

Yi Yang

Keyword(s):

Topic Modeling ◽

Topic Model ◽

State Of The Art ◽

Topic Models ◽

Document Classification ◽

Experimental Results ◽

Tag Recommendation ◽

Term Frequency ◽

Series Of Experiments ◽

Dirichlet Prior

Supervised topic modeling has been successfully applied in the fields of document classification and tag recommendation in recent years. However, most existing models neglect the fact that topic terms have the ability to distinguish topics. In this paper, we propose a term frequency-inverse topic frequency (TF-ITF) method for constructing a supervised topic model, in which the weight of each topic term indicates the ability to distinguish topics. We conduct a series of experiments with not only the symmetric Dirichlet prior parameters but also the asymmetric Dirichlet prior parameters. Experimental results demonstrate that the result of introducing TF-ITF into a supervised topic model outperforms several state-of-the-art supervised topic models.

Download Full-text

Combine Topic Modeling with Semantic Embedding: Embedding Enhanced Topic Model

IEEE Transactions on Knowledge and Data Engineering ◽

10.1109/tkde.2019.2922179 ◽

2020 ◽

Vol 32 (12) ◽

pp. 2322-2335 ◽

Cited By ~ 1

Author(s):

Peng Zhang ◽

Suge Wang ◽

Deyu Li ◽

Xiaoli Li ◽

Zhikang Xu

Keyword(s):

Topic Modeling ◽

Topic Model ◽

Semantic Embedding

Download Full-text

A Gentle Introduction to Topic Modeling Using Python

Theological Librarianship ◽

10.31046/tl.v11i1.506 ◽

2018 ◽

Vol 11 (1) ◽

pp. 18-27 ◽

Cited By ~ 2

Author(s):

Micah D. Saxton

Keyword(s):

Data Mining ◽

Prior Knowledge ◽

Topic Modeling ◽

Topic Model ◽

Mining Method ◽

Data Mining Method ◽

Conceptual Overview

Topic modeling is a data mining method which can be used to understand and categorize large corpora of data; as such, it is a tool which theological librarians can use in their professional workflows and scholarly practices. In this article I provide a gentle introduction to topic modeling for those who have no prior knowledge of the topic. I begin with a conceptual overview of topic modeling which does not rely on the complicated mathematics behind the process. Then, I illustrate topic modeling by providing a narrative of building a topic model using the entirety Theological Librarianship as my example corpus. This narrative ends with an analysis of the success of the model and suggestions for improvement. Finally, I recommend a few resources for those who would like to pursue topic modeling further.

Download Full-text