scholarly journals A New Sentence-Based Interpretative Topic Modeling and Automatic Topic Labeling

Symmetry ◽  
2021 ◽  
Vol 13 (5) ◽  
pp. 837
Author(s):  
Olzhas Kozbagarov ◽  
Rustam Mussabayev ◽  
Nenad Mladenovic

This article presents a new conceptual approach for the interpretative topic modeling problem. It uses sentences as basic units of analysis, instead of words or n-grams, which are commonly used in the standard approaches.The proposed approach’s specifics are using sentence probability evaluations within the text corpus and clustering of sentence embeddings. The topic model estimates discrete distributions of sentence occurrences within topics and discrete distributions of topic occurrence within the text. Our approach provides the possibility of explicit interpretation of topics since sentences, unlike words, are more informative and have complete grammatical and semantic constructions inside. The method for automatic topic labeling is also provided. Contextual embeddings based on the BERT model are used to obtain corresponding sentence embeddings for their subsequent analysis. Moreover, our approach allows big data processing and shows the possibility of utilizing the combination of internal and external knowledge sources in the process of topic modeling. The internal knowledge source is represented by the text corpus itself and often it is a single knowledge source in the traditional topic modeling approaches. The external knowledge source is represented by the BERT, a machine learning model which was preliminarily trained on a huge amount of textual data and is used for generating the context-dependent sentence embeddings.

Author(s):  
R. Derbanosov ◽  
◽  
M. Bakhanova ◽  
◽  

Probabilistic topic modeling is a tool for statistical text analysis that can give us information about the inner structure of a large corpus of documents. The most popular models—Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation—produce topics in a form of discrete distributions over the set of all words of the corpus. They build topics using an iterative algorithm that starts from some random initialization and optimizes a loss function. One of the main problems of topic modeling is sensitivity to random initialization that means producing significantly different solutions from different initial points. Several studies showed that side information about documents may improve the overall quality of a topic model. In this paper, we consider the use of additional information in the context of the stability problem. We represent auxiliary information as an additional modality and use BigARTM library in order to perform experiments on several text collections. We show that using side information as an additional modality improves topics stability without significant quality loss of the model.


2019 ◽  
Vol 8 (2S8) ◽  
pp. 1366-1371

Topic modeling, such as LDA is considered as a useful tool for the statistical analysis of text document collections and other text-based data. Recently, topic modeling becomes an attractive researching field due to its wide applications. However, there are remained disadvantages of traditional topic modeling like as LDA due the shortcoming of bag-of-words (BOW) model as well as low-performance in handle large text corpus. Therefore, in this paper, we present a novel approach of topic model, called LDA-GOW, which is the combination of word co-occurrence, also called: graph-of-words (GOW) model and traditional LDA topic discovering model. The LDA-GOW topic model not only enable to extract more informative topics from text but also be able to leverage the topic discovering process from large-scaled text corpus. We test our proposed model in comparing with the traditional LDA topic model, within several standardized datasets, include: WebKB, Reuters-R8 and annotated scientific documents which are collected from ACM digital library to demonstrate the effectiveness of our proposed model. For overall experiments, our proposed LDA-GOW model gains approximately 70.86% in accuracy.


Author(s):  
Shafquat Hussain ◽  
Athula Ginige

Chatbots or conversational agents are computer programs that interact with users using natural language through artificial intelligence in a way that the user thinks he is having dialogue with a human. One of the main limits of chatbot technology is associated with the construction of its local knowledge base. A conventional chatbot knowledge base is typically hand constructed, which is a very time-consuming process and may take years to train a chatbot in a particular field of expertise. This chapter extends the knowledge base of a conventional chatbot beyond its local knowledge base to external knowledge source Wikipedia. This has been achieved by using Media Wiki API to retrieve information from Wikipedia when the chatbot's local knowledge base does not contain the answer to a user query. To make the conversation with the chatbot more meaningful with regards to the user's previous chat sessions, a user-specific session ability has been added to the chatbot architecture. An open source AIML web-based chatbot has been modified and programmed for use in the health informatics domain. The chatbot has been named VDMS – Virtual Diabetes Management System. It is intended to be used by the general community and diabetic patients for diabetes education and management.


2020 ◽  
Vol 14 (02) ◽  
pp. 273-293
Author(s):  
Yingcheng Sun ◽  
Richard Kolacinski ◽  
Kenneth Loparo

With the explosive growth of online discussions published everyday on social media platforms, comprehension and discovery of the most popular topics have become a challenging problem. Conventional topic models have had limited success in online discussions because the corpus is extremely sparse and noisy. To overcome their limitations, we use the discussion thread tree structure and propose a “popularity” metric to quantify the number of replies to a comment to extend the frequency of word occurrences, and the “transitivity” concept to characterize topic dependency among nodes in a nested discussion thread. We build a Conversational Structure Aware Topic Model (CSATM) based on popularity and transitivity to infer topics and their assignments to comments. Experiments on real forum datasets are used to demonstrate improved performance for topic extraction with six different measurements of coherence and impressive accuracy for topic assignments.


Information ◽  
2020 ◽  
Vol 11 (8) ◽  
pp. 376 ◽  
Author(s):  
Cornelia Ferner ◽  
Clemens Havas ◽  
Elisabeth Birnbacher ◽  
Stefan Wegenkittl ◽  
Bernd Resch

In the event of a natural disaster, geo-tagged Tweets are an immediate source of information for locating casualties and damages, and for supporting disaster management. Topic modeling can help in detecting disaster-related Tweets in the noisy Twitter stream in an unsupervised manner. However, the results of topic models are difficult to interpret and require manual identification of one or more “disaster topics”. Immediate disaster response would benefit from a fully automated process for interpreting the modeled topics and extracting disaster relevant information. Initializing the topic model with a set of seed words already allows to directly identify the corresponding disaster topic. In order to enable an automated end-to-end process, we automatically generate seed words using older Tweets from the same geographic area. The results of two past events (Napa Valley earthquake 2014 and hurricane Harvey 2017) show that the geospatial distribution of Tweets identified as disaster related conforms with the officially released disaster footprints. The suggested approach is applicable when there is a single topic of interest and comparative data available.


Symmetry ◽  
2019 ◽  
Vol 11 (12) ◽  
pp. 1486
Author(s):  
Zhinan Gou ◽  
Zheng Huo ◽  
Yuanzhen Liu ◽  
Yi Yang

Supervised topic modeling has been successfully applied in the fields of document classification and tag recommendation in recent years. However, most existing models neglect the fact that topic terms have the ability to distinguish topics. In this paper, we propose a term frequency-inverse topic frequency (TF-ITF) method for constructing a supervised topic model, in which the weight of each topic term indicates the ability to distinguish topics. We conduct a series of experiments with not only the symmetric Dirichlet prior parameters but also the asymmetric Dirichlet prior parameters. Experimental results demonstrate that the result of introducing TF-ITF into a supervised topic model outperforms several state-of-the-art supervised topic models.


2020 ◽  
Vol 32 (12) ◽  
pp. 2322-2335 ◽  
Author(s):  
Peng Zhang ◽  
Suge Wang ◽  
Deyu Li ◽  
Xiaoli Li ◽  
Zhikang Xu

2018 ◽  
Vol 11 (1) ◽  
pp. 18-27 ◽  
Author(s):  
Micah D. Saxton

Topic modeling is a data mining method which can be used to understand and categorize large corpora of data; as such, it is a tool which theological librarians can use in their professional workflows and scholarly practices. In this article I provide a gentle introduction to topic modeling for those who have no prior knowledge of the topic. I begin with a conceptual overview of topic modeling which does not rely on the complicated mathematics behind the process. Then, I illustrate topic modeling by providing a narrative of building a topic model using the entirety Theological Librarianship as my example corpus. This narrative ends with an analysis of the success of the model and suggestions for improvement. Finally, I recommend a few resources for those who would like to pursue topic modeling further.


Sign in / Sign up

Export Citation Format

Share Document