Modeling Topics in the Alternative Uses Task

Mapping Intimacies ◽

10.31234/osf.io/2t9qw ◽

2018 ◽

Author(s):

Rick Hass

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Topic Model ◽

Short Paper ◽

Log P ◽

Test Items ◽

Log Likelihood ◽

Data Log ◽

Representation Of Knowledge

What is the representation of knowledge underlying human responses to alternative uses test items? This short paper describes an application of Latent Dirichlet Allocation (LDA, also known as topic modeling) to solve this problem of knowledge representation. For thissmall application, a document was de?ned as the set of responses given by a single participant to the alternative uses test brick prompt. This was chosen instead of single responses as the document unit, as single responses to alternative uses items are rather short, and LDA assumesthat documents are probabilistic mixtures of topics. The approach explored in this paper used LDA with Gibbs sampling, with the primary goal of model selection. The log likelihood of the data (log P(w | T)) was computed as to topics varied from 5 to 100. Results showed that the log likelihood increased to a peak at 15 topics and then steadily declined up to 100 topics. In the 15-topic model the most frequently appearing topic was that which gave the highest probability to the terms build, house, step, and smash. Documents best represented by that topic assignment were, on average, more similar to the dictionary de?finition of a brick based on vector cosines computed with Latent Semantic Analysis. Additional implications for using the topic model as a knowledge base for cognitive systems, and also as a tool for quantifying flexibility, the number of categories present in alternative uses response arrays, will also be discussed.

Get full-text (via PubEx)

STABILITY OF TOPIC MODELING VIA MODALITY REGULARIZATION

Computational Linguistics and Intellectual Technologies ◽

10.28995/2075-7182-2020-19-198-210 ◽

2020 ◽

Author(s):

R. Derbanosov ◽

◽

M. Bakhanova ◽

◽

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Topic Model ◽

Side Information ◽

Auxiliary Information ◽

Discrete Distributions ◽

Probabilistic Latent Semantic Analysis ◽

Probabilistic Topic Modeling ◽

Random Initialization

Probabilistic topic modeling is a tool for statistical text analysis that can give us information about the inner structure of a large corpus of documents. The most popular models—Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation—produce topics in a form of discrete distributions over the set of all words of the corpus. They build topics using an iterative algorithm that starts from some random initialization and optimizes a loss function. One of the main problems of topic modeling is sensitivity to random initialization that means producing significantly different solutions from different initial points. Several studies showed that side information about documents may improve the overall quality of a topic model. In this paper, we consider the use of additional information in the context of the stability problem. We represent auxiliary information as an additional modality and use BigARTM library in order to perform experiments on several text collections. We show that using side information as an additional modality improves topics stability without significant quality loss of the model.

Get full-text (via PubEx)

Multi-Task Topic Analysis Framework for Hallmarks of Cancer with Weak Supervision

Applied Sciences ◽

10.3390/app10030834 ◽

2020 ◽

Vol 10 (3) ◽

pp. 834

Author(s):

Erdenebileg Batbaatar ◽

Van-Huy Pham ◽

Keun Ho Ryu

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Topic Model ◽

Label Propagation ◽

Probabilistic Latent Semantic Analysis ◽

Hallmarks Of Cancer ◽

Weak Supervision ◽

Topic Analysis ◽

Cancer Data

The hallmarks of cancer represent an essential concept for discovering novel knowledge about cancer and for extracting the complexity of cancer. Due to the lack of topic analysis frameworks optimized specifically for cancer data, the studies on topic modeling in cancer research still have a strong challenge. Recently, deep learning (DL) based approaches were successfully employed to learn semantic and contextual information from scientific documents using word embeddings according to the hallmarks of cancer (HoC). However, those are only applicable to labeled data. There is a comparatively small number of documents that are labeled by experts. In the real world, there is a massive number of unlabeled documents that are available online. In this paper, we present a multi-task topic analysis (MTTA) framework to analyze cancer hallmark-specific topics from documents. The MTTA framework consists of three main subtasks: (1) cancer hallmark learning (CHL)—used to learn cancer hallmarks on existing labeled documents; (2) weak label propagation (WLP)—used to classify a large number of unlabeled documents with the pre-trained model in the CHL task; and (3) topic modeling (ToM)—used to discover topics for each hallmark category. In the CHL task, we employed a convolutional neural network (CNN) with pre-trained word embedding that represents semantic meanings obtained from an unlabeled large corpus. In the ToM task, we employed a latent topic model such as latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (PLSA) model to catch the semantic information learned by the CNN model for topic analysis. To evaluate the MTTA framework, we collected a large number of documents related to lung cancer in a case study. We also conducted a comprehensive performance evaluation for the MTTA framework, comparing it with several approaches.

Get full-text (via PubEx)

Automated Seeded Latent Dirichlet Allocation for Social Media Based Event Detection and Mapping

Information ◽

10.3390/info11080376 ◽

2020 ◽

Vol 11 (8) ◽

pp. 376 ◽

Cited By ~ 2

Author(s):

Cornelia Ferner ◽

Clemens Havas ◽

Elisabeth Birnbacher ◽

Stefan Wegenkittl ◽

Bernd Resch

Keyword(s):

Event Detection ◽

Disaster Response ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Geographic Area ◽

Relevant Information ◽

Suggested Approach ◽

Napa Valley ◽

Source Of Information

In the event of a natural disaster, geo-tagged Tweets are an immediate source of information for locating casualties and damages, and for supporting disaster management. Topic modeling can help in detecting disaster-related Tweets in the noisy Twitter stream in an unsupervised manner. However, the results of topic models are difficult to interpret and require manual identification of one or more “disaster topics”. Immediate disaster response would benefit from a fully automated process for interpreting the modeled topics and extracting disaster relevant information. Initializing the topic model with a set of seed words already allows to directly identify the corresponding disaster topic. In order to enable an automated end-to-end process, we automatically generate seed words using older Tweets from the same geographic area. The results of two past events (Napa Valley earthquake 2014 and hurricane Harvey 2017) show that the geospatial distribution of Tweets identified as disaster related conforms with the officially released disaster footprints. The suggested approach is applicable when there is a single topic of interest and comparative data available.

Get full-text (via PubEx)

A word embedding topic model for topic detection and summary in social networks

Measurement and Control ◽

10.1177/0020294019865750 ◽

2019 ◽

Vol 52 (9-10) ◽

pp. 1289-1298 ◽

Cited By ~ 1

Author(s):

Lei Shi ◽

Gang Cheng ◽

Shang-ru Xie ◽

Gang Xie

Keyword(s):

Social Networks ◽

Social Network ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Topic Model ◽

Word Embedding ◽

Probabilistic Latent Semantic Analysis ◽

Topic Detection ◽

Short Text ◽

Internal Relationship

The aim of topic detection is to automatically identify the events and hot topics in social networks and continuously track known topics. Applying the traditional methods such as Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis is difficult given the high dimensionality of massive event texts and the short-text sparsity problems of social networks. The problem also exists of unclear topics caused by the sparse distribution of topics. To solve the above challenge, we propose a novel word embedding topic model by combining the topic model and the continuous bag-of-words mode (Cbow) method in word embedding method, named Cbow Topic Model (CTM), for topic detection and summary in social networks. We conduct similar word clustering of the target social network text dataset by introducing the classic Cbow word vectorization method, which can effectively learn the internal relationship between words and reduce the dimensionality of the input texts. We employ the topic model-to-model short text for effectively weakening the sparsity problem of social network texts. To detect and summarize the topic, we propose a topic detection method by leveraging similarity computing for social networks. We collected a Sina microblog dataset to conduct various experiments. The experimental results demonstrate that the CTM method is superior to the existing topic model method.

Get full-text (via PubEx)

Topic Modeling in Embedding Spaces

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00325 ◽

2020 ◽

Vol 8 ◽

pp. 439-453 ◽

Cited By ~ 2

Author(s):

Adji B. Dieng ◽

Francisco J. R. Ruiz ◽

David M. Blei

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Topic Models ◽

Predictive Performance ◽

Inner Product ◽

Natural Parameter ◽

Document Models ◽

Heavy Tailed ◽

Categorical Distribution

Topic modeling analyzes documents to learn meaningful patterns of words. However, existing topic models fail to learn interpretable topics when working with large and heavy-tailed vocabularies. To this end, we develop the embedded topic model (etm), a generative model of documents that marries traditional topic models with word embeddings. More specifically, the etm models each word with a categorical distribution whose natural parameter is the inner product between the word’s embedding and an embedding of its assigned topic. To fit the etm, we develop an efficient amortized variational inference algorithm. The etm discovers interpretable topics even with large vocabularies that include rare words and stop words. It outperforms existing document models, such as latent Dirichlet allocation, in terms of both topic quality and predictive performance.

Get full-text (via PubEx)

Incorporating Biterm Correlation Knowledge into Topic Modeling for Short Texts

The Computer Journal ◽

10.1093/comjnl/bxaa079 ◽

2020 ◽

Author(s):

Kai Zhang ◽

Yuan Zhou ◽

Zheng Chen ◽

Yufei Liu ◽

Zhuo Tang ◽

...

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Semantic Knowledge ◽

Superior Performance ◽

Knowledge Based ◽

Modeling Process ◽

Proposed Model ◽

Benchmark Datasets ◽

Latent Topic

Abstract The prevalence of short texts on the Web has made mining the latent topic structures of short texts a critical and fundamental task for many applications. However, due to the lack of word co-occurrence information induced by the content sparsity of short texts, it is challenging for traditional topic models like latent Dirichlet allocation (LDA) to extract coherent topic structures on short texts. Incorporating external semantic knowledge into the topic modeling process is an effective strategy to improve the coherence of inferred topics. In this paper, we develop a novel topic model—called biterm correlation knowledge-based topic model (BCK-TM)—to infer latent topics from short texts. Specifically, the proposed model mines biterm correlation knowledge automatically based on recent progress in word embedding, which can represent semantic information of words in a continuous vector space. To incorporate external knowledge, a knowledge incorporation mechanism is designed over the latent topic layer to regularize the topic assignment of each biterm during the topic sampling process. Experimental results on three public benchmark datasets illustrate the superior performance of the proposed approach over several state-of-the-art baseline models.

Get full-text (via PubEx)

Ldagibbs: A Command for Topic Modeling in Stata Using Latent Dirichlet Allocation

The Stata Journal Promoting communications on statistics and Stata ◽

10.1177/1536867x1801800107 ◽

2018 ◽

Vol 18 (1) ◽

pp. 101-117 ◽

Cited By ~ 10

Author(s):

Carlo Schwarz

Keyword(s):

Machine Learning ◽

Probability Distribution ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Topic Models ◽

Text Documents ◽

Text Data ◽

Dirichlet Allocation

In this article, I introduce the ldagibbs command, which implements latent Dirichlet allocation in Stata. Latent Dirichlet allocation is the most popular machine-learning topic model. Topic models automatically cluster text documents into a user-chosen number of topics. Latent Dirichlet allocation represents each document as a probability distribution over topics and represents each topic as a probability distribution over words. Therefore, latent Dirichlet allocation provides a way to analyze the content of large unclassified text data and an alternative to predefined document classifications.

Get full-text (via PubEx)

CLDA: An Effective Topic Model for Mining User Interest Preference under Big Data Background

Complexity ◽

10.1155/2018/2503816 ◽

2018 ◽

Vol 2018 ◽

pp. 1-10 ◽

Cited By ~ 5

Author(s):

Lirong Qiu ◽

Jia Yu

Keyword(s):

Big Data ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

User Interest ◽

Text Data ◽

Data Set ◽

Data Sparsity ◽

Short Text ◽

Text Filtering

In the present big data background, how to effectively excavate useful information is the problem that big data is facing now. The purpose of this study is to construct a more effective method of mining interest preferences of users in a particular field in the context of today’s big data. We mainly use a large number of user text data from microblog to study. LDA is an effective method of text mining, but it will not play a very good role in applying LDA directly to a large number of short texts in microblog. In today’s more effective topic modeling project, short texts need to be aggregated into long texts to avoid data sparsity. However, aggregated short texts are mixed with a lot of noise, reducing the accuracy of mining the user’s interest preferences. In this paper, we propose Combining Latent Dirichlet Allocation (CLDA), a new topic model that can learn the potential topics of microblog short texts and long texts simultaneously. The data sparsity of short texts is avoided by aggregating long texts to assist in learning short texts. Short text filtering long text is reused to improve mining accuracy, making long texts and short texts effectively combined. Experimental results in a real microblog data set show that CLDA outperforms many advanced models in mining user interest, and we also confirm that CLDA also has good performance in recommending systems.

Get full-text (via PubEx)

An Efficient Topic Modeling Approach for Text Mining and Information Retrieval through K-means Clustering

Mehran University Research Journal of Engineering and Technology ◽

10.22581/muet1982.2001.20 ◽

2020 ◽

Vol 39 (1) ◽

pp. 213-222

Author(s):

Junaid Rashid ◽

Syed Muhammad Adnan Shah ◽

Aun Irtaza

Keyword(s):

Information Retrieval ◽

Text Mining ◽

Topic Modeling ◽

Clustering Algorithm ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

State Of The Art ◽

Text Documents ◽

New Perspective ◽

Better Than

Topic modeling is an effective text mining and information retrieval approach to organizing knowledge with various contents under a specific topic. Text documents in form of news articles are increasing very fast on the web. Analysis of these documents is very important in the fields of text mining and information retrieval. Meaningful information extraction from these documents is a challenging task. One approach for discovering the theme from text documents is topic modeling but this approach still needs a new perspective to improve its performance. In topic modeling, documents have topics and topics are the collection of words. In this paper, we propose a new k-means topic modeling (KTM) approach by using the k-means clustering algorithm. KTM discovers better semantic topics from a collection of documents. Experiments on two real-world Reuters 21578 and BBC News datasets show that KTM performance is better than state-of-the-art topic models like LDA (Latent Dirichlet Allocation) and LSA (Latent Semantic Analysis). The KTM is also applicable for classification and clustering tasks in text mining and achieves higher performance with a comparison of its competitors LDA and LSA.

Get full-text (via PubEx)

Topic Modeling as a Tool for Analyzing Library Chat Transcripts

Information Technology and Libraries ◽

10.6017/ital.v40i3.13333 ◽

2021 ◽

Vol 40 (3) ◽

Author(s):

HyunSeung Koh ◽

Mark Fienup

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Academic Library ◽

Qualitative Evaluation ◽

Probabilistic Latent Semantic Analysis ◽

Analysis Tool ◽

Chat Reference ◽

Better Than

Library chat services are an increasingly important communication channel to connect patrons to library resources and services. Analysis of chat transcripts could provide librarians with insights into improving services. Unfortunately, chat transcripts consist of unstructured text data, making it impractical for librarians to go beyond simple quantitative analysis (e.g., chat duration, message count, word frequencies) with existing tools. As a stepping-stone toward a more sophisticated chat transcript analysis tool, this study investigated the application of different types of topic modeling techniques to analyze one academic library’s chat reference data collected from April 10, 2015, to May 31, 2019, with the goal of extracting the most accurate and easily interpretable topics. In this study, topic accuracy and interpretability—the quality of topic outcomes—were quantitatively measured with topic coherence metrics. Additionally, qualitative accuracy and interpretability were measured by the librarian author of this paper depending on the subjective judgment on whether topics are aligned with frequently asked questions or easily inferable themes in academic library contexts. This study found that from a human’s qualitative evaluation, Probabilistic Latent Semantic Analysis (pLSA) produced more accurate and interpretable topics, which is not necessarily aligned with the findings of the quantitative evaluation with all three types of topic coherence metrics. Interestingly, the commonly used technique Latent Dirichlet Allocation (LDA) did not necessarily perform better than pLSA. Also, semi-supervised techniques with human-curated anchor words of Correlation Explanation (CorEx) or guided LDA (GuidedLDA) did not necessarily perform better than an unsupervised technique of Dirichlet Multinomial Mixture (DMM). Last, the study found that using the entire transcript, including both sides of the interaction between the library patron and the librarian, performed better than using only the initial question asked by the library patron across different techniques in increasing the quality of topic outcomes.

Get full-text (via PubEx)