scholarly journals Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy

Entropy ◽  
2020 ◽  
Vol 22 (4) ◽  
pp. 394
Author(s):  
Sergei Koltcov ◽  
Vera Ignatenko ◽  
Zeyd Boukhers ◽  
Steffen Staab

Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from statistical physics, where an inferred topical structure of a collection can be considered an information statistical system residing in a non-equilibrium state. By testing our approach on four models—Probabilistic Latent Semantic Analysis (pLSA), Additive Regularization of Topic Models (BigARTM), Latent Dirichlet Allocation (LDA) with Gibbs sampling, LDA with variational inference (VLDA)—we, first of all, show that the minimum of Renyi entropy coincides with the “true” number of topics, as determined in two labelled collections. Simultaneously, we find that Hierarchical Dirichlet Process (HDP) model as a well-known approach for topic number optimization fails to detect such optimum. Next, we demonstrate that large values of the regularization coefficient in BigARTM significantly shift the minimum of entropy from the topic number optimum, which effect is not observed for hyper-parameters in LDA with Gibbs sampling. We conclude that regularization may introduce unpredictable distortions into topic models that need further research.

Entropy ◽  
2019 ◽  
Vol 21 (7) ◽  
pp. 660 ◽  
Author(s):  
Sergei Koltcov ◽  
Vera Ignatenko ◽  
Olessia Koltsova

Topic modeling is a popular approach for clustering text documents. However, current tools have a number of unsolved problems such as instability and a lack of criteria for selecting the values of model parameters. In this work, we propose a method to solve partially the problems of optimizing model parameters, simultaneously accounting for semantic stability. Our method is inspired by the concepts from statistical physics and is based on Sharma–Mittal entropy. We test our approach on two models: probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA) with Gibbs sampling, and on two datasets in different languages. We compare our approach against a number of standard metrics, each of which is able to account for just one of the parameters of our interest. We demonstrate that Sharma–Mittal entropy is a convenient tool for selecting both the number of topics and the values of hyper-parameters, simultaneously controlling for semantic stability, which none of the existing metrics can do. Furthermore, we show that concepts from statistical physics can be used to contribute to theory construction for machine learning, a rapidly-developing sphere that currently lacks a consistent theoretical ground.


Author(s):  
R. Derbanosov ◽  
◽  
M. Bakhanova ◽  
◽  

Probabilistic topic modeling is a tool for statistical text analysis that can give us information about the inner structure of a large corpus of documents. The most popular models—Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation—produce topics in a form of discrete distributions over the set of all words of the corpus. They build topics using an iterative algorithm that starts from some random initialization and optimizes a loss function. One of the main problems of topic modeling is sensitivity to random initialization that means producing significantly different solutions from different initial points. Several studies showed that side information about documents may improve the overall quality of a topic model. In this paper, we consider the use of additional information in the context of the stability problem. We represent auxiliary information as an additional modality and use BigARTM library in order to perform experiments on several text collections. We show that using side information as an additional modality improves topics stability without significant quality loss of the model.


2021 ◽  
Vol 40 (3) ◽  
Author(s):  
HyunSeung Koh ◽  
Mark Fienup

Library chat services are an increasingly important communication channel to connect patrons to library resources and services. Analysis of chat transcripts could provide librarians with insights into improving services. Unfortunately, chat transcripts consist of unstructured text data, making it impractical for librarians to go beyond simple quantitative analysis (e.g., chat duration, message count, word frequencies) with existing tools. As a stepping-stone toward a more sophisticated chat transcript analysis tool, this study investigated the application of different types of topic modeling techniques to analyze one academic library’s chat reference data collected from April 10, 2015, to May 31, 2019, with the goal of extracting the most accurate and easily interpretable topics. In this study, topic accuracy and interpretability—the quality of topic outcomes—were quantitatively measured with topic coherence metrics. Additionally, qualitative accuracy and interpretability were measured by the librarian author of this paper depending on the subjective judgment on whether topics are aligned with frequently asked questions or easily inferable themes in academic library contexts. This study found that from a human’s qualitative evaluation, Probabilistic Latent Semantic Analysis (pLSA) produced more accurate and interpretable topics, which is not necessarily aligned with the findings of the quantitative evaluation with all three types of topic coherence metrics. Interestingly, the commonly used technique Latent Dirichlet Allocation (LDA) did not necessarily perform better than pLSA. Also, semi-supervised techniques with human-curated anchor words of Correlation Explanation (CorEx) or guided LDA (GuidedLDA) did not necessarily perform better than an unsupervised technique of Dirichlet Multinomial Mixture (DMM). Last, the study found that using the entire transcript, including both sides of the interaction between the library patron and the librarian, performed better than using only the initial question asked by the library patron across different techniques in increasing the quality of topic outcomes.


2021 ◽  
Vol 7 ◽  
pp. e608
Author(s):  
Sergei Koltcov ◽  
Vera Ignatenko ◽  
Maxim Terpilovskii ◽  
Paolo Rosso

Hierarchical topic modeling is a potentially powerful instrument for determining topical structures of text collections that additionally allows constructing a hierarchy representing the levels of topic abstractness. However, parameter optimization in hierarchical models, which includes finding an appropriate number of topics at each level of hierarchy, remains a challenging task. In this paper, we propose an approach based on Renyi entropy as a partial solution to the above problem. First, we introduce a Renyi entropy-based metric of quality for hierarchical models. Second, we propose a practical approach to obtaining the “correct” number of topics in hierarchical topic models and show how model hyperparameters should be tuned for that purpose. We test this approach on the datasets with the known number of topics, as determined by the human mark-up, three of these datasets being in the English language and one in Russian. In the numerical experiments, we consider three different hierarchical models: hierarchical latent Dirichlet allocation model (hLDA), hierarchical Pachinko allocation model (hPAM), and hierarchical additive regularization of topic models (hARTM). We demonstrate that the hLDA model possesses a significant level of instability and, moreover, the derived numbers of topics are far from the true numbers for the labeled datasets. For the hPAM model, the Renyi entropy approach allows determining only one level of the data structure. For hARTM model, the proposed approach allows us to estimate the number of topics for two levels of hierarchy.


Author(s):  
Samuel Kim ◽  
Panayiotis Georgiou ◽  
Shrikanth Narayanan

We propose the notion of latent acoustic topics to capture contextual information embedded within a collection of audio signals. The central idea is to learn a probability distribution over a set of latent topics of a given audio clip in an unsupervised manner, assuming that there exist latent acoustic topics and each audio clip can be described in terms of those latent acoustic topics. In this regard, we use the latent Dirichlet allocation (LDA) to implement the acoustic topic models over elemental acoustic units, referred as acoustic words, and perform text-like audio signal processing. Experiments on audio tag classification with the BBC sound effects library demonstrate the usefulness of the proposed latent audio context modeling schemes. In particular, the proposed method is shown to be superior to other latent structure analysis methods, such as latent semantic analysis and probabilistic latent semantic analysis. We also demonstrate that topic models can be used as complementary features to content-based features and offer about 9% relative improvement in audio classification when combined with the traditional Gaussian mixture model (GMM)–Support Vector Machine (SVM) technique.


2020 ◽  
Vol 10 (3) ◽  
pp. 834
Author(s):  
Erdenebileg Batbaatar ◽  
Van-Huy Pham ◽  
Keun Ho Ryu

The hallmarks of cancer represent an essential concept for discovering novel knowledge about cancer and for extracting the complexity of cancer. Due to the lack of topic analysis frameworks optimized specifically for cancer data, the studies on topic modeling in cancer research still have a strong challenge. Recently, deep learning (DL) based approaches were successfully employed to learn semantic and contextual information from scientific documents using word embeddings according to the hallmarks of cancer (HoC). However, those are only applicable to labeled data. There is a comparatively small number of documents that are labeled by experts. In the real world, there is a massive number of unlabeled documents that are available online. In this paper, we present a multi-task topic analysis (MTTA) framework to analyze cancer hallmark-specific topics from documents. The MTTA framework consists of three main subtasks: (1) cancer hallmark learning (CHL)—used to learn cancer hallmarks on existing labeled documents; (2) weak label propagation (WLP)—used to classify a large number of unlabeled documents with the pre-trained model in the CHL task; and (3) topic modeling (ToM)—used to discover topics for each hallmark category. In the CHL task, we employed a convolutional neural network (CNN) with pre-trained word embedding that represents semantic meanings obtained from an unlabeled large corpus. In the ToM task, we employed a latent topic model such as latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (PLSA) model to catch the semantic information learned by the CNN model for topic analysis. To evaluate the MTTA framework, we collected a large number of documents related to lung cancer in a case study. We also conducted a comprehensive performance evaluation for the MTTA framework, comparing it with several approaches.


2021 ◽  
pp. 1-17
Author(s):  
Zeinab Shahbazi ◽  
Yung-Cheol Byun

Understanding the real-world short texts become an essential task in the recent research area. The document deduction analysis and latent coherent topic named as the important aspect of this process. Latent Dirichlet Allocation (LDA) and Probabilistic Latent Semantic Analysis (PLSA) are suggested to model huge information and documents. This type of contexts’ main problem is the information limitation, words relationship, sparsity, and knowledge extraction. The knowledge discovery and machine learning techniques integrated with topic modeling were proposed to overcome this issue. The knowledge discovery was applied based on the hidden information extraction to increase the suitable dataset for further analysis. The integration of machine learning techniques, Artificial Neural Network (ANN) and Long Short-Term (LSTM) are applied to anticipate topic movements. LSTM layers are fed with latent topic distribution learned from the pre-trained Latent Dirichlet Allocation (LDA) model. We demonstrate general information from different techniques applied in short text topic modeling. We proposed three categories based on Dirichlet multinomial mixture, global word co-occurrences, and self-aggregation using representative design and analysis of all categories’ performance in different tasks. Finally, the proposed system evaluates with state-of-art methods on real-world datasets, comprises them with long document topic modeling algorithms, and creates a classification framework that considers further knowledge and represents it in the machine learning pipeline.


2019 ◽  
Vol 52 (9-10) ◽  
pp. 1289-1298 ◽  
Author(s):  
Lei Shi ◽  
Gang Cheng ◽  
Shang-ru Xie ◽  
Gang Xie

The aim of topic detection is to automatically identify the events and hot topics in social networks and continuously track known topics. Applying the traditional methods such as Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis is difficult given the high dimensionality of massive event texts and the short-text sparsity problems of social networks. The problem also exists of unclear topics caused by the sparse distribution of topics. To solve the above challenge, we propose a novel word embedding topic model by combining the topic model and the continuous bag-of-words mode (Cbow) method in word embedding method, named Cbow Topic Model (CTM), for topic detection and summary in social networks. We conduct similar word clustering of the target social network text dataset by introducing the classic Cbow word vectorization method, which can effectively learn the internal relationship between words and reduce the dimensionality of the input texts. We employ the topic model-to-model short text for effectively weakening the sparsity problem of social network texts. To detect and summarize the topic, we propose a topic detection method by leveraging similarity computing for social networks. We collected a Sina microblog dataset to conduct various experiments. The experimental results demonstrate that the CTM method is superior to the existing topic model method.


Author(s):  
М.А. Нокель ◽  
Н.В. Лукашевич

Представлены результаты экспериментов по добавлению биграмм в тематические модели и учету сходства между ними и униграммами. Предложен новый алгоритм PLSA-SIM, являющийся модификацией алгоритма построения тематических моделей PLSA (Probabilistic Latent Semantic Analysis). Предложенный алгоритм позволяет добавлять биграммы и учитывать сходство между ними и униграммными компонентами. Исследована возможность применения ассоциативных мер для выбора и последующего включения биграмм в тематические модели. В качестве текстовых коллекций взяты русскоязычная подборка статей из электронных банковских журналов, английские части корпусов параллельных текстов Europarl и JRC-Acquiz и англоязычный архив исследовательских работ по компьютерной лингвистике ACL Anthology. Выполненные эксперименты показывают, что существует подгруппа тестируемых мер, упорядочивающих биграммы таким образом, что при последующем их добавлении в предложенный алгоритм PLSA-SIM качество получающихся тематических моделей значительно повышается. Предложен новый итеративный алгоритм PLSA-ITER без учителя, позволяющий добавлять наиболее подходящие биграммы. Эксперименты показывают дальнейшее улучшение качества тематических моделей по сравнению с исходным алгоритмом PLSA. The results of experimental study of adding bigrams and taking account of the similarity between them and unigrams are discussed. A novel PLSA-SIM algorithm based on a modification of the original PLSA (Probabilistic Latent Semantic Analysis) algorithm is proposed. The proposed algorithm incorporates bigrams and takes into account the similarity between them and unigram components. Various word association measures are analyzed to integrate top-ranked bigrams into topic models. As target text collections, articles from various Russian electronic banking magazines, English parts of parallel corpora Europarl and JRC-Acquiz, and the English digital archive of research papers in computational linguistics (ACL Anthology) are chosen. The computational experiments show that there exists a subgroup of tested measures that produce top-ranked bigrams in such a way that their inclusion into the PLSA-SIM algorithm significantly improves the quality of topic models for all collections. A novel unsupervised iterative algorithm named PLSA-ITER is also proposed for adding the most relevant bigrams. The computational experiments show a further improvement in the quality of topic models compared to the PLSA algorithm.


2020 ◽  
Vol 75 (6) ◽  
pp. 559-569
Author(s):  
T. N. Bakiev ◽  
D. V. Nakashidze ◽  
A. M. Savchenko

Sign in / Sign up

Export Citation Format

Share Document