An evolutionary analysis of new energy and industry policy tools in China based on large-scale policy topic modeling

This study investigates the evolution of provincial new energy policies and industries of China using a topic modeling approach. To this end, six out of 31 provinces in China are first selected as research samples, central and provincial new energy policies in the period of 2010 to 2019 are collected to establish a text corpus with 23, 674 documents. Then, the policy corpus is fed to two different topic models, one is the Latent Dirichlet Allocation for modeling static policy topics, another is the Dynamic Topic Model for extracting topics over time. Finally, the obtained topics are mapped into policy tools for comparisons. The dynamic policy topics are further analyzed with the panel data from provincial new energy industries. The results show that the provincial new energy policies moved to different tracks after about 2014 due to the regional conditions such as the economy and CO2 emission intensity. Underdeveloped provinces tend to use environment-oriented tools to regulate and control CO2 emissions, while developed regions employ the more balanced policy mix for improving new energy vehicles and other industries. Widespread hysteretic effects are revealed during the correlation analysis of the policy topics and new energy capacity.

Download Full-text

Decoding brain activity using a large-scale probabilistic functional-anatomical atlas of human cognition

10.1101/059618 ◽

2016 ◽

Cited By ~ 4

Author(s):

Timothy N. Rubin ◽

Oluwasanmi Koyejo ◽

Krzysztof J. Gorgolewski ◽

Michael N. Jones ◽

Russell A. Poldrack ◽

...

Keyword(s):

Large Scale ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Brain Activity ◽

Human Cognition ◽

Brain Images ◽

Whole Brain ◽

Context Sensitive ◽

Cognitive States ◽

Small Set

AbstractA central goal of cognitive neuroscience is to decode human brain activity--i.e., to infer mental processes from observed patterns of whole-brain activation. Previous decoding efforts have focused on classifying brain activity into a small set of discrete cognitive states. To attain maximal utility, a decoding framework must be open-ended, systematic, and context-sensitive--i.e., capable of interpreting numerous brain states, presented in arbitrary combinations, in light of prior information. Here we take steps towards this objective by introducing a Bayesian decoding framework based on a novel topic model---Generalized Correspondence Latent Dirichlet Allocation---that learns latent topics from a database of over 11,000 published fMRI studies. The model produces highly interpretable, spatially-circumscribed topics that enable flexible decoding of whole-brain images. Importantly, the Bayesian nature of the model allows one to “seed” decoder priors with arbitrary images and text--enabling researchers, for the first time, to generative quantitative, context-sensitive interpretations of whole-brain patterns of brain activity.

Download Full-text

Automated Seeded Latent Dirichlet Allocation for Social Media Based Event Detection and Mapping

Information ◽

10.3390/info11080376 ◽

2020 ◽

Vol 11 (8) ◽

pp. 376 ◽

Cited By ~ 2

Author(s):

Cornelia Ferner ◽

Clemens Havas ◽

Elisabeth Birnbacher ◽

Stefan Wegenkittl ◽

Bernd Resch

Keyword(s):

Event Detection ◽

Disaster Response ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Geographic Area ◽

Relevant Information ◽

Suggested Approach ◽

Napa Valley ◽

Source Of Information

In the event of a natural disaster, geo-tagged Tweets are an immediate source of information for locating casualties and damages, and for supporting disaster management. Topic modeling can help in detecting disaster-related Tweets in the noisy Twitter stream in an unsupervised manner. However, the results of topic models are difficult to interpret and require manual identification of one or more “disaster topics”. Immediate disaster response would benefit from a fully automated process for interpreting the modeled topics and extracting disaster relevant information. Initializing the topic model with a set of seed words already allows to directly identify the corresponding disaster topic. In order to enable an automated end-to-end process, we automatically generate seed words using older Tweets from the same geographic area. The results of two past events (Napa Valley earthquake 2014 and hurricane Harvey 2017) show that the geospatial distribution of Tweets identified as disaster related conforms with the officially released disaster footprints. The suggested approach is applicable when there is a single topic of interest and comparative data available.

Download Full-text

Exploring the Citywide Human Mobility Patterns of Taxi Trips through a Topic-Modeling Analysis

Journal of Advanced Transportation ◽

10.1155/2021/6697827 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Hui Xiong ◽

Kaiqiang Xie ◽

Lu Ma ◽

Feng Yuan ◽

Rui Shen

Keyword(s):

Power Law ◽

Traffic Management ◽

Large Scale ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Human Mobility ◽

Mobility Patterns ◽

Power Law Distribution ◽

Modeling Analysis ◽

Wide Range

Understanding human mobility patterns is of great importance for a wide range of applications from social networks to transportation planning. Toward this end, the spatial-temporal information of a large-scale dataset of taxi trips was collected via GPS, from March 10 to 23, 2014, in Beijing. The data contain trips generated by a great portion of taxi vehicles citywide. We revealed that the geographic displacement of those trips follows the power law distribution and the corresponding travel time follows a mixture of the exponential and power law distribution. To identify human mobility patterns, a topic model with the latent Dirichlet allocation (LDA) algorithm was proposed to infer the sixty-five key topics. By measuring the variation of trip displacement over time, we find that the travel distance in the morning rush hour is much shorter than that in the other time. As for daily patterns, it shows that taxi mobility presents weekly regularity both on weekdays and on weekends. Among different days in the same week, mobility patterns on Tuesday and Wednesday are quite similar. By quantifying the trip distance along time, we find that Topic 44 exhibits dominant patterns, which means distance less than 10 km is predominant no matter what time in a day. The findings could be references for travelers to arrange trips and policymakers to formulate sound traffic management policies.

Download Full-text

MixEHR-Guided: A guided multi-modal topic modeling approach for large-scale automatic phenotyping using the electronic health record

10.1101/2021.12.17.473215 ◽

2021 ◽

Author(s):

Yuri Ahuja ◽

Yuesong Zou ◽

Aman Verma ◽

David Buckeridge ◽

Yue Li

Keyword(s):

Gold Standard ◽

Topic Modeling ◽

Large Scale ◽

Topic Model ◽

Disease Risk ◽

Clinical Decision ◽

Treatment Recommendation ◽

Administrative Claims ◽

Electronic Health ◽

Automatic Phenotyping

Electronic Health Records (EHRs) contain rich clinical data collected at the point of the care, and their increasing adoption offers exciting opportunities for clinical informatics, disease risk prediction, and personalized treatment recommendation. However, effective use of EHR data for research and clinical decision support is often hampered by a lack of reliable disease labels. To compile gold-standard labels, researchers often rely on clinical experts to develop rule-based phenotyping algorithms from billing codes and other surrogate features. This process is tedious and error-prone due to recall and observer biases in how codes and measures are selected, and some phenotypes are incompletely captured by a handful of surrogate features. To address this challenge, we present a novel automatic phenotyping model called MixEHR-Guided (MixEHR-G), a multimodal hierarchical Bayesian topic model that efficiently models the EHR generative process by identifying latent phenotype structure in the data. Unlike existing topic modeling algorithms wherein, the inferred topics are not identifiable, MixEHR-G uses prior information from informative surrogate features to align topics with known phenotypes. We applied MixEHR-G to an openly available EHR dataset of 38,597 intensive care patients (MIMIC-III) in Boston, USA and to administrative claims data for a population-based cohort (PopHR) of 1.3 million people in Quebec, Canada. Qualitatively, we demonstrate that MixEHR-G learns interpretable phenotypes and yields meaningful insights about phenotype similarities, comorbidities, and epidemiological associations. Quantitatively, MixEHR-G outperforms existing unsupervised phenotyping methods on a phenotype label annotation task, and it can accurately estimate relative phenotype prevalence functions without gold-standard phenotype information. Altogether, MixEHR-G is an important step towards building an interpretable and automated phenotyping system using EHR data.

Download Full-text

Topic Modeling in Embedding Spaces

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00325 ◽

2020 ◽

Vol 8 ◽

pp. 439-453 ◽

Cited By ~ 2

Author(s):

Adji B. Dieng ◽

Francisco J. R. Ruiz ◽

David M. Blei

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Topic Models ◽

Predictive Performance ◽

Inner Product ◽

Natural Parameter ◽

Document Models ◽

Heavy Tailed ◽

Categorical Distribution

Topic modeling analyzes documents to learn meaningful patterns of words. However, existing topic models fail to learn interpretable topics when working with large and heavy-tailed vocabularies. To this end, we develop the embedded topic model (etm), a generative model of documents that marries traditional topic models with word embeddings. More specifically, the etm models each word with a categorical distribution whose natural parameter is the inner product between the word’s embedding and an embedding of its assigned topic. To fit the etm, we develop an efficient amortized variational inference algorithm. The etm discovers interpretable topics even with large vocabularies that include rare words and stop words. It outperforms existing document models, such as latent Dirichlet allocation, in terms of both topic quality and predictive performance.

Download Full-text

Incorporating Biterm Correlation Knowledge into Topic Modeling for Short Texts

The Computer Journal ◽

10.1093/comjnl/bxaa079 ◽

2020 ◽

Author(s):

Kai Zhang ◽

Yuan Zhou ◽

Zheng Chen ◽

Yufei Liu ◽

Zhuo Tang ◽

...

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Semantic Knowledge ◽

Superior Performance ◽

Knowledge Based ◽

Modeling Process ◽

Proposed Model ◽

Benchmark Datasets ◽

Latent Topic

Abstract The prevalence of short texts on the Web has made mining the latent topic structures of short texts a critical and fundamental task for many applications. However, due to the lack of word co-occurrence information induced by the content sparsity of short texts, it is challenging for traditional topic models like latent Dirichlet allocation (LDA) to extract coherent topic structures on short texts. Incorporating external semantic knowledge into the topic modeling process is an effective strategy to improve the coherence of inferred topics. In this paper, we develop a novel topic model—called biterm correlation knowledge-based topic model (BCK-TM)—to infer latent topics from short texts. Specifically, the proposed model mines biterm correlation knowledge automatically based on recent progress in word embedding, which can represent semantic information of words in a continuous vector space. To incorporate external knowledge, a knowledge incorporation mechanism is designed over the latent topic layer to regularize the topic assignment of each biterm during the topic sampling process. Experimental results on three public benchmark datasets illustrate the superior performance of the proposed approach over several state-of-the-art baseline models.

Download Full-text

Ldagibbs: A Command for Topic Modeling in Stata Using Latent Dirichlet Allocation

The Stata Journal Promoting communications on statistics and Stata ◽

10.1177/1536867x1801800107 ◽

2018 ◽

Vol 18 (1) ◽

pp. 101-117 ◽

Cited By ~ 10

Author(s):

Carlo Schwarz

Keyword(s):

Machine Learning ◽

Probability Distribution ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Topic Models ◽

Text Documents ◽

Text Data ◽

Dirichlet Allocation

In this article, I introduce the ldagibbs command, which implements latent Dirichlet allocation in Stata. Latent Dirichlet allocation is the most popular machine-learning topic model. Topic models automatically cluster text documents into a user-chosen number of topics. Latent Dirichlet allocation represents each document as a probability distribution over topics and represents each topic as a probability distribution over words. Therefore, latent Dirichlet allocation provides a way to analyze the content of large unclassified text data and an alternative to predefined document classifications.

Download Full-text

CLDA: An Effective Topic Model for Mining User Interest Preference under Big Data Background

Complexity ◽

10.1155/2018/2503816 ◽

2018 ◽

Vol 2018 ◽

pp. 1-10 ◽

Cited By ~ 5

Author(s):

Lirong Qiu ◽

Jia Yu

Keyword(s):

Big Data ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

User Interest ◽

Text Data ◽

Data Set ◽

Data Sparsity ◽

Short Text ◽

Text Filtering

In the present big data background, how to effectively excavate useful information is the problem that big data is facing now. The purpose of this study is to construct a more effective method of mining interest preferences of users in a particular field in the context of today’s big data. We mainly use a large number of user text data from microblog to study. LDA is an effective method of text mining, but it will not play a very good role in applying LDA directly to a large number of short texts in microblog. In today’s more effective topic modeling project, short texts need to be aggregated into long texts to avoid data sparsity. However, aggregated short texts are mixed with a lot of noise, reducing the accuracy of mining the user’s interest preferences. In this paper, we propose Combining Latent Dirichlet Allocation (CLDA), a new topic model that can learn the potential topics of microblog short texts and long texts simultaneously. The data sparsity of short texts is avoided by aggregating long texts to assist in learning short texts. Short text filtering long text is reused to improve mining accuracy, making long texts and short texts effectively combined. Experimental results in a real microblog data set show that CLDA outperforms many advanced models in mining user interest, and we also confirm that CLDA also has good performance in recommending systems.

Download Full-text

STABILITY OF TOPIC MODELING VIA MODALITY REGULARIZATION

Computational Linguistics and Intellectual Technologies ◽

10.28995/2075-7182-2020-19-198-210 ◽

2020 ◽

Author(s):

R. Derbanosov ◽

◽

M. Bakhanova ◽

◽

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Topic Model ◽

Side Information ◽

Auxiliary Information ◽

Discrete Distributions ◽

Probabilistic Latent Semantic Analysis ◽

Probabilistic Topic Modeling ◽

Random Initialization

Probabilistic topic modeling is a tool for statistical text analysis that can give us information about the inner structure of a large corpus of documents. The most popular models—Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation—produce topics in a form of discrete distributions over the set of all words of the corpus. They build topics using an iterative algorithm that starts from some random initialization and optimizes a loss function. One of the main problems of topic modeling is sensitivity to random initialization that means producing significantly different solutions from different initial points. Several studies showed that side information about documents may improve the overall quality of a topic model. In this paper, we consider the use of additional information in the context of the stability problem. We represent auxiliary information as an additional modality and use BigARTM library in order to perform experiments on several text collections. We show that using side information as an additional modality improves topics stability without significant quality loss of the model.

Download Full-text

Iranian COVID-19 Publications in LitCovid: Text Mining and Topic Modeling

Scientific Programming ◽

10.1155/2021/3315695 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Meisam Dastani ◽

Farshid Danesh

Keyword(s):

Case Report ◽

Text Mining ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Subject Area ◽

Scientific Publications ◽

Statistical Population ◽

Strategic Issues ◽

Number Of Publications ◽

And Control

COVID-19 is a threat to the lives of people all over the world. As a result of the new and unknown nature of COVID-19, much research has been conducted recently. In order to increase and enhance the growth rate of Iranian publications on COVID-19, this article aims to analyze these publications in LitCovid to identify the topical and content structure and topic modeling of scientific publications in the mentioned subject area. The present article is applied research performed by using an analytical approach as well as text mining techniques. The statistical population is all the publications of Iranian researchers in LitCovid. Latent Dirichlet Allocation (LDA) and Python were used to analyze the data and implement text mining and topic modeling algorithms. Data analysis shows that the percentage of Iranian publications in the eight topical groups in LitCovid is as follows: prevention (39.57%), treatment (18.99%), diagnosis (18.99%), forecasting (7.83%), case report (6.52%), mechanism (3.91%), transmission (3.62%), and general (0.58%). The results indicate that patient, pandemic, outbreak, case, Iranian, model, care, health, coronavirus, and disease are the most important words in the publications of Iranian researchers in LitCovid. Six topics for prevention; four topics for treatment and case report and forecasting; three topics for diagnosis, mechanism, and transmission in general have been obtained by implementing the topic modeling algorithm. Most of the Iranian publications in LitCovid are related to the topic “pandemic status,” with 22.47% in the prevention category, and the lowest number of publications is related to the topic “environment,” with 11.11% in the transmission category. The present study indicates a better understanding of essential and strategic issues of Iranian publications in LitCovid. The results reveal that many Iranian studies on COVID-19 were primarily on the issues related to prevention, management, and control. These findings provided a structured and research-based viewpoint of COVID-19 in Iran to guide researchers and policymakers.

Download Full-text