scholarly journals Improving Topic Models with Latent Feature Word Representations

Author(s):  
Dat Quoc Nguyen ◽  
Richard Billingsley ◽  
Lan Du ◽  
Mark Johnson

Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, our new models produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.

2015 ◽  
Author(s):  
Benjamin Adams

Probabilistic topic models are a class of unsupervised machine learning models used for understanding the latent topics in a corpus of documents. A new method for combining geographic feature data with text from geo-referenced documents to create topic models that are grounded in the physical environment is proposed. The Geographic Feature Type Topic Model (GFTTM) models each document in a corpus as a mixture of feature type topics and abstract topics. Feature type topics are conditioned on additional observation data of the relative densities of geographic feature types co-located with the document's location referent, whereas abstract topics are trained independently of that information. The GFTTM is evaluated using geo-referenced Wikipedia articles and feature type data from volunteered geographic information sources. A technique for the measurement of semantic similarity of feature types and places based on the mixtures of topics associated with the types is also presented. The results of the evaluation demonstrate that GFTTM finds two distinct types of topics that can be used to disentangle how places are described in terms of its physical features and more abstract topics such as history and culture.


2015 ◽  
Author(s):  
Benjamin Adams

Probabilistic topic models are a class of unsupervised machine learning models used for understanding the latent topics in a corpus of documents. A new method for combining geographic feature data with text from geo-referenced documents to create topic models that are grounded in the physical environment is proposed. The Geographic Feature Type Topic Model (GFTTM) models each document in a corpus as a mixture of feature type topics and abstract topics. Feature type topics are conditioned on additional observation data of the relative densities of geographic feature types co-located with the document's location referent, whereas abstract topics are trained independently of that information. The GFTTM is evaluated using geo-referenced Wikipedia articles and feature type data from volunteered geographic information sources. A technique for the measurement of semantic similarity of feature types and places based on the mixtures of topics associated with the types is also presented. The results of the evaluation demonstrate that GFTTM finds two distinct types of topics that can be used to disentangle how places are described in terms of its physical features and more abstract topics such as history and culture.


2016 ◽  
Vol 6 (1) ◽  
Author(s):  
Mirwaes Wahabzada ◽  
Anne-Katrin Mahlein ◽  
Christian Bauckhage ◽  
Ulrike Steiner ◽  
Erich-Christian Oerke ◽  
...  

2021 ◽  
Vol 20 ◽  
pp. 199-206
Author(s):  
Seda Postalcioglu

This study focused on the classification of EEG signal. The study aims to make a classification with fast response and high-performance rate. Thus, it could be possible for real-time control applications as Brain-Computer Interface (BCI) systems. The feature vector is created by Wavelet transform and statistical calculations. It is trained and tested with a neural network. The db4 wavelet is used in the study. Pwelch, skewness, kurtosis, band power, median, standard deviation, min, max, energy, entropy are used to make the wavelet coefficients meaningful. The performance is achieved as 99.414% with the running time of 0.0209 seconds


Author(s):  
Murugan Anandarajan ◽  
Chelsey Hill ◽  
Thomas Nolan

2020 ◽  
Vol 34 (04) ◽  
pp. 5444-5453
Author(s):  
Edward Raff ◽  
Charles Nicholas ◽  
Mark McLean

Prior work inspired by compression algorithms has described how the Burrows Wheeler Transform can be used to create a distance measure for bioinformatics problems. We describe issues with this approach that were not widely known, and introduce our new Burrows Wheeler Markov Distance (BWMD) as an alternative. The BWMD avoids the shortcomings of earlier efforts, and allows us to tackle problems in variable length DNA sequence clustering. BWMD is also more adaptable to other domains, which we demonstrate on malware classification tasks. Unlike other compression-based distance metrics known to us, BWMD works by embedding sequences into a fixed-length feature vector. This allows us to provide significantly improved clustering performance on larger malware corpora, a weakness of prior methods.


Sign in / Sign up

Export Citation Format

Share Document