A Tutorial on Probabilistic Topic Models for Text Data Retrieval and Analysis

Text data clustering is performed for organizing the set of text documents into the desired number of coherent and meaningful sub-clusters. Modeling the text documents in terms of topics derivations is a vital task in text data clustering. Each tweet is considered as a text document, and various topic models perform modeling of tweets. In existing topic models, the clustering tendency of tweets is assessed initially based on Euclidean dissimilarity features. Cosine metric is more suitable for more informative assessment, especially of text clustering. Thus, this paper develops a novel cosine based external and interval validity assessment of cluster tendency for improving the computational efficiency of tweets data clustering. In the experimental, tweets data clustering results are evaluated using cluster validity indices measures. Experimentally proved that cosine based internal and external validity metrics outperforms the other using benchmarked and Twitter-based datasets.

Download Full-text

Discovering Mobility Patterns on Bicycle-Based Public Transportation System by Using Probabilistic Topic Models

Ambient Intelligence - Software and Applications - Advances in Intelligent and Soft Computing ◽

10.1007/978-3-642-28783-1_18 ◽

2012 ◽

pp. 145-153 ◽

Cited By ~ 5

Author(s):

Raul Montoliu

Keyword(s):

Public Transportation ◽

Topic Models ◽

Transportation System ◽

Mobility Patterns ◽

Probabilistic Topic Models ◽

Public Transportation System

Download Full-text

Improving Topic Models with Latent Feature Word Representations

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00140 ◽

2015 ◽

Vol 3 ◽

pp. 299-313 ◽

Cited By ~ 94

Author(s):

Dat Quoc Nguyen ◽

Richard Billingsley ◽

Lan Du ◽

Mark Johnson

Keyword(s):

High Performance ◽

Feature Vector ◽

Topic Models ◽

Document Collections ◽

Probabilistic Topic Models ◽

New Models ◽

Classification Tasks ◽

Latent Topics ◽

Vector Representations ◽

Feature Word

Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, our new models produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.

Download Full-text