DistSNNMF: Solving Large-Scale Semantic Topic Model Problems on HPC for Streaming Texts

Decoding brain activity using a large-scale probabilistic functional-anatomical atlas of human cognition

10.1101/059618 ◽

2016 ◽

Cited By ~ 4

Author(s):

Timothy N. Rubin ◽

Oluwasanmi Koyejo ◽

Krzysztof J. Gorgolewski ◽

Michael N. Jones ◽

Russell A. Poldrack ◽

...

Keyword(s):

Large Scale ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Brain Activity ◽

Human Cognition ◽

Brain Images ◽

Whole Brain ◽

Context Sensitive ◽

Cognitive States ◽

Small Set

AbstractA central goal of cognitive neuroscience is to decode human brain activity--i.e., to infer mental processes from observed patterns of whole-brain activation. Previous decoding efforts have focused on classifying brain activity into a small set of discrete cognitive states. To attain maximal utility, a decoding framework must be open-ended, systematic, and context-sensitive--i.e., capable of interpreting numerous brain states, presented in arbitrary combinations, in light of prior information. Here we take steps towards this objective by introducing a Bayesian decoding framework based on a novel topic model---Generalized Correspondence Latent Dirichlet Allocation---that learns latent topics from a database of over 11,000 published fMRI studies. The model produces highly interpretable, spatially-circumscribed topics that enable flexible decoding of whole-brain images. Importantly, the Bayesian nature of the model allows one to “seed” decoder priors with arbitrary images and text--enabling researchers, for the first time, to generative quantitative, context-sensitive interpretations of whole-brain patterns of brain activity.

Download Full-text

Stochastic Variational Inference-Based Parallel and Online Supervised Topic Model for Large-Scale Text Processing

Journal of Computer Science and Technology ◽

10.1007/s11390-018-1871-y ◽

2018 ◽

Vol 33 (5) ◽

pp. 1007-1022

Author(s):

Yang Li ◽

Wen-Zhuo Song ◽

Bo Yang

Keyword(s):

Large Scale ◽

Topic Model ◽

Text Processing ◽

Variational Inference ◽

Stochastic Variational Inference

Download Full-text

SRTM: a supervised relation topic model for multi-classification on large-scale document network

Neural Computing and Applications ◽

10.1007/s00521-019-04145-5 ◽

2019 ◽

Vol 32 (10) ◽

pp. 6383-6392

Author(s):

Chunshan Li ◽

Hua Zhang ◽

Dianhui Chu ◽

Xiaofei Xu

Keyword(s):

Large Scale ◽

Topic Model ◽

Multi Classification

Download Full-text

ZenLDA: Large-scale topic model training on distributed data-parallel platform

Big Data Mining and Analytics ◽

10.26599/bdma.2018.9020006 ◽

2018 ◽

Vol 1 (1) ◽

pp. 57-74 ◽

Cited By ~ 11

Keyword(s):

Large Scale ◽

Topic Model ◽

Distributed Data ◽

Data Parallel ◽

Parallel Platform ◽

Model Training

Download Full-text

Exploring the Citywide Human Mobility Patterns of Taxi Trips through a Topic-Modeling Analysis

Journal of Advanced Transportation ◽

10.1155/2021/6697827 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Hui Xiong ◽

Kaiqiang Xie ◽

Lu Ma ◽

Feng Yuan ◽

Rui Shen

Keyword(s):

Power Law ◽

Traffic Management ◽

Large Scale ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Human Mobility ◽

Mobility Patterns ◽

Power Law Distribution ◽

Modeling Analysis ◽

Wide Range

Understanding human mobility patterns is of great importance for a wide range of applications from social networks to transportation planning. Toward this end, the spatial-temporal information of a large-scale dataset of taxi trips was collected via GPS, from March 10 to 23, 2014, in Beijing. The data contain trips generated by a great portion of taxi vehicles citywide. We revealed that the geographic displacement of those trips follows the power law distribution and the corresponding travel time follows a mixture of the exponential and power law distribution. To identify human mobility patterns, a topic model with the latent Dirichlet allocation (LDA) algorithm was proposed to infer the sixty-five key topics. By measuring the variation of trip displacement over time, we find that the travel distance in the morning rush hour is much shorter than that in the other time. As for daily patterns, it shows that taxi mobility presents weekly regularity both on weekdays and on weekends. Among different days in the same week, mobility patterns on Tuesday and Wednesday are quite similar. By quantifying the trip distance along time, we find that Topic 44 exhibits dominant patterns, which means distance less than 10 km is predominant no matter what time in a day. The findings could be references for travelers to arrange trips and policymakers to formulate sound traffic management policies.

Download Full-text

MixEHR-Guided: A guided multi-modal topic modeling approach for large-scale automatic phenotyping using the electronic health record

10.1101/2021.12.17.473215 ◽

2021 ◽

Author(s):

Yuri Ahuja ◽

Yuesong Zou ◽

Aman Verma ◽

David Buckeridge ◽

Yue Li

Keyword(s):

Gold Standard ◽

Topic Modeling ◽

Large Scale ◽

Topic Model ◽

Disease Risk ◽

Clinical Decision ◽

Treatment Recommendation ◽

Administrative Claims ◽

Electronic Health ◽

Automatic Phenotyping

Electronic Health Records (EHRs) contain rich clinical data collected at the point of the care, and their increasing adoption offers exciting opportunities for clinical informatics, disease risk prediction, and personalized treatment recommendation. However, effective use of EHR data for research and clinical decision support is often hampered by a lack of reliable disease labels. To compile gold-standard labels, researchers often rely on clinical experts to develop rule-based phenotyping algorithms from billing codes and other surrogate features. This process is tedious and error-prone due to recall and observer biases in how codes and measures are selected, and some phenotypes are incompletely captured by a handful of surrogate features. To address this challenge, we present a novel automatic phenotyping model called MixEHR-Guided (MixEHR-G), a multimodal hierarchical Bayesian topic model that efficiently models the EHR generative process by identifying latent phenotype structure in the data. Unlike existing topic modeling algorithms wherein, the inferred topics are not identifiable, MixEHR-G uses prior information from informative surrogate features to align topics with known phenotypes. We applied MixEHR-G to an openly available EHR dataset of 38,597 intensive care patients (MIMIC-III) in Boston, USA and to administrative claims data for a population-based cohort (PopHR) of 1.3 million people in Quebec, Canada. Qualitatively, we demonstrate that MixEHR-G learns interpretable phenotypes and yields meaningful insights about phenotype similarities, comorbidities, and epidemiological associations. Quantitatively, MixEHR-G outperforms existing unsupervised phenotyping methods on a phenotype label annotation task, and it can accurately estimate relative phenotype prevalence functions without gold-standard phenotype information. Altogether, MixEHR-G is an important step towards building an interpretable and automated phenotyping system using EHR data.

Download Full-text

A Distributed Topic Model for Large-Scale Streaming Text

Knowledge Science, Engineering and Management - Lecture Notes in Computer Science ◽

10.1007/978-3-030-29563-9_4 ◽

2019 ◽

pp. 37-48

Author(s):

Yicong Li ◽

Dawei Feng ◽

Menglong Lu ◽

Dongsheng Li

Keyword(s):

Large Scale ◽

Topic Model

Download Full-text

Learning interpretable cellular and gene signature embeddings from single-cell transcriptomic data

Nature Communications ◽

10.1038/s41467-021-25534-2 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Yifan Zhao ◽

Huiyu Cai ◽

Zuobai Zhang ◽

Jian Tang ◽

Yue Li

Keyword(s):

Single Cell ◽

Large Scale ◽

Topic Model ◽

Enrichment Analysis ◽

Gene Signature ◽

Gene Set Enrichment Analysis ◽

Learning Performance ◽

Cell Type ◽

Gene Set Enrichment ◽

Gene Sets

AbstractThe advent of single-cell RNA sequencing (scRNA-seq) technologies has revolutionized transcriptomic studies. However, large-scale integrative analysis of scRNA-seq data remains a challenge largely due to unwanted batch effects and the limited transferabilty, interpretability, and scalability of the existing computational methods. We present single-cell Embedded Topic Model (scETM). Our key contribution is the utilization of a transferable neural-network-based encoder while having an interpretable linear decoder via a matrix tri-factorization. In particular, scETM simultaneously learns an encoder network to infer cell type mixture and a set of highly interpretable gene embeddings, topic embeddings, and batch-effect linear intercepts from multiple scRNA-seq datasets. scETM is scalable to over 106 cells and confers remarkable cross-tissue and cross-species zero-shot transfer-learning performance. Using gene set enrichment analysis, we find that scETM-learned topics are enriched in biologically meaningful and disease-related pathways. Lastly, scETM enables the incorporation of known gene sets into the gene embeddings, thereby directly learning the associations between pathways and topics via the topic embeddings.

Download Full-text

Collapsed variational bayesian inference of the author-topic model: application to large-scale coordinate-based meta-analysis

2016 International Workshop on Pattern Recognition in Neuroimaging (PRNI) ◽

10.1109/prni.2016.7552332 ◽

2016 ◽

Cited By ~ 2

Author(s):

Gia H. Ngo ◽

Simon B. Eickhoff ◽

Peter T. Fox ◽

B.T. Thomas Yeo

Keyword(s):

Bayesian Inference ◽

Large Scale ◽

Topic Model ◽

Meta Analysis ◽

Model Application ◽

Variational Bayesian ◽

Variational Bayesian Inference

Download Full-text

A Tag-Based Improved LDA and Web Page Clustering Analysis

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.667.277 ◽

2014 ◽

Vol 667 ◽

pp. 277-285 ◽

Cited By ~ 1

Author(s):

Fang Chen ◽

Yan Hui Zhou

Keyword(s):

Clustering Analysis ◽

Large Scale ◽

Clustering Algorithm ◽

Semantic Analysis ◽

Topic Model ◽

Rapid Development ◽

Expansion Method ◽

Web Page ◽

Web Page Clustering ◽

The Web

With the rapid development of Internet, tag technology has been widely used in various sites. The brief text labels of network resources are greatly convenient for people to access the massive data. Social tags allows the user to use any word ----to tag network objects, and to share these tags, because of its simple and flexible operation, and it has become one of the popular applications. However, there exists some problems like noise of tags, lack of using criteria, and sparse distribution etc. Especially sparsity of tags seriously limits its application in the semantic analysis of web pages. This paper, by exploiting the user-related tag expansion method to overcome this problem, at the same time by using the topic model----LDA to model the web tags, mine its potential topic from the large-scale web page, and obtain the topic distribution of the text to the text clustering analysis. The experimental results show that, compared with the traditional clustering algorithm, the method of based LDA clustering on the analysis of the web tags have a larger increase.

Download Full-text