Topic Modeling of Large Scale Social Text

BACKGROUND The COVID-19 pandemic has impacted nearly all aspects of life and has posed significant threats to international health and the economy. Given the rapidly unfolding nature of the current pandemic, there is an urgent need to streamline literature synthesis of the growing scientific research to elucidate targeted solutions. While traditional systematic literature review studies provide valuable insights, these studies have restrictions, including analyzing a limited number of papers, having various biases, being time-consuming and labor-intensive, focusing on a few topics, incapable of trend analysis, and lack of data-driven tools. OBJECTIVE This study fills the mentioned restrictions in the literature and practice by analyzing two biomedical concepts, clinical manifestations of disease and therapeutic chemical compounds, with text mining methods in a corpus containing COVID-19 research papers and find associations between the two biomedical concepts. METHODS This research has collected papers representing COVID-19 pre-prints and peer-reviewed research published in 2020. We used frequency analysis to find highly frequent manifestations and therapeutic chemicals, representing the importance of the two biomedical concepts. This study also applied topic modeling to find the relationship between the two biomedical concepts. RESULTS We analyzed 9,298 research papers published through May 5, 2020 and found 3,645 disease-related and 2,434 chemical-related articles. The most frequent clinical manifestations of disease terminology included COVID-19, SARS, cancer, pneumonia, fever, and cough. The most frequent chemical-related terminology included Lopinavir, Ritonavir, Oxygen, Chloroquine, Remdesivir, and water. Topic modeling provided 25 categories showing relationships between our two overarching categories. These categories represent statistically significant associations between multiple aspects of each category, some connections of which were novel and not previously identified by the scientific community. CONCLUSIONS Appreciation of this context is vital due to the lack of a systematic large-scale literature review survey and the importance of fast literature review during the current COVID-19 pandemic for developing treatments. This study is beneficial to researchers for obtaining a macro-level picture of literature, to educators for knowing the scope of literature, to journals for exploring most discussed disease symptoms and pharmaceutical targets, and to policymakers and funding agencies for creating scientific strategic plans regarding COVID-19.

Download Full-text

Topic modeling and improvement of image representation for large-scale image retrieval

Information Sciences ◽

10.1016/j.ins.2016.05.029 ◽

2016 ◽

Vol 366 ◽

pp. 99-120 ◽

Cited By ~ 11

Author(s):

Nguyen Anh Tu ◽

Dong-Luong Dinh ◽

Mostofa Kamal Rasel ◽

Young-Koo Lee

Keyword(s):

Image Retrieval ◽

Topic Modeling ◽

Large Scale ◽

Image Representation ◽

Large Scale Image Retrieval

Download Full-text

Topic modeling for large-scale text data

Frontiers of Information Technology & Electronic Engineering ◽

10.1631/fitee.1400352 ◽

2015 ◽

Vol 16 (6) ◽

pp. 457-465 ◽

Cited By ~ 3

Author(s):

Xi-ming Li ◽

Ji-hong Ouyang ◽

You Lu

Keyword(s):

Topic Modeling ◽

Large Scale ◽

Text Data

Download Full-text

Local Topic Discovery via Boosted Ensemble of Nonnegative Matrix Factorization

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/699 ◽

2017 ◽

Cited By ~ 4

Author(s):

Sangho Suh ◽

Jaegul Choo ◽

Joonseok Lee ◽

Chandan K. Reddy

Keyword(s):

Matrix Factorization ◽

Topic Modeling ◽

Large Scale ◽

Nonnegative Matrix Factorization ◽

Nonnegative Matrix ◽

Gradient Boosting ◽

Ensemble Model ◽

High Quality ◽

Residual Matrix ◽

The Given

Nonnegative matrix factorization (NMF) has been increasingly popular for topic modeling of large-scale documents. However, the resulting topics often represent only general, thus redundant information about the data rather than minor, but potentially meaningful information to users. To tackle this problem, we propose a novel ensemble model of nonnegative matrix factorization for discovering high-quality local topics. Our method leverages the idea of an ensemble model to successively perform NMF given a residual matrix obtained from previous stages and generates a sequence of topic sets. The novelty of our method lies in the fact that it utilizes the residual matrix inspired by a state-of-the-art gradient boosting model and applies a sophisticated local weighting scheme on the given matrix to enhance the locality of topics, which in turn delivers high-quality, focused topics of interest to users.

Download Full-text

MixEHR-Guided: A guided multi-modal topic modeling approach for large-scale automatic phenotyping using the electronic health record

10.1101/2021.12.17.473215 ◽

2021 ◽

Author(s):

Yuri Ahuja ◽

Yuesong Zou ◽

Aman Verma ◽

David Buckeridge ◽

Yue Li

Keyword(s):

Gold Standard ◽

Topic Modeling ◽

Large Scale ◽

Topic Model ◽

Disease Risk ◽

Clinical Decision ◽

Treatment Recommendation ◽

Administrative Claims ◽

Electronic Health ◽

Automatic Phenotyping

Electronic Health Records (EHRs) contain rich clinical data collected at the point of the care, and their increasing adoption offers exciting opportunities for clinical informatics, disease risk prediction, and personalized treatment recommendation. However, effective use of EHR data for research and clinical decision support is often hampered by a lack of reliable disease labels. To compile gold-standard labels, researchers often rely on clinical experts to develop rule-based phenotyping algorithms from billing codes and other surrogate features. This process is tedious and error-prone due to recall and observer biases in how codes and measures are selected, and some phenotypes are incompletely captured by a handful of surrogate features. To address this challenge, we present a novel automatic phenotyping model called MixEHR-Guided (MixEHR-G), a multimodal hierarchical Bayesian topic model that efficiently models the EHR generative process by identifying latent phenotype structure in the data. Unlike existing topic modeling algorithms wherein, the inferred topics are not identifiable, MixEHR-G uses prior information from informative surrogate features to align topics with known phenotypes. We applied MixEHR-G to an openly available EHR dataset of 38,597 intensive care patients (MIMIC-III) in Boston, USA and to administrative claims data for a population-based cohort (PopHR) of 1.3 million people in Quebec, Canada. Qualitatively, we demonstrate that MixEHR-G learns interpretable phenotypes and yields meaningful insights about phenotype similarities, comorbidities, and epidemiological associations. Quantitatively, MixEHR-G outperforms existing unsupervised phenotyping methods on a phenotype label annotation task, and it can accurately estimate relative phenotype prevalence functions without gold-standard phenotype information. Altogether, MixEHR-G is an important step towards building an interpretable and automated phenotyping system using EHR data.

Download Full-text

◾ Topic Modeling for Large-Scale Multimedia Analysis and Retrieval

Big Data ◽

10.1201/b18050-24 ◽

2015 ◽

pp. 412-429

Keyword(s):

Topic Modeling ◽

Large Scale ◽

Multimedia Analysis

Download Full-text

Demystifying COVID-19 publications: institutions, journals, concepts, and topics

Journal of the Medical Library Association JMLA ◽

10.5195/jmla.2021.1141 ◽

2021 ◽

Vol 109 (3) ◽

Author(s):

Haihua Chen ◽

Jiangping Chen ◽

Huyen Nguyen

Keyword(s):

Topic Modeling ◽

Large Scale ◽

Descriptive Analysis ◽

Key Words And Phrases ◽

Research Institutions ◽

Powerful Approach ◽

The Us ◽

Visualization Techniques ◽

The University ◽

Academy Of Sciences

Objective: We analyzed the COVID-19 Open Research Dataset (CORD-19) to understand leading research institutions, collaborations among institutions, major publication venues, key research concepts, and topics covered by pandemic-related research.Methods: We conducted a descriptive analysis of authors’ institutions and relationships, automatic content extraction of key words and phrases from titles and abstracts, and topic modeling and evolution. Data visualization techniques were applied to present the results of the analysis.Results: We found that leading research institutions on COVID-19 included the Chinese Academy of Sciences, the US National Institutes of Health, and the University of California. Research studies mostly involved collaboration among different institutions at national and international levels. In addition to bioRxiv, major publication venues included journals such as The BMJ, PLOS One, Journal of Virology, and The Lancet. Key research concepts included the coronavirus, acute respiratory impairments, health care, and social distancing. The ten most popular topics were identified through topic modeling and included human metapneumovirus and livestock, clinical outcomes of severe patients, and risk factors for higher mortality rate.Conclusion: Data analytics is a powerful approach for quickly processing and understanding large-scale datasets like CORD-19. This approach could help medical librarians, researchers, and the public understand important characteristics of COVID-19 research and could be applied to the analysis of other large datasets.

Download Full-text

Measuring Equity-Promoting Behaviors in Digital Teaching Simulations: A Topic Modeling Approach

AERA Open ◽

10.1177/23328584211045685 ◽

2021 ◽

Vol 7 ◽

pp. 233285842110456

Author(s):

Joshua Littenberg-Tobias ◽

Elizabeth Borneman ◽

Justin Reich

Keyword(s):

Language Processing ◽

Topic Modeling ◽

Large Scale ◽

Topic Model ◽

Online Course ◽

Massive Open Online ◽

Massive Open Online Course ◽

Digital Simulations ◽

Equity And Inclusion ◽

Structural Topic Model

Diversity, equity, and inclusion (DEI) issues are urgent in education. We developed and evaluated a massive open online course ( N = 963) with embedded equity simulations that attempted to equip educators with equity teaching practices. Applying a structural topic model (STM)—a type of natural language processing (NLP)—we examined how participants with different equity attitudes responded in simulations. Over a sequence of four simulations, the simulation behavior of participants with less equitable beliefs converged to be more similar with the simulated behavior of participants with more equitable beliefs ( ES [effect size] = 1.08 SD). This finding was corroborated by overall changes in equity mindsets ( ES = 0.88 SD) and changed in self-reported equity-promoting practices ( ES = 0.32 SD). Digital simulations when combined with NLP offer a compelling approach to both teaching about DEI topics and formatively assessing learner behavior in large-scale learning environments.

Download Full-text

COVID-19 Vaccine–Related Discussion on Twitter: Topic Modeling and Sentiment Analysis (Preprint)

10.2196/preprints.24435 ◽

2021 ◽

Author(s):

Joanne Chen Lyu ◽

Eileen Le Han ◽

Garving K Luli

Keyword(s):

Social Media ◽

Topic Modeling ◽

Large Scale ◽

National Research Council ◽

Global Perspective ◽

World Health ◽

Data Set ◽

Emotion Analysis ◽

The Public ◽

The World

BACKGROUND Vaccination is a cornerstone of the prevention of communicable infectious diseases; however, vaccines have traditionally met with public fear and hesitancy, and COVID-19 vaccines are no exception. Social media use has been demonstrated to play a role in the low acceptance of vaccines. OBJECTIVE The aim of this study is to identify the topics and sentiments in the public COVID-19 vaccine–related discussion on social media and discern the salient changes in topics and sentiments over time to better understand the public perceptions, concerns, and emotions that may influence the achievement of herd immunity goals. METHODS Tweets were downloaded from a large-scale COVID-19 Twitter chatter data set from March 11, 2020, the day the World Health Organization declared COVID-19 a pandemic, to January 31, 2021. We used R software to clean the tweets and retain tweets that contained the keywords vaccination, vaccinations, vaccine, vaccines, immunization, vaccinate, and vaccinated. The final data set included in the analysis consisted of 1,499,421 unique tweets from 583,499 different users. We used R to perform latent Dirichlet allocation for topic modeling as well as sentiment and emotion analysis using the National Research Council of Canada Emotion Lexicon. RESULTS Topic modeling of tweets related to COVID-19 vaccines yielded 16 topics, which were grouped into 5 overarching themes. Opinions about vaccination (227,840/1,499,421 tweets, 15.2%) was the most tweeted topic and remained a highly discussed topic during the majority of the period of our examination. Vaccine progress around the world became the most discussed topic around August 11, 2020, when Russia approved the world’s first COVID-19 vaccine. With the advancement of vaccine administration, the topic of instruction on getting vaccines gradually became more salient and became the most discussed topic after the first week of January 2021. Weekly mean sentiment scores showed that despite fluctuations, the sentiment was increasingly positive in general. Emotion analysis further showed that trust was the most predominant emotion, followed by anticipation, fear, sadness, etc. The trust emotion reached its peak on November 9, 2020, when Pfizer announced that its vaccine is 90% effective. CONCLUSIONS Public COVID-19 vaccine–related discussion on Twitter was largely driven by major events about COVID-19 vaccines and mirrored the active news topics in mainstream media. The discussion also demonstrated a global perspective. The increasingly positive sentiment around COVID-19 vaccines and the dominant emotion of trust shown in the social media discussion may imply higher acceptance of COVID-19 vaccines compared with previous vaccines.

Download Full-text