Intersection of Resilience and COVID-19: Structural Topic Modelling and Word Embeddings from Reddit Titles (Preprint)
BACKGROUND Topic modeling and word embeddings’ studies of Twitter data related to COVID-19 are being extensively reported. Another social media platform that experienced a tremendous increase in new users and posts due to COVID-19 was Reddit, offering a much less explored alternative, especially the submissions’ titles, due to their format (≤ 300 characters) and content rules. The positivity of self-presentation on social media has an influence on both the quantity and quality of reactions (upvotes) from other social media contacts. OBJECTIVE 1) Expand on the concept of resilience identifying possible related topics considering their number of upvotes and its closest terms and 2) Associate specific emotions obtained from the state-of-the-art literature to their closest terms in order to relate such emotions to experienced situations. METHODS Reddit data were collected from pushshift.io, with the pushshiftr R package, data cleaning and preprocessing was performed using quanteda, tidyverse, tidytext R packages. A word2vec model (W2V) was trained using submissions’ titles, preliminary validation was performed using a subset of Mikolov’s analogies and a COVID-19 glossary. The W2V model was trained with the wordVectors R package. Main topics (represented as sets of words) using the number of upvotes as covariate were extracted using structural topic modelling (STM) with the spectral methos using the stm R package. Topics validation was performed using semantic coherence and exclusivity. Clusters were assessed using Dunn index. RESULTS We collected all 374,421 titles submitted by 104,351 different redditors to the r/Coronavirus subreddit between January 20th 2020 and 14th May 2021. We trained W2V and identified more than 20 valid analogies (e.g. doctor – hospital + teacher = school). We further validated W2V with representative terms extracted from a COVID-19 glossary, all closest terms retrieved by W2V were verified using state of the art publications. STM retrieved 20 topics (with 20 words each) ordered by their number of upvotes, we run W2V in a representative topic (addressing vaccines) and we used two terms as seeds leading to other related terms (represented using cluster analysis) that we validated using scientific publications. STM did not retrieve any topic containing the term “resilience”, it hardly appeared (less than 0.02%) in all titles. Nevertheless we identified several closest terms (e.g. wellbeing, roadmap) and combined terms (e.g. resilience and elderly, resilience and indigenous) as well as specific emotions that W2V related to lived experiences (e.g. the emotion of gratitude associated to applauses and balconies). CONCLUSIONS We applied for the first time the combination of STM and a word2vec model trained with a relatively small Coronavirus dataset of Reddit titles, leading to immediate and accurate terms that can be used to expand our knowledge on topics associated to the pandemic (e.g. vaccines) or specific aspects such as resilience.