A reliable cross-site user generated content modeling method based on topic model

2020 ◽  
Vol 209 ◽  
pp. 106435
Author(s):  
Baoxi Liu ◽  
Peng Zhang ◽  
Tun Lu ◽  
Ning Gu
2017 ◽  
Vol 35 (4) ◽  
pp. 770-782 ◽  
Author(s):  
Qingqing Zhou ◽  
Chengzhi Zhang

Purpose The development of social media has led to large numbers of internet users now producing massive amounts of user-generated content (UGC). UGC, which shows users’ opinions about events directly, is valuable for monitoring public opinion. Current researches have focused on analysing topic evolutions in UGC. However, few researches pay attention to emotion evolutions of sub-topics about popular events. Important details about users’ opinions might be missed, as users’ emotions are ignored. This paper aims to extract sub-topics about a popular event from UGC and investigate the emotion evolutions of each sub-topic. Design/methodology/approach This paper first collects UGC about a popular event as experimental data and conducts subjectivity classification on the data to get subjective corpus. Second, the subjective corpus is classified into different emotion categories using supervised emotion classification. Meanwhile, a topic model is used to extract sub-topics about the event from the subjective corpora. Finally, the authors use the results of emotion classification and sub-topic extraction to analyze emotion evolutions over time. Findings Experimental results show that specific primary emotions exist in each sub-topic and undergo evolutions differently. Moreover, the authors find that performance of emotion classifier is optimal with term frequency and relevance frequency as the feature-weighting method. Originality/value To the best of the authors’ knowledge, this is the first research to mine emotion evolutions of sub-topics about an event with UGC. It mines users’ opinions about sub-topics of event, which may offer more details that are useful for analysing users’ emotions in preparation for decision-making.


2020 ◽  
Author(s):  
Sicheng Zhou ◽  
Yunpeng Zhao ◽  
Jiang Bian ◽  
Ann F Haynos ◽  
Rui Zhang

BACKGROUND Eating disorders (EDs) are a group of mental illnesses that have an adverse effect on both mental and physical health. As social media platforms (eg, Twitter) have become an important data source for public health research, some studies have qualitatively explored the ways in which EDs are discussed on these platforms. Initial results suggest that such research offers a promising method for further understanding this group of diseases. Nevertheless, an efficient computational method is needed to further identify and analyze tweets relevant to EDs on a larger scale. OBJECTIVE This study aims to develop and validate a machine learning–based classifier to identify tweets related to EDs and to explore factors (ie, topics) related to EDs using a topic modeling method. METHODS We collected potential ED-relevant tweets using keywords from previous studies and annotated these tweets into different groups (ie, ED relevant vs irrelevant and then promotional information vs laypeople discussion). Several supervised machine learning methods, such as convolutional neural network (CNN), long short-term memory (LSTM), support vector machine, and naïve Bayes, were developed and evaluated using annotated data. We used the classifier with the best performance to identify ED-relevant tweets and applied a topic modeling method—Correlation Explanation (CorEx)—to analyze the content of the identified tweets. To validate these machine learning results, we also collected a cohort of ED-relevant tweets on the basis of manually curated rules. RESULTS A total of 123,977 tweets were collected during the set period. We randomly annotated 2219 tweets for developing the machine learning classifiers. We developed a CNN-LSTM classifier to identify ED-relevant tweets published by laypeople in 2 steps: first relevant versus irrelevant (F<sub>1</sub> score=0.89) and then promotional versus published by laypeople (F<sub>1</sub> score=0.90). A total of 40,790 ED-relevant tweets were identified using the CNN-LSTM classifier. We also identified another set of tweets (ie, 17,632 ED-relevant and 83,557 ED-irrelevant tweets) posted by laypeople using manually specified rules. Using CorEx on all ED-relevant tweets, the topic model identified 162 topics. Overall, the coherence rate for topic modeling was 77.07% (1264/1640), indicating a high quality of the produced topics. The topics were further reviewed and analyzed by a domain expert. CONCLUSIONS A developed CNN-LSTM classifier could improve the efficiency of identifying ED-relevant tweets compared with the traditional manual-based method. The CorEx topic model was applied on the tweets identified by the machine learning–based classifier and the traditional manual approach separately. Highly overlapping topics were observed between the 2 cohorts of tweets. The produced topics were further reviewed by a domain expert. Some of the topics identified by the potential ED tweets may provide new avenues for understanding this serious set of disorders.


2020 ◽  
Vol 12 (7) ◽  
pp. 2843 ◽  
Author(s):  
Eunhye (Olivia) Park ◽  
Bongsug (Kevin) Chae ◽  
Junehee Kwon ◽  
Woo-Hyuk Kim

Although green practice is increasingly adopted in the restaurant industry, there is still little research in terms of investigating the impacts of green practice on customer satisfaction. This study utilized user-generated content by green restaurant customers to identify various aspects of green restaurants, including perceived green restaurant practices. Our data are based on U.S. green-certified restaurants available on Yelp. Structural topic modeling was used to discover latent restaurant attributes from user-generated content. With a longitudinal approach, the changes in customers’ interest in green practices were estimated. Finally, the common restaurant attributes and green attributes were used to predict customer satisfaction. This study will contribute to marketing strategies for the restaurant industry.


2022 ◽  
Vol 40 (4) ◽  
pp. 1-28
Author(s):  
Peng Zhang ◽  
Baoxi Liu ◽  
Tun Lu ◽  
Xianghua Ding ◽  
Hansu Gu ◽  
...  

User-generated contents (UGC) in social media are the direct expression of users’ interests, preferences, and opinions. User behavior prediction based on UGC has increasingly been investigated in recent years. Compared to learning a person’s behavioral patterns in each social media site separately, jointly predicting user behavior in multiple social media sites and complementing each other (cross-site user behavior prediction) can be more accurate. However, cross-site user behavior prediction based on UGC is a challenging task due to the difficulty of cross-site data sampling, the complexity of UGC modeling, and uncertainty of knowledge sharing among different sites. For these problems, we propose a Cross-Site Multi-Task (CSMT) learning method to jointly predict user behavior in multiple social media sites. CSMT mainly derives from the hierarchical attention network and multi-task learning. Using this method, the UGC in each social media site can obtain fine-grained representations in terms of words, topics, posts, hashtags, and time slices as well as the relevances among them, and prediction tasks in different social media sites can be jointly implemented and complement each other. By utilizing two cross-site datasets sampled from Weibo, Douban, Facebook, and Twitter, we validate our method’s superiority on several classification metrics compared with existing related methods.


10.2196/18273 ◽  
2020 ◽  
Vol 8 (10) ◽  
pp. e18273
Author(s):  
Sicheng Zhou ◽  
Yunpeng Zhao ◽  
Jiang Bian ◽  
Ann F Haynos ◽  
Rui Zhang

Background Eating disorders (EDs) are a group of mental illnesses that have an adverse effect on both mental and physical health. As social media platforms (eg, Twitter) have become an important data source for public health research, some studies have qualitatively explored the ways in which EDs are discussed on these platforms. Initial results suggest that such research offers a promising method for further understanding this group of diseases. Nevertheless, an efficient computational method is needed to further identify and analyze tweets relevant to EDs on a larger scale. Objective This study aims to develop and validate a machine learning–based classifier to identify tweets related to EDs and to explore factors (ie, topics) related to EDs using a topic modeling method. Methods We collected potential ED-relevant tweets using keywords from previous studies and annotated these tweets into different groups (ie, ED relevant vs irrelevant and then promotional information vs laypeople discussion). Several supervised machine learning methods, such as convolutional neural network (CNN), long short-term memory (LSTM), support vector machine, and naïve Bayes, were developed and evaluated using annotated data. We used the classifier with the best performance to identify ED-relevant tweets and applied a topic modeling method—Correlation Explanation (CorEx)—to analyze the content of the identified tweets. To validate these machine learning results, we also collected a cohort of ED-relevant tweets on the basis of manually curated rules. Results A total of 123,977 tweets were collected during the set period. We randomly annotated 2219 tweets for developing the machine learning classifiers. We developed a CNN-LSTM classifier to identify ED-relevant tweets published by laypeople in 2 steps: first relevant versus irrelevant (F1 score=0.89) and then promotional versus published by laypeople (F1 score=0.90). A total of 40,790 ED-relevant tweets were identified using the CNN-LSTM classifier. We also identified another set of tweets (ie, 17,632 ED-relevant and 83,557 ED-irrelevant tweets) posted by laypeople using manually specified rules. Using CorEx on all ED-relevant tweets, the topic model identified 162 topics. Overall, the coherence rate for topic modeling was 77.07% (1264/1640), indicating a high quality of the produced topics. The topics were further reviewed and analyzed by a domain expert. Conclusions A developed CNN-LSTM classifier could improve the efficiency of identifying ED-relevant tweets compared with the traditional manual-based method. The CorEx topic model was applied on the tweets identified by the machine learning–based classifier and the traditional manual approach separately. Highly overlapping topics were observed between the 2 cohorts of tweets. The produced topics were further reviewed by a domain expert. Some of the topics identified by the potential ED tweets may provide new avenues for understanding this serious set of disorders.


2016 ◽  
Vol 46 (6) ◽  
pp. 908-920 ◽  
Author(s):  
Shu Wu ◽  
Weiyu Guo ◽  
Song Xu ◽  
Yongzhen Huang ◽  
Liang Wang ◽  
...  

SAGE Open ◽  
2021 ◽  
Vol 11 (3) ◽  
pp. 215824402110315
Author(s):  
Eunhye Park ◽  
Junehee Kwon ◽  
Bongsug (Kevin) Chae ◽  
Sung-Bum Kim

This study aims to survey user-generated content (UGC) from diners in certified green restaurants, discover the green images they recall, and demonstrate the usefulness of applying a probabilistic topic model to comprehend customers’ perceptions. Postvisit online reviews ( N = 28,098), in the form of unstructured texts from the TripAdvisor.com website, were used to find freely recalled green-restaurant images. These data were preprocessed with a structural topic model (STM) algorithm to select 51 relevant categories of images. These image categories were compared with the findings of previous studies to discover unique restaurant attributes. Furthermore, a topic-level network and a green-restaurant network were drawn to discover the most easily recallable image categories and their attributes. This machine-learning-based approach improved the reproducibility of unstructured data analyses, overcoming the subjectivity of qualitative data analysis. Theoretical and practical implications are offered for topic modeling methodology along with marketing strategies for restaurateurs.


Sign in / Sign up

Export Citation Format

Share Document