Promoting Reproducible Research for Characterizing Nonmedical Use of Medications Through Data Annotation: Description of a Twitter Corpus and Guidelines

Background Social media data are being increasingly used for population-level health research because it provides near real-time access to large volumes of consumer-generated data. Recently, a number of studies have explored the possibility of using social media data, such as from Twitter, for monitoring prescription medication abuse. However, there is a paucity of annotated data or guidelines for data characterization that discuss how information related to abuse-prone medications is presented on Twitter. Objective This study discusses the creation of an annotated corpus suitable for training supervised classification algorithms for the automatic classification of medication abuse–related chatter. The annotation strategies used for improving interannotator agreement (IAA), a detailed annotation guideline, and machine learning experiments that illustrate the utility of the annotated corpus are also described. Methods We employed an iterative annotation strategy, with interannotator discussions held and updates made to the annotation guidelines at each iteration to improve IAA for the manual annotation task. Using the grounded theory approach, we first characterized tweets into fine-grained categories and then grouped them into 4 broad classes—abuse or misuse, personal consumption, mention, and unrelated. After the completion of manual annotations, we experimented with several machine learning algorithms to illustrate the utility of the corpus and generate baseline performance metrics for automatic classification on these data. Results Our final annotated set consisted of 16,443 tweets mentioning at least 20 abuse-prone medications including opioids, benzodiazepines, atypical antipsychotics, central nervous system stimulants, and gamma-aminobutyric acid analogs. Our final overall IAA was 0.86 (Cohen kappa), which represents high agreement. The manual annotation process revealed the variety of ways in which prescription medication misuse or abuse is discussed on Twitter, including expressions indicating coingestion, nonmedical use, nonstandard route of intake, and consumption above the prescribed doses. Among machine learning classifiers, support vector machines obtained the highest automatic classification accuracy of 73.00% (95% CI 71.4-74.5) over the test set (n=3271). Conclusions Our manual analysis and annotations of a large number of tweets have revealed types of information posted on Twitter about a set of abuse-prone prescription medications and their distributions. In the interests of reproducible and community-driven research, we have made our detailed annotation guidelines and the training data for the classification experiments publicly available, and the test data will be used in future shared tasks.

Download Full-text

Promoting Reproducible Research for Characterizing Nonmedical Use of Medications Through Data Annotation: Description of a Twitter Corpus and Guidelines (Preprint)

10.2196/preprints.15861 ◽

2019 ◽

Author(s):

Karen O'Connor ◽

Abeed Sarker ◽

Jeanmarie Perrone ◽

Graciela Gonzalez Hernandez

Keyword(s):

Machine Learning ◽

Social Media ◽

Prescription Medication ◽

Automatic Classification ◽

Support Vector ◽

Manual Annotation ◽

Social Media Data ◽

Nonmedical Use ◽

Medication Abuse ◽

Media Data

BACKGROUND Social media data are being increasingly used for population-level health research because it provides near real-time access to large volumes of consumer-generated data. Recently, a number of studies have explored the possibility of using social media data, such as from Twitter, for monitoring prescription medication abuse. However, there is a paucity of annotated data or guidelines for data characterization that discuss how information related to abuse-prone medications is presented on Twitter. OBJECTIVE This study discusses the creation of an annotated corpus suitable for training supervised classification algorithms for the automatic classification of medication abuse–related chatter. The annotation strategies used for improving interannotator agreement (IAA), a detailed annotation guideline, and machine learning experiments that illustrate the utility of the annotated corpus are also described. METHODS We employed an iterative annotation strategy, with interannotator discussions held and updates made to the annotation guidelines at each iteration to improve IAA for the manual annotation task. Using the grounded theory approach, we first characterized tweets into fine-grained categories and then grouped them into 4 broad classes—<i>abuse or misuse, personal consumption, mention,</i> and <i>unrelated</i>. After the completion of manual annotations, we experimented with several machine learning algorithms to illustrate the utility of the corpus and generate baseline performance metrics for automatic classification on these data. RESULTS Our final annotated set consisted of 16,443 tweets mentioning at least 20 abuse-prone medications including opioids, benzodiazepines, atypical antipsychotics, central nervous system stimulants, and gamma-aminobutyric acid analogs. Our final overall IAA was 0.86 (Cohen kappa), which represents high agreement. The manual annotation process revealed the variety of ways in which prescription medication misuse or abuse is discussed on Twitter, including expressions indicating coingestion, nonmedical use, nonstandard route of intake, and consumption above the prescribed doses. Among machine learning classifiers, support vector machines obtained the highest automatic classification accuracy of 73.00% (95% CI 71.4-74.5) over the test set (n=3271). CONCLUSIONS Our manual analysis and annotations of a large number of tweets have revealed types of information posted on Twitter about a set of abuse-prone prescription medications and their distributions. In the interests of reproducible and community-driven research, we have made our detailed annotation guidelines and the training data for the classification experiments publicly available, and the test data will be used in future shared tasks.

Download Full-text

The Bigger Picture: Combining Econometrics with Analytics Improves Forecasts of Movie Success

Management Science ◽

10.1287/mnsc.2020.3911 ◽

2021 ◽

Author(s):

Steven F. Lehrer ◽

Tian Xie

Keyword(s):

Machine Learning ◽

Social Media ◽

Big Data ◽

Predictive Analytics ◽

Big Data Analytics ◽

Forecast Accuracy ◽

Support Vector ◽

Significant Heterogeneity ◽

Social Media Data ◽

Media Data

There exists significant hype regarding how much machine learning and incorporating social media data can improve forecast accuracy in commercial applications. To assess if the hype is warranted, we use data from the film industry in simulation experiments that contrast econometric approaches with tools from the predictive analytics literature. Further, we propose new strategies that combine elements from each literature in a bid to capture richer patterns of heterogeneity in the underlying relationship governing revenue. Our results demonstrate the importance of social media data and value from hybrid strategies that combine econometrics and machine learning when conducting forecasts with new big data sources. Specifically, although both least squares support vector regression and recursive partitioning strategies greatly outperform dimension reduction strategies and traditional econometrics approaches in forecast accuracy, there are further significant gains from using hybrid approaches. Further, Monte Carlo experiments demonstrate that these benefits arise from the significant heterogeneity in how social media measures and other film characteristics influence box office outcomes. This paper was accepted by J. George Shanthikumar, big data analytics.

Download Full-text

Hybrid features prediction model of movie quality using Multi-machine learning techniques for effective business resource planning

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201844 ◽

2021 ◽

Vol 40 (5) ◽

pp. 9361-9382 ◽

Cited By ~ 1

Author(s):

Naeem Iqbal ◽

Rashid Ahmad ◽

Faisal Jamil ◽

Do-Hyeun Kim

Keyword(s):

Machine Learning ◽

Social Media ◽

Resource Planning ◽

Experimental Results ◽

Quality Prediction ◽

Classification Models ◽

Hybrid Features ◽

Social Media Data ◽

Media Data

Quality prediction plays an essential role in the business outcome of the product. Due to the business interest of the concept, it has extensively been studied in the last few years. Advancement in machine learning (ML) techniques and with the advent of robust and sophisticated ML algorithms, it is required to analyze the factors influencing the success of the movies. This paper presents a hybrid features prediction model based on pre-released and social media data features using multiple ML techniques to predict the quality of the pre-released movies for effective business resource planning. This study aims to integrate pre-released and social media data features to form a hybrid features-based movie quality prediction (MQP) model. The proposed model comprises of two different experimental models; (i) predict movies quality using the original set of features and (ii) develop a subset of features based on principle component analysis technique to predict movies success class. This work employ and implement different ML-based classification models, such as Decision Tree (DT), Support Vector Machines with the linear and quadratic kernel (L-SVM and Q-SVM), Logistic Regression (LR), Bagged Tree (BT) and Boosted Tree (BOT), to predict the quality of the movies. Different performance measures are utilized to evaluate the performance of the proposed ML-based classification models, such as Accuracy (AC), Precision (PR), Recall (RE), and F-Measure (FM). The experimental results reveal that BT and BOT classifiers performed accurately and produced high accuracy compared to other classifiers, such as DT, LR, LSVM, and Q-SVM. The BT and BOT classifiers achieved an accuracy of 90.1% and 89.7%, which shows an efficiency of the proposed MQP model compared to other state-of-art- techniques. The proposed work is also compared with existing prediction models, and experimental results indicate that the proposed MQP model performed slightly better compared to other models. The experimental results will help the movies industry to formulate business resources effectively, such as investment, number of screens, and release date planning, etc.

Download Full-text

Predicting ethnicity with data on personal names in Russia

10.31235/osf.io/wf6p4 ◽

2021 ◽

Author(s):

Alexey Bessudnov ◽

Denis Tarasov ◽

Viacheslav Panasovets ◽

Veronica Kostenko ◽

Ivan Smirnov ◽

...

Keyword(s):

Machine Learning ◽

Social Media ◽

Ethnic Groups ◽

Geographical Location ◽

Ethnic Relations ◽

Social Media Data ◽

Personal Names ◽

Learning Classifier ◽

Media Data

In this paper we develop a machine learning classifier that predicts perceived ethnicity from data on personal names for major ethnic groups populating Russia. We collect data from VK, the largest Russian social media website. Ethnicity has been determined from languages spoken by users and their geographical location, with the data manually cleaned by crowd workers. The classifier shows the accuracy of 0.82 for a scheme with 24 ethnic groups and 0.92 for 15 aggregated ethnic groups. It can be used for research on ethnicity and ethnic relations in Russia, in particular with VK and other social media data.

Download Full-text

Sentiment Analysis in Social Media using Machine Learning Techniques

Iraqi Journal of Science ◽

10.24996/ijs.2020.61.1.22 ◽

2020 ◽

pp. 193-201 ◽

Cited By ~ 1

Author(s):

Hayder A. Alatabi ◽

Ayad R. Abbas

Keyword(s):

Machine Learning ◽

Social Media ◽

Sentiment Analysis ◽

Machine Learning Techniques ◽

Great Success ◽

Social Media Data ◽

Learning Techniques ◽

The World ◽

Analysis System ◽

Media Data

Over the last period, social media achieved a widespread use worldwide where the statistics indicate that more than three billion people are on social media, leading to large quantities of data online. To analyze these large quantities of data, a special classification method known as sentiment analysis, is used. This paper presents a new sentiment analysis system based on machine learning techniques, which aims to create a process to extract the polarity from social media texts. By using machine learning techniques, sentiment analysis achieved a great success around the world. This paper investigates this topic and proposes a sentiment analysis system built on Bayesian Rough Decision Tree (BRDT) algorithm. The experimental results show the success of this system where the accuracy of the system is more than 95% on social media data.

Download Full-text

Communication Sentiment Analyzer using Machine Learning with Naive Bayes Bernoullinb

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.a1610.109119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 5976-5979

Keyword(s):

Machine Learning ◽

Social Media ◽

Major Part ◽

Naive Bayes ◽

Naïve Bayes ◽

User Preferences ◽

Social Media Data ◽

Machine Learning Model ◽

The World ◽

Media Data

In this never-ending social media era it is estimated that over 5 billion people use smartphones. Out of these, there are over 1.5 billion active users in the world. In which we all are a major part and before opening our messages we all are curious about what message we have received. No doubt, we all always hope for a good message to be received. So Sentiment analysis on social media data has been seen by many as an effective tool to monitor user preferences and inclination. Finally, we propose a scalable machine learning model to analyze the polarity of a communicative text using Naive Bayes’ Bernoulli classifier. This paper works on only two polarities that is whether the sentence is positive or negative. Bernoulli classifier is used in this paper because it is best suited for binary inputs which in turn enhances the accuracy of up to 97%.

Download Full-text

An unsupervised machine learning model for discovering latent infectious diseases using social media data

Journal of Biomedical Informatics ◽

10.1016/j.jbi.2016.12.007 ◽

2017 ◽

Vol 66 ◽

pp. 82-94 ◽

Cited By ~ 43

Author(s):

Sunghoon Lim ◽

Conrad S. Tucker ◽

Soundar Kumara

Keyword(s):

Machine Learning ◽

Social Media ◽

Infectious Diseases ◽

Learning Model ◽

Unsupervised Machine Learning ◽

Social Media Data ◽

Machine Learning Model ◽

Media Data

Download Full-text

Mining social media for prescription medication abuse monitoring: a review and proposal for a data-centric framework

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocz162 ◽

2019 ◽

Vol 27 (2) ◽

pp. 315-329 ◽

Cited By ~ 6

Author(s):

Abeed Sarker ◽

Annika DeRoos ◽

Jeanmarie Perrone

Keyword(s):

Social Media ◽

Language Processing ◽

Prescription Medication ◽

Inclusion Criteria ◽

Major Health Problem ◽

Social Media Data ◽

Use Of Data ◽

Media Source ◽

Multiple Characteristics ◽

Media Data

Abstract Objective Prescription medication (PM) misuse and abuse is a major health problem globally, and a number of recent studies have focused on exploring social media as a resource for monitoring nonmedical PM use. Our objectives are to present a methodological review of social media–based PM abuse or misuse monitoring studies, and to propose a potential generalizable, data-centric processing pipeline for the curation of data from this resource. Materials and Methods We identified studies involving social media, PMs, and misuse or abuse (inclusion criteria) from Medline, Embase, Scopus, Web of Science, and Google Scholar. We categorized studies based on multiple characteristics including but not limited to data size; social media source(s); medications studied; and primary objectives, methods, and findings. Results A total of 39 studies met our inclusion criteria, with 31 (∼79.5%) published since 2015. Twitter has been the most popular resource, with Reddit and Instagram gaining popularity recently. Early studies focused mostly on manual, qualitative analyses, with a growing trend toward the use of data-centric methods involving natural language processing and machine learning. Discussion There is a paucity of standardized, data-centric frameworks for curating social media data for task-specific analyses and near real-time surveillance of nonmedical PM use. Many existing studies do not quantify human agreements for manual annotation tasks or take into account the presence of noise in data. Conclusion The development of reproducible and standardized data-centric frameworks that build on the current state-of-the-art methods in data and text mining may enable effective utilization of social media data for understanding and monitoring nonmedical PM use.

Download Full-text

Analysis of Social Media Data to Classify and Detect Frequent Issues Using Machine Learning Approach

2020 2nd International Conference on Advanced Information and Communication Technology (ICAICT) ◽

10.1109/icaict51780.2020.9333452 ◽

2020 ◽

Author(s):

Pankaj Bhowmik ◽

Md. Sohrawordi ◽

U.A. Md. Ehsan Ali ◽

Md. Najmul Hasan ◽

Prodip Kumar Roy

Keyword(s):

Machine Learning ◽

Social Media ◽

Learning Approach ◽

Social Media Data ◽

Machine Learning Approach ◽

Media Data

Download Full-text

Assessing Patient-Perceived Hospital Service Quality and Sentiment in Malaysian Public Hospitals Using Machine Learning and Facebook Reviews

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18189912 ◽

2021 ◽

Vol 18 (18) ◽

pp. 9912

Author(s):

Afiq Izzudin A. Rahim ◽

Mohd Ismail Ibrahim ◽

Kamarul Imran Musa ◽

Sook-Ling Chua ◽

Najib Majdi Yaacob

Keyword(s):

Machine Learning ◽

Social Media ◽

Quality Of Care ◽

Service Quality ◽

Hospital Quality ◽

Public Hospitals ◽

Social Media Data ◽

Positive Sentiment ◽

Media Data

Social media is emerging as a new avenue for hospitals and patients to solicit input on the quality of care. However, social media data is unstructured and enormous in volume. Moreover, no empirical research on the use of social media data and perceived hospital quality of care based on patient online reviews has been performed in Malaysia. The purpose of this study was to investigate the determinants of positive sentiment expressed in hospital Facebook reviews in Malaysia, as well as the association between hospital accreditation and sentiments expressed in Facebook reviews. From 2017 to 2019, we retrieved comments from 48 official public hospitals’ Facebook pages. We used machine learning to build a sentiment analyzer and service quality (SERVQUAL) classifier that automatically classifies the sentiment and SERVQUAL dimensions. We utilized logistic regression analysis to determine our goals. We evaluated a total of 1852 reviews and our machine learning sentiment analyzer detected 72.1% of positive reviews and 27.9% of negative reviews. We classified 240 reviews as tangible, 1257 reviews as trustworthy, 125 reviews as responsive, 356 reviews as assurance, and 1174 reviews as empathy using our machine learning SERVQUAL classifier. After adjusting for hospital characteristics, all SERVQUAL dimensions except Tangible were associated with positive sentiment. However, no significant relationship between hospital accreditation and online sentiment was discovered. Facebook reviews powered by machine learning algorithms provide valuable, real-time data that may be missed by traditional hospital quality assessments. Additionally, online patient reviews offer a hitherto untapped indication of quality that may benefit all healthcare stakeholders. Our results confirm prior studies and support the use of Facebook reviews as an adjunct method for assessing the quality of hospital services in Malaysia.

Download Full-text