Developing an automatic pipeline for analyzing chatter about health services from social media: A case study for Medicaid
AbstractObjectiveSocial media can be an effective but challenging resource for conducting close-to-real-time assessments of consumers’ perceptions about health services. Our objective was to develop and evaluate an automatic pipeline, involving natural language processing and machine learning, for automatically characterizing user-posted Twitter data about Medicaid.Material and MethodsWe collected Twitter data via the public API using Medicaid-related keywords (Corpus-1), and the website’s search option using agency-specific handles (Corpus-2). We manually labeled a sample of tweets into five pre-determined categories or other, and artificially increased the number of training posts from specific low-frequency categories. We trained and evaluated several supervised learning algorithms using manually-labeled data, and applied the best-performing classifier to collected tweets for post-classification analyses assessing the utility of our methods.ResultsWe collected 628,411 and 27,377 tweets for Corpus-1 and -2, respectively. We manually annotated 9,571 (Corpus-1: 8,180; Corpus-2: 1,391) tweets, using 7,923 (82.8%) for training and 1,648 (17.2%) for evaluation. A BERT-based (bidirectional encoder representations from transformers) classifier obtained the highest accuracies (83.9%, Corpus-1; 86.4%, Corpus-2), outperforming the second-best classifier (SVMs: 79.6%; 76.4%). Post-classification analyses revealed differing inter-corpora distributions of tweet categories, with political (63%) and consumer-feedback (43%) tweets being most frequent for Corpus-1 and -2, respectively.Discussion and ConclusionThe broad and variable content of Medicaid-related tweets necessitates automatic categorization to identify topic-relevant posts. Our proposed pipeline presents a feasible solution for automatic categorization, and can be deployed/generalized for health service programs other than Medicaid. Annotated data and methods are available for future studies (LINK_TO_BE_AVAILABLE).