(Preprint)

Mapping Intimacies ◽

10.2196/preprints.28926 ◽

2021 ◽

Author(s):

Sungkyu Park ◽

Sungwon Han ◽

Jeongwook Kim ◽

Mir Majid Molaie ◽

Hoang Dieu Vu ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Public Discourse ◽

World Health ◽

Asian Countries ◽

Public Attention ◽

Data Set ◽

The Public ◽

Issue Attention

BACKGROUND COVID-19, caused by SARS-CoV-2, has led to a global pandemic. The World Health Organization has also declared an infodemic (ie, a plethora of information regarding COVID-19 containing both false and accurate information circulated on the internet). Hence, it has become critical to test the veracity of information shared online and analyze the evolution of discussed topics among citizens related to the pandemic. OBJECTIVE This research analyzes the public discourse on COVID-19. It characterizes risk communication patterns in four Asian countries with outbreaks at varying degrees of severity: South Korea, Iran, Vietnam, and India. METHODS We collected tweets on COVID-19 from four Asian countries in the early phase of the disease outbreak from January to March 2020. The data set was collected by relevant keywords in each language, as suggested by locals. We present a method to automatically extract a time–topic cohesive relationship in an unsupervised fashion based on natural language processing. The extracted topics were evaluated qualitatively based on their semantic meanings. RESULTS This research found that each government’s official phases of the epidemic were not well aligned with the degree of public attention represented by the daily tweet counts. Inspired by the issue-attention cycle theory, the presented natural language processing model can identify meaningful transition phases in the discussed topics among citizens. The analysis revealed an inverse relationship between the tweet count and topic diversity. CONCLUSIONS This paper compares similarities and differences of pandemic-related social media discourse in Asian countries. We observed multiple prominent peaks in the daily tweet counts across all countries, indicating multiple issue-attention cycles. Our analysis identified which topics the public concentrated on; some of these topics were related to misinformation and hate speech. These findings and the ability to quickly identify key topics can empower global efforts to fight against an infodemic during a pandemic.

Download Full-text

Risk Communication in Asian Countries: COVID-19 Discourse on Twitter (Preprint)

10.2196/preprints.23272 ◽

2020 ◽

Cited By ~ 1

Author(s):

Sungkyu Park ◽

Sungwon Han ◽

Jeongwook Kim ◽

Mir Majid Molaie ◽

Hoang Dieu Vu ◽

...

Keyword(s):

Social Media ◽

Natural Language Processing ◽

Natural Language ◽

Risk Communication ◽

Language Processing ◽

World Health ◽

The Novel ◽

Asian Countries ◽

Public Attention ◽

Novel Coronavirus

BACKGROUND The novel coronavirus disease (hereafter COVID-19) caused by severe acute respiratory coronavirus 2 (SARS-CoV-2) has caused a global pandemic. During this time, a plethora of information regarding COVID-19 containing both false information (misinformation) and accurate information circulated on social media. The World Health Organization has declared a need to fight not only the pandemic but also the infodemic (a portmanteau of information and pandemic). In this context, it is critical to analyze the quality and veracity of information shared on social media and the evolution of discussions on major topics regarding COVID-19. OBJECTIVE This research characterizes risk communication patterns by analyzing public discourse on the novel coronavirus in four Asian countries that suffered outbreaks of varying degrees of severity: South Korea, Iran, Vietnam, and India. METHODS We collect tweets on COVID-19 posted from the four Asian countries from the start of their respective COVID-19 outbreaks in January until March 2020. We consult with locals and utilize relevant keywords from the local languages, following each country's tweet conventions. We then utilize a natural language processing (NLP) method to learn topics in an unsupervised fashion automatically. Finally, we qualitatively label the extracted topics to comprehend their semantic meanings. RESULTS We find that the official phases of the epidemic, as announced by the governments of the studied countries, do not align well with the online attention paid to COVID-19. Motivated by this misalignment, we develop a new natural language processing method to identify the transitions in topic phases and compare the identified topics across the four Asian countries. We examine the time lag between social media attention and confirmed patient counts. We confirm an inverse relationship between the tweet count and topic diversity. CONCLUSIONS Through the current research, we observe similarities and differences in the social media discourse on the pandemic in different Asian countries. We observe that once the daily tweet count hits its peak, the successive tweet count trend tends to decrease for all countries. This phenomenon aligns with the dynamics of the issue-attention cycle, an existing construct from communication theory conceptualizing how an issue rises and falls from public attention. Little work has been performed to identify topics in online risk communication by collectively considering temporal tweet trends in different countries. In this regard, if a critical piece of misinformation can be detected at an early stage in one country, it can be reported to prevent the spread of misinformation in other countries. Therefore, this work can help social media services, social media communicators, journalists, policymakers, and medical professionals fight the infodemic on a global scale. CLINICALTRIAL N/A

Download Full-text

Development of a global infectious disease activity database using natural language processing, machine learning, and human expertise

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocz112 ◽

2019 ◽

Vol 26 (11) ◽

pp. 1355-1359 ◽

Cited By ~ 3

Author(s):

Joshua Feldman ◽

Andrea Thomas-Bachli ◽

Jack Forsyth ◽

Zaki Hasnain Patel ◽

Kamran Khan

Keyword(s):

Infectious Disease ◽

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Disease Activity ◽

Language Processing ◽

World Health ◽

Online Media ◽

Data Set ◽

Media Reports

Abstract Objective We assessed whether machine learning can be utilized to allow efficient extraction of infectious disease activity information from online media reports. Materials and Methods We curated a data set of labeled media reports (n = 8322) indicating which articles contain updates about disease activity. We trained a classifier on this data set. To validate our system, we used a held out test set and compared our articles to the World Health Organization Disease Outbreak News reports. Results Our classifier achieved a recall and precision of 88.8% and 86.1%, respectively. The overall surveillance system detected 94% of the outbreaks identified by the WHO covered by online media (89%) and did so 43.4 (IQR: 9.5–61) days earlier on average. Discussion We constructed a global real-time disease activity database surveilling 114 illnesses and syndromes. We must further assess our system for bias, representativeness, granularity, and accuracy. Conclusion Machine learning, natural language processing, and human expertise can be used to efficiently identify disease activity from digital media reports.

Download Full-text

A Review and evaluation of Machine Translation methods for Lumasaaba

Journal of Digital Science ◽

10.33847/2686-8296.2.1_1 ◽

2020 ◽

pp. 3-17

Author(s):

Peter Nabende

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Machine Translation ◽

Language Processing ◽

Research Area ◽

Data Driven ◽

East African ◽

Data Set ◽

African Languages ◽

Translation Methods

Natural Language Processing for under-resourced languages is now a mainstream research area. However, there are limited studies on Natural Language Processing applications for many indigenous East African languages. As a contribution to covering the current gap of knowledge, this paper focuses on evaluating the application of well-established machine translation methods for one heavily under-resourced indigenous East African language called Lumasaaba. Specifically, we review the most common machine translation methods in the context of Lumasaaba including both rule-based and data-driven methods. Then we apply a state of the art data-driven machine translation method to learn models for automating translation between Lumasaaba and English using a very limited data set of parallel sentences. Automatic evaluation results show that a transformer-based Neural Machine Translation model architecture leads to consistently better BLEU scores than the recurrent neural network-based models. Moreover, the automatically generated translations can be comprehended to a reasonable extent and are usually associated with the source language input.

Download Full-text

Identifying and intercepting health misinformation on Reddit dermatology forums with artificially intelligent bots using natural language processing (Preprint)

10.2196/preprints.20975 ◽

2021 ◽

Author(s):

Monique B. Sager ◽

Aditya M. Kashyap ◽

Mila Tamminga ◽

Sadhana Ravoori ◽

Christopher Callison-Burch ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

The United States ◽

Test Accuracy ◽

Limited Data ◽

Test Environment ◽

Data Set ◽

Inappropriate Care ◽

Processing Techniques

BACKGROUND Reddit, the fifth most popular website in the United States, boasts a large and engaged user base on its dermatology forums where users crowdsource free medical opinions. Unfortunately, much of the advice provided is unvalidated and could lead to inappropriate care. Initial testing has shown that artificially intelligent bots can detect misinformation on Reddit forums and may be able to produce responses to posts containing misinformation. OBJECTIVE To analyze the ability of bots to find and respond to health misinformation on Reddit’s dermatology forums in a controlled test environment. METHODS Using natural language processing techniques, we trained bots to target misinformation using relevant keywords and to post pre-fabricated responses. By evaluating different model architectures across a held-out test set, we compared performances. RESULTS Our models yielded data test accuracies ranging from 95%-100%, with a BERT fine-tuned model resulting in the highest level of test accuracy. Bots were then able to post corrective pre-fabricated responses to misinformation. CONCLUSIONS Using a limited data set, bots had near-perfect ability to detect these examples of health misinformation within Reddit dermatology forums. Given that these bots can then post pre-fabricated responses, this technique may allow for interception of misinformation. Providing correct information, even instantly, however, does not mean users will be receptive or find such interventions persuasive. Further work should investigate this strategy’s effectiveness to inform future deployment of bots as a technique in combating health misinformation. CLINICALTRIAL N/A

Download Full-text

Natural language processing for automated annotation of medication mentions in primary care visit conversations

JAMIA Open ◽

10.1093/jamiaopen/ooab071 ◽

2021 ◽

Vol 4 (3) ◽

Author(s):

Craig H Ganoe ◽

Weiyi Wu ◽

Paul J Barr ◽

William Haslett ◽

Michelle D Dannenberg ◽

...

Keyword(s):

Primary Care ◽

Natural Language Processing ◽

Natural Language ◽

Information Extraction ◽

Language Processing ◽

Primary Care Visit ◽

Data Set ◽

Test Set ◽

Medication Information ◽

Care Visit

Abstract Objectives The objective of this study is to build and evaluate a natural language processing approach to identify medication mentions in primary care visit conversations between patients and physicians. Materials and Methods Eight clinicians contributed to a data set of 85 clinic visit transcripts, and 10 transcripts were randomly selected from this data set as a development set. Our approach utilizes Apache cTAKES and Unified Medical Language System controlled vocabulary to generate a list of medication candidates in the transcribed text and then performs multiple customized filters to exclude common false positives from this list while including some additional common mentions of the supplements and immunizations. Results Sixty-five transcripts with 1121 medication mentions were randomly selected as an evaluation set. Our proposed method achieved an F-score of 85.0% for identifying the medication mentions in the test set, significantly outperforming existing medication information extraction systems for medical records with F-scores ranging from 42.9% to 68.9% on the same test set. Discussion Our medication information extraction approach for primary care visit conversations showed promising results, extracting about 27% more medication mentions from our evaluation set while eliminating many false positives in comparison to existing baseline systems. We made our approach publicly available on the web as an open-source software. Conclusion Integration of our annotation system with clinical recording applications has the potential to improve patients’ understanding and recall of key information from their clinic visits, and, in turn, to positively impact health outcomes.

Download Full-text

A Pipeline to Understand Emerging Illness Via Social Media Data Analysis: Case Study on Breast Implant Illness (Preprint)

10.2196/preprints.29768 ◽

2021 ◽

Author(s):

Vishal Dey ◽

Peter Krasniak ◽

Minh Nguyen ◽

Clara Lee ◽

Xia Ning

Keyword(s):

Mental Health ◽

Social Media ◽

Natural Language Processing ◽

Data Analysis ◽

Natural Language ◽

Language Processing ◽

Breast Implant ◽

Public Attention ◽

Social Media Data ◽

Media Data

BACKGROUND A new illness can come to public attention through social media before it is medically defined, formally documented, or systematically studied. One example is a condition known as breast implant illness (BII), which has been extensively discussed on social media, although it is vaguely defined in the medical literature. OBJECTIVE The objective of this study is to construct a data analysis pipeline to understand emerging illnesses using social media data and to apply the pipeline to understand the key attributes of BII. METHODS We constructed a pipeline of social media data analysis using natural language processing and topic modeling. Mentions related to signs, symptoms, diseases, disorders, and medical procedures were extracted from social media data using the clinical Text Analysis and Knowledge Extraction System. We mapped the mentions to standard medical concepts and then summarized these mapped concepts as topics using latent Dirichlet allocation. Finally, we applied this pipeline to understand BII from several BII-dedicated social media sites. RESULTS Our pipeline identified topics related to toxicity, cancer, and mental health issues that were highly associated with BII. Our pipeline also showed that cancers, autoimmune disorders, and mental health problems were emerging concerns associated with breast implants, based on social media discussions. Furthermore, the pipeline identified mentions such as rupture, infection, pain, and fatigue as common self-reported issues among the public, as well as concerns about toxicity from silicone implants. CONCLUSIONS Our study could inspire future studies on the suggested symptoms and factors of BII. Our study provides the first analysis and derived knowledge of BII from social media using natural language processing techniques and demonstrates the potential of using social media information to better understand similar emerging illnesses. CLINICALTRIAL

Download Full-text

EventEpi–A Natural Language Processing Framework for Event-Based Surveillance

10.1101/19006395 ◽

2019 ◽

Author(s):

Auss Abbood ◽

Alexander Ullrich ◽

Rüdiger Busche ◽

Stéphane Ghozzi

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Web Application ◽

Fine Tuning ◽

Entity Recognition ◽

World Health ◽

Support Vector ◽

Event Based ◽

Processing Framework

AbstractAccording to the World Health Organization (WHO), around 60% of all outbreaks are detected using informal sources. In many public health institutes, including the WHO and the Robert Koch Institute (RKI), dedicated groups of epidemiologists sift through numerous articles and newsletters to detect relevant events. This media screening is one important part of event-based surveillance (EBS). Reading the articles, discussing their relevance, and putting key information into a database is a time-consuming process. To support EBS, but also to gain insights into what makes an article and the event it describes relevant, we developed a natural-language-processing framework for automated information extraction and relevance scoring. First, we scraped relevant sources for EBS as done at RKI (WHO Disease Outbreak News and ProMED) and automatically extracted the articles’ key data: disease, country, date, and confirmed-case count. For this, we performed named entity recognition in two steps: EpiTator, an open-source epidemiological annotation tool, suggested many different possibilities for each. We trained a naive Bayes classifier to find the single most likely one using RKI’s EBS database as labels. Then, for relevance scoring, we defined two classes to which any article might belong: The article is relevant if it is in the EBS database and irrelevant otherwise. We compared the performance of different classifiers, using document and word embeddings. Two of the tested algorithms stood out: The multilayer perceptron performed best overall, with a precision of 0.19, recall of 0.50, specificity of 0.89, F1 of 0.28, and the highest tested index balanced accuracy of 0.46. The support-vector machine, on the other hand, had the highest recall (0.88) which can be of higher interest for epidemiologists. Finally, we integrated these functionalities into a web application called EventEpi where relevant sources are automatically analyzed and put into a database. The user can also provide any URL or text, that will be analyzed in the same way and added to the database. Each of these steps could be improved, in particular with larger labeled datasets and fine-tuning of the learning algorithms. The overall framework, however, works already well and can be used in production, promising improvements in EBS. The source code is publicly available at https://github.com/aauss/EventEpi.

Download Full-text

Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study

JMIR Medical Informatics ◽

10.2196/27386 ◽

2021 ◽

Vol 9 (12) ◽

pp. e27386

Author(s):

Qingyu Chen ◽

Alex Rankine ◽

Yifan Peng ◽

Elaheh Aghaarabi ◽

Zhiyong Lu

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Mean Squared Error ◽

Pearson Correlation ◽

Data Set ◽

Squared Error ◽

Real Time Applications ◽

Effectiveness And Efficiency ◽

Pearson Correlations

Background Semantic textual similarity (STS) measures the degree of relatedness between sentence pairs. The Open Health Natural Language Processing (OHNLP) Consortium released an expertly annotated STS data set and called for the National Natural Language Processing Clinical Challenges. This work describes our entry, an ensemble model that leverages a range of deep learning (DL) models. Our team from the National Library of Medicine obtained a Pearson correlation of 0.8967 in an official test set during 2019 National Natural Language Processing Clinical Challenges/Open Health Natural Language Processing shared task and achieved a second rank. Objective Although our models strongly correlate with manual annotations, annotator-level correlation was only moderate (weighted Cohen κ=0.60). We are cautious of the potential use of DL models in production systems and argue that it is more critical to evaluate the models in-depth, especially those with extremely high correlations. In this study, we benchmark the effectiveness and efficiency of top-ranked DL models. We quantify their robustness and inference times to validate their usefulness in real-time applications. Methods We benchmarked five DL models, which are the top-ranked systems for STS tasks: Convolutional Neural Network, BioSentVec, BioBERT, BlueBERT, and ClinicalBERT. We evaluated a random forest model as an additional baseline. For each model, we repeated the experiment 10 times, using the official training and testing sets. We reported 95% CI of the Wilcoxon rank-sum test on the average Pearson correlation (official evaluation metric) and running time. We further evaluated Spearman correlation, R², and mean squared error as additional measures. Results Using only the official training set, all models obtained highly effective results. BioSentVec and BioBERT achieved the highest average Pearson correlations (0.8497 and 0.8481, respectively). BioSentVec also had the highest results in 3 of 4 effectiveness measures, followed by BioBERT. However, their robustness to sentence pairs of different similarity levels varies significantly. A particular observation is that BERT models made the most errors (a mean squared error of over 2.5) on highly similar sentence pairs. They cannot capture highly similar sentence pairs effectively when they have different negation terms or word orders. In addition, time efficiency is dramatically different from the effectiveness results. On average, the BERT models were approximately 20 times and 50 times slower than the Convolutional Neural Network and BioSentVec models, respectively. This results in challenges for real-time applications. Conclusions Despite the excitement of further improving Pearson correlations in this data set, our results highlight that evaluations of the effectiveness and efficiency of STS models are critical. In future, we suggest more evaluations on the generalization capability and user-level testing of the models. We call for community efforts to create more biomedical and clinical STS data sets from different perspectives to reflect the multifaceted notion of sentence-relatedness.

Download Full-text

Natural Language Processing of Social Media as Screening for Suicide Risk

Biomedical Informatics Insights ◽

10.1177/1178222618792860 ◽

2018 ◽

Vol 10 ◽

pp. 117822261879286 ◽

Cited By ~ 51

Author(s):

Glen Coppersmith ◽

Ryan Leary ◽

Patrick Crutchley ◽

Alex Fine

Keyword(s):

At Risk ◽

Social Media ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Suicide Risk ◽

Automated System ◽

World Health ◽

Primary Care Doctor ◽

Trade Off

Suicide is among the 10 most common causes of death, as assessed by the World Health Organization. For every death by suicide, an estimated 138 people’s lives are meaningfully affected, and almost any other statistic around suicide deaths is equally alarming. The pervasiveness of social media—and the near-ubiquity of mobile devices used to access social media networks—offers new types of data for understanding the behavior of those who (attempt to) take their own lives and suggests new possibilities for preventive intervention. We demonstrate the feasibility of using social media data to detect those at risk for suicide. Specifically, we use natural language processing and machine learning (specifically deep learning) techniques to detect quantifiable signals around suicide attempts, and describe designs for an automated system for estimating suicide risk, usable by those without specialized mental health training (eg, a primary care doctor). We also discuss the ethical use of such technology and examine privacy implications. Currently, this technology is only used for intervention for individuals who have “opted in” for the analysis and intervention, but the technology enables scalable screening for suicide risk, potentially identifying many people who are at risk preventively and prior to any engagement with a health care system. This raises a significant cultural question about the trade-off between privacy and prevention—we have potentially life-saving technology that is currently reaching only a fraction of the possible people at risk because of respect for their privacy. Is the current trade-off between privacy and prevention the right one?

Download Full-text

From SARS to COVID-19: A Bibliometric study on Emerging Infectious Diseases with Natural Language Processing technologies

10.21203/rs.3.rs-25354/v1 ◽

2020 ◽

Cited By ~ 3

Author(s):

Yinjun Hu ◽

Mengmeng Chen ◽

Qian Wang ◽

Yue Zhu ◽

Bei Wang ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

International Collaboration ◽

Citation Impact ◽

Policy Decision ◽

Research Quality ◽

First Year ◽

Asian Countries ◽

Processing Technologies

Abstract [Background] On January 7, 2020, the novel coronavirus named "COVID-19" aroused worldwide concern was identified by Chinese scientists. Many related research works were developed for the emerging, rapidly evolving situation of this epidemic. This study aimed to analyze the research literatures on SARS, MERS and COVID-19 to retrieve important information for virologists, epidemiologist and policy decision makers. [Methods] In this study, we collected data from multi data sources and compared bibliometrics indices among COVID-19, Severe Acute Respiratory Syndrome (SARS), and Middle East Respiratory Syndrome (MERS) up to March 25, 2020. In purpose to extract data in corresponding quantity and scale, the volume of search results will be balance with the limitation of publication years. For further analysis, we extracted 1,480 documents from 1,671 candidates with Natural Language Processing technologies. [Results] In total, 13,945 research literatures of 7 datasets were selected for analysis. Unlike other topics, research passion on epidemic may reach its peak at the first year the outbreak happens. The document type distribution of SARS, MERS and COVID-19 are nearly the same (less than 6 point difference for each type), however, there were notable growth in the research qualities during these three epidemics (3.68, 6.63 and 11.35 for Field-Weighted Citation Impact scores). Asian countries has less international collaboration (less than 35.1\%) than the Occident (more than 49.5\%), which should be noticed as same as research itself. [Conclusions] We found that research passion on epidemics may always reach its peak at the first year after outburst, however, the peak of research on MERS appeared at the third year because of its outburst of reproduction in 2015. For the research quality, although we did better in research qualities than before especially on COVID-19, research on epidemics not started from our own country should not be looked down. Another important effective strategy for enhancing epidemic prevention for China and other Asian countries is to continue strengthening international collaboration.

Download Full-text