Development of a global infectious disease activity database using natural language processing, machine learning, and human expertise

Abstract Objective We assessed whether machine learning can be utilized to allow efficient extraction of infectious disease activity information from online media reports. Materials and Methods We curated a data set of labeled media reports (n = 8322) indicating which articles contain updates about disease activity. We trained a classifier on this data set. To validate our system, we used a held out test set and compared our articles to the World Health Organization Disease Outbreak News reports. Results Our classifier achieved a recall and precision of 88.8% and 86.1%, respectively. The overall surveillance system detected 94% of the outbreaks identified by the WHO covered by online media (89%) and did so 43.4 (IQR: 9.5–61) days earlier on average. Discussion We constructed a global real-time disease activity database surveilling 114 illnesses and syndromes. We must further assess our system for bias, representativeness, granularity, and accuracy. Conclusion Machine learning, natural language processing, and human expertise can be used to efficiently identify disease activity from digital media reports.

Download Full-text

Natural language processing and machine learning methods in public health surveillance: a narrative review (Preprint)

10.2196/preprints.26351 ◽

2020 ◽

Author(s):

Patrick James Ward ◽

April M Young

Keyword(s):

Public Health ◽

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Public Health Surveillance ◽

Health Surveillance ◽

Surveillance Data ◽

Online Media ◽

Traditional Surveillance

BACKGROUND Public health surveillance is critical to detecting emerging population health threats and improvements. Surveillance data has increased in size and complexity, posing challenges to data management and analysis. Natural language processing (NLP) and machine learning (ML) are valuable tools for analysis of unstructured data involving free-text and have been used in innovative ways to examine a variety of health outcomes. OBJECTIVE Given the cross-disciplinary applications of NLP and ML, research on their applications in surveillance have been disseminated in a variety of outlets. As such, the aim of this narrative review was to describe the current state of NLP and ML use in surveillance science and to identify directions in future research. METHODS Information was abstracted from articles describing the use of natural language processing and machine learning in public health surveillance identified through a PubMed search. RESULTS Twenty-two articles met review criteria, 12 involving traditional surveillance data sources and 10 involving online media sources for surveillance. Traditional surveillance sources analyzed with NLP and ML consisted primarily of death certificates (n=6), hospital data (n=5), and online media sources (e.g., Twitter) (n=8). CONCLUSIONS The reviewed articles demonstrate the potential of NLP and ML to enhance surveillance data through improving timeliness of surveillance, identifying cases in the absence of standardized case definitions, and enabling mining of social media for public health surveillance.

Download Full-text

Examination of fake news from a viral perspective: an interplay of emotions, resonance, and sentiments

Journal of Systems and Information Technology ◽

10.1108/jsit-11-2020-0257 ◽

2022 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Krishnadas Nanath ◽

Supriya Kaitheri ◽

Sonia Malik ◽

Shahid Mustafa

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Topic Modeling ◽

Machine Learning Algorithms ◽

Fake News ◽

Linguistic Features ◽

Data Set ◽

Content Type

Purpose The purpose of this paper is to examine the factors that significantly affect the prediction of fake news from the virality theory perspective. The paper looks at a mix of emotion-driven content, sentimental resonance, topic modeling and linguistic features of news articles to predict the probability of fake news. Design/methodology/approach A data set of over 12,000 articles was chosen to develop a model for fake news detection. Machine learning algorithms and natural language processing techniques were used to handle big data with efficiency. Lexicon-based emotion analysis provided eight kinds of emotions used in the article text. The cluster of topics was extracted using topic modeling (five topics), while sentiment analysis provided the resonance between the title and the text. Linguistic features were added to the coding outcomes to develop a logistic regression predictive model for testing the significant variables. Other machine learning algorithms were also executed and compared. Findings The results revealed that positive emotions in a text lower the probability of news being fake. It was also found that sensational content like illegal activities and crime-related content were associated with fake news. The news title and the text exhibiting similar sentiments were found to be having lower chances of being fake. News titles with more words and content with fewer words were found to impact fake news detection significantly. Practical implications Several systems and social media platforms today are trying to implement fake news detection methods to filter the content. This research provides exciting parameters from a viral theory perspective that could help develop automated fake news detectors. Originality/value While several studies have explored fake news detection, this study uses a new perspective on viral theory. It also introduces new parameters like sentimental resonance that could help predict fake news. This study deals with an extensive data set and uses advanced natural language processing to automate the coding techniques in developing the prediction model.

Download Full-text

Sentiment Analysis Using Machine Learning Approaches (Lexicon based on movie review dataset)

Journal of Ubiquitous Computing and Communication Technologies - December 2019 ◽

10.36548/jucct.2020.3.004 ◽

2020 ◽

Vol 2 (3) ◽

pp. 145-152

Author(s):

Ayushi Mitra

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Language Processing ◽

Opinion Mining ◽

Research Work ◽

Emotional States ◽

Learning Approaches ◽

Data Set

Sentiment analysis or Opinion Mining or Emotion Artificial Intelligence is an on-going field which refers to the use of Natural Language Processing, analysis of text and is utilized to extract quantify and is used to study the emotional states from a given piece of information or text data set. It is an area that continues to be currently in progress in field of text mining. Sentiment analysis is utilized in many corporations for review of products, comments from social media and from a small amount of it is utilized to check whether or not the text is positive, negative or neutral. Throughout this research work we wish to adopt rule- based approaches which defines a set of rules and inputs like Classic Natural Language Processing techniques, stemming, tokenization, a region of speech tagging and parsing of machine learning for sentiment analysis which is going to be implemented by most advanced python language.

Download Full-text

Framework for Infectious Disease Analysis: A comprehensive and integrative multi-modeling approach to disease prediction and management

Health Informatics Journal ◽

10.1177/1460458217747112 ◽

2017 ◽

Vol 25 (4) ◽

pp. 1170-1187 ◽

Cited By ~ 11

Author(s):

Madhav Erraguntla ◽

Josef Zapletal ◽

Mark Lawley

Keyword(s):

Infectious Disease ◽

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Support Vector ◽

Human Populations ◽

Illustrative Case ◽

The Impact ◽

Disease Analysis

The impact of infectious disease on human populations is a function of many factors including environmental conditions, vector dynamics, transmission mechanics, social and cultural behaviors, and public policy. A comprehensive framework for disease management must fully connect the complete disease lifecycle, including emergence from reservoir populations, zoonotic vector transmission, and impact on human societies. The Framework for Infectious Disease Analysis is a software environment and conceptual architecture for data integration, situational awareness, visualization, prediction, and intervention assessment. Framework for Infectious Disease Analysis automatically collects biosurveillance data using natural language processing, integrates structured and unstructured data from multiple sources, applies advanced machine learning, and uses multi-modeling for analyzing disease dynamics and testing interventions in complex, heterogeneous populations. In the illustrative case studies, natural language processing from social media, news feeds, and websites was used for information extraction, biosurveillance, and situation awareness. Classification machine learning algorithms (support vector machines, random forests, and boosting) were used for disease predictions.

Download Full-text

Complex Events Processing on Live News Events Using Apache Kafka and Clustering Techniques

International Journal of Intelligent Information Technologies ◽

10.4018/ijiit.2021010103 ◽

2021 ◽

Vol 17 (1) ◽

pp. 39-52

Author(s):

Aditya Kamleshbhai Lakkad ◽

Rushit Dharmendrabhai Bhadaniya ◽

Vraj Nareshkumar Shah ◽

Lavanya K.

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Complex Event Processing ◽

Online Media ◽

Clustering Techniques ◽

News Content ◽

Complex Events ◽

Access To Data

The explosive growth of news and news content generated worldwide, coupled with the expansion through online media and rapid access to data, has made trouble and screening of news tedious. An expanding need for a model that can reprocess, break down, and order main content to extract interpretable information, explicitly recognizing subjects and content-driven groupings of articles. This paper proposed automated analyzing heterogeneous news through complex event processing (CEP) and machine learning (ML) algorithms. Initially, news content streamed using Apache Kafka, stored in Apache Druid, and further processed by a blend of natural language processing (NLP) and unsupervised machine learning (ML) techniques.

Download Full-text

(Preprint)

10.2196/preprints.28926 ◽

2021 ◽

Author(s):

Sungkyu Park ◽

Sungwon Han ◽

Jeongwook Kim ◽

Mir Majid Molaie ◽

Hoang Dieu Vu ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Public Discourse ◽

World Health ◽

Asian Countries ◽

Public Attention ◽

Data Set ◽

The Public ◽

Issue Attention

BACKGROUND COVID-19, caused by SARS-CoV-2, has led to a global pandemic. The World Health Organization has also declared an infodemic (ie, a plethora of information regarding COVID-19 containing both false and accurate information circulated on the internet). Hence, it has become critical to test the veracity of information shared online and analyze the evolution of discussed topics among citizens related to the pandemic. OBJECTIVE This research analyzes the public discourse on COVID-19. It characterizes risk communication patterns in four Asian countries with outbreaks at varying degrees of severity: South Korea, Iran, Vietnam, and India. METHODS We collected tweets on COVID-19 from four Asian countries in the early phase of the disease outbreak from January to March 2020. The data set was collected by relevant keywords in each language, as suggested by locals. We present a method to automatically extract a time–topic cohesive relationship in an unsupervised fashion based on natural language processing. The extracted topics were evaluated qualitatively based on their semantic meanings. RESULTS This research found that each government’s official phases of the epidemic were not well aligned with the degree of public attention represented by the daily tweet counts. Inspired by the issue-attention cycle theory, the presented natural language processing model can identify meaningful transition phases in the discussed topics among citizens. The analysis revealed an inverse relationship between the tweet count and topic diversity. CONCLUSIONS This paper compares similarities and differences of pandemic-related social media discourse in Asian countries. We observed multiple prominent peaks in the daily tweet counts across all countries, indicating multiple issue-attention cycles. Our analysis identified which topics the public concentrated on; some of these topics were related to misinformation and hate speech. These findings and the ability to quickly identify key topics can empower global efforts to fight against an infodemic during a pandemic.

Download Full-text

A Review and evaluation of Machine Translation methods for Lumasaaba

Journal of Digital Science ◽

10.33847/2686-8296.2.1_1 ◽

2020 ◽

pp. 3-17

Author(s):

Peter Nabende

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Machine Translation ◽

Language Processing ◽

Research Area ◽

Data Driven ◽

East African ◽

Data Set ◽

African Languages ◽

Translation Methods

Natural Language Processing for under-resourced languages is now a mainstream research area. However, there are limited studies on Natural Language Processing applications for many indigenous East African languages. As a contribution to covering the current gap of knowledge, this paper focuses on evaluating the application of well-established machine translation methods for one heavily under-resourced indigenous East African language called Lumasaaba. Specifically, we review the most common machine translation methods in the context of Lumasaaba including both rule-based and data-driven methods. Then we apply a state of the art data-driven machine translation method to learn models for automating translation between Lumasaaba and English using a very limited data set of parallel sentences. Automatic evaluation results show that a transformer-based Neural Machine Translation model architecture leads to consistently better BLEU scores than the recurrent neural network-based models. Moreover, the automatically generated translations can be comprehended to a reasonable extent and are usually associated with the source language input.

Download Full-text

Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in Natural Language Processing - FeatureEng '05

10.3115/1610230 ◽

2005 ◽

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Feature Engineering

Download Full-text

A Machine Learning Application for Raising WASH Awareness in the Times of COVID-19 Pandemic (Preprint)

10.2196/preprints.25320 ◽

2020 ◽

Cited By ~ 1

Author(s):

Rohan Pandey ◽

Vaibhav Gautam ◽

Ridam Pal ◽

Harsh Bandhey ◽

Lovedeep Singh Dhingra ◽

...

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Machine Translation ◽

Language Processing ◽

User Feedback ◽

Who Guidelines ◽

The Times ◽

The Right ◽

Local Languages

BACKGROUND The COVID-19 pandemic has uncovered the potential of digital misinformation in shaping the health of nations. The deluge of unverified information that spreads faster than the epidemic itself is an unprecedented phenomenon that has put millions of lives in danger. Mitigating this ‘Infodemic’ requires strong health messaging systems that are engaging, vernacular, scalable, effective and continuously learn the new patterns of misinformation. OBJECTIVE We created WashKaro, a multi-pronged intervention for mitigating misinformation through conversational AI, machine translation and natural language processing. WashKaro provides the right information matched against WHO guidelines through AI, and delivers it in the right format in local languages. METHODS We theorize (i) an NLP based AI engine that could continuously incorporate user feedback to improve relevance of information, (ii) bite sized audio in the local language to improve penetrance in a country with skewed gender literacy ratios, and (iii) conversational but interactive AI engagement with users towards an increased health awareness in the community. RESULTS A total of 5026 people who downloaded the app during the study window, among those 1545 were active users. Our study shows that 3.4 times more females engaged with the App in Hindi as compared to males, the relevance of AI-filtered news content doubled within 45 days of continuous machine learning, and the prudence of integrated AI chatbot “Satya” increased thus proving the usefulness of an mHealth platform to mitigate health misinformation. CONCLUSIONS We conclude that a multi-pronged machine learning application delivering vernacular bite-sized audios and conversational AI is an effective approach to mitigate health misinformation. CLINICALTRIAL Not Applicable

Download Full-text

Thai Fake News Detection Based on Information Retrieval, Natural Language Processing and Machine Learning

SN Computer Science ◽

10.1007/s42979-021-00775-6 ◽

2021 ◽

Vol 2 (6) ◽

Author(s):

Phayung Meesad

Keyword(s):

Machine Learning ◽

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Fake News

Download Full-text