scholarly journals EventEpi—A natural language processing framework for event-based surveillance

2020 ◽  
Vol 16 (11) ◽  
pp. e1008277
Author(s):  
Auss Abbood ◽  
Alexander Ullrich ◽  
Rüdiger Busche ◽  
Stéphane Ghozzi

According to the World Health Organization (WHO), around 60% of all outbreaks are detected using informal sources. In many public health institutes, including the WHO and the Robert Koch Institute (RKI), dedicated groups of public health agents sift through numerous articles and newsletters to detect relevant events. This media screening is one important part of event-based surveillance (EBS). Reading the articles, discussing their relevance, and putting key information into a database is a time-consuming process. To support EBS, but also to gain insights into what makes an article and the event it describes relevant, we developed a natural language processing framework for automated information extraction and relevance scoring. First, we scraped relevant sources for EBS as done at the RKI (WHO Disease Outbreak News and ProMED) and automatically extracted the articles’ key data: disease, country, date, and confirmed-case count. For this, we performed named entity recognition in two steps: EpiTator, an open-source epidemiological annotation tool, suggested many different possibilities for each. We extracted the key country and disease using a heuristic with good results. We trained a naive Bayes classifier to find the key date and confirmed-case count, using the RKI’s EBS database as labels which performed modestly. Then, for relevance scoring, we defined two classes to which any article might belong: The article is relevant if it is in the EBS database and irrelevant otherwise. We compared the performance of different classifiers, using bag-of-words, document and word embeddings. The best classifier, a logistic regression, achieved a sensitivity of 0.82 and an index balanced accuracy of 0.61. Finally, we integrated these functionalities into a web application called EventEpi where relevant sources are automatically analyzed and put into a database. The user can also provide any URL or text, that will be analyzed in the same way and added to the database. Each of these steps could be improved, in particular with larger labeled datasets and fine-tuning of the learning algorithms. The overall framework, however, works already well and can be used in production, promising improvements in EBS. The source code and data are publicly available under open licenses.

2019 ◽  
Author(s):  
Auss Abbood ◽  
Alexander Ullrich ◽  
Rüdiger Busche ◽  
Stéphane Ghozzi

AbstractAccording to the World Health Organization (WHO), around 60% of all outbreaks are detected using informal sources. In many public health institutes, including the WHO and the Robert Koch Institute (RKI), dedicated groups of epidemiologists sift through numerous articles and newsletters to detect relevant events. This media screening is one important part of event-based surveillance (EBS). Reading the articles, discussing their relevance, and putting key information into a database is a time-consuming process. To support EBS, but also to gain insights into what makes an article and the event it describes relevant, we developed a natural-language-processing framework for automated information extraction and relevance scoring. First, we scraped relevant sources for EBS as done at RKI (WHO Disease Outbreak News and ProMED) and automatically extracted the articles’ key data: disease, country, date, and confirmed-case count. For this, we performed named entity recognition in two steps: EpiTator, an open-source epidemiological annotation tool, suggested many different possibilities for each. We trained a naive Bayes classifier to find the single most likely one using RKI’s EBS database as labels. Then, for relevance scoring, we defined two classes to which any article might belong: The article is relevant if it is in the EBS database and irrelevant otherwise. We compared the performance of different classifiers, using document and word embeddings. Two of the tested algorithms stood out: The multilayer perceptron performed best overall, with a precision of 0.19, recall of 0.50, specificity of 0.89, F1 of 0.28, and the highest tested index balanced accuracy of 0.46. The support-vector machine, on the other hand, had the highest recall (0.88) which can be of higher interest for epidemiologists. Finally, we integrated these functionalities into a web application called EventEpi where relevant sources are automatically analyzed and put into a database. The user can also provide any URL or text, that will be analyzed in the same way and added to the database. Each of these steps could be improved, in particular with larger labeled datasets and fine-tuning of the learning algorithms. The overall framework, however, works already well and can be used in production, promising improvements in EBS. The source code is publicly available at https://github.com/aauss/EventEpi.


2022 ◽  
Vol 3 (1) ◽  
pp. 1-23
Author(s):  
Yu Gu ◽  
Robert Tinn ◽  
Hao Cheng ◽  
Michael Lucas ◽  
Naoto Usuyama ◽  
...  

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this article, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition. To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB .


2019 ◽  
pp. 1-8 ◽  
Author(s):  
Tomasz Oliwa ◽  
Steven B. Maron ◽  
Leah M. Chase ◽  
Samantha Lomnicki ◽  
Daniel V.T. Catenacci ◽  
...  

PURPOSE Robust institutional tumor banks depend on continuous sample curation or else subsequent biopsy or resection specimens are overlooked after initial enrollment. Curation automation is hindered by semistructured free-text clinical pathology notes, which complicate data abstraction. Our motivation is to develop a natural language processing method that dynamically identifies existing pathology specimen elements necessary for locating specimens for future use in a manner that can be re-implemented by other institutions. PATIENTS AND METHODS Pathology reports from patients with gastroesophageal cancer enrolled in The University of Chicago GI oncology tumor bank were used to train and validate a novel composite natural language processing-based pipeline with a supervised machine learning classification step to separate notes into internal (primary review) and external (consultation) reports; a named-entity recognition step to obtain label (accession number), location, date, and sublabels (block identifiers); and a results proofreading step. RESULTS We analyzed 188 pathology reports, including 82 internal reports and 106 external consult reports, and successfully extracted named entities grouped as sample information (label, date, location). Our approach identified up to 24 additional unique samples in external consult notes that could have been overlooked. Our classification model obtained 100% accuracy on the basis of 10-fold cross-validation. Precision, recall, and F1 for class-specific named-entity recognition models show strong performance. CONCLUSION Through a combination of natural language processing and machine learning, we devised a re-implementable and automated approach that can accurately extract specimen attributes from semistructured pathology notes to dynamically populate a tumor registry.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
George Mastorakos ◽  
Aditya Khurana ◽  
Ming Huang ◽  
Sunyang Fu ◽  
Ahmad P. Tafti ◽  
...  

Background. Patients increasingly use asynchronous communication platforms to converse with care teams. Natural language processing (NLP) to classify content and automate triage of these messages has great potential to enhance clinical efficiency. We characterize the contents of a corpus of portal messages generated by patients using NLP methods. We aim to demonstrate descriptive analyses of patient text that can contribute to the development of future sophisticated NLP applications. Methods. We collected approximately 3,000 portal messages from the cardiology, dermatology, and gastroenterology departments at Mayo Clinic. After labeling these messages as either Active Symptom, Logistical, Prescription, or Update, we used NER (named entity recognition) to identify medical concepts based on the UMLS library. We hierarchically analyzed the distribution of these messages in terms of departments, message types, medical concepts, and keywords therewithin. Results. Active Symptom and Logistical content types comprised approximately 67% of the message cohort. The “Findings” medical concept had the largest number of keywords across all groupings of content types and departments. “Anatomical Sites” and “Disorders” keywords were more prevalent in Active Symptom messages, while “Drugs” keywords were most prevalent in Prescription messages. Logistical messages tended to have the lower proportions of “Anatomical Sites,”, “Disorders,”, “Drugs,”, and “Findings” keywords when compared to other message content types. Conclusions. This descriptive corpus analysis sheds light on the content and foci of portal messages. The insight into the content and differences among message themes can inform the development of more robust NLP models.


2020 ◽  
Vol 6 ◽  
Author(s):  
David Owen ◽  
Laurence Livermore ◽  
Quentin Groom ◽  
Alex Hardisty ◽  
Thijs Leegwater ◽  
...  

We describe an effective approach to automated text digitisation with respect to natural history specimen labels. These labels contain much useful data about the specimen including its collector, country of origin, and collection date. Our approach to automatically extracting these data takes the form of a pipeline. Recommendations are made for the pipeline's component parts based on some of the state-of-the-art technologies. Optical Character Recognition (OCR) can be used to digitise text on images of specimens. However, recognising text quickly and accurately from these images can be a challenge for OCR. We show that OCR performance can be improved by prior segmentation of specimen images into their component parts. This ensures that only text-bearing labels are submitted for OCR processing as opposed to whole specimen images, which inevitably contain non-textual information that may lead to false positive readings. In our testing Tesseract OCR version 4.0.0 offers promising text recognition accuracy with segmented images. Not all the text on specimen labels is printed. Handwritten text varies much more and does not conform to standard shapes and sizes of individual characters, which poses an additional challenge for OCR. Recently, deep learning has allowed for significant advances in this area. Google's Cloud Vision, which is based on deep learning, is trained on large-scale datasets, and is shown to be quite adept at this task. This may take us some way towards negating the need for humans to routinely transcribe handwritten text. Determining the countries and collectors of specimens has been the goal of previous automated text digitisation research activities. Our approach also focuses on these two pieces of information. An area of Natural Language Processing (NLP) known as Named Entity Recognition (NER) has matured enough to semi-automate this task. Our experiments demonstrated that existing approaches can accurately recognise location and person names within the text extracted from segmented images via Tesseract version 4.0.0. Potentially, NER could be used in conjunction with other online services, such as those of the Biodiversity Heritage Library to map the named entities to entities in the biodiversity literature (https://www.biodiversitylibrary.org/docs/api3.html). We have highlighted the main recommendations for potential pipeline components. The document also provides guidance on selecting appropriate software solutions. These include automatic language identification, terminology extraction, and integrating all pipeline components into a scientific workflow to automate the overall digitisation process.


Author(s):  
Ayush Srivastav ◽  
Hera Khan ◽  
Amit Kumar Mishra

The chapter provides an eloquent account of the major methodologies and advances in the field of Natural Language Processing. The most popular models that have been used over time for the task of Natural Language Processing have been discussed along with their applications in their specific tasks. The chapter begins with the fundamental concepts of regex and tokenization. It provides an insight to text preprocessing and its methodologies such as Stemming and Lemmatization, Stop Word Removal, followed by Part-of-Speech tagging and Named Entity Recognition. Further, this chapter elaborates the concept of Word Embedding, its various types, and some common frameworks such as word2vec, GloVe, and fastText. A brief description of classification algorithms used in Natural Language Processing is provided next, followed by Neural Networks and its advanced forms such as Recursive Neural Networks and Seq2seq models that are used in Computational Linguistics. A brief description of chatbots and Memory Networks concludes the chapter.


2018 ◽  
Vol 10 ◽  
pp. 117822261879286 ◽  
Author(s):  
Glen Coppersmith ◽  
Ryan Leary ◽  
Patrick Crutchley ◽  
Alex Fine

Suicide is among the 10 most common causes of death, as assessed by the World Health Organization. For every death by suicide, an estimated 138 people’s lives are meaningfully affected, and almost any other statistic around suicide deaths is equally alarming. The pervasiveness of social media—and the near-ubiquity of mobile devices used to access social media networks—offers new types of data for understanding the behavior of those who (attempt to) take their own lives and suggests new possibilities for preventive intervention. We demonstrate the feasibility of using social media data to detect those at risk for suicide. Specifically, we use natural language processing and machine learning (specifically deep learning) techniques to detect quantifiable signals around suicide attempts, and describe designs for an automated system for estimating suicide risk, usable by those without specialized mental health training (eg, a primary care doctor). We also discuss the ethical use of such technology and examine privacy implications. Currently, this technology is only used for intervention for individuals who have “opted in” for the analysis and intervention, but the technology enables scalable screening for suicide risk, potentially identifying many people who are at risk preventively and prior to any engagement with a health care system. This raises a significant cultural question about the trade-off between privacy and prevention—we have potentially life-saving technology that is currently reaching only a fraction of the possible people at risk because of respect for their privacy. Is the current trade-off between privacy and prevention the right one?


Energies ◽  
2019 ◽  
Vol 12 (17) ◽  
pp. 3258 ◽  
Author(s):  
Bai ◽  
Sun ◽  
Zang ◽  
Zhang ◽  
Shen ◽  
...  

Power dispatching systems currently receive massive, complicated, and irregular monitoring alarms during their operation, which prevents the controllers from making accurate judgments on the alarm events that occur within a short period of time. In view of the current situation with the low efficiency of monitoring alarm information, this paper proposes a method based on natural language processing (NLP) and a hybrid model that combines long short-term memory (LSTM) and convolutional neural network (CNN) for the identification of grid monitoring alarm events. Firstly, the characteristics of the alarm information text were analyzed and induced and then preprocessed. Then, the monitoring alarm information was vectorized based on the Word2vec model. Finally, a monitoring alarm event identification model based on a combination of LSTM and CNN was established for the characteristics of the alarm information. The feasibility and effectiveness of the method in this paper were verified by comparison with multiple identification models.


2017 ◽  
Vol 25 (3) ◽  
pp. 331-336 ◽  
Author(s):  
Ergin Soysal ◽  
Jingqi Wang ◽  
Min Jiang ◽  
Yonghui Wu ◽  
Serguei Pakhomov ◽  
...  

Abstract Existing general clinical natural language processing (NLP) systems such as MetaMap and Clinical Text Analysis and Knowledge Extraction System have been successfully applied to information extraction from clinical text. However, end users often have to customize existing systems for their individual tasks, which can require substantial NLP skills. Here we present CLAMP (Clinical Language Annotation, Modeling, and Processing), a newly developed clinical NLP toolkit that provides not only state-of-the-art NLP components, but also a user-friendly graphic user interface that can help users quickly build customized NLP pipelines for their individual applications. Our evaluation shows that the CLAMP default pipeline achieved good performance on named entity recognition and concept encoding. We also demonstrate the efficiency of the CLAMP graphic user interface in building customized, high-performance NLP pipelines with 2 use cases, extracting smoking status and lab test values. CLAMP is publicly available for research use, and we believe it is a unique asset for the clinical NLP community.


Sign in / Sign up

Export Citation Format

Share Document