PICO Entity Extraction For Preclinical Animal Literature

Mapping Intimacies ◽

10.21203/rs.3.rs-1008099/v1 ◽

2021 ◽

Author(s):

Qianying Wang ◽

Jing Liao ◽

Mirella Lapata ◽

Malcolm Macleod

Keyword(s):

Language Processing ◽

Systematic Reviews ◽

Animal Studies ◽

Fine Tuning ◽

Entity Recognition ◽

Entity Extraction ◽

Published Evidence ◽

Sentence Classification ◽

The Difference

Abstract Background: Natural language processing could assist multiple tasks in systematic reviews to reduce workflow, including the extraction of PICO elements such as study populations, interventions and outcomes. The PICO framework provides a basis for the retrieval and selection for inclusion of published evidence relevant to a specific systematic review question, and automatic approaches of PICO extraction have been developed particularly for reviews of clinical trial findings. Considering the difference between preclinical animal studies and clinical trials, developing separate approaches are necessary. Facilitating preclinical systematic reviews will inform the translation from preclinical to clinical research. Methods: We randomly selected 400 abstracts from the PubMed Central Open Access database which described in vivo animal research and manually annotated these with PICO phrases for Species, Strain, model Induction, Intervention, Comparator and Outcome. We developed a two-stage workflow for preclinical PICO extraction. Firstly we fine-tuned BERT with different pre-trained modules for PICO sentence classification. Then, after removing text irrelevant to PICO features, we explored LSTM, CRF and BERT-based models for PICO entity recognition. We also explored a self-training approach because of the small training corpus.Results: For PICO sentence classification, BERT models using all pre-trained modules achieved an F1 score over 80%, and models pre-trained on PubMed abstracts achieved the highest F1 of 85%. For PICO entity recognition, fine-tuning BERT pre-trained on PubMed abstracts achieved an overall F1 of 71%, and satisfactory F1 for Species (98%), Strain (70%), Intervention (70%) and Outcome (67%). The score of Induction and Comparator is less satisfactory, but F1 of Comparator can be improved to 50% by applying self-training. Conclusions: Our study indicates that of the approaches tested, BERT pre-trained on PubMed abstracts is the best for both PICO sentence classification and PICO entity recognition in the preclinical abstracts. Self-training yields better performance for identifying comparators and strains.

Download Full-text

Transformers-sklearn: a toolkit for medical language understanding with transformer-based models

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01459-0 ◽

2021 ◽

Vol 21 (S2) ◽

Author(s):

Feihong Yang ◽

Xuwen Wang ◽

Hetong Ma ◽

Jiao Li

Keyword(s):

Language Processing ◽

Pearson Correlation ◽

Fine Tuning ◽

Entity Recognition ◽

Training Dataset ◽

Training Methods ◽

Code Size ◽

Model Framework ◽

Language Understanding ◽

Medical Language

Abstract Background Transformer is an attention-based architecture proven the state-of-the-art model in natural language processing (NLP). To reduce the difficulty of beginning to use transformer-based models in medical language understanding and expand the capability of the scikit-learn toolkit in deep learning, we proposed an easy to learn Python toolkit named transformers-sklearn. By wrapping the interfaces of transformers in only three functions (i.e., fit, score, and predict), transformers-sklearn combines the advantages of the transformers and scikit-learn toolkits. Methods In transformers-sklearn, three Python classes were implemented, namely, BERTologyClassifier for the classification task, BERTologyNERClassifier for the named entity recognition (NER) task, and BERTologyRegressor for the regression task. Each class contains three methods, i.e., fit for fine-tuning transformer-based models with the training dataset, score for evaluating the performance of the fine-tuned model, and predict for predicting the labels of the test dataset. transformers-sklearn is a user-friendly toolkit that (1) Is customizable via a few parameters (e.g., model_name_or_path and model_type), (2) Supports multilingual NLP tasks, and (3) Requires less coding. The input data format is automatically generated by transformers-sklearn with the annotated corpus. Newcomers only need to prepare the dataset. The model framework and training methods are predefined in transformers-sklearn. Results We collected four open-source medical language datasets, including TrialClassification for Chinese medical trial text multi label classification, BC5CDR for English biomedical text name entity recognition, DiabetesNER for Chinese diabetes entity recognition and BIOSSES for English biomedical sentence similarity estimation. In the four medical NLP tasks, the average code size of our script is 45 lines/task, which is one-sixth the size of transformers’ script. The experimental results show that transformers-sklearn based on pretrained BERT models achieved macro F1 scores of 0.8225, 0.8703 and 0.6908, respectively, on the TrialClassification, BC5CDR and DiabetesNER tasks and a Pearson correlation of 0.8260 on the BIOSSES task, which is consistent with the results of transformers. Conclusions The proposed toolkit could help newcomers address medical language understanding tasks using the scikit-learn coding style easily. The code and tutorials of transformers-sklearn are available at https://doi.org/10.5281/zenodo.4453803. In future, more medical language understanding tasks will be supported to improve the applications of transformers_sklearn.

Download Full-text

EventEpi–A Natural Language Processing Framework for Event-Based Surveillance

10.1101/19006395 ◽

2019 ◽

Author(s):

Auss Abbood ◽

Alexander Ullrich ◽

Rüdiger Busche ◽

Stéphane Ghozzi

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Web Application ◽

Fine Tuning ◽

Entity Recognition ◽

World Health ◽

Support Vector ◽

Event Based ◽

Processing Framework

AbstractAccording to the World Health Organization (WHO), around 60% of all outbreaks are detected using informal sources. In many public health institutes, including the WHO and the Robert Koch Institute (RKI), dedicated groups of epidemiologists sift through numerous articles and newsletters to detect relevant events. This media screening is one important part of event-based surveillance (EBS). Reading the articles, discussing their relevance, and putting key information into a database is a time-consuming process. To support EBS, but also to gain insights into what makes an article and the event it describes relevant, we developed a natural-language-processing framework for automated information extraction and relevance scoring. First, we scraped relevant sources for EBS as done at RKI (WHO Disease Outbreak News and ProMED) and automatically extracted the articles’ key data: disease, country, date, and confirmed-case count. For this, we performed named entity recognition in two steps: EpiTator, an open-source epidemiological annotation tool, suggested many different possibilities for each. We trained a naive Bayes classifier to find the single most likely one using RKI’s EBS database as labels. Then, for relevance scoring, we defined two classes to which any article might belong: The article is relevant if it is in the EBS database and irrelevant otherwise. We compared the performance of different classifiers, using document and word embeddings. Two of the tested algorithms stood out: The multilayer perceptron performed best overall, with a precision of 0.19, recall of 0.50, specificity of 0.89, F1 of 0.28, and the highest tested index balanced accuracy of 0.46. The support-vector machine, on the other hand, had the highest recall (0.88) which can be of higher interest for epidemiologists. Finally, we integrated these functionalities into a web application called EventEpi where relevant sources are automatically analyzed and put into a database. The user can also provide any URL or text, that will be analyzed in the same way and added to the database. Each of these steps could be improved, in particular with larger labeled datasets and fine-tuning of the learning algorithms. The overall framework, however, works already well and can be used in production, promising improvements in EBS. The source code is publicly available at https://github.com/aauss/EventEpi.

Download Full-text

Myanmar named entity corpus and its use in syllable-based neural named entity recognition

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i2.pp1544-1551 ◽

2020 ◽

Vol 10 (2) ◽

pp. 1544 ◽

Cited By ~ 1

Author(s):

Hsu Myat Mo ◽

Khin Mar Soe

Keyword(s):

Neural Network ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Future Research ◽

Network Architectures ◽

Entity Extraction ◽

Named Entity ◽

Named Entity Extraction ◽

Network Approaches

Myanmar language is a low-resource language and this is one of the main reasons why Myanmar Natural Language Processing lagged behind compared to other languages. Currently, there is no publicly available named entity corpus for Myanmar language. As part of this work, a very first manually annotated Named Entity tagged corpus for Myanmar language was developed and proposed to support the evaluation of named entity extraction. At present, our named entity corpus contains approximately 170,000 name entities and 60,000 sentences. This work also contributes the first evaluation of various deep neural network architectures on Myanmar Named Entity Recognition. Experimental results of the 10-fold cross validation revealed that syllable-based neural sequence models without additional feature engineering can give better results compared to baseline CRF model. This work also aims to discover the effectiveness of neural network approaches to textual processing for Myanmar language as well as to promote future research works on this understudied language.

Download Full-text

EventEpi—A natural language processing framework for event-based surveillance

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008277 ◽

2020 ◽

Vol 16 (11) ◽

pp. e1008277

Author(s):

Auss Abbood ◽

Alexander Ullrich ◽

Rüdiger Busche ◽

Stéphane Ghozzi

Keyword(s):

Public Health ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Fine Tuning ◽

Entity Recognition ◽

World Health ◽

Case Count ◽

Event Based ◽

Processing Framework

According to the World Health Organization (WHO), around 60% of all outbreaks are detected using informal sources. In many public health institutes, including the WHO and the Robert Koch Institute (RKI), dedicated groups of public health agents sift through numerous articles and newsletters to detect relevant events. This media screening is one important part of event-based surveillance (EBS). Reading the articles, discussing their relevance, and putting key information into a database is a time-consuming process. To support EBS, but also to gain insights into what makes an article and the event it describes relevant, we developed a natural language processing framework for automated information extraction and relevance scoring. First, we scraped relevant sources for EBS as done at the RKI (WHO Disease Outbreak News and ProMED) and automatically extracted the articles’ key data: disease, country, date, and confirmed-case count. For this, we performed named entity recognition in two steps: EpiTator, an open-source epidemiological annotation tool, suggested many different possibilities for each. We extracted the key country and disease using a heuristic with good results. We trained a naive Bayes classifier to find the key date and confirmed-case count, using the RKI’s EBS database as labels which performed modestly. Then, for relevance scoring, we defined two classes to which any article might belong: The article is relevant if it is in the EBS database and irrelevant otherwise. We compared the performance of different classifiers, using bag-of-words, document and word embeddings. The best classifier, a logistic regression, achieved a sensitivity of 0.82 and an index balanced accuracy of 0.61. Finally, we integrated these functionalities into a web application called EventEpi where relevant sources are automatically analyzed and put into a database. The user can also provide any URL or text, that will be analyzed in the same way and added to the database. Each of these steps could be improved, in particular with larger labeled datasets and fine-tuning of the learning algorithms. The overall framework, however, works already well and can be used in production, promising improvements in EBS. The source code and data are publicly available under open licenses.

Download Full-text

Mining of Textual Health Information from Reddit: Analysis of Chronic Diseases With Extracted Entities and Their Relations (Preprint)

10.2196/preprints.12876 ◽

2018 ◽

Cited By ~ 1

Author(s):

Vasiliki Foufi ◽

Tatsawan Timakum ◽

Christophe Gaudet-Blavignac ◽

Christian Lovis ◽

Min Song

Keyword(s):

Social Media ◽

Chronic Diseases ◽

Language Processing ◽

Relation Extraction ◽

Entity Recognition ◽

Entity Extraction ◽

Privacy And Security ◽

Mining System ◽

Social Media Platforms ◽

The Way

BACKGROUND Social media platforms constitute a rich data source for natural language processing tasks such as named entity recognition, relation extraction, and sentiment analysis. In particular, social media platforms about health provide a different insight into patient’s experiences with diseases and treatment than those found in the scientific literature. OBJECTIVE This paper aimed to report a study of entities related to chronic diseases and their relation in user-generated text posts. The major focus of our research is the study of biomedical entities found in health social media platforms and their relations and the way people suffering from chronic diseases express themselves. METHODS We collected a corpus of 17,624 text posts from disease-specific subreddits of the social news and discussion website Reddit. For entity and relation extraction from this corpus, we employed the PKDE4J tool developed by Song et al (2015). PKDE4J is a text mining system that integrates dictionary-based entity extraction and rule-based relation extraction in a highly flexible and extensible framework. RESULTS Using PKDE4J, we extracted 2 types of entities and relations: biomedical entities and relations and subject-predicate-object entity relations. In total, 82,138 entities and 30,341 relation pairs were extracted from the Reddit dataset. The most highly mentioned entities were those related to oncological disease (2884 occurrences of cancer) and asthma (2180 occurrences). The relation pair anatomy-disease was the most frequent (5550 occurrences), the highest frequent entities in this pair being cancer and lymph. The manual validation of the extracted entities showed a very good performance of the system at the entity extraction task (3682/5151, 71.48% extracted entities were correctly labeled). CONCLUSIONS This study showed that people are eager to share their personal experience with chronic diseases on social media platforms despite possible privacy and security issues. The results reported in this paper are promising and demonstrate the need for more in-depth studies on the way patients with chronic diseases express themselves on social media platforms.

Download Full-text

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

ACM Transactions on Computing for Healthcare ◽

10.1145/3458754 ◽

2022 ◽

Vol 3 (1) ◽

pp. 1-23

Author(s):

Yu Gu ◽

Robert Tinn ◽

Hao Cheng ◽

Michael Lucas ◽

Naoto Usuyama ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Fine Tuning ◽

Entity Recognition ◽

Language Models ◽

General Domain ◽

Domain Specific ◽

And Task

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this article, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition. To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB .

Download Full-text

Cabernet: A Question-and-Answer System to Extract Data from Free-Text Pathology Reports (Preprint)

10.2196/preprints.27210 ◽

2021 ◽

Author(s):

Joseph Ross Mitchell ◽

Phillip Szepietowski ◽

Rachel Howard ◽

Phillip Reisman ◽

Jennie D. Jones ◽

...

Keyword(s):

Language Processing ◽

Language Model ◽

Ground Truth ◽

Fine Tuning ◽

Entity Recognition ◽

Training Dataset ◽

Pathology Report ◽

Free Text ◽

Tumor Site ◽

Pathology Reports

BACKGROUND Information in pathology reports is critical for cancer care. Natural language processing (NLP) systems to extract information from pathology reports are often narrow in scope or require extensive tuning. Consequently, there is growing interest in automated deep learning approaches. A powerful new NLP algorithm, Bidirectional Encoder Representations from Transformers (BERT), was published in late 2018. BERT set new performance standards on tasks as diverse as question-answering, named entity recognition, speech recognition, and more. OBJECTIVE to develop a BERT-based system to automatically extract detailed tumor site and histology information from free text pathology reports. METHODS We pursued three specific aims: 1) extract accurate tumor site and histology descriptions from free-text pathology reports; 2) accommodate the diverse terminology used to indicate the same pathology; and 3) provide accurate standardized tumor site and histology codes for use by downstream applications. We first trained a base language-model to comprehend the technical language in pathology reports. This involved unsupervised learning on a training corpus of 275,605 electronic pathology reports from 164,531 unique patients that included 121 million words. Next, we trained a Q&A “head” that would connect to, and work with, the pathology language model to answer pathology questions. Our Q&A system was designed to search for the answers to two predefined questions in each pathology report: 1) “What organ contains the tumor?”; and, 2) “What is the kind of tumor or carcinoma?”. This involved supervised training on 8,197 pathology reports, each with ground truth answers to these two questions determined by Certified Tumor Registrars. The dataset included 214 tumor sites and 193 histologies. The tumor site and histology phrases extracted by the Q&A model were used to predict ICD-O-3 site and histology codes. This involved fine-tuning two additional BERT models: one to predict site codes, and the second to predict histology codes. Our final system includes a network of 3 BERT-based models. We call this caBERTnet (pronounced “Cabernet”). We evaluated caBERnet using a sequestered test dataset of 2,050 pathology reports with ground truth answers determined by Certified Tumor Registrars. RESULTS caBERTnet’s accuracies for predicting group-level site and histology codes were 93.5% and 97.7%, respectively. The top-5 accuracies for predicting fine-grained ICD-O-3 site and histology codes with 5 or more samples each in the training dataset were 93.6% and 95.4%, respectively. CONCLUSIONS This is the first time an NLP system has achieved expert-level performance predicting ICD-O-3 codes across a broad range of tumor sites and histologies. Our new system could help reduce treatment delays, increase enrollment in clinical trials of new therapies, and improve patient outcomes.

Download Full-text

UMLS-based data augmentation for natural language processing of clinical research literature

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocaa309 ◽

2020 ◽

Author(s):

Tian Kang ◽

Adler Perotte ◽

Youlan Tang ◽

Casey Ta ◽

Chunhua Weng

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Data Augmentation ◽

Training Data ◽

Entity Recognition ◽

Classification Model ◽

Learning Models ◽

Sentence Classification

Abstract Objective The study sought to develop and evaluate a knowledge-based data augmentation method to improve the performance of deep learning models for biomedical natural language processing by overcoming training data scarcity. Materials and Methods We extended the easy data augmentation (EDA) method for biomedical named entity recognition (NER) by incorporating the Unified Medical Language System (UMLS) knowledge and called this method UMLS-EDA. We designed experiments to systematically evaluate the effect of UMLS-EDA on popular deep learning architectures for both NER and classification. We also compared UMLS-EDA to BERT. Results UMLS-EDA enables substantial improvement for NER tasks from the original long short-term memory conditional random fields (LSTM-CRF) model (micro-F1 score: +5%, + 17%, and +15%), helps the LSTM-CRF model (micro-F1 score: 0.66) outperform LSTM-CRF with transfer learning by BERT (0.63), and improves the performance of the state-of-the-art sentence classification model. The largest gain on micro-F1 score is 9%, from 0.75 to 0.84, better than classifiers with BERT pretraining (0.82). Conclusions This study presents a UMLS-based data augmentation method, UMLS-EDA. It is effective at improving deep learning models for both NER and sentence classification, and contributes original insights for designing new, superior deep learning approaches for low-resource biomedical domains.

Download Full-text

Decellularized bone extracellular matrix in skeletal tissue engineering

Biochemical Society Transactions ◽

10.1042/bst20190079 ◽

2020 ◽

Vol 48 (3) ◽

pp. 755-764

Author(s):

Benjamin B. Rothrauff ◽

Rocky S. Tuan

Keyword(s):

Tissue Engineering ◽

Bone Regeneration ◽

Tissue Regeneration ◽

Animal Studies ◽

Tumor Resection ◽

Regenerative Capacity ◽

Skeletal Tissue ◽

Large Bone

Bone possesses an intrinsic regenerative capacity, which can be compromised by aging, disease, trauma, and iatrogenesis (e.g. tumor resection, pharmacological). At present, autografts and allografts are the principal biological treatments available to replace large bone segments, but both entail several limitations that reduce wider use and consistent success. The use of decellularized extracellular matrices (ECM), often derived from xenogeneic sources, has been shown to favorably influence the immune response to injury and promote site-appropriate tissue regeneration. Decellularized bone ECM (dbECM), utilized in several forms — whole organ, particles, hydrogels — has shown promise in both in vitro and in vivo animal studies to promote osteogenic differentiation of stem/progenitor cells and enhance bone regeneration. However, dbECM has yet to be investigated in clinical studies, which are needed to determine the relative efficacy of this emerging biomaterial as compared with established treatments. This mini-review highlights the recent exploration of dbECM as a biomaterial for skeletal tissue engineering and considers modifications on its future use to more consistently promote bone regeneration.

Download Full-text

Enhancement of ADP-induced Platelet Aggregation by Adrenaline in Vivo and Its Prevention

Thrombosis and Haemostasis ◽

10.1055/s-0038-1647788 ◽

1973 ◽

Vol 29 (02) ◽

pp. 490-498 ◽

Cited By ~ 6

Author(s):

Hiroh Yamazaki ◽

Itsuro Kobayashi ◽

Tadahiro Sano ◽

Takio Shimamoto

Keyword(s):

Platelet Count ◽

Platelet Aggregation ◽

Platelet Rich Plasma ◽

The Other ◽

Endothelial Surface ◽

Important Interaction ◽

Blood Coagulability ◽

The Difference

SummaryThe authors previously reported a transient decrease in adhesive platelet count and an enhancement of blood coagulability after administration of a small amount of adrenaline (0.1-1 µg per Kg, i. v.) in man and rabbit. In such circumstances, the sensitivity of platelets to aggregation induced by ADP was studied by an optical density method. Five minutes after i. v. injection of 1 µg per Kg of adrenaline in 10 rabbits, intensity of platelet aggregation increased to 115.1 ± 4.9% (mean ± S. E.) by 10∼5 molar, 121.8 ± 7.8% by 3 × 10-6 molar and 129.4 ± 12.8% of the value before the injection by 10”6 molar ADP. The difference was statistically significant (P<0.01-0.05). The above change was not observed in each group of rabbits injected with saline, 1 µg per Kg of 1-noradrenaline or 0.1 and 10 µg per Kg of adrenaline. Also, it was prevented by oral administration of 10 mg per Kg of phenoxybenzamine or propranolol or aspirin or pyridinolcarbamate 3 hours before the challenge. On the other hand, the enhancement of ADP-induced platelet aggregation was not observed in vitro, when 10-5 or 3 × 10-6 molar and 129.4 ± 12.8% of the value before 10∼6 molar ADP was added to citrated platelet rich plasma (CPRP) of rabbit after incubation at 37°C for 30 second with 0.01, 0.1, 1, 10 or 100 µg per ml of adrenaline or noradrenaline. These results suggest an important interaction between endothelial surface and platelets in connection with the enhancement of ADP-induced platelet aggregation by adrenaline in vivo.

Download Full-text