scholarly journals De-identification of Emergency Medical Records in French: Survey and Comparison of State-of-the-Art Automated Systems

Author(s):  
Loick Bourdois ◽  
Marta Avalos ◽  
Gabrielle Chenais ◽  
Frantz Thiessard ◽  
Philippe Revel ◽  
...  

In France, structured data from emergency room (ER) visits are aggregated at the national level to build a syndromic surveillance system for several health events. For visits motivated by a traumatic event, information on the causes are stored in free-text clinical notes. To exploit these data, an automated de-identification system guaranteeing protection of privacy is required.In this study we review available de-identification tools to de-identify free-text clinical documents in French. A key point is how to overcome the resource barrier that hampers NLP applications in languages other than English. We compare rule-based, named entity recognition, new Transformer-based deep learning and hybrid systems using, when required, a fine-tuning set of 30,000 unlabeled clinical notes. The evaluation is performed on a test set of 3,000 manually annotated notes.Hybrid systems, combining capabilities in complementary tasks, show the best performance. This work is a first step in the foundation of a national surveillance system based on the exhaustive collection of ER visits reports for automated trauma monitoring.

2020 ◽  
Author(s):  
Shintaro Tsuji ◽  
Andrew Wen ◽  
Naoki Takahashi ◽  
Hongjian Zhang ◽  
Katsuhiko Ogasawara ◽  
...  

BACKGROUND Named entity recognition (NER) plays an important role in extracting the features of descriptions for mining free-text radiology reports. However, the performance of existing NER tools is limited because the number of entities depends on its dictionary lookup. Especially, the recognition of compound terms is very complicated because there are a variety of patterns. OBJECTIVE The objective of the study is to develop and evaluate a NER tool concerned with compound terms using the RadLex for mining free-text radiology reports. METHODS We leveraged the clinical Text Analysis and Knowledge Extraction System (cTAKES) to develop customized pipelines using both RadLex and SentiWordNet (a general-purpose dictionary, GPD). We manually annotated 400 of radiology reports for compound terms (Cts) in noun phrases and used them as the gold standard for the performance evaluation (precision, recall, and F-measure). Additionally, we also created a compound-term-enhanced dictionary (CtED) by analyzing false negatives (FNs) and false positives (FPs), and applied it for another 100 radiology reports for validation. We also evaluated the stem terms of compound terms, through defining two measures: an occurrence ratio (OR) and a matching ratio (MR). RESULTS The F-measure of the cTAKES+RadLex+GPD was 32.2% (Precision 92.1%, Recall 19.6%) and that of combined the CtED was 67.1% (Precision 98.1%, Recall 51.0%). The OR indicated that stem terms of “effusion”, "node", "tube", and "disease" were used frequently, but it still lacks capturing Cts. The MR showed that 71.9% of stem terms matched with that of ontologies and RadLex improved about 22% of the MR from the cTAKES default dictionary. The OR and MR revealed that the characteristics of stem terms would have the potential to help generate synonymous phrases using ontologies. CONCLUSIONS We developed a RadLex-based customized pipeline for parsing radiology reports and demonstrated that CtED and stem term analysis has the potential to improve dictionary-based NER performance toward expanding vocabularies.


2020 ◽  
Vol 2020 ◽  
pp. 1-8
Author(s):  
Lejun Gong ◽  
Zhifei Zhang ◽  
Shiqi Chen

Background. Clinical named entity recognition is the basic task of mining electronic medical records text, which are with some challenges containing the language features of Chinese electronic medical records text with many compound entities, serious missing sentence components, and unclear entity boundary. Moreover, the corpus of Chinese electronic medical records is difficult to obtain. Methods. Aiming at these characteristics of Chinese electronic medical records, this study proposed a Chinese clinical entity recognition model based on deep learning pretraining. The model used word embedding from domain corpus and fine-tuning of entity recognition model pretrained by relevant corpus. Then BiLSTM and Transformer are, respectively, used as feature extractors to identify four types of clinical entities including diseases, symptoms, drugs, and operations from the text of Chinese electronic medical records. Results. 75.06% Macro-P, 76.40% Macro-R, and 75.72% Macro-F1 aiming at test dataset could be achieved. These experiments show that the Chinese clinical entity recognition model based on deep learning pretraining can effectively improve the recognition effect. Conclusions. These experiments show that the proposed Chinese clinical entity recognition model based on deep learning pretraining can effectively improve the recognition performance.


2020 ◽  
Author(s):  
Xi Yang ◽  
Jiang Bian ◽  
Yonghui Wu

ABSTRACTElectronic Health Records (EHRs) are a valuable resource for both clinical and translational research. However, much detailed patient information is embedded in clinical narratives, including a large number of patients’ identifiable information. De-identification of clinical notes is a critical technology to protect the privacy and confidentiality of patients. Previous studies presented many automated de-identification systems to capture and remove protected health information from clinical text. However, most of them were tested only in one institute setting where training and test data were from the same institution. Directly adapting these systems without customization could lead to a dramatic performance drop. Recent studies have shown that fine-tuning is a promising method to customize deep learning-based NLP systems across different institutes. However, it’s still not clear how much local data is required. In this study, we examined the customizing of a deep learning-based de-identification system using different sizes of local notes from UF Health. Our results showed that the fine-tuning could significantly improve the model performance even on a small local dataset. Yet, when the local data exceeded a threshold (e.g., 700 notes in this study), the performance improvement became marginal.


2017 ◽  
Author(s):  
Tiba Delespierre ◽  
Loic Josseran

BACKGROUND New nursing homes (NH) data warehouses fed from residents’ medical records allow monitoring the health of elderly population on a daily basis. Elsewhere, syndromic surveillance has already shown that professional data can be used for public health (PH) surveillance but not during a long-term follow-up of the same cohort. OBJECTIVE This study aimed to build and assess a national ecological NH PH surveillance system (SS). METHODS Using a national network of 126 NH, we built a residents’ cohort, extracted medical and personal data from their electronic health records, and transmitted them through the internet to a national server almost in real time. After recording sociodemographic, autonomic and syndromic information, a set of 26 syndromes was defined using pattern matching with the standard query language-LIKE operator and a Delphi-like technique, between November 2010 and June 2016. We used early aberration reporting system (EARS) and Bayes surveillance algorithms of the R surveillance package (Höhle) to assess our influenza and acute gastroenteritis (AGE) syndromic data against the Sentinelles network data, French epidemics gold standard, following Centers for Disease Control and Prevention surveillance system assessment guidelines. RESULTS By extracting all sociodemographic residents’ data, a cohort of 41,061 senior citizens was built. EARS_C3 algorithm on NH influenza and AGE syndromic data gave sensitivities of 0.482 and 0.539 and specificities of 0.844 and 0.952, respectively, over a 6-year period, forecasting the last influenza outbreak by catching early flu signals. In addition, assessment of influenza and AGE syndromic data quality showed precisions of 0.98 and 0.96 during last season epidemic weeks’ peaks (weeks 03-2017 and 01-2017) and precisions of 0.95 and 0.92 during last summer epidemic weeks’ low (week 33-2016). CONCLUSIONS This study confirmed that using syndromic information gives a good opportunity to develop a genuine French national PH SS dedicated to senior citizens. Access to senior citizens’ free-text validated health data on influenza and AGE responds to a PH issue for the surveillance of this fragile population. This database will also make possible new ecological research on other subjects that will improve prevention, care, and rapid response when facing health threats.


2021 ◽  
Author(s):  
Nona Naderi ◽  
Julien Knafou ◽  
Jenny Copara ◽  
Patrick Ruch ◽  
Douglas Teodoro

AbstractThe health and life science domains are well known for their wealth of entities. These entities are presented as free text in large corpora, such as biomedical scientific and electronic health records. To enable the secondary use of these corpora and unlock their value, named entity recognition (NER) methods are proposed. Inspired by the success of deep masked language models, we present an ensemble approach for NER using these models. Results show statistically significant improvement of the ensemble models over baselines based on individual models in multiple domains - chemical, clinical and wet lab - and languages - English and French. The ensemble model achieves an overall performance of 79.2% macro F1-score, a 4.6 percentage point increase upon the baseline in multiple domains and languages. These results suggests that ensembles are a more effective strategy for tackling NER. We further perform a detailed analysis of their performance based on a set of entity properties.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Chengkun Wu ◽  
Xinyi Xiao ◽  
Canqun Yang ◽  
JinXiang Chen ◽  
Jiacai Yi ◽  
...  

Abstract Background Interactions of microbes and diseases are of great importance for biomedical research. However, large-scale of microbe–disease interactions are hidden in the biomedical literature. The structured databases for microbe–disease interactions are in limited amounts. In this paper, we aim to construct a large-scale database for microbe–disease interactions automatically. We attained this goal via applying text mining methods based on a deep learning model with a moderate curation cost. We also built a user-friendly web interface that allows researchers to navigate and query required information. Results Firstly, we manually constructed a golden-standard corpus and a sliver-standard corpus (SSC) for microbe–disease interactions for curation. Moreover, we proposed a text mining framework for microbe–disease interaction extraction based on a pretrained model BERE. We applied named entity recognition tools to detect microbe and disease mentions from the free biomedical texts. After that, we fine-tuned the pretrained model BERE to recognize relations between targeted entities, which was originally built for drug–target interactions or drug–drug interactions. The introduction of SSC for model fine-tuning greatly improved detection performance for microbe–disease interactions, with an average reduction in error of approximately 10%. The MDIDB website offers data browsing, custom searching for specific diseases or microbes, and batch downloading. Conclusions Evaluation results demonstrate that our method outperform the baseline model (rule-based PKDE4J) with an average $$F_1$$ F 1 -score of 73.81%. For further validation, we randomly sampled nearly 1000 predicted interactions by our model, and manually checked the correctness of each interaction, which gives a 73% accuracy. The MDIDB webiste is freely avaliable throuth http://dbmdi.com/index/


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Etran Bouchouar ◽  
Benjamin M. Hetman ◽  
Brendan Hanley

Abstract Background Automated Emergency Department syndromic surveillance systems (ED-SyS) are useful tools in routine surveillance activities and during mass gathering events to rapidly detect public health threats. To improve the existing surveillance infrastructure in a lower-resourced rural/remote setting and enhance monitoring during an upcoming mass gathering event, an automated low-cost and low-resources ED-SyS was developed and validated in Yukon, Canada. Methods Syndromes of interest were identified in consultation with the local public health authorities. For each syndrome, case definitions were developed using published resources and expert elicitation. Natural language processing algorithms were then written using Stata LP 15.1 (Texas, USA) to detect syndromic cases from three different fields (e.g., triage notes; chief complaint; discharge diagnosis), comprising of free-text and standardized codes. Validation was conducted using data from 19,082 visits between October 1, 2018 to April 30, 2019. The National Ambulatory Care Reporting System (NACRS) records were used as a reference for the inclusion of International Classification of Disease, 10th edition (ICD-10) diagnosis codes. The automatic identification of cases was then manually validated by two raters and results were used to calculate positive predicted values for each syndrome and identify improvements to the detection algorithms. Results A daily secure file transfer of Yukon’s Meditech ED-Tracker system data and an aberration detection plan was set up. A total of six syndromes were originally identified for the syndromic surveillance system (e.g., Gastrointestinal, Influenza-like-Illness, Mumps, Neurological Infections, Rash, Respiratory), with an additional syndrome added to assist in detecting potential cases of COVID-19. The positive predictive value for the automated detection of each syndrome ranged from 48.8–89.5% to 62.5–94.1% after implementing improvements identified during validation. As expected, no records were flagged for COVID-19 from our validation dataset. Conclusions The development and validation of automated ED-SyS in lower-resourced settings can be achieved without sophisticated platforms, intensive resources, time or costs. Validation is an important step for measuring the accuracy of syndromic surveillance, and ensuring it performs adequately in a local context. The use of three different fields and integration of both free-text and structured fields improved case detection.


2021 ◽  
Vol 11 (13) ◽  
pp. 6007
Author(s):  
Muzamil Hussain Syed ◽  
Sun-Tae Chung

Entity-based information extraction is one of the main applications of Natural Language Processing (NLP). Recently, deep transfer-learning utilizing contextualized word embedding from pre-trained language models has shown remarkable results for many NLP tasks, including Named-entity recognition (NER). BERT (Bidirectional Encoder Representations from Transformers) is gaining prominent attention among various contextualized word embedding models as a state-of-the-art pre-trained language model. It is quite expensive to train a BERT model from scratch for a new application domain since it needs a huge dataset and enormous computing time. In this paper, we focus on menu entity extraction from online user reviews for the restaurant and propose a simple but effective approach for NER task on a new domain where a large dataset is rarely available or difficult to prepare, such as food menu domain, based on domain adaptation technique for word embedding and fine-tuning the popular NER task network model ‘Bi-LSTM+CRF’ with extended feature vectors. The proposed NER approach (named as ‘MenuNER’) consists of two step-processes: (1) Domain adaptation for target domain; further pre-training of the off-the-shelf BERT language model (BERT-base) in semi-supervised fashion on a domain-specific dataset, and (2) Supervised fine-tuning the popular Bi-LSTM+CRF network for downstream task with extended feature vectors obtained by concatenating word embedding from the domain-adapted pre-trained BERT model from the first step, character embedding and POS tag feature information. Experimental results on handcrafted food menu corpus from customers’ review dataset show that our proposed approach for domain-specific NER task, that is: food menu named-entity recognition, performs significantly better than the one based on the baseline off-the-shelf BERT-base model. The proposed approach achieves 92.5% F1 score on the YELP dataset for the MenuNER task.


Author(s):  
Ying Zhang ◽  
Fandong Meng ◽  
Yufeng Chen ◽  
Jinan Xu ◽  
Jie Zhou

Author(s):  
Nona Naderi ◽  
Julien Knafou ◽  
Jenny Copara ◽  
Patrick Ruch ◽  
Douglas Teodoro

The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains—biology, chemistry, and medicine—available in different languages—English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains.


Sign in / Sign up

Export Citation Format

Share Document