scholarly journals Active learning: a step towards automating medical concept extraction

2015 ◽  
Vol 23 (2) ◽  
pp. 289-296 ◽  
Author(s):  
Mahnoosh Kholghi ◽  
Laurianne Sitbon ◽  
Guido Zuccon ◽  
Anthony Nguyen

Abstract Objective This paper presents an automatic, active learning-based system for the extraction of medical concepts from clinical free-text reports. Specifically, (1) the contribution of active learning in reducing the annotation effort and (2) the robustness of incremental active learning framework across different selection criteria and data sets are determined. Materials and methods The comparative performance of an active learning framework and a fully supervised approach were investigated to study how active learning reduces the annotation effort while achieving the same effectiveness as a supervised approach. Conditional random fields as the supervised method, and least confidence and information density as 2 selection criteria for active learning framework were used. The effect of incremental learning vs standard learning on the robustness of the models within the active learning framework with different selection criteria was also investigated. The following 2 clinical data sets were used for evaluation: the Informatics for Integrating Biology and the Bedside/Veteran Affairs (i2b2/VA) 2010 natural language processing challenge and the Shared Annotated Resources/Conference and Labs of the Evaluation Forum (ShARe/CLEF) 2013 eHealth Evaluation Lab. Results The annotation effort saved by active learning to achieve the same effectiveness as supervised learning is up to 77%, 57%, and 46% of the total number of sequences, tokens, and concepts, respectively. Compared with the random sampling baseline, the saving is at least doubled. Conclusion Incremental active learning is a promising approach for building effective and robust medical concept extraction models while significantly reducing the burden of manual annotation.

2020 ◽  
Vol 12 (2) ◽  
pp. 297 ◽  
Author(s):  
Nasehe Jamshidpour ◽  
Abdolreza Safari ◽  
Saeid Homayouni

This paper introduces a novel multi-view multi-learner (MVML) active learning method, in which the different views are generated by a genetic algorithm (GA). The GA-based view generation method attempts to construct diverse, sufficient, and independent views by considering both inter- and intra-view confidences. Hyperspectral data inherently owns high dimensionality, which makes it suitable for multi-view learning algorithms. Furthermore, by employing multiple learners at each view, a more accurate estimation of the underlying data distribution can be obtained. We also implemented a spectral-spatial graph-based semi-supervised learning (SSL) method as the classifier, which improved the performance of the classification task in comparison with supervised learning. The evaluation of the proposed method was based on three different benchmark hyperspectral data sets. The results were also compared with other state-of-the-art AL-SSL methods. The experimental results demonstrated the efficiency and statistically significant superiority of the proposed method. The GA-MVML AL method improved the classification performances by 16.68%, 18.37%, and 15.1% for different data sets after 40 iterations.


2015 ◽  
Vol 22 (3) ◽  
pp. 671-681 ◽  
Author(s):  
Azadeh Nikfarjam ◽  
Abeed Sarker ◽  
Karen O’Connor ◽  
Rachel Ginn ◽  
Graciela Gonzalez

Abstract Objective Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and user-expressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. Methods We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words’ semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. Results ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. Conclusion It is possible to extract complex medical concepts, with relatively high performance, from informal, user-generated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets.


2020 ◽  
Author(s):  
Xi Yang ◽  
Hansi Zhang ◽  
Xing He ◽  
Jiang Bian ◽  
Yonghui Wu

BACKGROUND Patients’ family history (FH) is a critical risk factor associated with numerous diseases. However, FH information is not well captured in the structured database but often documented in clinical narratives. Natural language processing (NLP) is the key technology to extract patients’ FH from clinical narratives. In 2019, the National NLP Clinical Challenge (n2c2) organized shared tasks to solicit NLP methods for FH information extraction. OBJECTIVE This study presents our end-to-end FH extraction system developed during the 2019 n2c2 open shared task as well as the new transformer-based models that we developed after the challenge. We seek to develop a machine learning–based solution for FH information extraction without task-specific rules created by hand. METHODS We developed deep learning–based systems for FH concept extraction and relation identification. We explored deep learning models including long short-term memory-conditional random fields and bidirectional encoder representations from transformers (BERT) as well as developed ensemble models using a majority voting strategy. To further optimize performance, we systematically compared 3 different strategies to use BERT output representations for relation identification. RESULTS Our system was among the top-ranked systems (3 out of 21) in the challenge. Our best system achieved micro-averaged F1 scores of 0.7944 and 0.6544 for concept extraction and relation identification, respectively. After challenge, we further explored new transformer-based models and improved the performances of both subtasks to 0.8249 and 0.6775, respectively. For relation identification, our system achieved a performance comparable to the best system (0.6810) reported in the challenge. CONCLUSIONS This study demonstrated the feasibility of utilizing deep learning methods to extract FH information from clinical narratives.


Entropy ◽  
2019 ◽  
Vol 21 (7) ◽  
pp. 651 ◽  
Author(s):  
Dina Elreedy ◽  
Amir F. Atiya ◽  
Samir I. Shaheen

Recently, active learning is considered a promising approach for data acquisition due to the significant cost of the data labeling process in many real world applications, such as natural language processing and image processing. Most active learning methods are merely designed to enhance the learning model accuracy. However, the model accuracy may not be the primary goal and there could be other domain-specific objectives to be optimized. In this work, we develop a novel active learning framework that aims to solve a general class of optimization problems. The proposed framework mainly targets the optimization problems exposed to the exploration-exploitation trade-off. The active learning framework is comprehensive, it includes exploration-based, exploitation-based and balancing strategies that seek to achieve the balance between exploration and exploitation. The paper mainly considers regression tasks, as they are under-researched in the active learning field compared to classification tasks. Furthermore, in this work, we investigate the different active querying approaches—pool-based and the query synthesis—and compare them. We apply the proposed framework to the problem of learning the price-demand function, an application that is important in optimal product pricing and dynamic (or time-varying) pricing. In our experiments, we provide a comparative study including the proposed framework strategies and some other baselines. The accomplished results demonstrate a significant performance for the proposed methods.


10.2196/22982 ◽  
2020 ◽  
Vol 8 (12) ◽  
pp. e22982
Author(s):  
Xi Yang ◽  
Hansi Zhang ◽  
Xing He ◽  
Jiang Bian ◽  
Yonghui Wu

Background Patients’ family history (FH) is a critical risk factor associated with numerous diseases. However, FH information is not well captured in the structured database but often documented in clinical narratives. Natural language processing (NLP) is the key technology to extract patients’ FH from clinical narratives. In 2019, the National NLP Clinical Challenge (n2c2) organized shared tasks to solicit NLP methods for FH information extraction. Objective This study presents our end-to-end FH extraction system developed during the 2019 n2c2 open shared task as well as the new transformer-based models that we developed after the challenge. We seek to develop a machine learning–based solution for FH information extraction without task-specific rules created by hand. Methods We developed deep learning–based systems for FH concept extraction and relation identification. We explored deep learning models including long short-term memory-conditional random fields and bidirectional encoder representations from transformers (BERT) as well as developed ensemble models using a majority voting strategy. To further optimize performance, we systematically compared 3 different strategies to use BERT output representations for relation identification. Results Our system was among the top-ranked systems (3 out of 21) in the challenge. Our best system achieved micro-averaged F1 scores of 0.7944 and 0.6544 for concept extraction and relation identification, respectively. After challenge, we further explored new transformer-based models and improved the performances of both subtasks to 0.8249 and 0.6775, respectively. For relation identification, our system achieved a performance comparable to the best system (0.6810) reported in the challenge. Conclusions This study demonstrated the feasibility of utilizing deep learning methods to extract FH information from clinical narratives.


Natural Language Processing (NLP) Algorithms are the key factors for automatic information extraction form the unstructured free-text radiology reports .To extract clinically important findings and recommendations, various NLP algorithms are used . A rule-based NLP system is used in most of the automated IE applications in medical domain; whereas some applications are using Random Forest classifier, PageRank Algorithm, clustering algorithm, Conditional Random Fields (CRF) algorithm, and deep learning-based approaches. Some papers found with methods used for IE like, Support Vector Machines (SVMs), linear-chain conditional random fields (LC-CRFs), k-means or k-medoids algorithm ,Affinity Propagation (AP) clustering algorithm , supervised machine learning algorithm and many more. Thus through this survey we can say that, NLP methods used to extract information ,brings new insights into already known clinical evidences. It also helps to identify previously unknown treatment and causal relations between biomedical entities .Therefore NLP algorithms has empowered Radiology monarchy.


2020 ◽  
Vol 27 (12) ◽  
pp. 1935-1942
Author(s):  
Xi Yang ◽  
Jiang Bian ◽  
William R Hogan ◽  
Yonghui Wu

Abstract Objective The goal of this study is to explore transformer-based models (eg, Bidirectional Encoder Representations from Transformers [BERT]) for clinical concept extraction and develop an open-source package with pretrained clinical models to facilitate concept extraction and other downstream natural language processing (NLP) tasks in the medical domain. Methods We systematically explored 4 widely used transformer-based architectures, including BERT, RoBERTa, ALBERT, and ELECTRA, for extracting various types of clinical concepts using 3 public datasets from the 2010 and 2012 i2b2 challenges and the 2018 n2c2 challenge. We examined general transformer models pretrained using general English corpora as well as clinical transformer models pretrained using a clinical corpus and compared them with a long short-term memory conditional random fields (LSTM-CRFs) mode as a baseline. Furthermore, we integrated the 4 clinical transformer-based models into an open-source package. Results and Conclusion The RoBERTa-MIMIC model achieved state-of-the-art performance on 3 public clinical concept extraction datasets with F1-scores of 0.8994, 0.8053, and 0.8907, respectively. Compared to the baseline LSTM-CRFs model, RoBERTa-MIMIC remarkably improved the F1-score by approximately 4% and 6% on the 2010 and 2012 i2b2 datasets. This study demonstrated the efficiency of transformer-based models for clinical concept extraction. Our methods and systems can be applied to other clinical tasks. The clinical transformer package with 4 pretrained clinical models is publicly available at https://github.com/uf-hobi-informatics-lab/ClinicalTransformerNER. We believe this package will improve current practice on clinical concept extraction and other tasks in the medical domain.


Author(s):  
Azad Dehghan ◽  
Tom Liptrot ◽  
Catherine O’hara ◽  
Matthew Barker-Hewitt ◽  
Daniel Tibble ◽  
...  

ABSTRACTObjectivesIncreasing interest to use unstructured electronic patient records for research has attracted attention to automated de-identification methods to conduct large scale removal of Personal Identifiable Information (PII). PII mainly include identifiable information such as person names, dates (e.g., date of birth), reference numbers (e.g., hospital number, NHS number), locations (e.g., hospital names, addresses), contacts (e.g., telephone, e-mail), occupation, age, and other identity information (ethnical, religion, sexual) mentioned in a private context.  De-identification of clinical free-text remains crucial to enable large-scale data access for health research while adhering to legal (Data Protection Act 1998) and ethical obligations. Here we present a computational method developed to automatically remove PII from clinical text. ApproachIn order to automatically identify PII in clinical text, we have developed and validated a Natural Language Processing (NLP) method which combine knowledge- (lexical dictionaries and rules) and data-driven (linear-chain conditional random fields) techniques. In addition, we have designed a novel two-pass recognition approach that uses the output of the initial pass to create patient-level and run-time dictionaries used to identify PII mentions that lack specific contextual clues considered by the initial entity extraction modules. The labelled data used to model and validate our techniques were generated using six human annotators and two distinct types of free-text from The Christie NHS Foundation Trust: (1) clinical correspondence (400 documents) and (2) clinical notes (1,300 documents). ResultsThe de-identification approach was developed and validated using a 60/40 percent split between the development and test datasets. The preliminary results show that our method achieves 97% and 93% token-level F1-measure on clinical correspondence and clinical notes respectively. In addition, the proposed two-pass recognition method was found particularly effective for longitudinal records. Notably, the performances are comparable to human benchmarks (using inter annotator agreements) of 97% and 90% F1 respectively. ConclusionsWe have developed and validated a state-of-the-art method that matches human benchmarks in identification and removal of PII from free-text clinical records. The method has been further validated across multiple institutions and countries (United States and United Kingdom), where we have identified a notable NLP challenge of cross-dataset adaption and have proposed using active learning methods to address this problem. The algorithm, including an active learning component, will be provided as open source to the healthcare community.


Sign in / Sign up

Export Citation Format

Share Document