Clinical Named Entity Recognition based on Deep Learning with Pre-trained Word Embedding for Colonoscopy Reports: Model Development and Performance Evaluation (Preprint)

2021 ◽  
Author(s):  
Donghyeong Seong ◽  
Yoonho Choi ◽  
Sungwon Jung ◽  
Sungchul Bae ◽  
Soo-Yong Shin ◽  
...  

BACKGROUND Colorectal cancer is a leading cause of cancer deaths. Several screening tests such as colonoscopy can be used to find polyps or colorectal cancer. Colonoscopy reports are often written in unstructured narrative text. The information embedded in the reports can be used for various purposes, including colorectal cancer risk prediction, follow-up recommendation, and quality measurement. However, the availability and accessibility of the unstructured text data are still very low despite the large amounts of accumulated data. OBJECTIVE We aimed to develop a deep learning-based natural language processing (NLP) method for named entity recognition (NER) in colonoscopy reports. To the best of our knowledge, no previous studies on clinical NLP for colonoscopy reports have applied deep learning techniques. METHODS This study proposed a method to apply pre-trained word embedding to a deep learning-based NER model using large unlabeled colonoscopy reports. Approximately 280,668 colonoscopy reports were extracted from the clinical data warehouse of the Samsung Medical Center. For 5,000 reports, procedural information and colonoscopic findings were manually annotated with 17 labels. We compared variants of the long short-term memory (LSTM) model to select the one with the best performance for colonoscopy reports, which was the bidirectional LSTM with conditional random fields. Then, we applied pre-trained word embedding using a large unlabeled data (280,668 reports) to the selected model. RESULTS The NER model with pre-trained word embedding performed better for most labels than the model with one-hot encoding. The F1 score for colonoscopic findings were: 0.9564 for lesions, 0.9722 for locations, 0.9809 for shapes, 0.9720 for colors, 0.9862 for sizes, and 0.9717 for numbers. CONCLUSIONS In this study, clinical NER was applied to extract meaningful information from colonoscopy reports. We proposed a deep learning-based NER model with pre-trained word embedding. The proposed method in this study achieved promising results that demonstrate it can be applied to various practical purposes.

2018 ◽  
Vol 10 (12) ◽  
pp. 123 ◽  
Author(s):  
Mohammed Ali ◽  
Guanzheng Tan ◽  
Aamir Hussain

Recurrent neural network (RNN) has achieved remarkable success in sequence labeling tasks with memory requirement. RNN can remember previous information of a sequence and can thus be used to solve natural language processing (NLP) tasks. Named entity recognition (NER) is a common task of NLP and can be considered a classification problem. We propose a bidirectional long short-term memory (LSTM) model for this entity recognition task of the Arabic text. The LSTM network can process sequences and relate to each part of it, which makes it useful for the NER task. Moreover, we use pre-trained word embedding to train the inputs that are fed into the LSTM network. The proposed model is evaluated on a popular dataset called “ANERcorp.” Experimental results show that the model with word embedding achieves a high F-score measure of approximately 88.01%.


10.2196/14850 ◽  
2019 ◽  
Vol 7 (4) ◽  
pp. e14850 ◽  
Author(s):  
Min Jiang ◽  
Todd Sanger ◽  
Xiong Liu

Background Named entity recognition (NER) is a key step in clinical natural language processing (NLP). Traditionally, rule-based systems leverage prior knowledge to define rules to identify named entities. Recently, deep learning–based NER systems have become more and more popular. Contextualized word embedding, as a new type of representation of the word, has been proposed to dynamically capture word sense using context information and has proven successful in many deep learning–based systems in either general domain or medical domain. However, there are very few studies that investigate the effects of combining multiple contextualized embeddings and prior knowledge on the clinical NER task. Objective This study aims to improve the performance of NER in clinical text by combining multiple contextual embeddings and prior knowledge. Methods In this study, we investigate the effects of combining multiple contextualized word embeddings with classic word embedding in deep neural networks to predict named entities in clinical text. We also investigate whether using a semantic lexicon could further improve the performance of the clinical NER system. Results By combining contextualized embeddings such as ELMo and Flair, our system achieves the F-1 score of 87.30% when only training based on a portion of the 2010 Informatics for Integrating Biology and the Bedside NER task dataset. After incorporating the medical lexicon into the word embedding, the F-1 score was further increased to 87.44%. Another finding was that our system still could achieve an F-1 score of 85.36% when the size of the training data was reduced to 40%. Conclusions Combined contextualized embedding could be beneficial for the clinical NER task. Moreover, the semantic lexicon could be used to further improve the performance of the clinical NER system.


Electronics ◽  
2020 ◽  
Vol 9 (6) ◽  
pp. 1001 ◽  
Author(s):  
Jingang Liu ◽  
Chunhe Xia ◽  
Haihua Yan ◽  
Wenjing Xu

Named entity recognition (NER) is a basic but crucial task in the field of natural language processing (NLP) and big data analysis. The recognition of named entities based on Chinese is more complicated and difficult than English, which makes the task of NER in Chinese more challenging. In particular, fine-grained named entity recognition is more challenging than traditional named entity recognition tasks, mainly because fine-grained tasks have higher requirements for the ability of automatic feature extraction and information representation of deep neural models. In this paper, we propose an innovative neural network model named En2BiLSTM-CRF to improve the effect of fine-grained Chinese entity recognition tasks. This proposed model including the initial encoding layer, the enhanced encoding layer, and the decoding layer combines the advantages of pre-training model encoding, dual bidirectional long short-term memory (BiLSTM) networks, and a residual connection mechanism. Hence, it can encode information multiple times and extract contextual features hierarchically. We conducted sufficient experiments on two representative datasets using multiple important metrics and compared them with other advanced baselines. We present promising results showing that our proposed En2BiLSTM-CRF has better performance as well as better generalization ability in both fine-grained and coarse-grained Chinese entity recognition tasks.


Information ◽  
2020 ◽  
Vol 11 (1) ◽  
pp. 45 ◽  
Author(s):  
Shardrom Johnson ◽  
Sherlock Shen ◽  
Yuanchen Liu

Usually taken as linguistic features by Part-Of-Speech (POS) tagging, Named Entity Recognition (NER) is a major task in Natural Language Processing (NLP). In this paper, we put forward a new comprehensive-embedding, considering three aspects, namely character-embedding, word-embedding, and pos-embedding stitched in the order we give, and thus get their dependencies, based on which we propose a new Character–Word–Position Combined BiLSTM-Attention (CWPC_BiAtt) for the Chinese NER task. Comprehensive-embedding via the Bidirectional Llong Short-Term Memory (BiLSTM) layer can get the connection between the historical and future information, and then employ the attention mechanism to capture the connection between the content of the sentence at the current position and that at any location. Finally, we utilize Conditional Random Field (CRF) to decode the entire tagging sequence. Experiments show that CWPC_BiAtt model we proposed is well qualified for the NER task on Microsoft Research Asia (MSRA) dataset and Weibo NER corpus. A high precision and recall were obtained, which verified the stability of the model. Position-embedding in comprehensive-embedding can compensate for attention-mechanism to provide position information for the disordered sequence, which shows that comprehensive-embedding has completeness. Looking at the entire model, our proposed CWPC_BiAtt has three distinct characteristics: completeness, simplicity, and stability. Our proposed CWPC_BiAtt model achieved the highest F-score, achieving the state-of-the-art performance in the MSRA dataset and Weibo NER corpus.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Hyejin Cho ◽  
Hyunju Lee

Abstract Background In biomedical text mining, named entity recognition (NER) is an important task used to extract information from biomedical articles. Previously proposed methods for NER are dictionary- or rule-based methods and machine learning approaches. However, these traditional approaches are heavily reliant on large-scale dictionaries, target-specific rules, or well-constructed corpora. These methods to NER have been superseded by the deep learning-based approach that is independent of hand-crafted features. However, although such methods of NER employ additional conditional random fields (CRF) to capture important correlations between neighboring labels, they often do not incorporate all the contextual information from text into the deep learning layers. Results We propose herein an NER system for biomedical entities by incorporating n-grams with bi-directional long short-term memory (BiLSTM) and CRF; this system is referred to as a contextual long short-term memory networks with CRF (CLSTM). We assess the CLSTM model on three corpora: the disease corpus of the National Center for Biotechnology Information (NCBI), the BioCreative II Gene Mention corpus (GM), and the BioCreative V Chemical Disease Relation corpus (CDR). Our framework was compared with several deep learning approaches, such as BiLSTM, BiLSTM with CRF, GRAM-CNN, and BERT. On the NCBI corpus, our model recorded an F-score of 85.68% for the NER of diseases, showing an improvement of 1.50% over previous methods. Moreover, although BERT used transfer learning by incorporating more than 2.5 billion words, our system showed similar performance with BERT with an F-scores of 81.44% for gene NER on the GM corpus and a outperformed F-score of 86.44% for the NER of chemicals and diseases on the CDR corpus. We conclude that our method significantly improves performance on biomedical NER tasks. Conclusion The proposed approach is robust in recognizing biological entities in text.


2019 ◽  
Vol 27 (3) ◽  
pp. 457-470 ◽  
Author(s):  
Stephen Wu ◽  
Kirk Roberts ◽  
Surabhi Datta ◽  
Jingcheng Du ◽  
Zongcheng Ji ◽  
...  

Abstract Objective This article methodically reviews the literature on deep learning (DL) for natural language processing (NLP) in the clinical domain, providing quantitative analysis to answer 3 research questions concerning methods, scope, and context of current research. Materials and Methods We searched MEDLINE, EMBASE, Scopus, the Association for Computing Machinery Digital Library, and the Association for Computational Linguistics Anthology for articles using DL-based approaches to NLP problems in electronic health records. After screening 1,737 articles, we collected data on 25 variables across 212 papers. Results DL in clinical NLP publications more than doubled each year, through 2018. Recurrent neural networks (60.8%) and word2vec embeddings (74.1%) were the most popular methods; the information extraction tasks of text classification, named entity recognition, and relation extraction were dominant (89.2%). However, there was a “long tail” of other methods and specific tasks. Most contributions were methodological variants or applications, but 20.8% were new methods of some kind. The earliest adopters were in the NLP community, but the medical informatics community was the most prolific. Discussion Our analysis shows growing acceptance of deep learning as a baseline for NLP research, and of DL-based NLP in the medical community. A number of common associations were substantiated (eg, the preference of recurrent neural networks for sequence-labeling named entity recognition), while others were surprisingly nuanced (eg, the scarcity of French language clinical NLP with deep learning). Conclusion Deep learning has not yet fully penetrated clinical NLP and is growing rapidly. This review highlighted both the popular and unique trends in this active field.


2019 ◽  
Vol 20 (S16) ◽  
Author(s):  
Xusheng Li ◽  
Chengcheng Fu ◽  
Ran Zhong ◽  
Duo Zhong ◽  
Tingting He ◽  
...  

Abstract Background Microbes have been shown to play a crucial role in various ecosystems. Many human diseases have been proved to be associated with bacteria, so it is essential to extract the interaction between bacteria for medical research and application. At the same time, many bacterial interactions with certain experimental evidences have been reported in biomedical literature. Integrating this knowledge into a database or knowledge graph could accelerate the progress of biomedical research. A crucial and necessary step in interaction extraction (IE) is named entity recognition (NER). However, due to the specificity of bacterial naming, there are still challenges in bacterial named entity recognition. Results In this paper, we propose a novel method for bacterial named entity recognition, which integrates domain features into a deep learning framework combining bidirectional long short-term memory network and convolutional neural network. When domain features are not added, F1-measure of the model achieves 89.14%. After part-of-speech (POS) features and dictionary features are added, F1-measure of the model achieves 89.7%. Hence, our model achieves an advanced performance in bacterial NER with the domain features. Conclusions We propose an efficient method for bacterial named entity recognition which combines domain features and deep learning models. Compared with the previous methods, the effect of our model has been improved. At the same time, the process of complex manual extraction and feature design are significantly reduced.


2021 ◽  
Vol 4 (1) ◽  
pp. 38
Author(s):  
Joan Santoso ◽  
Esther Irawati Setiawan ◽  
Christian Nathaniel Purwanto ◽  
Fachrul Kurniawan

Detecting the sentence boundary is one of the crucial pre-processing steps in natural language processing. It can define the boundary of a sentence since the border between a sentence, and another sentence might be ambiguous. Because there are multiple separators and dynamic sentence patterns, using a full stop at the end of a sentence is sometimes inappropriate. This research uses a deep learning approach to split each sentence from an Indonesian news document. Hence, there is no need to define any handcrafted features or rules. In Part of Speech Tagging and Named Entity Recognition, we use sequence labeling to determine sentence boundaries. Two labels will be used, namely O as a non-boundary token and E as the last token marker in the sentence. To do this, we used the Bi-LSTM approach, which has been widely used in sequence labeling. We have proved that our approach works for Indonesian text using pre-trained embedding in Indonesian, as in previous studies. This study achieved an F1-Score value of 98.49 percent. When compared to previous studies, the achieved performance represents a significant increase in outcomes..


2020 ◽  
Vol 9 (1) ◽  
pp. 1344-1349

In this data-driven world, AI is being used in almost all the tasks to automate processes and make human life more comfortable. One such industry where Artificial Intelligence (AI) importance is growing is the recruiting industry. This paper aims to propose a new and a better method to match the most suitable talent to jobs, which has been incorporated using two methods – suggesting top resumes to a job opening from a talent pool to a recruiter, recommending top jobs which match to a candidate based on the candidate's resume. Natural Language Processing techniques have been used in implementing this approach – Named Entity Recognition (NER), Word embedding model, and Cosine similarity using which a resume and job will be matched. The NER model is used to extract useful entities from documents, which is enhanced by the word2vec model by making the system more generic and the similarity is calculated using the cosine similarity algorithm.


Author(s):  
Ismail El Bazi ◽  
Nabil Laachfoubi

Most of the Arabic Named Entity Recognition (NER) systems depend massively on external resources and handmade feature engineering to achieve state-of-the-art results. To overcome such limitations, we proposed, in this paper, to use deep learning approach to tackle the Arabic NER task. We introduced a neural network architecture based on bidirectional Long Short-Term Memory (LSTM) and Conditional Random Fields (CRF) and experimented with various commonly used hyperparameters to assess their effect on the overall performance of our system. Our model gets two sources of information about words as input: pre-trained word embeddings and character-based representations and eliminated the need for any task-specific knowledge or feature engineering. We obtained state-of-the-art result on the standard ANERcorp corpus with an F1 score of 90.6%.


Sign in / Sign up

Export Citation Format

Share Document