scholarly journals Improving Biomedical Named Entity Recognition with Syntactic Information

2020 ◽  
Author(s):  
YUANHE TIAN ◽  
Wang Shen ◽  
Yan Song ◽  
Fei Xia ◽  
Min He ◽  
...  

Abstract Background Biomedical named entity recognition (BioNER) is an important task for understanding biomedical texts. The task can be challenging due to the lack of large-scale labeled training data and domain knowledge. Previous studies have shown that syntactic information can be useful for named entity recognition; however, most of them fail to weigh that information with respect to its contribution as they treat the syntactic information as gold reference. Results In this paper, we propose BioKMNER, a BioNER model for biomedical texts with key-value memory networks to incorporate syntactic information, which is extracted from syntactic structures automatically generated by existing toolkits. Our approach outperforms baselines without memories and achieves new state-of-the-art results on on four biomedical datasets compared with previous studies, i.e., 85.67% on BC2GM, 94.22% on BC5CDR-chemical, 90.11% on NCBI-diease, and 76.33% on Species-800. Conclusion Experimental results on four benchmark datasets demonstrate the effectiveness of our method, where the state-of-the-art performance is achieved on all of them.

2020 ◽  
Author(s):  
Usman Naseem ◽  
Matloob Khushi ◽  
Vinay Reddy ◽  
Sakthivel Rajendran ◽  
Imran Razzak ◽  
...  

Abstract Background: In recent years, with the growing amount of biomedical documents, coupled with advancement in natural language processing algorithms, the research on biomedical named entity recognition (BioNER) has increased exponentially. However, BioNER research is challenging as NER in the biomedical domain are: (i) often restricted due to limited amount of training data, (ii) an entity can refer to multiple types and concepts depending on its context and, (iii) heavy reliance on acronyms that are sub-domain specific. Existing BioNER approaches often neglect these issues and directly adopt the state-of-the-art (SOTA) models trained in general corpora which often yields unsatisfactory results. Results: We propose biomedical ALBERT (A Lite Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) - bioALBERT - an effective domain-specific pre-trained language model trained on huge biomedical corpus designed to capture biomedical context-dependent NER. We adopted self-supervised loss function used in ALBERT that targets on modelling inter-sentence coherence to better learn context-dependent representations and incorporated parameter reduction strategies to minimise memory usage and enhance the training time in BioNER. In our experiments, BioALBERT outperformed comparative SOTA BioNER models on eight biomedical NER benchmark datasets with four different entity types. The performance is increased for; (i) disease type corpora by 7.47% (NCBI-disease) and 10.63% (BC5CDR-disease); (ii) drug-chem type corpora by 4.61% (BC5CDR-Chem) and 3.89 (BC4CHEMD); (iii) gene-protein type corpora by 12.25% (BC2GM) and 6.42% (JNLPBA); and (iv) Species type corpora by 6.19% (LINNAEUS) and 23.71% (Species-800) is observed which leads to a state-of-the-art results. Conclusions: The performance of proposed model on four different biomedical entity types shows that our model is robust and generalizable in recognizing biomedical entities in text. We trained four different variants of BioALBERT models which are available for the research community to be used in future research.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Yuanhe Tian ◽  
Wang Shen ◽  
Yan Song ◽  
Fei Xia ◽  
Min He ◽  
...  

Abstract Background Biomedical named entity recognition (BioNER) is an important task for understanding biomedical texts, which can be challenging due to the lack of large-scale labeled training data and domain knowledge. To address the challenge, in addition to using powerful encoders (e.g., biLSTM and BioBERT), one possible method is to leverage extra knowledge that is easy to obtain. Previous studies have shown that auto-processed syntactic information can be a useful resource to improve model performance, but their approaches are limited to directly concatenating the embeddings of syntactic information to the input word embeddings. Therefore, such syntactic information is leveraged in an inflexible way, where inaccurate one may hurt model performance. Results In this paper, we propose BioKMNER, a BioNER model for biomedical texts with key-value memory networks (KVMN) to incorporate auto-processed syntactic information. We evaluate BioKMNER on six English biomedical datasets, where our method with KVMN outperforms the strong baseline method, namely, BioBERT, from the previous study on all datasets. Specifically, the F1 scores of our best performing model are 85.29% on BC2GM, 77.83% on JNLPBA, 94.22% on BC5CDR-chemical, 90.08% on NCBI-disease, 89.24% on LINNAEUS, and 76.33% on Species-800, where state-of-the-art performance is obtained on four of them (i.e., BC2GM, BC5CDR-chemical, NCBI-disease, and Species-800). Conclusion The experimental results on six English benchmark datasets demonstrate that auto-processed syntactic information can be a useful resource for BioNER and our method with KVMN can appropriately leverage such information to improve model performance.


2018 ◽  
Author(s):  
Xuan Wang ◽  
Yu Zhang ◽  
Xiang Ren ◽  
Yuhao Zhang ◽  
Marinka Zitnik ◽  
...  

AbstractMotivationState-of-the-art biomedical named entity recognition (BioNER) systems often require handcrafted features specific to each entity type, such as genes, chemicals and diseases. Although recent studies explored using neural network models for BioNER to free experts from manual feature engineering, the performance remains limited by the available training data for each entity type.ResultsWe propose a multi-task learning framework for BioNER to collectively use the training data of different types of entities and improve the performance on each of them. In experiments on 15 benchmark BioNER datasets, our multi-task model achieves substantially better performance compared with state-of-the-art BioNER systems and baseline neural sequence labeling models. Further analysis shows that the large performance gains come from sharing character- and word-level information among relevant biomedical entities across differently labeled corpora.AvailabilityOur source code is available at https://github.com/yuzhimanhua/[email protected], [email protected].


2021 ◽  
Vol 7 (1) ◽  
Author(s):  
Kanix Wang ◽  
Robert Stevens ◽  
Halima Alachram ◽  
Yu Li ◽  
Larisa Soldatova ◽  
...  

AbstractMachine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades1,2, the most dramatic advances in MR have followed in the wake of critical corpus development3. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.


2020 ◽  
Vol 36 (15) ◽  
pp. 4331-4338
Author(s):  
Mei Zuo ◽  
Yang Zhang

Abstract Motivation Named entity recognition is a critical and fundamental task for biomedical text mining. Recently, researchers have focused on exploiting deep neural networks for biomedical named entity recognition (Bio-NER). The performance of deep neural networks on a single dataset mostly depends on data quality and quantity while high-quality data tends to be limited in size. To alleviate task-specific data limitation, some studies explored the multi-task learning (MTL) for Bio-NER and achieved state-of-the-art performance. However, these MTL methods did not make full use of information from various datasets of Bio-NER. The performance of state-of-the-art MTL method was significantly limited by the number of training datasets. Results We propose two dataset-aware MTL approaches for Bio-NER which jointly train all models for numerous Bio-NER datasets, thus each of these models could discriminatively exploit information from all of related training datasets. Both of our two approaches achieve substantially better performance compared with the state-of-the-art MTL method on 14 out of 15 Bio-NER datasets. Furthermore, we implemented our approaches by incorporating Bio-NER and biomedical part-of-speech (POS) tagging datasets. The results verify Bio-NER and POS can significantly enhance one another. Availability and implementation Our source code is available at https://github.com/zmmzGitHub/MTL-BC-LBC-BioNER and all datasets are publicly available at https://github.com/cambridgeltl/MTL-Bioinformatics-2016. Supplementary information Supplementary data are available at Bioinformatics online.


2015 ◽  
Vol 7 (S1) ◽  
Author(s):  
Tsendsuren Munkhdalai ◽  
Meijing Li ◽  
Khuyagbaatar Batsuren ◽  
Hyeon Ah Park ◽  
Nak Hyeon Choi ◽  
...  

2021 ◽  
Vol 2 (4) ◽  
pp. 1-24
Author(s):  
Pratyay Banerjee ◽  
Kuntal Kumar Pal ◽  
Murthy Devarakonda ◽  
Chitta Baral

In this work, we formulated the named entity recognition (NER) task as a multi-answer knowledge guided question-answer task (KGQA) and showed that the knowledge guidance helps to achieve state-of-the-art results for 11 of 18 biomedical NER datasets. We prepended five different knowledge contexts—entity types, questions, definitions, and examples—to the input text and trained and tested BERT-based neural models on such input sequences from a combined dataset of the 18 different datasets. This novel formulation of the task (a) improved named entity recognition and illustrated the impact of different knowledge contexts, (b) reduced system confusion by limiting prediction to a single entity-class for each input token (i.e., B , I , O only) compared to multiple entity-classes in traditional NER (i.e., B entity 1, B entity 2, I entity 1, I , O ), (c) made detection of nested entities easier, and (d) enabled the models to jointly learn NER-specific features from a large number of datasets. We performed extensive experiments of this KGQA formulation on the biomedical datasets, and through the experiments, we showed when knowledge improved named entity recognition. We analyzed the effect of the task formulation, the impact of the different knowledge contexts, the multi-task aspect of the generic format, and the generalization ability of KGQA. We also probed the model to better understand the key contributors for these improvements.


2020 ◽  
Author(s):  
Xie-Yuan Xie

Abstract Named Entity Recognition (NER) is a key task which automatically extracts Named Entities (NE) from the text. Names of persons, places, date and time are examples of NEs. We are applying Conditional Random Fields (CRFs) for NER in biomedical domain. Examples of NEs in biomedical texts are gene, proteins. We used a minimal set of features to train CRF algorithm and obtained a good results for biomedical texts.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Huiwei Zhou ◽  
Zhe Liu ◽  
Chengkun Lang ◽  
Yibin Xu ◽  
Yingyu Lin ◽  
...  

Abstract Background Biomedical named entity recognition is one of the most essential tasks in biomedical information extraction. Previous studies suffer from inadequate annotated datasets, especially the limited knowledge contained in them. Methods To remedy the above issue, we propose a novel Biomedical Named Entity Recognition (BioNER) framework with label re-correction and knowledge distillation strategies, which could not only create large and high-quality datasets but also obtain a high-performance recognition model. Our framework is inspired by two points: (1) named entity recognition should be considered from the perspective of both coverage and accuracy; (2) trustable annotations should be yielded by iterative correction. Firstly, for coverage, we annotate chemical and disease entities in a large-scale unlabeled dataset by PubTator to generate a weakly labeled dataset. For accuracy, we then filter it by utilizing multiple knowledge bases to generate another weakly labeled dataset. Next, the two datasets are revised by a label re-correction strategy to construct two high-quality datasets, which are used to train two recognition models, respectively. Finally, we compress the knowledge in the two models into a single recognition model with knowledge distillation. Results Experiments on the BioCreative V chemical-disease relation corpus and NCBI Disease corpus show that knowledge from large-scale datasets significantly improves the performance of BioNER, especially the recall of it, leading to new state-of-the-art results. Conclusions We propose a framework with label re-correction and knowledge distillation strategies. Comparison results show that the two perspectives of knowledge in the two re-corrected datasets respectively are complementary and both effective for BioNER.


2021 ◽  
Author(s):  
Christoph Brandl ◽  
Jens Albrecht ◽  
Renato Budinich

The task of relation extraction aims at classifying the semantic relations between entities in a text. When coupled with named-entity recognition these can be used as the building blocks for an information extraction procedure that results in the construction of a Knowledge Graph. While many NLP libraries support named-entity recognition, there is no off-the-shelf solution for relation extraction. In this paper, we evaluate and compare several state-of-the-art approaches on a subset of the FewRel data set as well as a manually annotated corpus. The custom corpus contains six relations from the area of market research and is available for public use. Our approach provides guidance for the selection of models and training data for relation extraction in realworld projects.


Sign in / Sign up

Export Citation Format

Share Document