Chinese named entity identification using class-based language model

Abstract Background: In recent years, with the growing amount of biomedical documents, coupled with advancement in natural language processing algorithms, the research on biomedical named entity recognition (BioNER) has increased exponentially. However, BioNER research is challenging as NER in the biomedical domain are: (i) often restricted due to limited amount of training data, (ii) an entity can refer to multiple types and concepts depending on its context and, (iii) heavy reliance on acronyms that are sub-domain specific. Existing BioNER approaches often neglect these issues and directly adopt the state-of-the-art (SOTA) models trained in general corpora which often yields unsatisfactory results. Results: We propose biomedical ALBERT (A Lite Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) - bioALBERT - an effective domain-specific pre-trained language model trained on huge biomedical corpus designed to capture biomedical context-dependent NER. We adopted self-supervised loss function used in ALBERT that targets on modelling inter-sentence coherence to better learn context-dependent representations and incorporated parameter reduction strategies to minimise memory usage and enhance the training time in BioNER. In our experiments, BioALBERT outperformed comparative SOTA BioNER models on eight biomedical NER benchmark datasets with four different entity types. The performance is increased for; (i) disease type corpora by 7.47% (NCBI-disease) and 10.63% (BC5CDR-disease); (ii) drug-chem type corpora by 4.61% (BC5CDR-Chem) and 3.89 (BC4CHEMD); (iii) gene-protein type corpora by 12.25% (BC2GM) and 6.42% (JNLPBA); and (iv) Species type corpora by 6.19% (LINNAEUS) and 23.71% (Species-800) is observed which leads to a state-of-the-art results. Conclusions: The performance of proposed model on four different biomedical entity types shows that our model is robust and generalizable in recognizing biomedical entities in text. We trained four different variants of BioALBERT models which are available for the research community to be used in future research.

Download Full-text

Named Entity Identification and Cyberinfrastructure

Research and Advanced Technology for Digital Libraries - Lecture Notes in Computer Science ◽

10.1007/978-3-540-74851-9_22 ◽

2007 ◽

pp. 259-270 ◽

Cited By ~ 5

Author(s):

Alison Babeu ◽

David Bamman ◽

Gregory Crane ◽

Robert Kummer ◽

Gabriel Weaver

Keyword(s):

Named Entity ◽

Entity Identification

Download Full-text

A Comparative Study on the Performance of Named Entity Recognition in Materials and Chemistry Fields through Multiple Embedding Combination Based on a Pre-trained Neural Network Language Model

Journal of KIISE ◽

10.5626/jok.2021.48.6.696 ◽

2021 ◽

Vol 48 (6) ◽

pp. 696-706

Author(s):

Myunghoon Lee ◽

Hyeonho Shin ◽

Hong-Woo Chun ◽

Jae-Min Lee ◽

Taehyun Ha ◽

...

Keyword(s):

Neural Network ◽

Comparative Study ◽

Language Model ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity ◽

Trained Neural Network ◽

Network Language

Download Full-text

Leveraging Concept-Enhanced Pre-Training Model and Masked-Entity Language Model for Named Entity Disambiguation

IEEE Access ◽

10.1109/access.2020.2994247 ◽

2020 ◽

Vol 8 ◽

pp. 100469-100484

Author(s):

Zizheng Ji ◽

Lin Dai ◽

Jin Pang ◽

Tingting Shen

Keyword(s):

Language Model ◽

Training Model ◽

Named Entity ◽

Entity Disambiguation ◽

Named Entity Disambiguation

Download Full-text

Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training

10.18653/v1/2021.emnlp-main.810 ◽

2021 ◽

Author(s):

Yu Meng ◽

Yunyi Zhang ◽

Jiaxin Huang ◽

Xuan Wang ◽

Yu Zhang ◽

...

Keyword(s):

Language Model ◽

Named Entity Recognition ◽

Entity Recognition ◽

Robust Learning ◽

Named Entity ◽

Noise Robust

Download Full-text

Analyzing transfer learning impact in biomedical cross-lingual named entity recognition and normalization

BMC Bioinformatics ◽

10.1186/s12859-021-04247-9 ◽

2021 ◽

Vol 22 (S1) ◽

Author(s):

Renzo M. Rivera-Zavala ◽

Paloma Martínez

Keyword(s):

Deep Learning ◽

Recognition Performance ◽

Named Entity Recognition ◽

Biomedical Literature ◽

Entity Recognition ◽

Pharmaceutical Chemical ◽

Learning Models ◽

Named Entity ◽

Entity Identification ◽

Biomedical Texts

Abstract Background The volume of biomedical literature and clinical data is growing at an exponential rate. Therefore, efficient access to data described in unstructured biomedical texts is a crucial task for the biomedical industry and research. Named Entity Recognition (NER) is the first step for information and knowledge acquisition when we deal with unstructured texts. Recent NER approaches use contextualized word representations as input for a downstream classification task. However, distributed word vectors (embeddings) are very limited in Spanish and even more for the biomedical domain. Methods In this work, we develop several biomedical Spanish word representations, and we introduce two Deep Learning approaches for pharmaceutical, chemical, and other biomedical entities recognition in Spanish clinical case texts and biomedical texts, one based on a Bi-STM-CRF model and the other on a BERT-based architecture. Results Several Spanish biomedical embeddigns together with the two deep learning models were evaluated on the PharmaCoNER and CORD-19 datasets. The PharmaCoNER dataset is composed of a set of Spanish clinical cases annotated with drugs, chemical compounds and pharmacological substances; our extended Bi-LSTM-CRF model obtains an F-score of 85.24% on entity identification and classification and the BERT model obtains an F-score of 88.80% . For the entity normalization task, the extended Bi-LSTM-CRF model achieves an F-score of 72.85% and the BERT model achieves 79.97%. The CORD-19 dataset consists of scholarly articles written in English annotated with biomedical concepts such as disorder, species, chemical or drugs, gene and protein, enzyme and anatomy. Bi-LSTM-CRF model and BERT model obtain an F-measure of 78.23% and 78.86% on entity identification and classification, respectively on the CORD-19 dataset. Conclusion These results prove that deep learning models with in-domain knowledge learned from large-scale datasets highly improve named entity recognition performance. Moreover, contextualized representations help to understand complexities and ambiguity inherent to biomedical texts. Embeddings based on word, concepts, senses, etc. other than those for English are required to improve NER tasks in other languages.

Download Full-text

DeNERT-KG: Named Entity and Relation Extraction Model Using DQN, Knowledge Graph, and BERT

Applied Sciences ◽

10.3390/app10186429 ◽

2020 ◽

Vol 10 (18) ◽

pp. 6429

Author(s):

SungMin Yang ◽

SoYeop Yoo ◽

OkRan Jeong

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Language Model ◽

Named Entity Recognition ◽

Relation Extraction ◽

Entity Recognition ◽

Knowledge Graph ◽

Named Entity ◽

Artificial Intelligence Technology

Along with studies on artificial intelligence technology, research is also being carried out actively in the field of natural language processing to understand and process people’s language, in other words, natural language. For computers to learn on their own, the skill of understanding natural language is very important. There are a wide variety of tasks involved in the field of natural language processing, but we would like to focus on the named entity registration and relation extraction task, which is considered to be the most important in understanding sentences. We propose DeNERT-KG, a model that can extract subject, object, and relationships, to grasp the meaning inherent in a sentence. Based on the BERT language model and Deep Q-Network, the named entity recognition (NER) model for extracting subject and object is established, and a knowledge graph is applied for relation extraction. Using the DeNERT-KG model, it is possible to extract the subject, type of subject, object, type of object, and relationship from a sentence, and verify this model through experiments.

Download Full-text

Using a Pre-Trained Language Model for Medical Named Entity Extraction in Chinese Clinic Text

2020 IEEE 10th International Conference on Electronics Information and Emergency Communication (ICEIEC) ◽

10.1109/iceiec49280.2020.9152257 ◽

2020 ◽

Author(s):

Mengyuan Zhang ◽

Jin Wang ◽

Xuejie Zhang

Keyword(s):

Language Model ◽

Entity Extraction ◽

Named Entity ◽

Named Entity Extraction

Download Full-text

An Improved Tri-Training Based Named Entity Identification Approach for Legal Knowledgebase of Properties Involved in Criminal Cases

10.1109/icnisc54316.2021.00124 ◽

2021 ◽

Author(s):

Yimin Yang ◽

Zhaochong Wang ◽

Zongshen Jiang

Keyword(s):

Criminal Cases ◽

Named Entity ◽

Identification Approach ◽

Entity Identification

Download Full-text

Chinese named entity identification using class-based language model

Joint Pre-trained Chinese Named Entity Recognition based on Bi-directional Language Model

BioALBERT: A Simple and Effective Pre-trained Language Model for Biomedical Named Entity Recognition

Named Entity Identification and Cyberinfrastructure

A Comparative Study on the Performance of Named Entity Recognition in Materials and Chemistry Fields through Multiple Embedding Combination Based on a Pre-trained Neural Network Language Model

Leveraging Concept-Enhanced Pre-Training Model and Masked-Entity Language Model for Named Entity Disambiguation

Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training

Analyzing transfer learning impact in biomedical cross-lingual named entity recognition and normalization

DeNERT-KG: Named Entity and Relation Extraction Model Using DQN, Knowledge Graph, and BERT

Using a Pre-Trained Language Model for Medical Named Entity Extraction in Chinese Clinic Text

An Improved Tri-Training Based Named Entity Identification Approach for Legal Knowledgebase of Properties Involved in Criminal Cases

Export Citation Format