Investigation of Improving The Pre-Training And Fine-Tuning of BERT Model For Biomedical Relation Extraction

Mapping Intimacies ◽

10.21203/rs.3.rs-640112/v1 ◽

2021 ◽

Author(s):

Peng Su ◽

K. Vijay-Shanker

Keyword(s):

Language Processing ◽

Domain Knowledge ◽

Model Performance ◽

Relation Extraction ◽

Biomedical Literature ◽

Fine Tuning ◽

Score Improvement ◽

Model Generalization ◽

Training Stage ◽

Benchmark Datasets

Abstract Background: Recently, automatically extracting biomedical relations has been a significant subject in biomedical research due to the rapid growth of biomedical literature. Since the adaptation to the biomedical domain, the transformer-based BERT models have produced leading results on many biomedical natural language processing tasks. In this work, we will explore the approaches to improve the BERT model for relation extraction tasks in both the pre-training and fine-tuning stages of its applications. In the pre-training stage, we add another level of BERT adaptation on sub-domain data to bridge the gap between domain knowledge and task-specific knowledge. Also, we propose methods to incorporate the ignored knowledge in the last layer of BERT to improve its fine-tuning. Results: The experiment results demonstrate that our approaches for pre-training and fine-tuning can improve the BERT model performance. After combining the two proposed techniques, our approach outperforms the original BERT models with averaged F1 score improvement of 2.1% on relation extraction tasks. Moreover, our approach achieves state-of-the-art performance on three relation extraction benchmark datasets. Conclusions: The extra pre-training step on sub-domain data can help the BERT model generalization on specific tasks, and our proposed fine-tuning mechanism could utilize the knowledge in the last layer of BERT to boost the model performance. Furthermore, the combination of these two approaches further improves the performance of BERT model on the relation extraction tasks.

Download Full-text

Deep learning with language models improves named entity recognition for PharmaCoNER

BMC Bioinformatics ◽

10.1186/s12859-021-04260-y ◽

2021 ◽

Vol 22 (S1) ◽

Author(s):

Cong Sun ◽

Zhihao Yang ◽

Lei Wang ◽

Yin Zhang ◽

Hongfei Lin ◽

...

Keyword(s):

Deep Learning ◽

Language Processing ◽

Domain Knowledge ◽

Named Entity Recognition ◽

Model Performance ◽

Relation Extraction ◽

Entity Recognition ◽

Language Models ◽

Named Entity ◽

Biomedical Texts

Abstract Background The recognition of pharmacological substances, compounds and proteins is essential for biomedical relation extraction, knowledge graph construction, drug discovery, as well as medical question answering. Although considerable efforts have been made to recognize biomedical entities in English texts, to date, only few limited attempts were made to recognize them from biomedical texts in other languages. PharmaCoNER is a named entity recognition challenge to recognize pharmacological entities from Spanish texts. Because there are currently abundant resources in the field of natural language processing, how to leverage these resources to the PharmaCoNER challenge is a meaningful study. Methods Inspired by the success of deep learning with language models, we compare and explore various representative BERT models to promote the development of the PharmaCoNER task. Results The experimental results show that deep learning with language models can effectively improve model performance on the PharmaCoNER dataset. Our method achieves state-of-the-art performance on the PharmaCoNER dataset, with a max F1-score of 92.01%. Conclusion For the BERT models on the PharmaCoNER dataset, biomedical domain knowledge has a greater impact on model performance than the native language (i.e., Spanish). The BERT models can obtain competitive performance by using WordPiece to alleviate the out of vocabulary limitation. The performance on the BERT model can be further improved by constructing a specific vocabulary based on domain knowledge. Moreover, the character case also has a certain impact on model performance.

Download Full-text

Application of Domain Ontologies to Natural Language Processing

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2015070102 ◽

2015 ◽

Vol 5 (3) ◽

pp. 19-38 ◽

Cited By ~ 2

Author(s):

María Herrero-Zazo ◽

Isabel Segura-Bedmar ◽

Janna Hastings ◽

Paloma Martínez

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Domain Knowledge ◽

Semantic Representation ◽

Relation Extraction ◽

Biomedical Literature ◽

Entity Recognition ◽

Knowledge Domain ◽

Domain Ontologies

Natural Language Processing (NLP) techniques can provide an interesting way to mine the growing biomedical literature, and a promising approach for new knowledge discovery. However, the major bottleneck in this area is that these systems rely on specific resources providing the domain knowledge. Domain ontologies provide a contextual framework and a semantic representation of the domain, and they can contribute to a better performance of current NLP systems. However, their contribution to information extraction has not been well studied yet. The aim of this paper is to provide insights into the potential role that domain ontologies can play in NLP. To do this, the authors apply the drug-drug interactions ontology (DINTO) to named entity recognition and relation extraction from pharmacological texts. The authors use the DDI corpus, a gold-standard for the development and evaluation of IE systems in this domain, and evaluate their results in the framework of the last SemEval-2013 DDI Extraction task.

Download Full-text

A general approach for improving deep learning-based medical relation extraction using a pre-trained model and fine-tuning

Database ◽

10.1093/database/baz116 ◽

2019 ◽

Vol 2019 ◽

Cited By ~ 2

Author(s):

Tao Chen ◽

Mingfen Wu ◽

Hexi Li

Keyword(s):

Deep Learning ◽

Large Scale ◽

Relation Extraction ◽

Training Model ◽

Biomedical Literature ◽

Training Data ◽

Fine Tuning ◽

Learning Approaches ◽

Additional Time ◽

Clinical Records

Abstract The automatic extraction of meaningful relations from biomedical literature or clinical records is crucial in various biomedical applications. Most of the current deep learning approaches for medical relation extraction require large-scale training data to prevent overfitting of the training model. We propose using a pre-trained model and a fine-tuning technique to improve these approaches without additional time-consuming human labeling. Firstly, we show the architecture of Bidirectional Encoder Representations from Transformers (BERT), an approach for pre-training a model on large-scale unstructured text. We then combine BERT with a one-dimensional convolutional neural network (1d-CNN) to fine-tune the pre-trained model for relation extraction. Extensive experiments on three datasets, namely the BioCreative V chemical disease relation corpus, traditional Chinese medicine literature corpus and i2b2 2012 temporal relation challenge corpus, show that the proposed approach achieves state-of-the-art results (giving a relative improvement of 22.2, 7.77, and 38.5% in F1 score, respectively, compared with a traditional 1d-CNN classifier). The source code is available at https://github.com/chentao1999/MedicalRelationExtraction.

Download Full-text

A cyclic self-learning Chinese word segmentation for the geoscience domain

GEOMATICA ◽

10.1139/geomatica-2018-0007 ◽

2020 ◽

Author(s):

Qinjun Qiu ◽

Zhong Xie ◽

Liang Wu

Keyword(s):

Language Processing ◽

Domain Knowledge ◽

Short Term Memory ◽

Word Segmentation ◽

Training Data ◽

Local Context ◽

Chinese Word ◽

Chinese Word Segmentation ◽

Benchmark Datasets ◽

Self Learning

Unlike English and other western languages, Chinese does not delimit words using white-spaces. Chinese Word Segmentation (CWS) is the crucial first step towards natural language processing. However, for the geoscience subject domain, the CWS problem remains unresolved with many challenges. Although traditional methods can be used to process geoscience documents, they lack the domain knowledge for massive geoscience documents. Considering the above challenges, this motivated us to build a segmenter specifically for the geoscience domain. Currently, most of the state-of-the-art methods for Chinese word segmentation are based on supervised learning, whose features are mostly extracted from a local context. In this paper, we proposed a framework for sequence learning by incorporating cyclic self-learning corpus training. Following this framework, we build the GeoSegmenter based on the Bi-directional Long Short-Term Memory (Bi-LSTM) network model to perform Chinese word segmentation. It can gain a great advantage through iterations of the training data. Empirical experimental results on geoscience documents and benchmark datasets showed that geological documents can be identified, and it can also recognize the generic documents.

Download Full-text

Using distant supervision to augment manually annotated data for relation extraction

10.1101/626226 ◽

2019 ◽

Author(s):

Peng Su ◽

Gang Li ◽

Cathy Wu ◽

K. Vijay-Shanker

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Relation Extraction ◽

Biomedical Literature ◽

Training Data ◽

Distant Supervision ◽

Large Size ◽

Domain Expertise

AbstractSignificant progress has been made in applying deep learning on natural language processing tasks recently. However, deep learning models typically require a large amount of annotated training data while often only small labeled datasets are available for many natural language processing tasks in biomedical literature. Building large-size datasets for deep learning is expensive since it involves considerable human effort and usually requires domain expertise in specialized fields. In this work, we consider augmenting manually annotated data with large amounts of data using distant supervision. However, data obtained by distant supervision is often noisy, we first apply some heuristics to remove some of the incorrect annotations. Then using methods inspired from transfer learning, we show that the resulting models outperform models trained on the original manually annotated sets.

Download Full-text

Genome-wide Prediction of Small Molecule Binding to Remote Orphan Proteins Using Distilled Sequence Alignment Embedding

10.1101/2020.08.04.236729 ◽

2020 ◽

Author(s):

Tian Cai ◽

Hansaim Lim ◽

Kyra Alyssa Abbu ◽

Yue Qiu ◽

Ruth Nussinov ◽

...

Keyword(s):

Sequence Alignment ◽

Protein Interactions ◽

Domain Knowledge ◽

Training Data ◽

Fine Tuning ◽

Data Set ◽

Vast Number ◽

Model Generalization ◽

Bioassay Data ◽

Orphan Proteins

AbstractEndogenous or surrogate ligands of a vast number of proteins remain unknown. Identification of small molecules that bind to these orphan proteins will not only shed new light into their biological functions but also provide new opportunities for drug discovery. Deep learning plays an increasing role in the prediction of chemical-protein interactions, but it faces several challenges in protein deorphanization. Bioassay data are highly biased to certain proteins, making it difficult to train a generalizable machine learning model for the proteins that are dissimilar from the ones in the training data set. Pre-training offers a general solution to improving the model generalization, but needs incorporation of domain knowledge and customization of task-specific supervised learning. To address these challenges, we develop a novel protein pre-training method, DIstilled Sequence Alignment Embedding (DISAE), and a module-based fine-tuning strategy for the protein deorphanization. In the benchmark studies, DISAE significantly improves the generalizability and outperforms the state-of-the-art methods with a large margin. The interpretability analysis of pre-trained model suggests that it learns biologically meaningful information. We further use DISAE to assign ligands to 649 human orphan G-Protein Coupled Receptors (GPCRs) and to cluster the human GPCRome by integrating their phylogenetic and ligand relationships. The promising results of DISAE open an avenue for exploring the chemical landscape of entire sequenced genomes.

Download Full-text

Extraction of similar biomedical terms in biomedical literature mining: Examining the effect of the ratio of biomedical domain to general domain data (Preprint)

10.2196/preprints.30300 ◽

2021 ◽

Author(s):

Ziheng Zhang ◽

Feng Han ◽

Hongjian Zhang ◽

Tomohiro Aoki ◽

Katsuhiko Ogasawara

Keyword(s):

Language Processing ◽

Relation Extraction ◽

Medical Data ◽

Biomedical Literature ◽

Literature Mining ◽

Biomedical Domain ◽

Pubmed Central ◽

General Domain ◽

Biomedical Information Retrieval ◽

Science Community

BACKGROUND Biomedical terms extracted using Word2vec, the most popular word embedding model in recent years, serve as the foundation for various natural language processing (NLP) applications, such as biomedical information retrieval, relation extraction, and recommendation systems. OBJECTIVE The objective of this study is to examine how changes in the ratio of biomedical domain to general domain data in the corpus affect the extraction of similar biomedical terms using Word2vec. METHODS We downloaded abstracts of 214892 articles from PubMed Central (PMC) and the 3.9 GB Billion Word (BW) benchmark corpus from the computer science community. The datasets were preprocessed and grouped into 11 corpora based on the ratio of BW to PMC, ranging from 0:10 to 10:0, and then Word2vec models were trained on these corpora. The cosine similarities between the biomedical terms obtained from the Word2vec models were then compared in each model. RESULTS The results indicated that the models trained with both BW and PMC data outperformed the model trained only with medical data. The similarity between the biomedical terms extracted by the Word2vec model increased, when the ratio of biomedical domain to general domain data was 3: 7 to 5: 5. CONCLUSIONS This study allows NLP researchers to apply Word2vec based on more information and increase the similarity of extracted biomedical terms to improve their effectiveness in NLP applications, such as biomedical information extraction.

Download Full-text

Improving Cross-Domain Performance for Relation Extraction via Dependency Prediction and Information Flow Control

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/716 ◽

2019 ◽

Cited By ~ 1

Author(s):

Amir Pouran Ben Veyseh ◽

Thien Nguyen ◽

Dejing Dou

Keyword(s):

Deep Learning ◽

Information Flow ◽

Language Processing ◽

Relation Extraction ◽

Learning Models ◽

Information Flow Control ◽

Cross Domain ◽

Benchmark Datasets ◽

Dependency Trees ◽

Use Dependency

Relation Extraction (RE) is one of the fundamental tasks in Information Extraction and Natural Language Processing. Dependency trees have been shown to be a very useful source of information for this task. The current deep learning models for relation extraction has mainly exploited this dependency information by guiding their computation along the structures of the dependency trees. One potential problem with this approach is it might prevent the models from capturing important context information beyond syntactic structures and cause the poor cross-domain generalization. This paper introduces a novel method to use dependency trees in RE for deep learning models that jointly predicts dependency and semantics relations. We also propose a new mechanism to control the information flow in the model based on the input entity mentions. Our extensive experiments on benchmark datasets show that the proposed model outperforms the existing methods for RE significantly.

Download Full-text

Attention as Relation: Learning Supervised Multi-head Self-Attention for Relation Extraction

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/524 ◽

2020 ◽

Author(s):

Jie Liu ◽

Shaowei Chen ◽

Bingquan Wang ◽

Jiaxin Zhang ◽

Na Li ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Relation Extraction ◽

Attention Mechanism ◽

Entity Extraction ◽

Relation Type ◽

Benchmark Datasets ◽

Relation Learning

Joint entity and relation extraction is critical for many natural language processing (NLP) tasks, which has attracted increasing research interest. However, it is still faced with the challenges of identifying the overlapping relation triplets along with the entire entity boundary and detecting the multi-type relations. In this paper, we propose an attention-based joint model, which mainly contains an entity extraction module and a relation detection module, to address the challenges. The key of our model is devising a supervised multi-head self-attention mechanism as the relation detection module to learn the token-level correlation for each relation type separately. With the attention mechanism, our model can effectively identify overlapping relations and flexibly predict the relation type with its corresponding intensity. To verify the effectiveness of our model, we conduct comprehensive experiments on two benchmark datasets. The experimental results demonstrate that our model achieves state-of-the-art performances.

Download Full-text

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Bioinformatics ◽

10.1093/bioinformatics/btz682 ◽

2019 ◽

Cited By ~ 137

Author(s):

Jinhyuk Lee ◽

Wonjin Yoon ◽

Sungdong Kim ◽

Donghyeon Kim ◽

Sunkyu Kim ◽

...

Keyword(s):

Text Mining ◽

State Of The Art ◽

Biomedical Literature ◽

Fine Tuning ◽

Biomedical Text ◽

Biomedical Text Mining ◽

Representation Model ◽

Score Improvement ◽

Language Representation ◽

Previous State

Abstract Motivation Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora. Results We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts. Availability and implementation We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.

Download Full-text