Biomedical relation extraction via knowledge-enhanced reading comprehension

Abstract Background In biomedical research, chemical and disease relation extraction from unstructured biomedical literature is an essential task. Effective context understanding and knowledge integration are two main research problems in this task. Most work of relation extraction focuses on classification for entity mention pairs. Inspired by the effectiveness of machine reading comprehension (RC) in the respect of context understanding, solving biomedical relation extraction with the RC framework at both intra-sentential and inter-sentential levels is a new topic worthy to be explored. Except for the unstructured biomedical text, many structured knowledge bases (KBs) provide valuable guidance for biomedical relation extraction. Utilizing knowledge in the RC framework is also worthy to be investigated. We propose a knowledge-enhanced reading comprehension (KRC) framework to leverage reading comprehension and prior knowledge for biomedical relation extraction. First, we generate questions for each relation, which reformulates the relation extraction task to a question answering task. Second, based on the RC framework, we integrate knowledge representation through an efficient knowledge-enhanced attention interaction mechanism to guide the biomedical relation extraction. Results The proposed model was evaluated on the BioCreative V CDR dataset and CHR dataset. Experiments show that our model achieved a competitive document-level F1 of 71.18% and 93.3%, respectively, compared with other methods. Conclusion Result analysis reveals that open-domain reading comprehension data and knowledge representation can help improve biomedical relation extraction in our proposed KRC framework. Our work can encourage more research on bridging reading comprehension and biomedical relation extraction and promote the biomedical relation extraction.

Download Full-text

Biomedical Relation Extraction Using Distant Supervision

Scientific Programming ◽

10.1155/2020/8893749 ◽

2020 ◽

Vol 2020 ◽

pp. 1-9 ◽

Cited By ~ 2

Author(s):

Nada Boudjellal ◽

Huaping Zhang ◽

Asif Khan ◽

Arshad Ahmad

Keyword(s):

Big Data ◽

Information Extraction ◽

State Of The Art ◽

Relation Extraction ◽

Knowledge Bases ◽

Structured Data ◽

Distant Supervision ◽

Future Challenges ◽

Unstructured Information ◽

Biomedical Relation Extraction

With the accelerating growth of big data, especially in the healthcare area, information extraction is more needed currently than ever, for it can convey unstructured information into an easily interpretable structured data. Relation extraction is the second of the two important tasks of relation extraction. This study presents an overview of relation extraction using distant supervision, providing a generalized architecture of this task based on the state-of-the-art work that proposed this method. Besides, it surveys the methods used in the literature targeting this topic with a description of different knowledge bases used in the process along with the corpora, which can be helpful for beginner practitioners seeking knowledge on this subject. Moreover, the limitations of the proposed approaches and future challenges were highlighted, and possible solutions were proposed.

Download Full-text

PPPred: Classifying Protein-phenotype Co-mentions Extracted from Biomedical Literature

10.1101/654475 ◽

2019 ◽

Author(s):

Morteza Pourreza Shahri ◽

Mandi M. Roe ◽

Gillian Reynolds ◽

Indika Kahanda

Keyword(s):

Relation Extraction ◽

Biomedical Literature ◽

Supervised Machine Learning ◽

Human Phenotype ◽

Unstructured Text ◽

Gold Standard Dataset ◽

Sentence Level ◽

Machine Learning Approach ◽

Human Proteins ◽

Biomedical Relation Extraction

ABSTRACTThe MEDLINE database provides an extensive source of scientific articles and heterogeneous biomedical information in the form of unstructured text. One of the most important knowledge present within articles are the relations between human proteins and their phenotypes, which can stay hidden due to the exponential growth of publications. This has presented a range of opportunities for the development of computational methods to extract these biomedical relations from the articles. However, currently, no such method exists for the automated extraction of relations involving human proteins and human phenotype ontology (HPO) terms. In our previous work, we developed a comprehensive database composed of all co-mentions of proteins and phenotypes. In this study, we present a supervised machine learning approach called PPPred (Protein-Phenotype Predictor) for classifying the validity of a given sentence-level co-mention. Using an in-house developed gold standard dataset, we demonstrate that PPPred significantly outperforms several baseline methods. This two-step approach of co-mention extraction and classification constitutes a complete biomedical relation extraction pipeline for extracting protein-phenotype relations.CCS CONCEPTS•Computing methodologies → Information extraction; Supervised learning by classification; •Applied computing →Bioinformatics;

Download Full-text

Biomedical Literature Mining for Biomedical Relation Extraction

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i8.8493 ◽

2018 ◽

Vol 6 (8) ◽

pp. 84-93

Author(s):

Jahiruddin .

Keyword(s):

Relation Extraction ◽

Biomedical Literature ◽

Literature Mining ◽

Biomedical Literature Mining ◽

Biomedical Relation Extraction

Download Full-text

A semi-automatic approach to construct Vietnamese ontology from online text

The International Review of Research in Open and Distributed Learning ◽

10.19173/irrodl.v13i5.1250 ◽

2012 ◽

Vol 13 (5) ◽

pp. 148 ◽

Cited By ~ 3

Author(s):

Bao-An Nguyen ◽

Don-Lin Yang

Keyword(s):

Language Processing ◽

Question Answering ◽

Expert Knowledge ◽

Relation Extraction ◽

Knowledge Bases ◽

Instructional Materials ◽

Formal Representation ◽

Text Documents ◽

Ontology Construction ◽

Question Answering Systems

An ontology is an effective formal representation of knowledge used commonly in artificial intelligence, semantic web, software engineering, and information retrieval. In open and distance learning, ontologies are used as knowledge bases for e-learning supplements, educational recommenders, and question answering systems that support students with much needed resources. In such systems, ontology construction is one of the most important phases. Since there are abundant documents on the Internet, useful learning materials can be acquired openly with the use of an ontology. However, due to the lack of system support for ontology construction, it is difficult to construct self-instructional materials for Vietnamese people. In general, the cost of manual acquisition of ontologies from domain documents and expert knowledge is too high. Therefore, we present a support system for Vietnamese ontology construction using pattern-based mechanisms to discover Vietnamese concepts and conceptual relations from Vietnamese text documents. In this system, we use the combination of statistics-based, data mining, and Vietnamese natural language processing methods to develop concept and conceptual relation extraction algorithms to discover knowledge from Vietnamese text documents. From the experiments, we show that our approach provides a feasible solution to build Vietnamese ontologies used for supporting systems in education.<br /><br />

Download Full-text

A sui generis QA approach using RoBERTa for adverse drug event identification

BMC Bioinformatics ◽

10.1186/s12859-021-04249-7 ◽

2021 ◽

Vol 22 (S11) ◽

Author(s):

Harshit Jain ◽

Nishant Raj ◽

Suyash Mishra

Keyword(s):

Question Answering ◽

Adverse Drug Events ◽

Short Term Memory ◽

Domain Adaptation ◽

Relation Extraction ◽

Biomedical Literature ◽

Drug Event ◽

Textual Data ◽

Entity Relation Extraction ◽

End To End

Abstract Background Extraction of adverse drug events from biomedical literature and other textual data is an important component to monitor drug-safety and this has attracted attention of many researchers in healthcare. Existing works are more pivoted around entity-relation extraction using bidirectional long short term memory networks (Bi-LSTM) which does not attain the best feature representations. Results In this paper, we introduce a question answering framework that exploits the robustness, masking and dynamic attention capabilities of RoBERTa by a technique of domain adaptation and attempt to overcome the aforementioned limitations. With formulation of an end-to-end pipeline, our model outperforms the prior work by 9.53% F1-Score. Conclusion An end-to-end pipeline that leverages state of the art transformer architecture in conjunction with QA approach can bolster the performances of entity-relation extraction tasks in the biomedical domain. In particular, we believe our research would be helpful in identification of potential adverse drug reactions in mono as well as combination therapy related textual data.

Download Full-text

Document-Level Biomedical Relation Extraction Leveraging Pretrained Self-Attention Structure and Entity Replacement: Algorithm and Pretreatment Method Validation Study

JMIR Medical Informatics ◽

10.2196/17644 ◽

2020 ◽

Vol 8 (5) ◽

pp. e17644

Author(s):

Xiaofeng Liu ◽

Jianye Fan ◽

Shoubin Dong

Keyword(s):

State Of The Art ◽

Relation Extraction ◽

Biomedical Literature ◽

Long Distance ◽

Replacement Method ◽

Replacement Algorithm ◽

Additional Noise ◽

The Relationship ◽

Document Level ◽

Biomedical Relation Extraction

Background The most current methods applied for intrasentence relation extraction in the biomedical literature are inadequate for document-level relation extraction, in which the relationship may cross sentence boundaries. Hence, some approaches have been proposed to extract relations by splitting the document-level datasets through heuristic rules and learning methods. However, these approaches may introduce additional noise and do not really solve the problem of intersentence relation extraction. It is challenging to avoid noise and extract cross-sentence relations. Objective This study aimed to avoid errors by dividing the document-level dataset, verify that a self-attention structure can extract biomedical relations in a document with long-distance dependencies and complex semantics, and discuss the relative benefits of different entity pretreatment methods for biomedical relation extraction. Methods This paper proposes a new data preprocessing method and attempts to apply a pretrained self-attention structure for document biomedical relation extraction with an entity replacement method to capture very long-distance dependencies and complex semantics. Results Compared with state-of-the-art approaches, our method greatly improved the precision. The results show that our approach increases the F1 value, compared with state-of-the-art methods. Through experiments of biomedical entity pretreatments, we found that a model using an entity replacement method can improve performance. Conclusions When considering all target entity pairs as a whole in the document-level dataset, a pretrained self-attention structure is suitable to capture very long-distance dependencies and learn the textual context and complicated semantics. A replacement method for biomedical entities is conducive to biomedical relation extraction, especially to document-level relation extraction.

Download Full-text

Asking Effective and Diverse Questions: A Machine Reading Comprehension based Framework for Joint Entity-Relation Extraction

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/546 ◽

2020 ◽

Author(s):

Tianyang Zhao ◽

Zhao Yan ◽

Yunbo Cao ◽

Zhoujun Li

Keyword(s):

Reading Comprehension ◽

Question Answering ◽

Relation Extraction ◽

Effective Solution ◽

Joint Learning ◽

Selection Strategies ◽

Entity Relation Extraction ◽

Extraction Model ◽

Machine Reading ◽

Single Question

Recent advances cast the entity-relation extraction to a multi-turn question answering (QA) task and provide an effective solution based on the machine reading comprehension (MRC) models. However, they use a single question to characterize the meaning of entities and relations, which is intuitively not enough because of the variety of context semantics. Meanwhile, existing models enumerate all relation types to generate questions, which is inefficient and easily leads to confusing questions. In this paper, we improve the existing MRC-based entity-relation extraction model through diverse question answering. First, a diversity question answering mechanism is introduced to detect entity spans and two answering selection strategies are designed to integrate different answers. Then, we propose to predict a subset of potential relations and filter out irrelevant ones to generate questions effectively. Finally, entity and relation extractions are integrated in an end-to-end way and optimized through joint learning. Experiment results show that the proposed method significantly outperforms baseline models, which improves the relation F1 to 62.1% (+1.9%) on ACE05 and 71.9% (+3.0%) on CoNLL04. Our implementation is available at https://github.com/TanyaZhao/MRC4ERE.

Download Full-text

Extraction of chemical–protein interactions from the literature using neural networks and narrow instance representation

Database ◽

10.1093/database/baz095 ◽

2019 ◽

Vol 2019 ◽

Author(s):

Rui Antunes ◽

Sérgio Matos

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Protein Interactions ◽

Short Term Memory ◽

Relation Extraction ◽

Knowledge Bases ◽

Individual Variability ◽

Biomedical Literature ◽

Complex Sentence ◽

The Individual

Abstract The scientific literature contains large amounts of information on genes, proteins, chemicals and their interactions. Extraction and integration of this information in curated knowledge bases help researchers support their experimental results, leading to new hypotheses and discoveries. This is especially relevant for precision medicine, which aims to understand the individual variability across patient groups in order to select the most appropriate treatments. Methods for improved retrieval and automatic relation extraction from biomedical literature are therefore required for collecting structured information from the growing number of published works. In this paper, we follow a deep learning approach for extracting mentions of chemical–protein interactions from biomedical articles, based on various enhancements over our participation in the BioCreative VI CHEMPROT task. A significant aspect of our best method is the use of a simple deep learning model together with a very narrow representation of the relation instances, using only up to 10 words from the shortest dependency path and the respective dependency edges. Bidirectional long short-term memory recurrent networks or convolutional neural networks are used to build the deep learning models. We report the results of several experiments and show that our best model is competitive with more complex sentence representations or network structures, achieving an F1-score of 0.6306 on the test set. The source code of our work, along with detailed statistics, is publicly available.

Download Full-text

CovidBERT-Biomedical Relation Extraction for Covid-19

The International FLAIRS Conference Proceedings ◽

10.32473/flairs.v34i1.128488 ◽

2021 ◽

Vol 34 (1) ◽

Author(s):

Shashank Hebbar ◽

Ying Xie

Keyword(s):

Relation Extraction ◽

Biomedical Literature ◽

Medical Treatments ◽

Relationship Extraction ◽

Short Period ◽

Approved Drugs ◽

New Treatments ◽

Extraction Model ◽

Fine Tune ◽

Biomedical Relation Extraction

Given the ongoing pandemic of Covid-19 which has had a devastating impact on society and the economy, and the explosive growth of biomedical literature, there has been a growing need to find suitable medical treatments and therapeutics in a short period of time. Developing new treatments and therapeutics can be expensive and a time consuming process. It can be practical to re-purpose existing approved drugs and put them in clinical trial. Hence we propose CovidBERT, a biomedical relationship extraction model based on BERT that extracts new relationships between various biomedical entities, namely gene-disease and chemical-disease relationships. We use the transformer architecture to train on Covid-19 related literature and fine-tune it using standard annotated datasets to show improvement in performance from baseline models. This research uses the transformer BERT model as its foundation and extracts relations from newly published biomedical papers.

Download Full-text

A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing

Database ◽

10.1093/database/baaa104 ◽

2020 ◽

Vol 2020 ◽

Author(s):

Diana Sousa ◽

Andre Lamurias ◽

Francisco M Couto

Keyword(s):

Hybrid Approach ◽

Relation Extraction ◽

Knowledge Bases ◽

Amazon Mechanical Turk ◽

Domain Expert ◽

Human Phenotype ◽

Distant Supervision ◽

Original Dataset ◽

Partial Domain ◽

Biomedical Relation Extraction

Abstract Biomedical relation extraction (RE) datasets are vital in the construction of knowledge bases and to potentiate the discovery of new interactions. There are several ways to create biomedical RE datasets, some more reliable than others, such as resorting to domain expert annotations. However, the emerging use of crowdsourcing platforms, such as Amazon Mechanical Turk (MTurk), can potentially reduce the cost of RE dataset construction, even if the same level of quality cannot be guaranteed. There is a lack of power of the researcher to control who, how and in what context workers engage in crowdsourcing platforms. Hence, allying distant supervision with crowdsourcing can be a more reliable alternative. The crowdsourcing workers would be asked only to rectify or discard already existing annotations, which would make the process less dependent on their ability to interpret complex biomedical sentences. In this work, we use a previously created distantly supervised human phenotype–gene relations (PGR) dataset to perform crowdsourcing validation. We divided the original dataset into two annotation tasks: Task 1, 70% of the dataset annotated by one worker, and Task 2, 30% of the dataset annotated by seven workers. Also, for Task 2, we added an extra rater on-site and a domain expert to further assess the crowdsourcing validation quality. Here, we describe a detailed pipeline for RE crowdsourcing validation, creating a new release of the PGR dataset with partial domain expert revision, and assess the quality of the MTurk platform. We applied the new dataset to two state-of-the-art deep learning systems (BiOnt and BioBERT) and compared its performance with the original PGR dataset, as well as combinations between the two, achieving a 0.3494 increase in average F-measure. The code supporting our work and the new release of the PGR dataset is available at https://github.com/lasigeBioTM/PGR-crowd.

Download Full-text