A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing

Abstract Biomedical relation extraction (RE) datasets are vital in the construction of knowledge bases and to potentiate the discovery of new interactions. There are several ways to create biomedical RE datasets, some more reliable than others, such as resorting to domain expert annotations. However, the emerging use of crowdsourcing platforms, such as Amazon Mechanical Turk (MTurk), can potentially reduce the cost of RE dataset construction, even if the same level of quality cannot be guaranteed. There is a lack of power of the researcher to control who, how and in what context workers engage in crowdsourcing platforms. Hence, allying distant supervision with crowdsourcing can be a more reliable alternative. The crowdsourcing workers would be asked only to rectify or discard already existing annotations, which would make the process less dependent on their ability to interpret complex biomedical sentences. In this work, we use a previously created distantly supervised human phenotype–gene relations (PGR) dataset to perform crowdsourcing validation. We divided the original dataset into two annotation tasks: Task 1, 70% of the dataset annotated by one worker, and Task 2, 30% of the dataset annotated by seven workers. Also, for Task 2, we added an extra rater on-site and a domain expert to further assess the crowdsourcing validation quality. Here, we describe a detailed pipeline for RE crowdsourcing validation, creating a new release of the PGR dataset with partial domain expert revision, and assess the quality of the MTurk platform. We applied the new dataset to two state-of-the-art deep learning systems (BiOnt and BioBERT) and compared its performance with the original PGR dataset, as well as combinations between the two, achieving a 0.3494 increase in average F-measure. The code supporting our work and the new release of the PGR dataset is available at https://github.com/lasigeBioTM/PGR-crowd.

Download Full-text

Biomedical Relation Extraction Using Distant Supervision

Scientific Programming ◽

10.1155/2020/8893749 ◽

2020 ◽

Vol 2020 ◽

pp. 1-9 ◽

Cited By ~ 2

Author(s):

Nada Boudjellal ◽

Huaping Zhang ◽

Asif Khan ◽

Arshad Ahmad

Keyword(s):

Big Data ◽

Information Extraction ◽

State Of The Art ◽

Relation Extraction ◽

Knowledge Bases ◽

Structured Data ◽

Distant Supervision ◽

Future Challenges ◽

Unstructured Information ◽

Biomedical Relation Extraction

With the accelerating growth of big data, especially in the healthcare area, information extraction is more needed currently than ever, for it can convey unstructured information into an easily interpretable structured data. Relation extraction is the second of the two important tasks of relation extraction. This study presents an overview of relation extraction using distant supervision, providing a generalized architecture of this task based on the state-of-the-art work that proposed this method. Besides, it surveys the methods used in the literature targeting this topic with a description of different knowledge bases used in the process along with the corpora, which can be helpful for beginner practitioners seeking knowledge on this subject. Moreover, the limitations of the proposed approaches and future challenges were highlighted, and possible solutions were proposed.

Download Full-text

Learning to Transfer Relational Representations through Analogy

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.330110015 ◽

2019 ◽

Vol 33 ◽

pp. 10015-10016

Author(s):

Gaetano Rossiello ◽

Alfio Gliozzo ◽

Michael Glass

Keyword(s):

State Of The Art ◽

Relation Extraction ◽

Knowledge Bases ◽

The State ◽

Large Set ◽

Relational Information ◽

Siamese Network ◽

Distant Supervision ◽

Novel Approach ◽

Art Methods

We propose a novel approach to learn representations of relations expressed by their textual mentions. In our assumption, if two pairs of entities belong to the same relation, then those two pairs are analogous. We collect a large set of analogous pairs by matching triples in knowledge bases with web-scale corpora through distant supervision. This dataset is adopted to train a hierarchical siamese network in order to learn entity-entity embeddings which encode relational information through the different linguistic paraphrasing expressing the same relation. The model can be used to generate pre-trained embeddings which provide a valuable signal when integrated into an existing neural-based model by outperforming the state-of-the-art methods on a relation extraction task.

Download Full-text

PPPred: Classifying Protein-phenotype Co-mentions Extracted from Biomedical Literature

10.1101/654475 ◽

2019 ◽

Author(s):

Morteza Pourreza Shahri ◽

Mandi M. Roe ◽

Gillian Reynolds ◽

Indika Kahanda

Keyword(s):

Relation Extraction ◽

Biomedical Literature ◽

Supervised Machine Learning ◽

Human Phenotype ◽

Unstructured Text ◽

Gold Standard Dataset ◽

Sentence Level ◽

Machine Learning Approach ◽

Human Proteins ◽

Biomedical Relation Extraction

ABSTRACTThe MEDLINE database provides an extensive source of scientific articles and heterogeneous biomedical information in the form of unstructured text. One of the most important knowledge present within articles are the relations between human proteins and their phenotypes, which can stay hidden due to the exponential growth of publications. This has presented a range of opportunities for the development of computational methods to extract these biomedical relations from the articles. However, currently, no such method exists for the automated extraction of relations involving human proteins and human phenotype ontology (HPO) terms. In our previous work, we developed a comprehensive database composed of all co-mentions of proteins and phenotypes. In this study, we present a supervised machine learning approach called PPPred (Protein-Phenotype Predictor) for classifying the validity of a given sentence-level co-mention. Using an in-house developed gold standard dataset, we demonstrate that PPPred significantly outperforms several baseline methods. This two-step approach of co-mention extraction and classification constitutes a complete biomedical relation extraction pipeline for extracting protein-phenotype relations.CCS CONCEPTS•Computing methodologies → Information extraction; Supervised learning by classification; •Applied computing →Bioinformatics;

Download Full-text

Cross-Relation Cross-Bag Attention for Distantly-Supervised Relation Extraction

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.3301419 ◽

2019 ◽

Vol 33 ◽

pp. 419-426 ◽

Cited By ~ 6

Author(s):

Yujin Yuan ◽

Liyuan Liu ◽

Siliang Tang ◽

Zhongfei Zhang ◽

Yueting Zhuang ◽

...

Keyword(s):

Selective Attention ◽

Supervised Learning ◽

State Of The Art ◽

Relation Extraction ◽

Knowledge Bases ◽

Training Data ◽

Distant Supervision ◽

Sentence Level ◽

Noise Robust

Distant supervision leverages knowledge bases to automatically label instances, thus allowing us to train relation extractor without human annotations. However, the generated training data typically contain massive noise, and may result in poor performances with the vanilla supervised learning. In this paper, we propose to conduct multi-instance learning with a novel Cross-relation Cross-bag Selective Attention (C2SA), which leads to noise-robust training for distant supervised relation extractor. Specifically, we employ the sentence-level selective attention to reduce the effect of noisy or mismatched sentences, while the correlation among relations were captured to improve the quality of attention weights. Moreover, instead of treating all entity-pairs equally, we try to pay more attention to entity-pairs with a higher quality. Similarly, we adopt the selective attention mechanism to achieve this goal. Experiments with two types of relation extractor demonstrate the superiority of the proposed approach over the state-of-the-art, while further ablation studies verify our intuitions and demonstrate the effectiveness of our proposed two techniques.

Download Full-text

An autoencoder-based representation for noise reduction in distant supervision of relation extraction

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-219241 ◽

2021 ◽

pp. 1-7

Author(s):

Juan-Luis García-Mendoza ◽

Luis Villaseñor-Pineda ◽

Felipe Orihuela-Espina ◽

Lázaro Bustio-Martínez

Keyword(s):

Noise Reduction ◽

Relation Extraction ◽

Reduction Stage ◽

Distant Supervision ◽

Original Dataset ◽

Main Challenge ◽

Cluster Distance ◽

Noise Tolerant ◽

Noisy Labels

Distant Supervision is an approach that allows automatic labeling of instances. This approach has been used in Relation Extraction. Still, the main challenge of this task is handling instances with noisy labels (e.g., when two entities in a sentence are automatically labeled with an invalid relation). The approaches reported in the literature addressed this problem by employing noise-tolerant classifiers. However, if a noise reduction stage is introduced before the classification step, this increases the macro precision values. This paper proposes an Adversarial Autoencoders-based approach for obtaining a new representation that allows noise reduction in Distant Supervision. The representation obtained using Adversarial Autoencoders minimize the intra-cluster distance concerning pre-trained embeddings and classic Autoencoders. Experiments demonstrated that in the noise-reduced datasets, the macro precision values obtained over the original dataset are similar using fewer instances considering the same classifier. For example, in one of the noise-reduced datasets, the macro precision was improved approximately 2.32% using 77% of the original instances. This suggests the validity of using Adversarial Autoencoders to obtain well-suited representations for noise reduction. Also, the proposed approach maintains the macro precision values concerning the original dataset and reduces the total instances needed for classification.

Download Full-text

Bias Modeling for Distantly Supervised Relation Extraction

Mathematical Problems in Engineering ◽

10.1155/2015/969053 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 2

Author(s):

Yang Xiang ◽

Yaoyun Zhang ◽

Xiaolong Wang ◽

Yang Qin ◽

Wenying Han

Keyword(s):

Language Processing ◽

Learning Algorithm ◽

State Of The Art ◽

Relation Extraction ◽

Knowledge Bases ◽

Training Data ◽

Free Text ◽

Distant Supervision ◽

Annotation Process ◽

Noise Tolerant

Distant supervision (DS) automatically annotates free text with relation mentions from existing knowledge bases (KBs), providing a way to alleviate the problem of insufficient training data for relation extraction in natural language processing (NLP). However, the heuristic annotation process does not guarantee the correctness of the generated labels, promoting a hot research issue on how to efficiently make use of the noisy training data. In this paper, we model two types of biases to reduce noise: (1)bias-distto model the relative distance between points (instances) and classes (relation centers); (2)bias-rewardto model the possibility of each heuristically generated label being incorrect. Based on the biases, we propose three noise tolerant models:MIML-dist,MIML-dist-classify, andMIML-reward, building on top of a state-of-the-art distantly supervised learning algorithm. Experimental evaluations compared with three landmark methods on the KBP dataset validate the effectiveness of the proposed methods.

Download Full-text

Cross-Sentence N-ary Relation Extraction with Graph LSTMs

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00049 ◽

2017 ◽

Vol 5 ◽

pp. 101-115 ◽

Cited By ~ 74

Author(s):

Nanyun Peng ◽

Hoifung Poon ◽

Chris Quirk ◽

Kristina Toutanova ◽

Wen-tau Yih

Keyword(s):

Short Term Memory ◽

General Relation ◽

General Setting ◽

Relation Extraction ◽

Knowledge Bases ◽

Task Learning ◽

Distant Supervision ◽

Sentence Extraction ◽

Discourse Relations ◽

The Impact

Past work in relation extraction has focused on binary relations in single sentences. Recent NLP inroads in high-value domains have sparked interest in the more general setting of extracting n-ary relations that span multiple sentences. In this paper, we explore a general relation extraction framework based on graph long short-term memory networks (graph LSTMs) that can be easily extended to cross-sentence n-ary relation extraction. The graph formulation provides a unified way of exploring different LSTM approaches and incorporating various intra-sentential and inter-sentential dependencies, such as sequential, syntactic, and discourse relations. A robust contextual representation is learned for the entities, which serves as input to the relation classifier. This simplifies handling of relations with arbitrary arity, and enables multi-task learning with related relations. We evaluate this framework in two important precision medicine settings, demonstrating its effectiveness with both conventional supervised learning and distant supervision. Cross-sentence extraction produced larger knowledge bases. and multi-task learning significantly improved extraction accuracy. A thorough analysis of various LSTM approaches yielded useful insight the impact of linguistic analysis on extraction accuracy.

Download Full-text

A Hybrid Approach for Biomedical Relation Extraction Using Finite State Automata and Random Forest-Weighted Fusion

Computational Linguistics and Intelligent Text Processing - Lecture Notes in Computer Science ◽

10.1007/978-3-319-77113-7_35 ◽

2018 ◽

pp. 450-462

Author(s):

Thanassis Mavropoulos ◽

Dimitris Liparas ◽

Spyridon Symeonidis ◽

Stefanos Vrochidis ◽

Ioannis Kompatsiaris

Keyword(s):

Random Forest ◽

Hybrid Approach ◽

Relation Extraction ◽

Finite State Automata ◽

Weighted Fusion ◽

Finite State ◽

Biomedical Relation Extraction

Download Full-text

Biomedical relation extraction via knowledge-enhanced reading comprehension

BMC Bioinformatics ◽

10.1186/s12859-021-04534-5 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Jing Chen ◽

Baotian Hu ◽

Weihua Peng ◽

Qingcai Chen ◽

Buzhou Tang

Keyword(s):

Reading Comprehension ◽

Knowledge Representation ◽

Knowledge Integration ◽

Question Answering ◽

Interaction Mechanism ◽

Relation Extraction ◽

Knowledge Bases ◽

Biomedical Literature ◽

Main Research ◽

Biomedical Relation Extraction

Abstract Background In biomedical research, chemical and disease relation extraction from unstructured biomedical literature is an essential task. Effective context understanding and knowledge integration are two main research problems in this task. Most work of relation extraction focuses on classification for entity mention pairs. Inspired by the effectiveness of machine reading comprehension (RC) in the respect of context understanding, solving biomedical relation extraction with the RC framework at both intra-sentential and inter-sentential levels is a new topic worthy to be explored. Except for the unstructured biomedical text, many structured knowledge bases (KBs) provide valuable guidance for biomedical relation extraction. Utilizing knowledge in the RC framework is also worthy to be investigated. We propose a knowledge-enhanced reading comprehension (KRC) framework to leverage reading comprehension and prior knowledge for biomedical relation extraction. First, we generate questions for each relation, which reformulates the relation extraction task to a question answering task. Second, based on the RC framework, we integrate knowledge representation through an efficient knowledge-enhanced attention interaction mechanism to guide the biomedical relation extraction. Results The proposed model was evaluated on the BioCreative V CDR dataset and CHR dataset. Experiments show that our model achieved a competitive document-level F1 of 71.18% and 93.3%, respectively, compared with other methods. Conclusion Result analysis reveals that open-domain reading comprehension data and knowledge representation can help improve biomedical relation extraction in our proposed KRC framework. Our work can encourage more research on bridging reading comprehension and biomedical relation extraction and promote the biomedical relation extraction.

Download Full-text

Distant Supervision for Relation Extraction with Matrix Completion

10.3115/v1/p14-1079 ◽

2014 ◽

Cited By ~ 14

Author(s):

Miao Fan ◽

Deli Zhao ◽

Qiang Zhou ◽

Zhiyuan Liu ◽

Thomas Fang Zheng ◽

...

Keyword(s):

Matrix Completion ◽

Relation Extraction ◽

Distant Supervision

Download Full-text