PEDL: extracting protein–protein associations using deep language models and distant supervision

Leon Weber; Kirsten Thobe; Oscar Arturo Migueles Lozano; Jana Wolf; Ulf Leser

doi:10.1093/bioinformatics/btaa430

PEDL: extracting protein–protein associations using deep language models and distant supervision

Bioinformatics ◽

10.1093/bioinformatics/btaa430 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i490-i498

Author(s):

Leon Weber ◽

Kirsten Thobe ◽

Oscar Arturo Migueles Lozano ◽

Jana Wolf ◽

Ulf Leser

Keyword(s):

Relation Extraction ◽

Training Data ◽

Language Models ◽

Supplementary Information ◽

Functional Protein ◽

Major Pathway ◽

Distant Supervision ◽

Order Of Magnitude ◽

Biomedical Publications ◽

Pathway Databases

Abstract Motivation A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein–protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help to gather such pathway information from biomedical publications. Current methods for extracting PPAs typically rely exclusively on rare manually labelled data which severely limits their performance. Results We propose PPA Extraction with Deep Language (PEDL), a method for predicting PPAs from text that combines deep language models and distant supervision. Due to the reliance on distant supervision, PEDL has access to an order of magnitude more training data than methods solely relying on manually labelled annotations. We introduce three different datasets for PPA prediction and evaluate PEDL for the two subtasks of predicting PPAs between two proteins, as well as identifying the text spans stating the PPA. We compared PEDL with a recently published state-of-the-art model and found that on average PEDL performs better in both tasks on all three datasets. An expert evaluation demonstrates that PEDL can be used to predict PPAs that are missing from major pathway databases and that it correctly identifies the text spans supporting the PPA. Availability and implementation PEDL is freely available at https://github.com/leonweber/pedl. The repository also includes scripts to generate the used datasets and to reproduce the experiments from this article. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CoCoScore: context-aware co-occurrence scoring for text mining applications using distant supervision

Bioinformatics ◽

10.1093/bioinformatics/btz490 ◽

2019 ◽

Vol 36 (1) ◽

pp. 264-271 ◽

Cited By ~ 4

Author(s):

Alexander Junge ◽

Lars Juhl Jensen

Keyword(s):

Text Mining ◽

Language Processing ◽

Gold Standard ◽

Relation Extraction ◽

Supplementary Information ◽

Functional Protein ◽

Distant Supervision ◽

Sentence Level ◽

Gene Associations ◽

Standard Set

Abstract Motivation Information extraction by mining the scientific literature is key to uncovering relations between biomedical entities. Most existing approaches based on natural language processing extract relations from single sentence-level co-mentions, ignoring co-occurrence statistics over the whole corpus. Existing approaches counting entity co-occurrences ignore the textual context of each co-occurrence. Results We propose a novel corpus-wide co-occurrence scoring approach to relation extraction that takes the textual context of each co-mention into account. Our method, called CoCoScore, scores the certainty of stating an association for each sentence that co-mentions two entities. CoCoScore is trained using distant supervision based on a gold-standard set of associations between entities of interest. Instead of requiring a manually annotated training corpus, co-mentions are labeled as positives/negatives according to their presence/absence in the gold standard. We show that CoCoScore outperforms previous approaches in identifying human disease–gene and tissue–gene associations as well as in identifying physical and functional protein–protein associations in different species. CoCoScore is a versatile text mining tool to uncover pairwise associations via co-occurrence mining, within and beyond biomedical applications. Availability and implementation CoCoScore is available at: https://github.com/JungeAlexander/cocoscore. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Using distant supervision to augment manually annotated data for relation extraction

10.1101/626226 ◽

2019 ◽

Author(s):

Peng Su ◽

Gang Li ◽

Cathy Wu ◽

K. Vijay-Shanker

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Relation Extraction ◽

Biomedical Literature ◽

Training Data ◽

Distant Supervision ◽

Large Size ◽

Domain Expertise

AbstractSignificant progress has been made in applying deep learning on natural language processing tasks recently. However, deep learning models typically require a large amount of annotated training data while often only small labeled datasets are available for many natural language processing tasks in biomedical literature. Building large-size datasets for deep learning is expensive since it involves considerable human effort and usually requires domain expertise in specialized fields. In this work, we consider augmenting manually annotated data with large amounts of data using distant supervision. However, data obtained by distant supervision is often noisy, we first apply some heuristics to remove some of the incorrect annotations. Then using methods inspired from transfer learning, we show that the resulting models outperform models trained on the original manually annotated sets.

Download Full-text

Cross-Relation Cross-Bag Attention for Distantly-Supervised Relation Extraction

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.3301419 ◽

2019 ◽

Vol 33 ◽

pp. 419-426 ◽

Cited By ~ 6

Author(s):

Yujin Yuan ◽

Liyuan Liu ◽

Siliang Tang ◽

Zhongfei Zhang ◽

Yueting Zhuang ◽

...

Keyword(s):

Selective Attention ◽

Supervised Learning ◽

State Of The Art ◽

Relation Extraction ◽

Knowledge Bases ◽

Training Data ◽

Distant Supervision ◽

Sentence Level ◽

Noise Robust

Distant supervision leverages knowledge bases to automatically label instances, thus allowing us to train relation extractor without human annotations. However, the generated training data typically contain massive noise, and may result in poor performances with the vanilla supervised learning. In this paper, we propose to conduct multi-instance learning with a novel Cross-relation Cross-bag Selective Attention (C2SA), which leads to noise-robust training for distant supervised relation extractor. Specifically, we employ the sentence-level selective attention to reduce the effect of noisy or mismatched sentences, while the correlation among relations were captured to improve the quality of attention weights. Moreover, instead of treating all entity-pairs equally, we try to pay more attention to entity-pairs with a higher quality. Similarly, we adopt the selective attention mechanism to achieve this goal. Experiments with two types of relation extractor demonstrate the superiority of the proposed approach over the state-of-the-art, while further ablation studies verify our intuitions and demonstrate the effectiveness of our proposed two techniques.

Download Full-text

Relation Extraction for the Food Domain without Labeled Training Data – Is Distant Supervision the Best Solution?

Advances in Natural Language Processing - Lecture Notes in Computer Science ◽

10.1007/978-3-319-10888-9_35 ◽

2014 ◽

pp. 345-357 ◽

Cited By ~ 1

Author(s):

Melanie Reiplinger ◽

Michael Wiegand ◽

Dietrich Klakow

Keyword(s):

Relation Extraction ◽

Training Data ◽

Distant Supervision

Download Full-text

Language Models Application in Sentiment Attitude Extraction Task

Proceedings of the Institute for System Programming of RAS ◽

10.15514/ispras-2021-33(3)-14 ◽

2021 ◽

Vol 33 (3) ◽

pp. 199-222

Author(s):

Nicolay Leonidovich Rusnachenko

Keyword(s):

Mass Media ◽

Language Model ◽

Training Data ◽

Language Models ◽

Negative Effects ◽

Named Entities ◽

Distant Supervision ◽

Lexical Resource ◽

Attitude Extraction ◽

Over The Top

Large text can convey various forms of sentiment information including the author’s position, positive or negative effects of some events, attitudes of mentioned entities towards to each other. In this paper, we experiment with BERT based language models for extracting sentiment attitudes between named entities. Given a mass media article and list of mentioned named entities, the task is to ex tract positive or negative attitudes between them. Efficiency of language model methods depends on the amount of training data. To enrich training data, we adopt distant supervision method, which provide automatic annotation of unlabeled texts using an additional lexical resource. The proposed approach is subdivided into two stages FRAME-BASED: (1) sentiment pairs list completion (PAIR-BASED), (2) document annotations using PAIR-BASED and FRAME-BASED factors. Being applied towards a large news collection, the method generates RuAttitudes2017 automatically annotated collection. We evaluate the approach on RuSentRel-1.0, consisted of mass media articles written in Russian. Adopting RuAttitudes2017 in the training process results in 10-13% quality improvement by F1-measure over supervised learning and by 25% over the top neural network based model results.

Download Full-text

Simultaneously Linking Entities and Extracting Relations from Biomedical Text without Mention-Level Supervision

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6236 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7407-7414

Author(s):

Trapit Bansal ◽

Pat Verga ◽

Neha Choudhary ◽

Andrew McCallum

Keyword(s):

State Of The Art ◽

Relation Extraction ◽

Training Data ◽

Biomedical Text ◽

Entity Linking ◽

Distant Supervision ◽

Entity Relationships ◽

And Training

Understanding the meaning of text often involves reasoning about entities and their relationships. This requires identifying textual mentions of entities, linking them to a canonical concept, and discerning their relationships. These tasks are nearly always viewed as separate components within a pipeline, each requiring a distinct model and training data. While relation extraction can often be trained with readily available weak or distant supervision, entity linkers typically require expensive mention-level supervision – which is not available in many domains. Instead, we propose a model which is trained to simultaneously produce entity linking and relation decisions while requiring no mention-level annotations. This approach avoids cascading errors that arise from pipelined methods and more accurately predicts entity relationships from text. We show that our model outperforms a state-of-the art entity linking and relation extraction pipeline on two biomedical datasets and can drastically improve the overall recall of the system.

Download Full-text

Exploiting Parallel News Streams for Unsupervised Event Extraction

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00127 ◽

2015 ◽

Vol 3 ◽

pp. 117-129 ◽

Cited By ~ 2

Author(s):

Congle Zhang ◽

Stephen Soderland ◽

Daniel S. Weld

Keyword(s):

Graphical Model ◽

Relation Extraction ◽

Event Extraction ◽

Training Data ◽

Natural Language Text ◽

Distant Supervision ◽

Wide Range ◽

Precision Recall Curve ◽

Language Text ◽

Better Than

Most approaches to relation extraction, the task of extracting ground facts from natural language text, are based on machine learning and thus starved by scarce training data. Manual annotation is too expensive to scale to a comprehensive set of relations. Distant supervision, which automatically creates training data, only works with relations that already populate a knowledge base (KB). Unfortunately, KBs such as FreeBase rarely cover event relations ( e.g. “person travels to location”). Thus, the problem of extracting a wide range of events — e.g., from news streams — is an important, open challenge. This paper introduces NewsSpike-RE, a novel, unsupervised algorithm that discovers event relations and then learns to extract them. NewsSpike-RE uses a novel probabilistic graphical model to cluster sentences describing similar events from parallel news streams. These clusters then comprise training data for the extractor. Our evaluation shows that NewsSpike-RE generates high quality training sentences and learns extractors that perform much better than rival approaches, more than doubling the area under a precision-recall curve compared to Universal Schemas.

Download Full-text

Bias Modeling for Distantly Supervised Relation Extraction

Mathematical Problems in Engineering ◽

10.1155/2015/969053 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 2

Author(s):

Yang Xiang ◽

Yaoyun Zhang ◽

Xiaolong Wang ◽

Yang Qin ◽

Wenying Han

Keyword(s):

Language Processing ◽

Learning Algorithm ◽

State Of The Art ◽

Relation Extraction ◽

Knowledge Bases ◽

Training Data ◽

Free Text ◽

Distant Supervision ◽

Annotation Process ◽

Noise Tolerant

Distant supervision (DS) automatically annotates free text with relation mentions from existing knowledge bases (KBs), providing a way to alleviate the problem of insufficient training data for relation extraction in natural language processing (NLP). However, the heuristic annotation process does not guarantee the correctness of the generated labels, promoting a hot research issue on how to efficiently make use of the noisy training data. In this paper, we model two types of biases to reduce noise: (1)bias-distto model the relative distance between points (instances) and classes (relation centers); (2)bias-rewardto model the possibility of each heuristically generated label being incorrect. Based on the biases, we propose three noise tolerant models:MIML-dist,MIML-dist-classify, andMIML-reward, building on top of a state-of-the-art distantly supervised learning algorithm. Experimental evaluations compared with three landmark methods on the KBP dataset validate the effectiveness of the proposed methods.

Download Full-text

A Brief Review of Relation Extraction Based on Pre-Trained Language Models

Fuzzy Systems and Data Mining VI - Frontiers in Artificial Intelligence and Applications ◽

10.3233/faia200755 ◽

2020 ◽

Author(s):

Tiange Xu ◽

Fu Zhang

Keyword(s):

Recurrent Neural Networks ◽

Rapid Development ◽

Relation Extraction ◽

Language Models ◽

Research Progress ◽

Future Research ◽

Distant Supervision ◽

Future Research Directions ◽

Knowledge Graphs ◽

Key Techniques

Relation extraction is to extract the semantic relation between entity pairs in text, and it is a key point in building Knowledge Graphs and information extraction. The rapid development of deep learning in recent years has resulted in rich research results in relation extraction tasks. At present, the accuracy of relation extraction tasks based on pre-trained language models such as BERT exceeds the methods based on Convolutional or Recurrent Neural Networks. This review mainly summarizes the research progress of pre-trained language models such as BERT in supervised learning and distant supervision relation extraction. In addition, the directions for future research and some comparisons and analyses are discussed in our whole survey. The survey may help readers understand and catch some key techniques about the issue, and identify some future research directions.

Download Full-text

CoCoScore: Context-aware co-occurrence scoring for text mining applications using distant supervision

10.1101/444398 ◽

2018 ◽

Cited By ~ 1

Author(s):

Alexander Junge ◽

Lars Juhl Jensen

Keyword(s):

Text Mining ◽

Language Processing ◽

Gold Standard ◽

Relation Extraction ◽

Functional Protein ◽

Training Corpus ◽

Distant Supervision ◽

Sentence Level ◽

Gene Associations ◽

Standard Set

Information extraction by mining the scientific literature is key to uncovering relations between biomedical entities. Most existing approaches based on natural language processing extract relations from single sentence-level co-mentions, ignoring co-occurrence statistics over the whole corpus. Existing approaches counting entity co-occurrences ignore the textual context of each co-occurrence. We propose a novel corpus-wide co-occurrence scoring approach to relation extraction that takes the textual context of each co-mention into account. Our method, called CoCoScore, scores the certainty of stating an association for each sentence that co-mentions two entities. CoCoScore is trained using distant supervision based on a gold-standard set of associations between entities of interest. Instead of requiring a manually annotated training corpus, co-mentions are labeled as positives/negatives according to their presence/absence in the gold standard. We show that CoCoScore outperforms previous approaches in identifying human disease-gene and tissue-gene associations as well as in identifying physical and functional protein-protein associations in different species. CoCoScore is a versatile text-mining tool to uncover pairwise associations via co-occurrence mining, within and beyond biomedical applications. CoCoScore is available at: https://github.com/JungeAlexander/cocoscore

Download Full-text