Injecting Event Knowledge into Pre-Trained Language Models for Event Extraction

Mapping Intimacies ◽

10.5121/csit.2020.101404 ◽

2020 ◽

Author(s):

Zining Yang ◽

Siyu Zhan ◽

Mengshu Hou ◽

Xiaoyang Zeng ◽

Hao Zhu

Keyword(s):

Language Model ◽

Empirical Evaluation ◽

Event Extraction ◽

Training Data ◽

Language Models ◽

Extraction System ◽

Training Dataset ◽

Great Success ◽

Event Knowledge ◽

Event Trigger

The recent pre-trained language model has made great success in many NLP tasks. In this paper, we propose an event extraction system based on the novel pre-trained language model BERT to extract both event trigger and argument. As a deep-learningbased method, the size of the training dataset has a crucial impact on performance. To address the lacking training data problem for event extraction, we further train the pretrained language model with a carefully constructed in-domain corpus to inject event knowledge to our event extraction system with minimal efforts. Empirical evaluation on the ACE2005 dataset shows that injecting event knowledge can significantly improve the performance of event extraction.

Download Full-text

Analysis and Evaluation of Language Models for Word Sense Disambiguation

Computational Linguistics ◽

10.1162/coli_a_00405 ◽

2021 ◽

pp. 1-55

Author(s):

Daniel Loureiro ◽

Kiamehr Rezaee ◽

Mohammad Taher Pilehvar ◽

Jose Camacho-Collados

Keyword(s):

Feature Extraction ◽

Word Sense Disambiguation ◽

Language Model ◽

Training Data ◽

Fine Tuning ◽

Language Models ◽

Coarse Grained ◽

Word Sense ◽

Sense Disambiguation ◽

High Level

Abstract Transformer-based language models have taken many fields in NLP by storm. BERT and its derivatives dominate most of the existing evaluation benchmarks, including those for Word Sense Disambiguation (WSD), thanks to their ability in capturing context-sensitive semantic nuances. However, there is still little knowledge about their capabilities and potential limitations in encoding and recovering word senses. In this article, we provide an in-depth quantitative and qualitative analysis of the celebrated BERT model with respect to lexical ambiguity. One of the main conclusions of our analysis is that BERT can accurately capture high-level sense distinctions, even when a limited number of examples is available for each word sense. Our analysis also reveals that in some cases language models come close to solving coarse-grained noun disambiguation under ideal conditions in terms of availability of training data and computing resources. However, this scenario rarely occurs in real-world settings and, hence, many practical challenges remain even in the coarse-grained setting. We also perform an in-depth comparison of the two main language model based WSD strategies, i.e., fine-tuning and feature extraction, finding that the latter approach is more robust with respect to sense bias and it can better exploit limited available training data. In fact, the simple feature extraction strategy of averaging contextualized embeddings proves robust even using only three training sentences per word sense, with minimal improvements obtained by increasing the size of this training data.

Download Full-text

Deep indexed active learning for matching heterogeneous entity representations

Proceedings of the VLDB Endowment ◽

10.14778/3485450.3485455 ◽

2021 ◽

Vol 15 (1) ◽

pp. 31-45

Author(s):

Arjit Jain ◽

Sunita Sarawagi ◽

Prithviraj Sen

Keyword(s):

Active Learning ◽

Committee Member ◽

Language Model ◽

Cartesian Product ◽

Rule Learning ◽

Search Space ◽

Training Data ◽

Language Models ◽

Passive Learning ◽

Benchmark Datasets

Given two large lists of records, the task in entity resolution (ER) is to find the pairs from the Cartesian product of the lists that correspond to the same real world entity. Typically, passive learning methods on such tasks require large amounts of labeled data to yield useful models. Active Learning is a promising approach for ER in low resource settings. However, the search space, to find informative samples for the user to label, grows quadratically for instance-pair tasks making active learning hard to scale. Previous works, in this setting, rely on hand-crafted predicates, pre-trained language model embeddings, or rule learning to prune away unlikely pairs from the Cartesian product. This blocking step can miss out on important regions in the product space leading to low recall. We propose DIAL, a scalable active learning approach that jointly learns embeddings to maximize recall for blocking and accuracy for matching blocked pairs. DIAL uses an Index-By-Committee framework, where each committee member learns representations based on powerful pre-trained transformer language models. We highlight surprising differences between the matcher and the blocker in the creation of the training data and the objective used to train their parameters. Experiments on five benchmark datasets and a multilingual record matching dataset show the effectiveness of our approach in terms of precision, recall and running time.

Download Full-text

HIDING CRITICAL INFORMATION WHEN TRAINING LANGUAGE MODELS

EurasianUnionScientists ◽

10.31618/esu.2413-9335.2021.1.86.1349 ◽

2021 ◽

pp. 15-18

Author(s):

A. Evtushenko

Keyword(s):

Natural Language ◽

Language Processing ◽

Text Processing ◽

Language Model ◽

Personal Data ◽

Language Models ◽

Training Dataset ◽

Critical Information ◽

Research Company ◽

Learning Language

Machine learning language models are combinations of algorithms and neural networks designed for text processing composed in natural language (Natural Language Processing, NLP). In 2020, the largest language model from the artificial intelligence research company OpenAI, GPT-3, was released, the maximum number of parameters of which reaches 175 billion. The parameterization of the model increased by more than 100 times made it possible to improve the quality of generated texts to a level that is hard to distinguish from human-written texts. It is noteworthy that this model was trained on a training dataset mainly collected from open sources on the Internet, the volume of which is estimated at 570 GB. This article discusses the problem of memorizing critical information, in particular, personal data of individual, at the stage of training large language models (GPT-2/3 and derivatives), and also describes an algorithmic approach to solving this problem, which consists in additional preprocessing training dataset and refinement of the model inference in the context of generating pseudo-personal data and embedding into the results of work on the tasks of summarization, text generation, formation of answers to questions and others from the field of seq2seq.

Download Full-text

Just Add Functions: A Neural-Symbolic Language Model

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6264 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7634-7642

Author(s):

David Demeter ◽

Doug Downey

Keyword(s):

Probability Distributions ◽

Language Model ◽

Training Data ◽

Language Models ◽

Symbolic Language ◽

Inductive Bias ◽

Second Nature ◽

Word Classes ◽

Network Language ◽

Improving Accuracy

Neural network language models (NNLMs) have achieved ever-improving accuracy due to more sophisticated architectures and increasing amounts of training data. However, the inductive bias of these models (formed by the distributional hypothesis of language), while ideally suited to modeling most running text, results in key limitations for today's models. In particular, the models often struggle to learn certain spatial, temporal, or quantitative relationships, which are commonplace in text and are second-nature for human readers. Yet, in many cases, these relationships can be encoded with simple mathematical or logical expressions. How can we augment today's neural models with such encodings?In this paper, we propose a general methodology to enhance the inductive bias of NNLMs by incorporating simple functions into a neural architecture to form a hierarchical neural-symbolic language model (NSLM). These functions explicitly encode symbolic deterministic relationships to form probability distributions over words. We explore the effectiveness of this approach on numbers and geographic locations, and show that NSLMs significantly reduce perplexity in small-corpus language modeling, and that the performance improvement persists for rare tokens even on much larger corpora. The approach is simple and general, and we discuss how it can be applied to other word classes beyond numbers and geography.

Download Full-text

DCFEE: A Document-level Chinese Financial Event Extraction System based on Automatically Labeled Training Data

10.18653/v1/p18-4009 ◽

2018 ◽

Cited By ~ 7

Author(s):

Hang Yang ◽

Yubo Chen ◽

Kang Liu ◽

Yang Xiao ◽

Jun Zhao

Keyword(s):

Event Extraction ◽

Training Data ◽

Extraction System ◽

Document Level

Download Full-text

OxLM: A Neural Language Modelling Framework for Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2014-0016 ◽

2014 ◽

Vol 102 (1) ◽

pp. 81-92 ◽

Cited By ~ 1

Author(s):

Baltescu Paul ◽

Blunsom Phil ◽

Hoang Hieu

Keyword(s):

Machine Translation ◽

Language Model ◽

Computational Cost ◽

Training Data ◽

Language Models ◽

Training Algorithm ◽

Beam Search ◽

Modelling Framework ◽

Language Modelling ◽

N Gram

Abstract This paper presents an open source implementation1 of a neural language model for machine translation. Neural language models deal with the problem of data sparsity by learning distributed representations for words in a continuous vector space. The language modelling probabilities are estimated by projecting a word's context in the same space as the word representations and by assigning probabilities proportional to the distance between the words and the context's projection. Neural language models are notoriously slow to train and test. Our framework is designed with scalability in mind and provides two optional techniques for reducing the computational cost: the so-called class decomposition trick and a training algorithm based on noise contrastive estimation. Our models may be extended to incorporate direct n-gram features to learn weights for every n-gram in the training data. Our framework comes with wrappers for the cdec and Moses translation toolkits, allowing our language models to be incorporated as normalized features in their decoders (inside the beam search).

Download Full-text

Language Models Application in Sentiment Attitude Extraction Task

Proceedings of the Institute for System Programming of RAS ◽

10.15514/ispras-2021-33(3)-14 ◽

2021 ◽

Vol 33 (3) ◽

pp. 199-222

Author(s):

Nicolay Leonidovich Rusnachenko

Keyword(s):

Mass Media ◽

Language Model ◽

Training Data ◽

Language Models ◽

Negative Effects ◽

Named Entities ◽

Distant Supervision ◽

Lexical Resource ◽

Attitude Extraction ◽

Over The Top

Large text can convey various forms of sentiment information including the author’s position, positive or negative effects of some events, attitudes of mentioned entities towards to each other. In this paper, we experiment with BERT based language models for extracting sentiment attitudes between named entities. Given a mass media article and list of mentioned named entities, the task is to ex tract positive or negative attitudes between them. Efficiency of language model methods depends on the amount of training data. To enrich training data, we adopt distant supervision method, which provide automatic annotation of unlabeled texts using an additional lexical resource. The proposed approach is subdivided into two stages FRAME-BASED: (1) sentiment pairs list completion (PAIR-BASED), (2) document annotations using PAIR-BASED and FRAME-BASED factors. Being applied towards a large news collection, the method generates RuAttitudes2017 automatically annotated collection. We evaluate the approach on RuSentRel-1.0, consisted of mass media articles written in Russian. Adopting RuAttitudes2017 in the training process results in 10-13% quality improvement by F1-measure over supervised learning and by 25% over the top neural network based model results.

Download Full-text

Gated POS-Level Language Model for Authorship Verification

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/557 ◽

2020 ◽

Author(s):

Linshu Ouyang ◽

Yongzheng Zhang ◽

Hui Liu ◽

Yige Chen ◽

Yipeng Wang

Keyword(s):

State Of The Art ◽

Language Model ◽

Training Data ◽

Language Models ◽

Effective Parameters ◽

Part Of Speech ◽

Authorship Verification ◽

Verification Methods ◽

Optimal Accuracy ◽

Pos Tagger

Authorship verification is an important problem that has many applications. The state-of-the-art deep authorship verification methods typically leverage character-level language models to encode author-specific writing styles. However, they often fail to capture syntactic level patterns, leading to sub-optimal accuracy in cross-topic scenarios. Also, due to imperfect cross-author parameter sharing, it's difficult for them to distinguish author-specific writing style from common patterns, leading to data-inefficient learning. This paper introduces a novel POS-level (Part of Speech) gated RNN based language model to effectively learn the author-specific syntactic styles. The author-agnostic syntactic information obtained from the POS tagger pre-trained on large external datasets greatly reduces the number of effective parameters of our model, enabling the model to learn accurate author-specific syntactic styles with limited training data. We also utilize a gated architecture to learn the common syntactic writing styles with a small set of shared parameters and let the author-specific parameters focus on each author's special syntactic styles. Extensive experimental results show that our method achieves significantly better accuracy than state-of-the-art competing methods, especially in cross-topic scenarios (over 5\% in terms of AUC-ROC).

Download Full-text

Enriching contextualized language model from knowledge graph for biomedical information extraction

Briefings in Bioinformatics ◽

10.1093/bib/bbaa110 ◽

2020 ◽

Author(s):

Hao Fei ◽

Yafeng Ren ◽

Yue Zhang ◽

Donghong Ji ◽

Xiaohui Liang

Keyword(s):

Information Extraction ◽

Large Scale ◽

Language Model ◽

Relation Extraction ◽

Event Extraction ◽

Entity Recognition ◽

Language Models ◽

Training Procedure ◽

Biomedical Knowledge ◽

Biomedical Texts

Abstract Biomedical information extraction (BioIE) is an important task. The aim is to analyze biomedical texts and extract structured information such as named entities and semantic relations between them. In recent years, pre-trained language models have largely improved the performance of BioIE. However, they neglect to incorporate external structural knowledge, which can provide rich factual information to support the underlying understanding and reasoning for biomedical information extraction. In this paper, we first evaluate current extraction methods, including vanilla neural networks, general language models and pre-trained contextualized language models on biomedical information extraction tasks, including named entity recognition, relation extraction and event extraction. We then propose to enrich a contextualized language model by integrating a large scale of biomedical knowledge graphs (namely, BioKGLM). In order to effectively encode knowledge, we explore a three-stage training procedure and introduce different fusion strategies to facilitate knowledge injection. Experimental results on multiple tasks show that BioKGLM consistently outperforms state-of-the-art extraction models. A further analysis proves that BioKGLM can capture the underlying relations between biomedical knowledge concepts, which are crucial for BioIE.

Download Full-text

Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00434 ◽

2021 ◽

Vol 9 ◽

pp. 1408-1424

Author(s):

Timo Schick ◽

Sahana Udupa ◽

Hinrich Schütze

Keyword(s):

Language Model ◽

Training Data ◽

Language Models ◽

Considerable Degree ◽

The Internet ◽

Decoding Algorithm ◽

Textual Description ◽

Word Lists ◽

Training Examples ◽

Surprising Finding

Abstract ⚠ This paper contains prompts and model outputs that are offensive in nature. When trained on large, unfiltered crawls from the Internet, language models pick up and reproduce all kinds of undesirable biases that can be found in the data: They often generate racist, sexist, violent, or otherwise toxic language. As large models require millions of training examples to achieve good performance, it is difficult to completely prevent them from being exposed to such content. In this paper, we first demonstrate a surprising finding: Pretrained language models recognize, to a considerable degree, their undesirable biases and the toxicity of the content they produce. We refer to this capability as self-diagnosis. Based on this finding, we then propose a decoding algorithm that, given only a textual description of the undesired behavior, reduces the probability of a language model producing problematic text. We refer to this approach as self-debiasing. Self-debiasing does not rely on manually curated word lists, nor does it require any training data or changes to the model’s parameters. While we by no means eliminate the issue of language models generating biased text, we believe our approach to be an important step in this direction.1

Download Full-text