Deep indexed active learning for matching heterogeneous entity representations

Given two large lists of records, the task in entity resolution (ER) is to find the pairs from the Cartesian product of the lists that correspond to the same real world entity. Typically, passive learning methods on such tasks require large amounts of labeled data to yield useful models. Active Learning is a promising approach for ER in low resource settings. However, the search space, to find informative samples for the user to label, grows quadratically for instance-pair tasks making active learning hard to scale. Previous works, in this setting, rely on hand-crafted predicates, pre-trained language model embeddings, or rule learning to prune away unlikely pairs from the Cartesian product. This blocking step can miss out on important regions in the product space leading to low recall. We propose DIAL, a scalable active learning approach that jointly learns embeddings to maximize recall for blocking and accuracy for matching blocked pairs. DIAL uses an Index-By-Committee framework, where each committee member learns representations based on powerful pre-trained transformer language models. We highlight surprising differences between the matcher and the blocker in the creation of the training data and the objective used to train their parameters. Experiments on five benchmark datasets and a multilingual record matching dataset show the effectiveness of our approach in terms of precision, recall and running time.

Download Full-text

Astrid

Proceedings of the VLDB Endowment ◽

10.14778/3436905.3436907 ◽

2020 ◽

Vol 14 (4) ◽

pp. 471-484

Author(s):

Suraj Shetiya ◽

Saravanan Thirumuruganathan ◽

Nick Koudas ◽

Gautam Das

Keyword(s):

Deep Learning ◽

Objective Function ◽

Pattern Matching ◽

Language Processing ◽

Language Model ◽

Language Models ◽

Selectivity Estimation ◽

Statistical Correlations ◽

Benchmark Datasets ◽

Traditional Approaches

Accurate selectivity estimation for string predicates is a long-standing research challenge in databases. Supporting pattern matching on strings (such as prefix, substring, and suffix) makes this problem much more challenging, thereby necessitating a dedicated study. Traditional approaches often build pruned summary data structures such as tries followed by selectivity estimation using statistical correlations. However, this produces insufficiently accurate cardinality estimates resulting in the selection of sub-optimal plans by the query optimizer. Recently proposed deep learning based approaches leverage techniques from natural language processing such as embeddings to encode the strings and use it to train a model. While this is an improvement over traditional approaches, there is a large scope for improvement. We propose Astrid, a framework for string selectivity estimation that synthesizes ideas from traditional and deep learning based approaches. We make two complementary contributions. First, we propose an embedding algorithm that is query-type (prefix, substring, and suffix) and selectivity aware. Consider three strings 'ab', 'abc' and 'abd' whose prefix frequencies are 1000, 800 and 100 respectively. Our approach would ensure that the embedding for 'ab' is closer to 'abc' than 'abd'. Second, we describe how neural language models could be used for selectivity estimation. While they work well for prefix queries, their performance for substring queries is sub-optimal. We modify the objective function of the neural language model so that it could be used for estimating selectivities of pattern matching queries. We also propose a novel and efficient algorithm for optimizing the new objective function. We conduct extensive experiments over benchmark datasets and show that our proposed approaches achieve state-of-the-art results.

Download Full-text

Adapting SVM for data sparseness and imbalance: a case study in information extraction

Natural Language Engineering ◽

10.1017/s1351324908004968 ◽

2009 ◽

Vol 15 (2) ◽

pp. 241-271 ◽

Cited By ~ 31

Author(s):

YAOYONG LI ◽

KALINA BONTCHEVA ◽

HAMISH CUNNINGHAM

Keyword(s):

Active Learning ◽

Language Learning ◽

Information Extraction ◽

Language Processing ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

Training Data ◽

Support Vector ◽

Passive Learning ◽

Wide Range

AbstractSupport Vector Machines (SVM) have been used successfully in many Natural Language Processing (NLP) tasks. The novel contribution of this paper is in investigating two techniques for making SVM more suitable for language learning tasks. Firstly, we propose an SVM with uneven margins (SVMUM) model to deal with the problem of imbalanced training data. Secondly, SVM active learning is employed in order to alleviate the difficulty in obtaining labelled training data. The algorithms are presented and evaluated on several Information Extraction (IE) tasks, where they achieved better performance than the standard SVM and the SVM with passive learning, respectively. Moreover, by combining SVMUM with the active learning algorithm, we achieve the best reported results on the seminars and jobs corpora, which are benchmark data sets used for evaluation and comparison of machine learning algorithms for IE. In addition, we also evaluate the token based classification framework for IE with three different entity tagging schemes. In comparison to previous methods dealing with the same problems, our methods are both effective and efficient, which are valuable features for real-world applications. Due to the similarity in the formulation of the learning problem for IE and for other NLP tasks, the two techniques are likely to be beneficial in a wide range of applications1.

Download Full-text

Injecting Event Knowledge into Pre-Trained Language Models for Event Extraction

10.5121/csit.2020.101404 ◽

2020 ◽

Author(s):

Zining Yang ◽

Siyu Zhan ◽

Mengshu Hou ◽

Xiaoyang Zeng ◽

Hao Zhu

Keyword(s):

Language Model ◽

Empirical Evaluation ◽

Event Extraction ◽

Training Data ◽

Language Models ◽

Extraction System ◽

Training Dataset ◽

Great Success ◽

Event Knowledge ◽

Event Trigger

The recent pre-trained language model has made great success in many NLP tasks. In this paper, we propose an event extraction system based on the novel pre-trained language model BERT to extract both event trigger and argument. As a deep-learningbased method, the size of the training dataset has a crucial impact on performance. To address the lacking training data problem for event extraction, we further train the pretrained language model with a carefully constructed in-domain corpus to inject event knowledge to our event extraction system with minimal efforts. Empirical evaluation on the ACE2005 dataset shows that injecting event knowledge can significantly improve the performance of event extraction.

Download Full-text

Analysis and Evaluation of Language Models for Word Sense Disambiguation

Computational Linguistics ◽

10.1162/coli_a_00405 ◽

2021 ◽

pp. 1-55

Author(s):

Daniel Loureiro ◽

Kiamehr Rezaee ◽

Mohammad Taher Pilehvar ◽

Jose Camacho-Collados

Keyword(s):

Feature Extraction ◽

Word Sense Disambiguation ◽

Language Model ◽

Training Data ◽

Fine Tuning ◽

Language Models ◽

Coarse Grained ◽

Word Sense ◽

Sense Disambiguation ◽

High Level

Abstract Transformer-based language models have taken many fields in NLP by storm. BERT and its derivatives dominate most of the existing evaluation benchmarks, including those for Word Sense Disambiguation (WSD), thanks to their ability in capturing context-sensitive semantic nuances. However, there is still little knowledge about their capabilities and potential limitations in encoding and recovering word senses. In this article, we provide an in-depth quantitative and qualitative analysis of the celebrated BERT model with respect to lexical ambiguity. One of the main conclusions of our analysis is that BERT can accurately capture high-level sense distinctions, even when a limited number of examples is available for each word sense. Our analysis also reveals that in some cases language models come close to solving coarse-grained noun disambiguation under ideal conditions in terms of availability of training data and computing resources. However, this scenario rarely occurs in real-world settings and, hence, many practical challenges remain even in the coarse-grained setting. We also perform an in-depth comparison of the two main language model based WSD strategies, i.e., fine-tuning and feature extraction, finding that the latter approach is more robust with respect to sense bias and it can better exploit limited available training data. In fact, the simple feature extraction strategy of averaging contextualized embeddings proves robust even using only three training sentences per word sense, with minimal improvements obtained by increasing the size of this training data.

Download Full-text

Just Add Functions: A Neural-Symbolic Language Model

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6264 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7634-7642

Author(s):

David Demeter ◽

Doug Downey

Keyword(s):

Probability Distributions ◽

Language Model ◽

Training Data ◽

Language Models ◽

Symbolic Language ◽

Inductive Bias ◽

Second Nature ◽

Word Classes ◽

Network Language ◽

Improving Accuracy

Neural network language models (NNLMs) have achieved ever-improving accuracy due to more sophisticated architectures and increasing amounts of training data. However, the inductive bias of these models (formed by the distributional hypothesis of language), while ideally suited to modeling most running text, results in key limitations for today's models. In particular, the models often struggle to learn certain spatial, temporal, or quantitative relationships, which are commonplace in text and are second-nature for human readers. Yet, in many cases, these relationships can be encoded with simple mathematical or logical expressions. How can we augment today's neural models with such encodings?In this paper, we propose a general methodology to enhance the inductive bias of NNLMs by incorporating simple functions into a neural architecture to form a hierarchical neural-symbolic language model (NSLM). These functions explicitly encode symbolic deterministic relationships to form probability distributions over words. We explore the effectiveness of this approach on numbers and geographic locations, and show that NSLMs significantly reduce perplexity in small-corpus language modeling, and that the performance improvement persists for rare tokens even on much larger corpora. The approach is simple and general, and we discuss how it can be applied to other word classes beyond numbers and geography.

Download Full-text

SYNERGISTIC USE OF SENTINEL-1 AND SENTINEL-2 TIME SERIES FOR POPLAR PLANTATIONS MONITORING AT LARGE SCALE

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xliii-b3-2020-1457-2020 ◽

2020 ◽

Vol XLIII-B3-2020 ◽

pp. 1457-1461

Author(s):

Y. Hamrouni ◽

É. Paillassa ◽

V. Chéret ◽

C. Monteil ◽

D. Sheeren

Keyword(s):

Time Series ◽

Active Learning ◽

Supervised Classification ◽

Large Scale ◽

Training Data ◽

Passive Learning ◽

Training Samples ◽

Poplar Plantations ◽

Annual Means ◽

Sentinel 2

Abstract. The current context of availability of Earth Observation satellite data at high spatial and temporal resolutions makes it possible to map large areas. Although supervised classification is the most widely adopted approach, its performance is highly dependent on the availability and the quality of training data. However, gathering samples from field surveys or through photo interpretation is often expensive and time-consuming especially when the area to be classified is large. In this paper we propose the use of an active learning-based technique to address this issue by reducing the labelling effort required for supervised classification while increasing the generalisation capabilities of the classifier across space. Experiments were conducted to identify poplar plantations in three different sites in France using Sentinel-2 time series. In order to characterise the age of the identified poplar stands, temporal means of Sentinel-1 backscatter coefficients were computed. The results are promising and show the good capacities of the active learning-based approach to achieve similar performance (Poplar F-score &geq; 90%) to traditional passive learning (i.e. with random selection of samples) with up to 50% fewer training samples. Sentinel-1 annual means have demonstrated their potential to differentiate two stand ages with an overall accuracy of 83% regardless of the cultivar considered.

Download Full-text

Using Pre-trained Language Model to Enhance Active Learning for Sentence Matching

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3480937 ◽

2022 ◽

Vol 21 (2) ◽

pp. 1-19

Author(s):

Guirong Bai ◽

Shizhu He ◽

Kang Liu ◽

Jun Zhao

Keyword(s):

Active Learning ◽

Language Model ◽

Experimental Results ◽

Language Models ◽

Data Driven ◽

Learning Approach ◽

Sentence Matching ◽

Learning Language

Active learning is an effective method to substantially alleviate the problem of expensive annotation cost for data-driven models. Recently, pre-trained language models have been demonstrated to be powerful for learning language representations. In this article, we demonstrate that the pre-trained language model can also utilize its learned textual characteristics to enrich criteria of active learning. Specifically, we provide extra textual criteria with the pre-trained language model to measure instances, including noise, coverage, and diversity. With these extra textual criteria, we can select more efficient instances for annotation and obtain better results. We conduct experiments on both English and Chinese sentence matching datasets. The experimental results show that the proposed active learning approach can be enhanced by the pre-trained language model and obtain better performance.

Download Full-text

OxLM: A Neural Language Modelling Framework for Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2014-0016 ◽

2014 ◽

Vol 102 (1) ◽

pp. 81-92 ◽

Cited By ~ 1

Author(s):

Baltescu Paul ◽

Blunsom Phil ◽

Hoang Hieu

Keyword(s):

Machine Translation ◽

Language Model ◽

Computational Cost ◽

Training Data ◽

Language Models ◽

Training Algorithm ◽

Beam Search ◽

Modelling Framework ◽

Language Modelling ◽

N Gram

Abstract This paper presents an open source implementation1 of a neural language model for machine translation. Neural language models deal with the problem of data sparsity by learning distributed representations for words in a continuous vector space. The language modelling probabilities are estimated by projecting a word's context in the same space as the word representations and by assigning probabilities proportional to the distance between the words and the context's projection. Neural language models are notoriously slow to train and test. Our framework is designed with scalability in mind and provides two optional techniques for reducing the computational cost: the so-called class decomposition trick and a training algorithm based on noise contrastive estimation. Our models may be extended to incorporate direct n-gram features to learn weights for every n-gram in the training data. Our framework comes with wrappers for the cdec and Moses translation toolkits, allowing our language models to be incorporated as normalized features in their decoders (inside the beam search).

Download Full-text

Language Models Application in Sentiment Attitude Extraction Task

Proceedings of the Institute for System Programming of RAS ◽

10.15514/ispras-2021-33(3)-14 ◽

2021 ◽

Vol 33 (3) ◽

pp. 199-222

Author(s):

Nicolay Leonidovich Rusnachenko

Keyword(s):

Mass Media ◽

Language Model ◽

Training Data ◽

Language Models ◽

Negative Effects ◽

Named Entities ◽

Distant Supervision ◽

Lexical Resource ◽

Attitude Extraction ◽

Over The Top

Large text can convey various forms of sentiment information including the author’s position, positive or negative effects of some events, attitudes of mentioned entities towards to each other. In this paper, we experiment with BERT based language models for extracting sentiment attitudes between named entities. Given a mass media article and list of mentioned named entities, the task is to ex tract positive or negative attitudes between them. Efficiency of language model methods depends on the amount of training data. To enrich training data, we adopt distant supervision method, which provide automatic annotation of unlabeled texts using an additional lexical resource. The proposed approach is subdivided into two stages FRAME-BASED: (1) sentiment pairs list completion (PAIR-BASED), (2) document annotations using PAIR-BASED and FRAME-BASED factors. Being applied towards a large news collection, the method generates RuAttitudes2017 automatically annotated collection. We evaluate the approach on RuSentRel-1.0, consisted of mass media articles written in Russian. Adopting RuAttitudes2017 in the training process results in 10-13% quality improvement by F1-measure over supervised learning and by 25% over the top neural network based model results.

Download Full-text

Gated POS-Level Language Model for Authorship Verification

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/557 ◽

2020 ◽

Author(s):

Linshu Ouyang ◽

Yongzheng Zhang ◽

Hui Liu ◽

Yige Chen ◽

Yipeng Wang

Keyword(s):

State Of The Art ◽

Language Model ◽

Training Data ◽

Language Models ◽

Effective Parameters ◽

Part Of Speech ◽

Authorship Verification ◽

Verification Methods ◽

Optimal Accuracy ◽

Pos Tagger

Authorship verification is an important problem that has many applications. The state-of-the-art deep authorship verification methods typically leverage character-level language models to encode author-specific writing styles. However, they often fail to capture syntactic level patterns, leading to sub-optimal accuracy in cross-topic scenarios. Also, due to imperfect cross-author parameter sharing, it's difficult for them to distinguish author-specific writing style from common patterns, leading to data-inefficient learning. This paper introduces a novel POS-level (Part of Speech) gated RNN based language model to effectively learn the author-specific syntactic styles. The author-agnostic syntactic information obtained from the POS tagger pre-trained on large external datasets greatly reduces the number of effective parameters of our model, enabling the model to learn accurate author-specific syntactic styles with limited training data. We also utilize a gated architecture to learn the common syntactic writing styles with a small set of shared parameters and let the author-specific parameters focus on each author's special syntactic styles. Extensive experimental results show that our method achieves significantly better accuracy than state-of-the-art competing methods, especially in cross-topic scenarios (over 5\% in terms of AUC-ROC).

Download Full-text