Using Pre-trained Language Model to Enhance Active Learning for Sentence Matching

Author(s):  
Guirong Bai ◽  
Shizhu He ◽  
Kang Liu ◽  
Jun Zhao

Active learning is an effective method to substantially alleviate the problem of expensive annotation cost for data-driven models. Recently, pre-trained language models have been demonstrated to be powerful for learning language representations. In this article, we demonstrate that the pre-trained language model can also utilize its learned textual characteristics to enrich criteria of active learning. Specifically, we provide extra textual criteria with the pre-trained language model to measure instances, including noise, coverage, and diversity. With these extra textual criteria, we can select more efficient instances for annotation and obtain better results. We conduct experiments on both English and Chinese sentence matching datasets. The experimental results show that the proposed active learning approach can be enhanced by the pre-trained language model and obtain better performance.

2021 ◽  
Vol 15 (1) ◽  
pp. 31-45
Author(s):  
Arjit Jain ◽  
Sunita Sarawagi ◽  
Prithviraj Sen

Given two large lists of records, the task in entity resolution (ER) is to find the pairs from the Cartesian product of the lists that correspond to the same real world entity. Typically, passive learning methods on such tasks require large amounts of labeled data to yield useful models. Active Learning is a promising approach for ER in low resource settings. However, the search space, to find informative samples for the user to label, grows quadratically for instance-pair tasks making active learning hard to scale. Previous works, in this setting, rely on hand-crafted predicates, pre-trained language model embeddings, or rule learning to prune away unlikely pairs from the Cartesian product. This blocking step can miss out on important regions in the product space leading to low recall. We propose DIAL, a scalable active learning approach that jointly learns embeddings to maximize recall for blocking and accuracy for matching blocked pairs. DIAL uses an Index-By-Committee framework, where each committee member learns representations based on powerful pre-trained transformer language models. We highlight surprising differences between the matcher and the blocker in the creation of the training data and the objective used to train their parameters. Experiments on five benchmark datasets and a multilingual record matching dataset show the effectiveness of our approach in terms of precision, recall and running time.


Author(s):  
A. Evtushenko

Machine learning language models are combinations of algorithms and neural networks designed for text processing composed in natural language (Natural Language Processing, NLP).  In 2020, the largest language model from the artificial intelligence research company OpenAI, GPT-3, was released, the maximum number of parameters of which reaches 175 billion. The parameterization of the model increased by more than 100 times made it possible to improve the quality of generated texts to a level that is hard to distinguish from human-written texts. It is noteworthy that this model was trained on a training dataset mainly collected from open sources on the Internet, the volume of which is estimated at 570 GB.  This article discusses the problem of memorizing critical information, in particular, personal data of individual, at the stage of training large language models (GPT-2/3 and derivatives), and also describes an algorithmic approach to solving this problem, which consists in additional preprocessing training dataset and refinement of the model inference in the context of generating pseudo-personal data and embedding into the results of work on the tasks of summarization, text generation, formation of answers to questions and others from the field of seq2seq.


Author(s):  
MIRJAM SEPESY MAUČEC ◽  
TOMAŽ ROTOVNIK ◽  
ZDRAVKO KAČIČ ◽  
JANEZ BREST

This paper presents the results of a study on modeling the highly inflective Slovenian language. We focus on creating a language model for a large vocabulary speech recognition system. A new data-driven method is proposed for the induction of inflectional morphology into language modeling. The research focus is on data sparsity, which results from the complex morphology of the language. The idea of using subword units is examined. An attempt is made to figure out the segmentation of words into two subword units: stems and endings. No prior knowledge of the language is used. The subword units should fit into the frameworks of the probabilistic language models. A morphologically correct decomposition of words is not being sought, but searching for a decomposition which yields the minimum entropy of the training corpus. This entropy is approximated by using N-gram models. Despite some seemingly over-simplified assumption, the subword models improve the applicability of the language models for a sparse training corpus. The experiments were performed using the VEČER newswire text corpus as a training corpus. The test set was taken from the SNABI speech database, because the final models were evaluated in speech recognition experiments on SNABI speech database. Two different subword-based models are proposed and examined experimentally. The experiments demonstrate that subword-based models, which considerably reduce OOV rate, improve speech recognition WER when compared with standard word-based models, even though they increase test set perplexity. Subword-based models with improved perplexity, but which reduce the OOV rate much less than the previous ones, do not improve speech recognition results.


2021 ◽  
Author(s):  
Andrew E Blanchard ◽  
John Gounley ◽  
Debsindhu Bhowmik ◽  
Mayanka Chandra Shekar ◽  
Isaac Lyngaas ◽  
...  

The COVID-19 pandemic highlights the need for computational tools to automate and accelerate drug design for novel protein targets. We leverage deep learning language models to generate and score drug candidates based on predicted protein binding affinity. We pre-trained a deep learning language model (BERT) on ~9.6 billion molecules and achieved peak performance of 603 petaflops in mixed precision. Our work reduces pre-training time from days to hours, compared to previous efforts with this architecture, while also increasing the dataset size by nearly an order of magnitude. For scoring, we fine-tuned the language model using an assembled set of thousands of protein targets with binding affinity data and searched for inhibitors of specific protein targets, SARS-CoV-2 Mpro and PLpro. We utilized a genetic algorithm approach for finding optimal candidates using the generation and scoring capabilities of the language model. Our generalizable models accelerate the identification of inhibitors for emerging therapeutic targets.


2013 ◽  
Vol 21 (2) ◽  
pp. 201-226 ◽  
Author(s):  
DEYI XIONG ◽  
MIN ZHANG

AbstractThe language model is one of the most important knowledge sources for statistical machine translation. In this article, we present two extensions to standard n-gram language models in statistical machine translation: a backward language model that augments the conventional forward language model, and a mutual information trigger model which captures long-distance dependencies that go beyond the scope of standard n-gram language models. We introduce algorithms to integrate the two proposed models into two kinds of state-of-the-art phrase-based decoders. Our experimental results on Chinese/Spanish/Vietnamese-to-English show that both models are able to significantly improve translation quality in terms of BLEU and METEOR over a competitive baseline.


2021 ◽  
Vol 5 (OOPSLA) ◽  
pp. 1-25
Author(s):  
Gust Verbruggen ◽  
Vu Le ◽  
Sumit Gulwani

The ability to learn programs from few examples is a powerful technology with disruptive applications in many domains, as it allows users to automate repetitive tasks in an intuitive way. Existing frameworks on inductive synthesis only perform syntactic manipulations, where they rely on the syntactic structure of the given examples and not their meaning. Any semantic manipulations, such as transforming dates, have to be manually encoded by the designer of the inductive programming framework. Recent advances in large language models have shown these models to be very adept at performing semantic transformations of its input by simply providing a few examples of the task at hand. When it comes to syntactic transformations, however, these models are limited in their expressive power. In this paper, we propose a novel framework for integrating inductive synthesis with few-shot learning language models to combine the strength of these two popular technologies. In particular, the inductive synthesis is tasked with breaking down the problem in smaller subproblems, among which those that cannot be solved syntactically are passed to the language model. We formalize three semantic operators that can be integrated with inductive synthesizers. To minimize invoking expensive semantic operators during learning, we introduce a novel deferred query execution algorithm that considers the operators to be oracles during learning. We evaluate our approach in the domain of string transformations: the combination methodology can automate tasks that cannot be handled using either technologies by themselves. Finally, we demonstrate the generality of our approach via a case study in the domain of string profiling.


Author(s):  
Sho Takase ◽  
Jun Suzuki ◽  
Masaaki Nagata

This paper proposes a novel Recurrent Neural Network (RNN) language model that takes advantage of character information. We focus on character n-grams based on research in the field of word embedding construction (Wieting et al. 2016). Our proposed method constructs word embeddings from character ngram embeddings and combines them with ordinary word embeddings. We demonstrate that the proposed method achieves the best perplexities on the language modeling datasets: Penn Treebank, WikiText-2, and WikiText-103. Moreover, we conduct experiments on application tasks: machine translation and headline generation. The experimental results indicate that our proposed method also positively affects these tasks


Author(s):  
Wenhua Yang ◽  
Yu Zhou ◽  
Zhiqiu Huang

Application Programming Interfaces (APIs) play an important role in modern software development. Developers interact with APIs on a daily basis and thus need to learn and memorize those APIs suitable for implementing the required functions. This can be a burden even for experienced developers since there exists a mass of available APIs. API recommendation techniques focus on assisting developers in selecting suitable APIs. However, existing API recommendation techniques have not taken the developers personal characteristics into account. As a result, they cannot provide developers with personalized API recommendation services. Meanwhile, they lack the support for self-defined APIs in the recommendation. To this end, we aim to propose a personalized API recommendation method that considers developers’ differences. Our API recommendation method is based on statistical language. We propose a model structure that combines the N-gram model and the long short-term memory (LSTM) neural network and train predictive models using API invoking sequences extracted from GitHub code repositories. A general language model trained on all sorts of code data is first acquired, based on which two personalized language models that recommend personalized library APIs and self-defined APIs are trained using the code data of the developer who needs personalized services. We evaluate our personalized API recommendation method on real-world developers, and the experimental results show that our approach achieves better accuracy in recommending both library APIs and self-defined APIs compared with the state-of-the-art. The experimental results also confirm the effectiveness of our hybrid model structure and the choice of the LSTM’s size.


2015 ◽  
pp. 20 ◽  
Author(s):  
Stig-Arne Grönroos ◽  
Kristiina Jokinen ◽  
Katri Hiovain ◽  
Mikko Kurimo ◽  
Sami Virpioja

Many Uralic languages have a rich morphological structure, but lack tools of morphological analysis needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications.We study how to create a statistical model for morphological segmentation of North Sámi language with a large unannotated corpus and a small amount of human-annotated word forms selected using an active learning approach. For statistical learning, we use the semi-supervised Morfessor Baseline and FlatCat methods. Aer annotating 237 words with our active learning setup, we improve morph boundary recall over 20% with no loss of precision.


Author(s):  
Ying-Peng Tang ◽  
Sheng-Jun Huang

Active learning queries labels from the oracle for the most valuable instances to reduce the labeling cost. In many active learning studies, informative and representative instances are preferred because they are expected to have higher potential value for improving the model. Recently, the results in self-paced learning show that training the model with easy examples first and then gradually with harder examples can improve the performance. While informative and representative instances could be easy or hard, querying valuable but hard examples at early stage may lead to waste of labeling cost. In this paper, we propose a self-paced active learning approach to simultaneously consider the potential value and easiness of an instance, and try to train the model with least cost by querying the right thing at the right time. Experimental results show that the proposed approach is superior to state-of-the-art batch mode active learning methods.


Sign in / Sign up

Export Citation Format

Share Document