Gated POS-Level Language Model for Authorship Verification

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/557 ◽

2020 ◽

Author(s):

Linshu Ouyang ◽

Yongzheng Zhang ◽

Hui Liu ◽

Yige Chen ◽

Yipeng Wang

Keyword(s):

State Of The Art ◽

Language Model ◽

Training Data ◽

Language Models ◽

Effective Parameters ◽

Part Of Speech ◽

Authorship Verification ◽

Verification Methods ◽

Optimal Accuracy ◽

Pos Tagger

Authorship verification is an important problem that has many applications. The state-of-the-art deep authorship verification methods typically leverage character-level language models to encode author-specific writing styles. However, they often fail to capture syntactic level patterns, leading to sub-optimal accuracy in cross-topic scenarios. Also, due to imperfect cross-author parameter sharing, it's difficult for them to distinguish author-specific writing style from common patterns, leading to data-inefficient learning. This paper introduces a novel POS-level (Part of Speech) gated RNN based language model to effectively learn the author-specific syntactic styles. The author-agnostic syntactic information obtained from the POS tagger pre-trained on large external datasets greatly reduces the number of effective parameters of our model, enabling the model to learn accurate author-specific syntactic styles with limited training data. We also utilize a gated architecture to learn the common syntactic writing styles with a small set of shared parameters and let the author-specific parameters focus on each author's special syntactic styles. Extensive experimental results show that our method achieves significantly better accuracy than state-of-the-art competing methods, especially in cross-topic scenarios (over 5\% in terms of AUC-ROC).

Download Full-text

Attending to Entities for Better Text Understanding

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6254 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7554-7561

Author(s):

Pengxiang Cheng ◽

Katrin Erk

Keyword(s):

Large Scale ◽

Human Performance ◽

State Of The Art ◽

Syntactic Structure ◽

Semantic Knowledge ◽

Training Data ◽

Language Models ◽

Long Distance ◽

Future Directions ◽

Text Understanding

Recent progress in NLP witnessed the development of large-scale pre-trained language models (GPT, BERT, XLNet, etc.) based on Transformer (Vaswani et al. 2017), and in a range of end tasks, such models have achieved state-of-the-art results, approaching human performance. This clearly demonstrates the power of the stacked self-attention architecture when paired with a sufficient number of layers and a large amount of pre-training data. However, on tasks that require complex and long-distance reasoning where surface-level cues are not enough, there is still a large gap between the pre-trained models and human performance. Strubell et al. (2018) recently showed that it is possible to inject knowledge of syntactic structure into a model through supervised self-attention. We conjecture that a similar injection of semantic knowledge, in particular, coreference information, into an existing model would improve performance on such complex problems. On the LAMBADA (Paperno et al. 2016) task, we show that a model trained from scratch with coreference as auxiliary supervision for self-attention outperforms the largest GPT-2 model, setting the new state-of-the-art, while only containing a tiny fraction of parameters compared to GPT-2. We also conduct a thorough analysis of different variants of model architectures and supervision configurations, suggesting future directions on applying similar techniques to other problems.

Download Full-text

Improved CCG Parsing with Semi-supervised Supertagging

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00186 ◽

2014 ◽

Vol 2 ◽

pp. 327-338 ◽

Cited By ~ 7

Author(s):

Mike Lewis ◽

Mark Steedman

Keyword(s):

Wall Street Journal ◽

State Of The Art ◽

Training Data ◽

Word Embeddings ◽

Important Goal ◽

Wall Street ◽

Feature Sets ◽

Lexical Categories ◽

Unlabelled Data ◽

Pos Tagger

Current supervised parsers are limited by the size of their labelled training data, making improving them with unlabelled data an important goal. We show how a state-of-the-art CCG parser can be enhanced, by predicting lexical categories using unsupervised vector-space embeddings of words. The use of word embeddings enables our model to better generalize from the labelled data, and allows us to accurately assign lexical categories without depending on a POS-tagger. Our approach leads to substantial improvements in dependency parsing results over the standard supervised CCG parser when evaluated on Wall Street Journal (0.8%), Wikipedia (1.8%) and biomedical (3.4%) text. We compare the performance of two recently proposed approaches for classification using a wide variety of word embeddings. We also give a detailed error analysis demonstrating where using embeddings outperforms traditional feature sets, and showing how including POS features can decrease accuracy.

Download Full-text

MSA Transformer

10.1101/2021.02.12.430858 ◽

2021 ◽

Author(s):

Roshan Rao ◽

Jason Liu ◽

Robert Verkuil ◽

Joshua Meier ◽

John F. Canny ◽

...

Keyword(s):

Structure Learning ◽

State Of The Art ◽

Language Model ◽

Language Modeling ◽

Language Models ◽

Multiple Sequence ◽

Wide Margin ◽

Current State ◽

Individual Sequences ◽

And Function

AbstractUnsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evolutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.

Download Full-text

BioALBERT: A Simple and Effective Pre-trained Language Model for Biomedical Named Entity Recognition

10.21203/rs.3.rs-90025/v1 ◽

2020 ◽

Author(s):

Usman Naseem ◽

Matloob Khushi ◽

Vinay Reddy ◽

Sakthivel Rajendran ◽

Imran Razzak ◽

...

Keyword(s):

State Of The Art ◽

Language Model ◽

Named Entity Recognition ◽

Training Data ◽

Entity Recognition ◽

Future Research ◽

Named Entity ◽

Domain Specific ◽

Context Dependent ◽

Biomedical Named Entity Recognition

Abstract Background: In recent years, with the growing amount of biomedical documents, coupled with advancement in natural language processing algorithms, the research on biomedical named entity recognition (BioNER) has increased exponentially. However, BioNER research is challenging as NER in the biomedical domain are: (i) often restricted due to limited amount of training data, (ii) an entity can refer to multiple types and concepts depending on its context and, (iii) heavy reliance on acronyms that are sub-domain specific. Existing BioNER approaches often neglect these issues and directly adopt the state-of-the-art (SOTA) models trained in general corpora which often yields unsatisfactory results. Results: We propose biomedical ALBERT (A Lite Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) - bioALBERT - an effective domain-specific pre-trained language model trained on huge biomedical corpus designed to capture biomedical context-dependent NER. We adopted self-supervised loss function used in ALBERT that targets on modelling inter-sentence coherence to better learn context-dependent representations and incorporated parameter reduction strategies to minimise memory usage and enhance the training time in BioNER. In our experiments, BioALBERT outperformed comparative SOTA BioNER models on eight biomedical NER benchmark datasets with four different entity types. The performance is increased for; (i) disease type corpora by 7.47% (NCBI-disease) and 10.63% (BC5CDR-disease); (ii) drug-chem type corpora by 4.61% (BC5CDR-Chem) and 3.89 (BC4CHEMD); (iii) gene-protein type corpora by 12.25% (BC2GM) and 6.42% (JNLPBA); and (iv) Species type corpora by 6.19% (LINNAEUS) and 23.71% (Species-800) is observed which leads to a state-of-the-art results. Conclusions: The performance of proposed model on four different biomedical entity types shows that our model is robust and generalizable in recognizing biomedical entities in text. We trained four different variants of BioALBERT models which are available for the research community to be used in future research.

Download Full-text

Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00112 ◽

2016 ◽

Vol 4 ◽

pp. 477-490 ◽

Cited By ~ 4

Author(s):

Ehsan Shareghi ◽

Matthias Petri ◽

Gholamreza Haffari ◽

Trevor Cohn

Keyword(s):

Infinite Order ◽

State Of The Art ◽

Language Model ◽

Language Models ◽

Memory Usage ◽

Suffix Trees ◽

Construction Time ◽

Modest Increase ◽

Order Language ◽

Language Modelling

Efficient methods for storing and querying are critical for scaling high-order m-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500×, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).

Download Full-text

QASC: A Dataset for Question Answering via Sentence Composition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6319 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8082-8090

Author(s):

Tushar Khot ◽

Peter Clark ◽

Michal Guerquin ◽

Peter Jansen ◽

Ashish Sabharwal

Keyword(s):

Common Sense ◽

Human Performance ◽

Question Answering ◽

State Of The Art ◽

Multiple Choice ◽

Training Data ◽

Language Models ◽

Current State ◽

New Concepts ◽

Large Corpus

Composing knowledge from multiple pieces of texts is a key challenge in multi-hop question answering. We present a multi-hop reasoning dataset, Question Answering via Sentence Composition (QASC), that requires retrieving facts from a large corpus and composing them to answer a multiple-choice question. QASC is the first dataset to offer two desirable properties: (a) the facts to be composed are annotated in a large corpus, and (b) the decomposition into these facts is not evident from the question itself. The latter makes retrieval challenging as the system must introduce new concepts or relations in order to discover potential decompositions. Further, the reasoning model must then learn to identify valid compositions of these retrieved facts using common-sense reasoning. To help address these challenges, we provide annotation for supporting facts as well as their composition. Guided by these annotations, we present a two-step approach to mitigate the retrieval challenges. We use other multiple-choice datasets as additional training data to strengthen the reasoning model. Our proposed approach improves over current state-of-the-art language models by 11% (absolute). The reasoning and retrieval problems, however, remain unsolved as this model still lags by 20% behind human performance.

Download Full-text

Improving word prediction for augmentative communication by using idiolects and sociolects

Dutch Journal of Applied Linguistics ◽

10.1075/dujal.3.2.03sto ◽

2014 ◽

Vol 3 (2) ◽

pp. 137-154 ◽

Cited By ~ 1

Author(s):

Wessel Stoop ◽

Antal van den Bosch

Keyword(s):

Data Collection ◽

State Of The Art ◽

Training Data ◽

Language Models ◽

Prediction System ◽

Word Prediction ◽

The People ◽

The Social ◽

Statistical Language Models ◽

Assistive Communication

Word prediction, or predictive editing, has a long history as a tool for augmentative and assistive communication. Improvements in the state-of-the-art can still be achieved, for instance by training personalized statistical language models. We developed the word prediction system Soothsayer. The main innovation of Soothsayer is that it not only uses idiolects, the language of one individual person, as training data, but also sociolects, the language of the social circle around that person. We use Twitter for data collection and experimentation. The idiolect models are based on individual Twitter feeds, the sociolect models are based on the tweets of a particular person and the tweets of the people he often communicates with. The sociolect approach achieved the best results. For a number of users, more than 50% of the keystrokes could have been saved if they had used Soothsayer.

Download Full-text

Injecting Event Knowledge into Pre-Trained Language Models for Event Extraction

10.5121/csit.2020.101404 ◽

2020 ◽

Author(s):

Zining Yang ◽

Siyu Zhan ◽

Mengshu Hou ◽

Xiaoyang Zeng ◽

Hao Zhu

Keyword(s):

Language Model ◽

Empirical Evaluation ◽

Event Extraction ◽

Training Data ◽

Language Models ◽

Extraction System ◽

Training Dataset ◽

Great Success ◽

Event Knowledge ◽

Event Trigger

The recent pre-trained language model has made great success in many NLP tasks. In this paper, we propose an event extraction system based on the novel pre-trained language model BERT to extract both event trigger and argument. As a deep-learningbased method, the size of the training dataset has a crucial impact on performance. To address the lacking training data problem for event extraction, we further train the pretrained language model with a carefully constructed in-domain corpus to inject event knowledge to our event extraction system with minimal efforts. Empirical evaluation on the ACE2005 dataset shows that injecting event knowledge can significantly improve the performance of event extraction.

Download Full-text

Analysis and Evaluation of Language Models for Word Sense Disambiguation

Computational Linguistics ◽

10.1162/coli_a_00405 ◽

2021 ◽

pp. 1-55

Author(s):

Daniel Loureiro ◽

Kiamehr Rezaee ◽

Mohammad Taher Pilehvar ◽

Jose Camacho-Collados

Keyword(s):

Feature Extraction ◽

Word Sense Disambiguation ◽

Language Model ◽

Training Data ◽

Fine Tuning ◽

Language Models ◽

Coarse Grained ◽

Word Sense ◽

Sense Disambiguation ◽

High Level

Abstract Transformer-based language models have taken many fields in NLP by storm. BERT and its derivatives dominate most of the existing evaluation benchmarks, including those for Word Sense Disambiguation (WSD), thanks to their ability in capturing context-sensitive semantic nuances. However, there is still little knowledge about their capabilities and potential limitations in encoding and recovering word senses. In this article, we provide an in-depth quantitative and qualitative analysis of the celebrated BERT model with respect to lexical ambiguity. One of the main conclusions of our analysis is that BERT can accurately capture high-level sense distinctions, even when a limited number of examples is available for each word sense. Our analysis also reveals that in some cases language models come close to solving coarse-grained noun disambiguation under ideal conditions in terms of availability of training data and computing resources. However, this scenario rarely occurs in real-world settings and, hence, many practical challenges remain even in the coarse-grained setting. We also perform an in-depth comparison of the two main language model based WSD strategies, i.e., fine-tuning and feature extraction, finding that the latter approach is more robust with respect to sense bias and it can better exploit limited available training data. In fact, the simple feature extraction strategy of averaging contextualized embeddings proves robust even using only three training sentences per word sense, with minimal improvements obtained by increasing the size of this training data.

Download Full-text

Deep indexed active learning for matching heterogeneous entity representations

Proceedings of the VLDB Endowment ◽

10.14778/3485450.3485455 ◽

2021 ◽

Vol 15 (1) ◽

pp. 31-45

Author(s):

Arjit Jain ◽

Sunita Sarawagi ◽

Prithviraj Sen

Keyword(s):

Active Learning ◽

Committee Member ◽

Language Model ◽

Cartesian Product ◽

Rule Learning ◽

Search Space ◽

Training Data ◽

Language Models ◽

Passive Learning ◽

Benchmark Datasets

Given two large lists of records, the task in entity resolution (ER) is to find the pairs from the Cartesian product of the lists that correspond to the same real world entity. Typically, passive learning methods on such tasks require large amounts of labeled data to yield useful models. Active Learning is a promising approach for ER in low resource settings. However, the search space, to find informative samples for the user to label, grows quadratically for instance-pair tasks making active learning hard to scale. Previous works, in this setting, rely on hand-crafted predicates, pre-trained language model embeddings, or rule learning to prune away unlikely pairs from the Cartesian product. This blocking step can miss out on important regions in the product space leading to low recall. We propose DIAL, a scalable active learning approach that jointly learns embeddings to maximize recall for blocking and accuracy for matching blocked pairs. DIAL uses an Index-By-Committee framework, where each committee member learns representations based on powerful pre-trained transformer language models. We highlight surprising differences between the matcher and the blocker in the creation of the training data and the objective used to train their parameters. Experiments on five benchmark datasets and a multilingual record matching dataset show the effectiveness of our approach in terms of precision, recall and running time.

Download Full-text