Dependency-based n-gram models for general purpose sentence realisation

AbstractThis paper presents a general-purpose, wide-coverage, probabilistic sentence generator based on dependency n-gram models. This is particularly interesting as many semantic or abstract syntactic input specifications for sentence realisation can be represented as labelled bi-lexical dependencies or typed predicate-argument structures. Our generation method captures the mapping between semantic representations and surface forms by linearising a set of dependencies directly, rather than via the application of grammar rules as in more traditional chart-style or unification-based generators. In contrast to conventional n-gram language models over surface word forms, we exploit structural information and various linguistic features inherent in the dependency representations to constrain the generation space and improve the generation quality. A series of experiments shows that dependency-based n-gram models generalise well to different languages (English and Chinese) and representations (LFG and CoNLL). Compared with state-of-the-art generation systems, our general-purpose sentence realiser is highly competitive with the added advantages of being simple, fast, robust and accurate.

Download Full-text

When Low Resource NLP Meets Unsupervised Language Model: Meta-Pretraining then Meta-Learning for Few-Shot Text Classification (Student Abstract)

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i10.7158 ◽

2020 ◽

Vol 34 (10) ◽

pp. 13773-13774

Author(s):

Shumin Deng ◽

Ningyu Zhang ◽

Zhanlin Sun ◽

Jiaoyan Chen ◽

Huajun Chen

Keyword(s):

Text Classification ◽

State Of The Art ◽

Language Model ◽

Language Models ◽

Generic Model ◽

Effective Strategy ◽

Linguistic Features ◽

Meta Learning ◽

Promising Solution ◽

Model Initialization

Text classification tends to be difficult when data are deficient or when it is required to adapt to unseen classes. In such challenging scenarios, recent studies have often used meta-learning to simulate the few-shot task, thus negating implicit common linguistic features across tasks. This paper addresses such problems using meta-learning and unsupervised language models. Our approach is based on the insight that having a good generalization from a few examples relies on both a generic model initialization and an effective strategy for adapting this model to newly arising tasks. We show that our approach is not only simple but also produces a state-of-the-art performance on a well-studied sentiment classification dataset. It can thus be further suggested that pretraining could be a promising solution for few-shot learning of many other NLP tasks. The code and the dataset to replicate the experiments are made available at https://github.com/zxlzr/FewShotNLP.

Download Full-text

Sparse Non-negative Matrix Language Modeling

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00102 ◽

2016 ◽

Vol 4 ◽

pp. 329-342 ◽

Cited By ~ 1

Author(s):

Joris Pelemans ◽

Noam Shazeer ◽

Ciprian Chelba

Keyword(s):

Neural Network ◽

State Of The Art ◽

Language Modeling ◽

Language Models ◽

Close Match ◽

Modeling Techniques ◽

Network Estimation ◽

N Gram ◽

Network Language ◽

The One

We present Sparse Non-negative Matrix (SNM) estimation, a novel probability estimation technique for language modeling that can efficiently incorporate arbitrary features. We evaluate SNM language models on two corpora: the One Billion Word Benchmark and a subset of the LDC English Gigaword corpus. Results show that SNM language models trained with n-gram features are a close match for the well-established Kneser-Ney models. The addition of skip-gram features yields a model that is in the same league as the state-of-the-art recurrent neural network language models, as well as complementary: combining the two modeling techniques yields the best known result on the One Billion Word Benchmark. On the Gigaword corpus further improvements are observed using features that cross sentence boundaries. The computational advantages of SNM estimation over both maximum entropy and neural network estimation are probably its main strength, promising an approach that has large flexibility in combining arbitrary features and yet scales gracefully to large amounts of data.

Download Full-text

BLiMP: The Benchmark of Linguistic Minimal Pairs for English

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00321 ◽

2020 ◽

Vol 8 ◽

pp. 377-392

Author(s):

Alex Warstadt ◽

Alicia Parrish ◽

Haokun Liu ◽

Anhad Mohananey ◽

Wei Peng ◽

...

Keyword(s):

State Of The Art ◽

Negative Polarity ◽

Minimal Pair ◽

Language Models ◽

Linguistic Knowledge ◽

Negative Polarity Items ◽

Polarity Items ◽

Minimal Pairs ◽

Knowledge Of Language ◽

N Gram

We introduce The Benchmark of Linguistic Minimal Pairs (BLiMP), 1 a challenge set for evaluating the linguistic knowledge of language models (LMs) on major grammatical phenomena in English. BLiMP consists of 67 individual datasets, each containing 1,000 minimal pairs—that is, pairs of minimally different sentences that contrast in grammatical acceptability and isolate specific phenomenon in syntax, morphology, or semantics. We generate the data according to linguist-crafted grammar templates, and human aggregate agreement with the labels is 96.4%. We evaluate n-gram, LSTM, and Transformer (GPT-2 and Transformer-XL) LMs by observing whether they assign a higher probability to the acceptable sentence in each minimal pair. We find that state-of-the-art models identify morphological contrasts related to agreement reliably, but they struggle with some subtle semantic and syntactic phenomena, such as negative polarity items and extraction islands.

Download Full-text

Toward an optimal code for communication: The case of scientific English

Corpus Linguistics and Linguistic Theory ◽

10.1515/cllt-2018-0088 ◽

2019 ◽

Vol 0 (0) ◽

Author(s):

Stefania Degaetano-Ortlieb ◽

Elke Teich

Keyword(s):

Information Content ◽

Time Course ◽

Modern Science ◽

Language Models ◽

Late Nineteenth Century ◽

Social Changes ◽

Linguistic Features ◽

External Pressures ◽

N Gram ◽

Late Modern

AbstractWe present a model of the linguistic development of scientific English from the mid-seventeenth to the late-nineteenth century, a period that witnessed significant political and social changes, including the evolution of modern science. There is a wealth of descriptive accounts of scientific English, both from a synchronic and a diachronic perspective, but only few attempts at a unified explanation of its evolution. The explanation we offer here is a communicative one: while external pressures (specialization, diversification) push for an increase in expressivity, communicative concerns pull toward convergence on particular options (conventionalization). What emerges over time is a code which is optimized for written, specialist communication, relying on specific linguistic means to modulate information content. As we show, this is achieved by the systematic interplay between lexis and grammar. The corpora we employ are the Royal Society Corpus (RSC) and for comparative purposes, the Corpus of Late Modern English (CLMET). We build various diachronic, computational n-gram language models of these corpora and then apply formal measures of information content (here: relative entropy and surprisal) to detect the linguistic features significantly contributing to diachronic change, estimate the (changing) level of information of features and capture the time course of change.

Download Full-text

From General Language Understanding to Noisy Text Comprehension

Applied Sciences ◽

10.3390/app11177814 ◽

2021 ◽

Vol 11 (17) ◽

pp. 7814

Author(s):

Buddhika Kasthuriarachchy ◽

Madhu Chetty ◽

Adrian Shatte ◽

Darren Walls

Keyword(s):

Text Comprehension ◽

State Of The Art ◽

Language Model ◽

General Purpose ◽

Language Models ◽

Language Understanding ◽

English Usage ◽

Latent Representations ◽

Noisy Text ◽

Better Than

Obtaining meaning-rich representations of social media inputs, such as Tweets (unstructured and noisy text), from general-purpose pre-trained language models has become challenging, as these inputs typically deviate from mainstream English usage. The proposed research establishes effective methods for improving the comprehension of noisy texts. For this, we propose a new generic methodology to derive a diverse set of sentence vectors combining and extracting various linguistic characteristics from latent representations of multi-layer, pre-trained language models. Further, we clearly establish how BERT, a state-of-the-art pre-trained language model, comprehends the linguistic attributes of Tweets to identify appropriate sentence representations. Five new probing tasks are developed for Tweets, which can serve as benchmark probing tasks to study noisy text comprehension. Experiments are carried out for classification accuracy by deriving the sentence vectors from GloVe-based pre-trained models and Sentence-BERT, and by using different hidden layers from the BERT model. We show that the initial and middle layers of BERT have better capability for capturing the key linguistic characteristics of noisy texts than its latter layers. With complex predictive models, we further show that the sentence vector length has lesser importance to capture linguistic information, and the proposed sentence vectors for noisy texts perform better than the existing state-of-the-art sentence vectors.

Download Full-text

TreeGen: A Tree-Based Transformer Architecture for Code Generation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6430 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8984-8991

Author(s):

Zeyu Sun ◽

Qihao Zhu ◽

Yingfei Xiong ◽

Yican Sun ◽

Lili Mou ◽

...

Keyword(s):

Code Generation ◽

State Of The Art ◽

Structural Information ◽

Semantic Parsing ◽

Generation System ◽

Neural Architecture ◽

Percentage Points ◽

Code Generators ◽

Grammar Rules ◽

Previous State

A code generation system generates programming language code based on an input natural language description. State-of-the-art approaches rely on neural networks for code generation. However, these code generators suffer from two problems. One is the long dependency problem, where a code element often depends on another far-away code element. A variable reference, for example, depends on its definition, which may appear quite a few lines before. The other problem is structure modeling, as programs contain rich structural information. In this paper, we propose a novel tree-based neural architecture, TreeGen, for code generation. TreeGen uses the attention mechanism of Transformers to alleviate the long-dependency problem, and introduces a novel AST reader (encoder) to incorporate grammar rules and AST structures into the network. We evaluated TreeGen on a Python benchmark, HearthStone, and two semantic parsing benchmarks, ATIS and GEO. TreeGen outperformed the previous state-of-the-art approach by 4.5 percentage points on HearthStone, and achieved the best accuracy among neural network-based approaches on ATIS (89.1%) and GEO (89.6%). We also conducted an ablation test to better understand each component of our model.

Download Full-text

Learning N-Gram Language Models from Uncertain Data

10.21437/interspeech.2016-1093 ◽

2016 ◽

Cited By ~ 4

Author(s):

Vitaly Kuznetsov ◽

Hank Liao ◽

Mehryar Mohri ◽

Michael Riley ◽

Brian Roark

Keyword(s):

Uncertain Data ◽

Language Models ◽

N Gram

Download Full-text

Rescore in a Flash: Compact, Cache Efficient Hashing Data Structures for n-Gram Language Models

10.21437/interspeech.2020-1939 ◽

2020 ◽

Author(s):

Grant P. Strimel ◽

Ariya Rastrow ◽

Gautam Tiwari ◽

Adrien Piérard ◽

Jon Webb

Keyword(s):

Data Structures ◽

Language Models ◽

Cache Efficient ◽

N Gram

Download Full-text

Towards corpus and model: Hierarchical structured-attention-based features for Indonesian named entity recognition

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-202286 ◽

2021 ◽

pp. 1-12

Author(s):

Yingwen Fu ◽

Nankai Lin ◽

Xiaotian Lin ◽

Shengyi Jiang

Keyword(s):

Language Processing ◽

State Of The Art ◽

Named Entity Recognition ◽

Entity Recognition ◽

Language Models ◽

Neural Models ◽

Performance Models ◽

Named Entity ◽

High Resource ◽

Benchmark Datasets

Named entity recognition (NER) is fundamental to natural language processing (NLP). Most state-of-the-art researches on NER are based on pre-trained language models (PLMs) or classic neural models. However, these researches are mainly oriented to high-resource languages such as English. While for Indonesian, related resources (both in dataset and technology) are not yet well-developed. Besides, affix is an important word composition for Indonesian language, indicating the essentiality of character and token features for token-wise Indonesian NLP tasks. However, features extracted by currently top-performance models are insufficient. Aiming at Indonesian NER task, in this paper, we build an Indonesian NER dataset (IDNER) comprising over 50 thousand sentences (over 670 thousand tokens) to alleviate the shortage of labeled resources in Indonesian. Furthermore, we construct a hierarchical structured-attention-based model (HSA) for Indonesian NER to extract sequence features from different perspectives. Specifically, we use an enhanced convolutional structure as well as an enhanced attention structure to extract deeper features from characters and tokens. Experimental results show that HSA establishes competitive performance on IDNER and three benchmark datasets.

Download Full-text

Principled approach to the selection of the embedding dimension of networks

Nature Communications ◽

10.1038/s41467-021-23795-5 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Weiwei Gu ◽

Aditya Tandon ◽

Yong-Yeol Ahn ◽

Filippo Radicchi

Keyword(s):

Real World ◽

Structural Information ◽

General Purpose ◽

Embedding Dimension ◽

Network Embedding ◽

Machine Learning Technique ◽

Learning Technique ◽

Low Dimensional ◽

Large Corpus ◽

Selection Of

AbstractNetwork embedding is a general-purpose machine learning technique that encodes network structure in vector spaces with tunable dimension. Choosing an appropriate embedding dimension – small enough to be efficient and large enough to be effective – is challenging but necessary to generate embeddings applicable to a multitude of tasks. Existing strategies for the selection of the embedding dimension rely on performance maximization in downstream tasks. Here, we propose a principled method such that all structural information of a network is parsimoniously encoded. The method is validated on various embedding algorithms and a large corpus of real-world networks. The embedding dimension selected by our method in real-world networks suggest that efficient encoding in low-dimensional spaces is usually possible.

Download Full-text