Syntactic Structure Distillation Pretraining for Bidirectional Encoders

Textual representation learners trained on large amounts of data have achieved notable success on downstream tasks; intriguingly, they have also performed well on challenging tests of syntactic competence. Hence, it remains an open question whether scalable learners like BERT can become fully proficient in the syntax of natural language by virtue of data scale alone, or whether they still benefit from more explicit syntactic biases. To answer this question, we introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining, by distilling the syntactically informative predictions of a hierarchical—albeit harder to scale—syntactic language model. Since BERT models masked words in bidirectional context, we propose to distill the approximate marginal distribution over words in context from the syntactic LM. Our approach reduces relative error by 2–21% on a diverse set of structured prediction tasks, although we obtain mixed results on the GLUE benchmark. Our findings demonstrate the benefits of syntactic biases, even for representation learners that exploit large amounts of data, and contribute to a better understanding of where syntactic biases are helpful in benchmarks of natural language understanding.

Download Full-text

Semantics-Aware BERT for Language Understanding

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6510 ◽

2020 ◽

Vol 34 (05) ◽

pp. 9628-9635

Author(s):

Zhuosheng Zhang ◽

Yuwei Wu ◽

Hai Zhao ◽

Zuchao Li ◽

Shuailiang Zhang ◽

...

Keyword(s):

Reading Comprehension ◽

Natural Language ◽

Language Model ◽

Fine Tuning ◽

Semantic Role Labeling ◽

Language Understanding ◽

Context Sensitive ◽

Language Representation ◽

Model Training ◽

Machine Reading

The latest work on language representations carefully integrates contextualized features into language model training, which enables a series of success especially in various machine reading comprehension and natural language inference tasks. However, the existing language representation models including ELMo, GPT and BERT only exploit plain context-sensitive features such as character or word embeddings. They rarely consider incorporating structured semantic information which can provide rich semantics for language representation. To promote natural language understanding, we propose to incorporate explicit contextual semantics from pre-trained semantic role labeling, and introduce an improved language representation model, Semantics-aware BERT (SemBERT), which is capable of explicitly absorbing contextual semantics over a BERT backbone. SemBERT keeps the convenient usability of its BERT precursor in a light fine-tuning way without substantial task-specific modifications. Compared with BERT, semantics-aware BERT is as simple in concept but more powerful. It obtains new state-of-the-art or substantially improves results on ten reading comprehension and language inference tasks.

Download Full-text

How to Select One Among All ? An Empirical Study Towards the Robustness of Knowledge Distillation in Natural Language Understanding

10.18653/v1/2021.findings-emnlp.65 ◽

2021 ◽

Author(s):

Tianda Li ◽

Ahmad Rashid ◽

Aref Jafari ◽

Pranav Sharma ◽

Ali Ghodsi ◽

...

Keyword(s):

Empirical Study ◽

Natural Language ◽

Natural Language Understanding ◽

Language Understanding ◽

Knowledge Distillation

Download Full-text

Where do Clinical Language Models Break Down? A Critical Behavioural Exploration of the ClinicalBERT Deep Transformer Model

Journal of Computational Vision and Imaging Systems ◽

10.15353/jcvis.v6i1.3548 ◽

2021 ◽

Vol 6 (1) ◽

pp. 1-4

Author(s):

Alexander MacLean ◽

Alexander Wong

Keyword(s):

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Language Model ◽

Language Models ◽

Clinical Knowledge ◽

Language Understanding ◽

Improved Performance ◽

Transformer Model ◽

Clinical Domain

The introduction of Bidirectional Encoder Representations from Transformers (BERT) was a major breakthrough for transfer learning in natural language processing, enabling state-of-the-art performance across a large variety of complex language understanding tasks. In the realm of clinical language modeling, the advent of BERT led to the creation of ClinicalBERT, a state-of-the-art deep transformer model pretrained on a wealth of patient clinical notes to facilitate for downstream predictive tasks in the clinical domain. While ClinicalBERT has been widely leveraged by the research community as the foundation for building clinical domain-specific predictive models given its overall improved performance in the Medical Natural Language inference (MedNLI) challenge compared to the seminal BERT model, the fine-grained behaviour and intricacies of this popular clinical language model has not been well-studied. Without this deeper understanding, it is very challenging to understand where ClinicalBERT does well given its additional exposure to clinical knowledge, where it doesn't, and where it can be improved in a meaningful manner. Motivated to garner a deeper understanding, this study presents a critical behaviour exploration of the ClinicalBERT deep transformer model using MedNLI challenge dataset to better understanding the following intricacies: 1) decision-making similarities between ClinicalBERT and BERT (leverage a new metric we introduce called Model Alignment), 2) where ClinicalBERT holds advantages over BERT given its clinical knowledge exposure, and 3) where ClinicalBERT struggles when compared to BERT. The insights gained about the behaviour of ClinicalBERT will help guide towards new directions for designing and training clinical language models in a way that not only addresses the remaining gaps and facilitates for further improvements in clinical language understanding performance, but also highlights the limitation and boundaries of use for such models.

Download Full-text

K-PLUG: Knowledge-injected Pre-trained Language Model for Natural Language Understanding and Generation in E-Commerce

10.18653/v1/2021.findings-emnlp.1 ◽

2021 ◽

Author(s):

Song Xu ◽

Haoran Li ◽

Peng Yuan ◽

Yujia Wang ◽

Youzheng Wu ◽

...

Keyword(s):

Natural Language ◽

Language Model ◽

Natural Language Understanding ◽

Language Understanding

Download Full-text

Knowledge Distillation with Noisy Labels for Natural Language Understanding

10.18653/v1/2021.wnut-1.33 ◽

2021 ◽

Author(s):

Shivendra Bhardwaj ◽

Abbas Ghaddar ◽

Ahmad Rashid ◽

Khalil Bibi ◽

Chengyang Li ◽

...

Keyword(s):

Natural Language ◽

Natural Language Understanding ◽

Language Understanding ◽

Knowledge Distillation ◽

Noisy Labels

Download Full-text

Language Model is all You Need: Natural Language Understanding as Question Answering

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp39728.2021.9413810 ◽

2021 ◽

Author(s):

Mahdi Namazifar ◽

Alexandros Papangelis ◽

Gokhan Tur ◽

Dilek Hakkani-Tur

Keyword(s):

Natural Language ◽

Question Answering ◽

Language Model ◽

Natural Language Understanding ◽

Language Understanding

Download Full-text

Scalable Attentive Sentence Pair Modeling via Distilled Sentence Embedding

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5722 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3235-3242 ◽

Cited By ~ 2

Author(s):

Oren Barkan ◽

Noam Razin ◽

Itzik Malkiel ◽

Ori Katz ◽

Avi Caciularu ◽

...

Keyword(s):

Natural Language ◽

State Of The Art ◽

Sentence Pair ◽

Language Understanding ◽

Universal Sentence ◽

Multiple Cross ◽

Knowledge Distillation ◽

Vector Mapping ◽

Embedding Methods ◽

Teacher Model

Recent state-of-the-art natural language understanding models, such as BERT and XLNet, score a pair of sentences (A and B) using multiple cross-attention operations – a process in which each word in sentence A attends to all words in sentence B and vice versa. As a result, computing the similarity between a query sentence and a set of candidate sentences, requires the propagation of all query-candidate sentence-pairs throughout a stack of cross-attention layers. This exhaustive process becomes computationally prohibitive when the number of candidate sentences is large. In contrast, sentence embedding techniques learn a sentence-to-vector mapping and compute the similarity between the sentence vectors via simple elementary operations. In this paper, we introduce Distilled Sentence Embedding (DSE) – a model that is based on knowledge distillation from cross-attentive models, focusing on sentence-pair tasks. The outline of DSE is as follows: Given a cross-attentive teacher model (e.g. a fine-tuned BERT), we train a sentence embedding based student model to reconstruct the sentence-pair scores obtained by the teacher model. We empirically demonstrate the effectiveness of DSE on five GLUE sentence-pair tasks. DSE significantly outperforms several ELMO variants and other sentence embedding methods, while accelerating computation of the query-candidate sentence-pairs similarities by several orders of magnitude, with an average relative degradation of 4.6% compared to BERT. Furthermore, we show that DSE produces sentence embeddings that reach state-of-the-art performance on universal sentence representation benchmarks. Our code is made publicly available at https://github.com/microsoft/Distilled-Sentence-Embedding.

Download Full-text

Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00240 ◽

2018 ◽

Vol 6 ◽

pp. 605-617 ◽

Cited By ~ 1

Author(s):

Kellie Webster ◽

Marta Recasens ◽

Vera Axelrod ◽

Jason Baldridge

Keyword(s):

Natural Language ◽

Gender Bias ◽

Real World ◽

Syntactic Structure ◽

Natural Language Understanding ◽

Important Task ◽

Coreference Resolution ◽

Neural Models ◽

Language Understanding ◽

Practical Utility

Coreference resolution is an important task for natural language understanding, and the resolution of ambiguous pronouns a longstanding challenge. Nonetheless, existing corpora do not capture ambiguous pronouns in sufficient volume or diversity to accurately indicate the practical utility of models. Furthermore, we find gender bias in existing corpora and systems favoring masculine entities. To address this, we present and release GAP, a gender-balanced labeled corpus of 8,908 ambiguous pronoun–name pairs sampled to provide diverse coverage of challenges posed by real-world text. We explore a range of baselines that demonstrate the complexity of the challenge, the best achieving just 66.9% F1. We show that syntactic structure and continuous neural models provide promising, complementary cues for approaching the challenge.

Download Full-text

From NLP to Natural Language Understanding for medical decision making

PsycEXTRA Dataset ◽

10.1037/e615572012-006 ◽

2012 ◽

Author(s):

Chitta Baral

Keyword(s):

Decision Making ◽

Natural Language ◽

Medical Decision Making ◽

Natural Language Understanding ◽

Medical Decision ◽

Language Understanding

Download Full-text

The Distinction between Linguistic and Conceptual Semantics in Medical Terminology and its Implication for NLP-Based Knowledge Acquisition

Methods of Information in Medicine ◽

10.1055/s-0038-1634568 ◽

1998 ◽

Vol 37 (04/05) ◽

pp. 327-333 ◽

Cited By ~ 3

Author(s):

F. Buekens ◽

G. De Moor ◽

A. Waagmeester ◽

W. Ceusters

Keyword(s):

Natural Language ◽

Knowledge Acquisition ◽

Natural Language Understanding ◽

Knowledge Bases ◽

Linguistic Knowledge ◽

Medical Terminology ◽

Language Understanding ◽

Conceptual Semantics

AbstractNatural language understanding systems have to exploit various kinds of knowledge in order to represent the meaning behind texts. Getting this knowledge in place is often such a huge enterprise that it is tempting to look for systems that can discover such knowledge automatically. We describe how the distinction between conceptual and linguistic semantics may assist in reaching this objective, provided that distinguishing between them is not done too rigorously. We present several examples to support this view and argue that in a multilingual environment, linguistic ontologies should be designed as interfaces between domain conceptualizations and linguistic knowledge bases.

Download Full-text