Analysis and Evaluation of Language Models for Word Sense Disambiguation

Abstract Transformer-based language models have taken many fields in NLP by storm. BERT and its derivatives dominate most of the existing evaluation benchmarks, including those for Word Sense Disambiguation (WSD), thanks to their ability in capturing context-sensitive semantic nuances. However, there is still little knowledge about their capabilities and potential limitations in encoding and recovering word senses. In this article, we provide an in-depth quantitative and qualitative analysis of the celebrated BERT model with respect to lexical ambiguity. One of the main conclusions of our analysis is that BERT can accurately capture high-level sense distinctions, even when a limited number of examples is available for each word sense. Our analysis also reveals that in some cases language models come close to solving coarse-grained noun disambiguation under ideal conditions in terms of availability of training data and computing resources. However, this scenario rarely occurs in real-world settings and, hence, many practical challenges remain even in the coarse-grained setting. We also perform an in-depth comparison of the two main language model based WSD strategies, i.e., fine-tuning and feature extraction, finding that the latter approach is more robust with respect to sense bias and it can better exploit limited available training data. In fact, the simple feature extraction strategy of averaging contextualized embeddings proves robust even using only three training sentences per word sense, with minimal improvements obtained by increasing the size of this training data.

Download Full-text

The Noisy Channel Model for Unsupervised Word Sense Disambiguation

Computational Linguistics ◽

10.1162/coli.2010.36.1.36103 ◽

2010 ◽

Vol 36 (1) ◽

pp. 111-127 ◽

Cited By ~ 10

Author(s):

Deniz Yuret ◽

Mehmet Ali Yatbaz

Keyword(s):

Channel Model ◽

Word Sense Disambiguation ◽

Language Model ◽

Ambiguous Word ◽

Coarse Grained ◽

Word Sense ◽

Noisy Channel ◽

Intended Meaning ◽

Fine Grained ◽

Sense Disambiguation

We introduce a generative probabilistic model, the noisy channel model, for unsupervised word sense disambiguation. In our model, each context C is modeled as a distinct channel through which the speaker intends to transmit a particular meaning S using a possibly ambiguous word W. To reconstruct the intended meaning the hearer uses the distribution of possible meanings in the given context P(S|C) and possible words that can express each meaning P(W|S). We assume P(W|S) is independent of the context and estimate it using WordNet sense frequencies. The main problem of unsupervised WSD is estimating context-dependent P(S|C) without access to any sense-tagged text. We show one way to solve this problem using a statistical language model based on large amounts of untagged text. Our model uses coarse-grained semantic classes for S internally and we explore the effect of using different levels of granularity on WSD performance. The system outputs fine-grained senses for evaluation, and its performance on noun disambiguation is better than most previously reported unsupervised systems and close to the best supervised systems.

Download Full-text

Word vs. Class-Based Word Sense Disambiguation

Journal of Artificial Intelligence Research ◽

10.1613/jair.4727 ◽

2015 ◽

Vol 54 ◽

pp. 83-122 ◽

Cited By ~ 4

Author(s):

Ruben Izquierdo ◽

Armando Suarez ◽

German Rigau

Keyword(s):

Word Sense Disambiguation ◽

Coarse Grained ◽

Semantic Features ◽

Word Sense ◽

Simple Method ◽

Word Meanings ◽

Semantic Class ◽

Semantic Classes ◽

Sense Disambiguation ◽

Word Senses

As empirically demonstrated by the Word Sense Disambiguation (WSD) tasks of the last SensEval/SemEval exercises, assigning the appropriate meaning to words in context has resisted all attempts to be successfully addressed. Many authors argue that one possible reason could be the use of inappropriate sets of word meanings. In particular, WordNet has been used as a de-facto standard repository of word meanings in most of these tasks. Thus, instead of using the word senses defined in WordNet, some approaches have derived semantic classes representing groups of word senses. However, the meanings represented by WordNet have been only used for WSD at a very fine-grained sense level or at a very coarse-grained semantic class level (also called SuperSenses). We suspect that an appropriate level of abstraction could be on between both levels. The contributions of this paper are manifold. First, we propose a simple method to automatically derive semantic classes at intermediate levels of abstraction covering all nominal and verbal WordNet meanings. Second, we empirically demonstrate that our automatically derived semantic classes outperform classical approaches based on word senses and more coarse-grained sense groupings. Third, we also demonstrate that our supervised WSD system benefits from using these new semantic classes as additional semantic features while reducing the amount of training examples. Finally, we also demonstrate the robustness of our supervised semantic class-based WSD system when tested on out of domain corpus.

Download Full-text

deepBioWSD: effective deep neural word sense disambiguation of biomedical text data

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocy189 ◽

2019 ◽

Vol 26 (5) ◽

pp. 438-446 ◽

Cited By ~ 3

Author(s):

Ahmad Pesaranghader ◽

Stan Matwin ◽

Marina Sokolova ◽

Ali Pesaranghader

Keyword(s):

Language Processing ◽

Short Term Memory ◽

Word Sense Disambiguation ◽

Training Data ◽

Biomedical Text ◽

Word Sense ◽

Vocabulary Size ◽

Unified Medical Language System ◽

Knowledge Based ◽

Sense Disambiguation

Abstract Objective In biomedicine, there is a wealth of information hidden in unstructured narratives such as research articles and clinical reports. To exploit these data properly, a word sense disambiguation (WSD) algorithm prevents downstream difficulties in the natural language processing applications pipeline. Supervised WSD algorithms largely outperform un- or semisupervised and knowledge-based methods; however, they train 1 separate classifier for each ambiguous term, necessitating a large number of expert-labeled training data, an unattainable goal in medical informatics. To alleviate this need, a single model that shares statistical strength across all instances and scales well with the vocabulary size is desirable. Materials and Methods Built on recent advances in deep learning, our deepBioWSD model leverages 1 single bidirectional long short-term memory network that makes sense prediction for any ambiguous term. In the model, first, the Unified Medical Language System sense embeddings will be computed using their text definitions; and then, after initializing the network with these embeddings, it will be trained on all (available) training data collectively. This method also considers a novel technique for automatic collection of training data from PubMed to (pre)train the network in an unsupervised manner. Results We use the MSH WSD dataset to compare WSD algorithms, with macro and micro accuracies employed as evaluation metrics. deepBioWSD outperforms existing models in biomedical text WSD by achieving the state-of-the-art performance of 96.82% for macro accuracy. Conclusions Apart from the disambiguation improvement and unsupervised training, deepBioWSD depends on considerably less number of expert-labeled data as it learns the target and the context terms jointly. These merit deepBioWSD to be conveniently deployable in real-time biomedical applications.

Download Full-text

The role of domain information in Word Sense Disambiguation

Natural Language Engineering ◽

10.1017/s1351324902003029 ◽

2002 ◽

Vol 8 (4) ◽

pp. 359-373 ◽

Cited By ~ 47

Author(s):

BERNARDO MAGNINI ◽

CARLO STRAPPARAVA ◽

GIOVANNI PEZZULO ◽

ALFIO GLIOZZO

Keyword(s):

Word Sense Disambiguation ◽

Semantic Relations ◽

Word Sense ◽

Sense Disambiguation ◽

High Level ◽

Word Senses ◽

Very High ◽

Domain Information

This paper explores the role of domain information in word sense disambiguation. The underlying hypothesis is that domain labels, such as MEDICINE, ARCHITECTURE and SPORT, provide a useful way to establish semantic relations among word senses, which can be profitably used during the disambiguation process. Results obtained at the SENSEVAL-2 initiative confirm that for a significant subset of words domain information can be used to disambiguate with a very high level of precision.

Download Full-text

Evaluating sense disambiguation across diverse parameter spaces

Natural Language Engineering ◽

10.1017/s135132490200298x ◽

2002 ◽

Vol 8 (4) ◽

pp. 293-310 ◽

Cited By ~ 45

Author(s):

DAVID YAROWSKY ◽

RADU FLORIAN

Keyword(s):

Word Sense Disambiguation ◽

Model Performance ◽

Training Data ◽

Target Language ◽

Word Sense ◽

Parameter Spaces ◽

Diverse Range ◽

Part Of Speech ◽

Sense Disambiguation ◽

Training Examples

This paper presents a comprehensive empirical exploration and evaluation of a diverse range of data characteristics which influence word sense disambiguation performance. It focuses on a set of six core supervised algorithms, including three variants of Bayesian classifiers, a cosine model, non-hierarchical decision lists, and an extension of the transformation-based learning model. Performance is investigated in detail with respect to the following parameters: (a) target language (English, Spanish, Swedish and Basque); (b) part of speech; (c) sense granularity; (d) inclusion and exclusion of major feature classes; (e) variable context width (further broken down by part-of-speech of keyword); (f) number of training examples; (g) baseline probability of the most likely sense; (h) sense distributional entropy; (i) number of senses per keyword; (j) divergence between training and test data; (k) degree of (artificially introduced) noise in the training data; (l) the effectiveness of an algorithm's confidence rankings; and (m) a full keyword breakdown of the performance of each algorithm. The paper concludes with a brief analysis of similarities, differences, strengths and weaknesses of the algorithms and a hierarchical clustering of these algorithms based on agreement of sense classification behavior. Collectively, the paper constitutes the most comprehensive survey of evaluation measures and tests yet applied to sense disambiguation algorithms. And it does so over a diverse range of supervised algorithms, languages and parameter spaces in single unified experimental framework.

Download Full-text

Exemplification Modeling: Can You Give Me an Example, Please?

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/520 ◽

2021 ◽

Author(s):

Edoardo Barba ◽

Luigi Procopio ◽

Caterina Lacerra ◽

Tommaso Pasini ◽

Roberto Navigli

Keyword(s):

Gold Standard ◽

State Of The Art ◽

Word Sense Disambiguation ◽

Full Range ◽

Training Data ◽

Training Procedure ◽

Word Sense ◽

The Novel ◽

Current State ◽

Sense Disambiguation

Recently, generative approaches have been used effectively to provide definitions of words in their context. However, the opposite, i.e., generating a usage example given one or more words along with their definitions, has not yet been investigated. In this work, we introduce the novel task of Exemplification Modeling (ExMod), along with a sequence-to-sequence architecture and a training procedure for it. Starting from a set of (word, definition) pairs, our approach is capable of automatically generating high-quality sentences which express the requested semantics. As a result, we can drive the creation of sense-tagged data which cover the full range of meanings in any inventory of interest, and their interactions within sentences. Human annotators agree that the sentences generated are as fluent and semantically-coherent with the input definitions as the sentences in manually-annotated corpora. Indeed, when employed as training data for Word Sense Disambiguation, our examples enable the current state of the art to be outperformed, and higher results to be achieved than when using gold-standard datasets only. We release the pretrained model, the dataset and the software at https://github.com/SapienzaNLP/exmod.

Download Full-text

Coarse-Grained +/-Effect Word Sense Disambiguation for Implicit Sentiment Analysis

IEEE Transactions on Affective Computing ◽

10.1109/taffc.2017.2734085 ◽

2017 ◽

Vol 8 (4) ◽

pp. 471-479 ◽

Cited By ~ 2

Author(s):

Yoonjung Choi ◽

Janyce Wiebe ◽

Rada Mihalcea

Keyword(s):

Sentiment Analysis ◽

Word Sense Disambiguation ◽

Coarse Grained ◽

Word Sense ◽

Sense Disambiguation

Download Full-text

Selecting Training Data for Unsupervised Domain Adaptation in Word Sense Disambiguation

PRICAI 2016: Trends in Artificial Intelligence - Lecture Notes in Computer Science ◽

10.1007/978-3-319-42911-3_18 ◽

2016 ◽

pp. 220-232

Author(s):

Kanako Komiya ◽

Minoru Sasaki ◽

Hiroyuki Shinnou ◽

Yoshiyuki Kotani ◽

Manabu Okumura

Keyword(s):

Domain Adaptation ◽

Word Sense Disambiguation ◽

Training Data ◽

Word Sense ◽

Unsupervised Domain Adaptation ◽

Sense Disambiguation

Download Full-text

Chinese word sense disambiguation by combining pseudo training data

International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 ◽

10.1109/nlpke.2003.1275884 ◽

2004 ◽

Cited By ~ 2

Author(s):

Xiaojie Wang ◽

Y. Matsumoto

Keyword(s):

Word Sense Disambiguation ◽

Training Data ◽

Word Sense ◽

Chinese Word ◽

Sense Disambiguation

Download Full-text

SENSE: an analogy-based Word Sense Disambiguation system

Natural Language Engineering ◽

10.1017/s135132499900217x ◽

1999 ◽

Vol 5 (2) ◽

pp. 207-218 ◽

Cited By ~ 3

Author(s):

STEFANO FEDERICI ◽

SIMONETTA MONTEMAGNI ◽

VITO PIRRELLI

Keyword(s):

Word Sense Disambiguation ◽

Training Data ◽

Word Sense ◽

Data Sparseness ◽

Sense Disambiguation ◽

Conservative Bias

The paper describes SENSE, a word sense disambiguation system which makes use of multidimensional analogy-based proportions to infer the most likely sense of a word given its context. Architecture and functioning of the system are illustrated in detail. Results of different experimental settings are given, showing that the system, in spite its conservative bias, successfully copes with the problem of training data sparseness.

Download Full-text