scholarly journals A Large-Scale Pseudoword-Based Evaluation Framework for State-of-the-Art Word Sense Disambiguation

2014 ◽  
Vol 40 (4) ◽  
pp. 837-881 ◽  
Author(s):  
Mohammad Taher Pilehvar ◽  
Roberto Navigli

The evaluation of several tasks in lexical semantics is often limited by the lack of large amounts of manual annotations, not only for training purposes, but also for testing purposes. Word Sense Disambiguation (WSD) is a case in point, as hand-labeled datasets are particularly hard and time-consuming to create. Consequently, evaluations tend to be performed on a small scale, which does not allow for in-depth analysis of the factors that determine a systems' performance. In this paper we address this issue by means of a realistic simulation of large-scale evaluation for the WSD task. We do this by providing two main contributions: First, we put forward two novel approaches to the wide-coverage generation of semantically aware pseudowords (i.e., artificial words capable of modeling real polysemous words); second, we leverage the most suitable type of pseudoword to create large pseudosense-annotated corpora, which enable a large-scale experimental framework for the comparison of state-of-the-art supervised and knowledge-based algorithms. Using this framework, we study the impact of supervision and knowledge on the two major disambiguation paradigms and perform an in-depth analysis of the factors which affect their performance.

Author(s):  
Pushpak Bhattacharyya ◽  
Mitesh Khapra

This chapter discusses the basic concepts of Word Sense Disambiguation (WSD) and the approaches to solving this problem. Both general purpose WSD and domain specific WSD are presented. The first part of the discussion focuses on existing approaches for WSD, including knowledge-based, supervised, semi-supervised, unsupervised, hybrid, and bilingual approaches. The accuracy value for general purpose WSD as the current state of affairs seems to be pegged at around 65%. This has motivated investigations into domain specific WSD, which is the current trend in the field. In the latter part of the chapter, we present a greedy neural network inspired algorithm for domain specific WSD and compare its performance with other state-of-the-art algorithms for WSD. Our experiments suggest that for domain-specific WSD, simply selecting the most frequent sense of a word does as well as any state-of-the-art algorithm.


Electronics ◽  
2021 ◽  
Vol 10 (23) ◽  
pp. 2938
Author(s):  
Minho Kim ◽  
Hyuk-Chul Kwon

Supervised disambiguation using a large amount of corpus data delivers better performance than other word sense disambiguation methods. However, it is not easy to construct large-scale, sense-tagged corpora since this requires high cost and time. On the other hand, implementing unsupervised disambiguation is relatively easy, although most of the efforts have not been satisfactory. A primary reason for the performance degradation of unsupervised disambiguation is that the semantic occurrence probability of ambiguous words is not available. Hence, a data deficiency problem occurs while determining the dependency between words. This paper proposes an unsupervised disambiguation method using a prior probability estimation based on the Korean WordNet. This performs better than supervised disambiguation. In the Korean WordNet, all the words have similar semantic characteristics to their related words. Thus, it is assumed that the dependency between words is the same as the dependency between their related words. This resolves the data deficiency problem by determining the dependency between words by calculating the χ2 statistic between related words. Moreover, in order to have the same effect as using the semantic occurrence probability as prior probability, which is used in supervised disambiguation, semantically related words of ambiguous vocabulary are obtained and utilized as prior probability data. An experiment was conducted with Korean, English, and Chinese to evaluate the performance of our proposed lexical disambiguation method. We found that our proposed method had better performance than supervised disambiguation methods even though our method is based on unsupervised disambiguation (using a knowledge-based approach).


2014 ◽  
Vol 1049-1050 ◽  
pp. 1327-1338
Author(s):  
Guo Zhen Zhao ◽  
Wan Li Zuo

Word sense disambiguation as a central research topic in natural language processing can promote the development of many applications such as information retrieval, speech synthesis, machine translation, summarization and question answering. Previous approaches can be grouped into three categories: supervised, unsupervised and knowledge-based. The accuracy of supervised methods is the highest, but they suffer from knowledge acquisition bottleneck. Unsupervised method can avoid knowledge acquisition bottleneck, but its effect is not satisfactory. With the built-up of large-scale knowledge, knowledge-based approach has attracted more and more attention. This paper introduces a new context weighting method, and based on which proposes a novel semi-supervised approach for word sense disambiguation. The significant contribution of our method is that thesaurus and machine learning techniques are integrated in word sense disambiguation. Compared with the state of the art on the test data of the English all words disambiguation task in Sensaval-3, our method yields obvious improvements over existing methods in nouns, adjectives and verbs disambiguation.


2015 ◽  
Author(s):  
Rodrigo Goulart ◽  
Juliano De Carvalho ◽  
Vera De Lima

Word Sense Disambiguation (WSD) is an important task for Biomedicine text-mining. Supervised WSD methods have the best results but they are complex and their cost for testing is too high. This work presents an experiment on WSD using graph-based approaches (unsupervised methods). Three algorithms were tested and compared to the state of the art. Results indicate that similar performance could be reached with different levels of complexity, what may point to a new approach to this problem.


2019 ◽  
Vol 26 (5) ◽  
pp. 438-446 ◽  
Author(s):  
Ahmad Pesaranghader ◽  
Stan Matwin ◽  
Marina Sokolova ◽  
Ali Pesaranghader

Abstract Objective In biomedicine, there is a wealth of information hidden in unstructured narratives such as research articles and clinical reports. To exploit these data properly, a word sense disambiguation (WSD) algorithm prevents downstream difficulties in the natural language processing applications pipeline. Supervised WSD algorithms largely outperform un- or semisupervised and knowledge-based methods; however, they train 1 separate classifier for each ambiguous term, necessitating a large number of expert-labeled training data, an unattainable goal in medical informatics. To alleviate this need, a single model that shares statistical strength across all instances and scales well with the vocabulary size is desirable. Materials and Methods Built on recent advances in deep learning, our deepBioWSD model leverages 1 single bidirectional long short-term memory network that makes sense prediction for any ambiguous term. In the model, first, the Unified Medical Language System sense embeddings will be computed using their text definitions; and then, after initializing the network with these embeddings, it will be trained on all (available) training data collectively. This method also considers a novel technique for automatic collection of training data from PubMed to (pre)train the network in an unsupervised manner. Results We use the MSH WSD dataset to compare WSD algorithms, with macro and micro accuracies employed as evaluation metrics. deepBioWSD outperforms existing models in biomedical text WSD by achieving the state-of-the-art performance of 96.82% for macro accuracy. Conclusions Apart from the disambiguation improvement and unsupervised training, deepBioWSD depends on considerably less number of expert-labeled data as it learns the target and the context terms jointly. These merit deepBioWSD to be conveniently deployable in real-time biomedical applications.


2017 ◽  
Vol 43 (3) ◽  
pp. 593-617 ◽  
Author(s):  
Sascha Rothe ◽  
Hinrich Schütze

We present AutoExtend, a system that combines word embeddings with semantic resources by learning embeddings for non-word objects like synsets and entities and learning word embeddings that incorporate the semantic information from the resource. The method is based on encoding and decoding the word embeddings and is flexible in that it can take any word embeddings as input and does not need an additional training corpus. The obtained embeddings live in the same vector space as the input word embeddings. A sparse tensor formalization guarantees efficiency and parallelizability. We use WordNet, GermaNet, and Freebase as semantic resources. AutoExtend achieves state-of-the-art performance on Word-in-Context Similarity and Word Sense Disambiguation tasks.


Sign in / Sign up

Export Citation Format

Share Document