Translational equivalence in Statistical Machine Translation or meaning as co-occurrence

In this paper, we will describe the current state-of-the-art of Statistical Machine Translation (SMT), and reflect on how SMT handles meaning. Statistical Machine Translation is a corpus-based approach to MT: it de-rives the required knowledge to generate new translations from corpora. General-purpose SMT systems do not use any formal semantic representa-tion. Instead, they directly extract translationally equivalent words or word sequences – expressions with the same meaning – from bilingual parallel corpora. All statistical translation models are based on the idea of word alignment, i.e., the automatic linking of corresponding words in parallel texts. The first generation SMT systems were word-based. From a linguistic point of view, the major problem with word-based systems is that the mean-ing of a word is often ambiguous, and is determined by its context. Current state-of-the-art SMT-systems try to capture the local contextual dependen-cies by using phrases instead of words as units of translation. In order to solve more complex ambiguity problems (where a broader text scope or even domain information is needed), a Word Sense Disambiguation (WSD) module is integrated in the Machine Translation environment.

Download Full-text

Word Sense Disambiguation

Emerging Applications of Natural Language Processing ◽

10.4018/978-1-4666-2169-5.ch002 ◽

2013 ◽

pp. 22-51

Author(s):

Pushpak Bhattacharyya ◽

Mitesh Khapra

Keyword(s):

State Of The Art ◽

Word Sense Disambiguation ◽

Current Trend ◽

General Purpose ◽

Word Sense ◽

Domain Specific ◽

Knowledge Based ◽

Current State ◽

Sense Disambiguation ◽

State Of Affairs

This chapter discusses the basic concepts of Word Sense Disambiguation (WSD) and the approaches to solving this problem. Both general purpose WSD and domain specific WSD are presented. The first part of the discussion focuses on existing approaches for WSD, including knowledge-based, supervised, semi-supervised, unsupervised, hybrid, and bilingual approaches. The accuracy value for general purpose WSD as the current state of affairs seems to be pegged at around 65%. This has motivated investigations into domain specific WSD, which is the current trend in the field. In the latter part of the chapter, we present a greedy neural network inspired algorithm for domain specific WSD and compare its performance with other state-of-the-art algorithms for WSD. Our experiments suggest that for domain-specific WSD, simply selecting the most frequent sense of a word does as well as any state-of-the-art algorithm.

Download Full-text

An Approach to Word Sense Disambiguation in English-Vietnamese-English Statistical Machine Translation

2012 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future ◽

10.1109/rivf.2012.6169839 ◽

2012 ◽

Cited By ~ 2

Author(s):

Quy Nguyen ◽

An Nguyen ◽

Dien Dinh

Keyword(s):

Machine Translation ◽

Word Sense Disambiguation ◽

Statistical Machine Translation ◽

Word Sense ◽

Sense Disambiguation

Download Full-text

n-Best Reranking for the Efficient Integration of Word Sense Disambiguation and Statistical Machine Translation

Computational Linguistics and Intelligent Text Processing - Lecture Notes in Computer Science ◽

10.1007/978-3-540-78135-6_34 ◽

2008 ◽

pp. 399-410 ◽

Cited By ~ 5

Author(s):

Lucia Specia ◽

Baskaran Sankaran ◽

Maria das Graças Volpe Nunes

Keyword(s):

Machine Translation ◽

Word Sense Disambiguation ◽

Statistical Machine Translation ◽

Word Sense ◽

Sense Disambiguation ◽

Efficient Integration

Download Full-text

Exemplification Modeling: Can You Give Me an Example, Please?

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/520 ◽

2021 ◽

Author(s):

Edoardo Barba ◽

Luigi Procopio ◽

Caterina Lacerra ◽

Tommaso Pasini ◽

Roberto Navigli

Keyword(s):

Gold Standard ◽

State Of The Art ◽

Word Sense Disambiguation ◽

Full Range ◽

Training Data ◽

Training Procedure ◽

Word Sense ◽

The Novel ◽

Current State ◽

Sense Disambiguation

Recently, generative approaches have been used effectively to provide definitions of words in their context. However, the opposite, i.e., generating a usage example given one or more words along with their definitions, has not yet been investigated. In this work, we introduce the novel task of Exemplification Modeling (ExMod), along with a sequence-to-sequence architecture and a training procedure for it. Starting from a set of (word, definition) pairs, our approach is capable of automatically generating high-quality sentences which express the requested semantics. As a result, we can drive the creation of sense-tagged data which cover the full range of meanings in any inventory of interest, and their interactions within sentences. Human annotators agree that the sentences generated are as fluent and semantically-coherent with the input definitions as the sentences in manually-annotated corpora. Indeed, when employed as training data for Word Sense Disambiguation, our examples enable the current state of the art to be outperformed, and higher results to be achieved than when using gold-standard datasets only. We release the pretrained model, the dataset and the software at https://github.com/SapienzaNLP/exmod.

Download Full-text

Integrating a Discriminative Classifier into Phrase-based and Hierarchical Decoding

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2014-0002 ◽

2014 ◽

Vol 101 (1) ◽

pp. 29-41

Author(s):

Aleš Tamchyna ◽

Fabienne Braune ◽

Alexander Fraser ◽

Marine Carpuat ◽

Hal Daumé iii ◽

...

Keyword(s):

Open Source ◽

Machine Translation ◽

State Of The Art ◽

Statistical Machine Translation ◽

Sentence Context ◽

Discriminative Models ◽

Current State ◽

Source Sentence ◽

Independence Assumptions

Abstract Current state-of-the-art statistical machine translation (SMT) relies on simple feature functions which make independence assumptions at the level of phrases or hierarchical rules. However, it is well-known that discriminative models can benefit from rich features extracted from the source sentence context outside of the applied phrase or hierarchical rule, which is available at decoding time. We present a framework for the open-source decoder Moses that allows discriminative models over source context to easily be trained on a large number of examples and then be included as feature functions in decoding.

Download Full-text

Evaluating a Pivot-Based Approach for Bilingual Lexicon Extraction

Computational Intelligence and Neuroscience ◽

10.1155/2015/434153 ◽

2015 ◽

Vol 2015 ◽

pp. 1-13 ◽

Cited By ~ 3

Author(s):

Jae-Hoon Kim ◽

Hong-Seok Kwon ◽

Hyeong-Won Seo

Keyword(s):

Machine Translation ◽

Word Association ◽

Statistical Machine Translation ◽

Low Frequency ◽

The Other ◽

Word Alignment ◽

Parallel Corpora ◽

Empirical Results ◽

Bilingual Lexicon ◽

Resource Poor

A pivot-based approach for bilingual lexicon extraction is based on the similarity of context vectors represented by words in a pivot language like English. In this paper, in order to show validity and usability of the pivot-based approach, we evaluate the approach in company with two different methods for estimating context vectors: one estimates them from two parallel corpora based on word association between source words (resp., target words) and pivot words and the other estimates them from two parallel corpora based on word alignment tools for statistical machine translation. Empirical results on two language pairs (e.g., Korean-Spanish and Korean-French) have shown that the pivot-based approach is very promising for resource-poor languages and this approach observes its validity and usability. Furthermore, for words with low frequency, our method is also well performed.

Download Full-text

Word sense disambiguation for statistical machine translation

10.14711/thesis-b1029221 ◽

2008 ◽

Cited By ~ 1

Author(s):

Marine Jacinthe Carpuat

Keyword(s):

Machine Translation ◽

Word Sense Disambiguation ◽

Statistical Machine Translation ◽

Word Sense ◽

Sense Disambiguation

Download Full-text

Fine-grained word sense disambiguation based on parallel corpora, word alignment, word clustering and aligned wordnets

10.3115/1220355.1220547 ◽

2004 ◽

Cited By ~ 17

Author(s):

Dan TufiŞ ◽

Radu Ion ◽

Nancy Ide

Keyword(s):

Word Sense Disambiguation ◽

Word Alignment ◽

Word Sense ◽

Parallel Corpora ◽

Fine Grained ◽

Sense Disambiguation ◽

Word Clustering

Download Full-text

Re-structuring, Re-labeling, and Re-aligning for Syntax-Based Machine Translation

Computational Linguistics ◽

10.1162/coli.2010.36.2.09054 ◽

2010 ◽

Vol 36 (2) ◽

pp. 247-277 ◽

Cited By ~ 11

Author(s):

Wei Wang ◽

Jonathan May ◽

Kevin Knight ◽

Daniel Marcu

Keyword(s):

Machine Translation ◽

State Of The Art ◽

Syntactic Structure ◽

Statistical Machine Translation ◽

Training Data ◽

Word Alignment ◽

The Em Algorithm ◽

Rule Application ◽

Word Alignments ◽

Parse Trees

This article shows that the structure of bilingual material from standard parsing and alignment tools is not optimal for training syntax-based statistical machine translation (SMT) systems. We present three modifications to the MT training data to improve the accuracy of a state-of-the-art syntax MT system: re-structuring changes the syntactic structure of training parse trees to enable reuse of substructures; re-labeling alters bracket labels to enrich rule application context; and re-aligning unifies word alignment across sentences to remove bad word alignments and refine good ones. Better structures, labels, and word alignments are learned by the EM algorithm. We show that each individual technique leads to improvement as measured by BLEU, and we also show that the greatest improvement is achieved by combining them. We report an overall 1.48 BLEU improvement on the NIST08 evaluation set over a strong baseline in Chinese/English translation.

Download Full-text

Word Sense Disambiguation Using Semantic Web for Tamil to English Statistical Machine Translation

IRA-International Journal of Technology & Engineering (ISSN 2455-4480) ◽

10.21013/jte.v5.n2.p1 ◽

2016 ◽

Vol 5 (2) ◽

pp. 22

Author(s):

Santosh Kumar T.S.

Keyword(s):

Semantic Web ◽

Machine Translation ◽

Word Sense Disambiguation ◽

Statistical Machine Translation ◽

Web Technology ◽

The Internet ◽

Word Sense ◽

Semantic Web Technology ◽

Sense Disambiguation ◽

The Web

<div> Machine Translation has been an area of linguistic research for almost more than two decades now. But it still remains a very challenging task for devising an automated system which will deliver accurate translations of the natural languages. However, great strides have been made in this field with more success owing to the development of technologies of the web and off late there is a renewed interest in this area of research. Technological advancements in the preceding two decades have influenced Machine Translation in a considerable way. Several MT approaches including Statistical Machine Translation greatly benefitted from these advancements, basically making use of the availability of extensive corpora. Web technology web3.0 uses the semantic web technology which represents any object or resource in the web both syntactically and semantically. This type of representation is very much useful for the computing systems to search any content on the internet similar to lexical search and improve the internet based translations making it more effective and efficient. In this paper we propose a technique to improve existing statistical Machine Translation methods by making use of semantic web technology. Our focus will be on Tamil and Tamil to English MT. The proposed method could successfully integrate a semantic web technique in the process of WSD which forms part of the MT system. The integration is accomplished by using the capabilities of RDFS and OWL into the WSD component of the MT model. The contribution of this work lies in showing that integrating a Semantic web technique in the WSD system significantly improves the performance of a statistical MT system for a translation from Tamil to English.</div> In this paper we assume the availability of large corpora in Tamil language and specific domain based ontologies with Tamil semantic web technology using web3.0. We are positive on the expansion and development of Tamil semantic web and subsequently infer that Tamil to English MT will greatly improve the disambiguation concept apart from other related benefits. This method could enable the enhancement of translation quality by improving on word sense disambiguation process while text is translated from Tamil to English language. This method can also be extended to other languages such as Hindi and Indian Languages.

Download Full-text