scholarly journals Translational equivalence in Statistical Machine Translation or meaning as co-occurrence

Author(s):  
Lieve Macken ◽  
Els Lefever

In this paper, we will describe the current state-of-the-art of Statistical Machine Translation (SMT), and reflect on how SMT handles meaning. Statistical Machine Translation is a corpus-based approach to MT: it de-rives the required knowledge to generate new translations from corpora. General-purpose SMT systems do not use any formal semantic representa-tion. Instead, they directly extract translationally equivalent words or word sequences – expressions with the same meaning – from bilingual parallel corpora. All statistical translation models are based on the idea of word alignment, i.e., the automatic linking of corresponding words in parallel texts. The first generation SMT systems were word-based. From a linguistic point of view, the major problem with word-based systems is that the mean-ing of a word is often ambiguous, and is determined by its context. Current state-of-the-art SMT-systems try to capture the local contextual dependen-cies by using phrases instead of words as units of translation. In order to solve more complex ambiguity problems (where a broader text scope or even domain information is needed), a Word Sense Disambiguation (WSD) module is integrated in the Machine Translation environment.

Author(s):  
Pushpak Bhattacharyya ◽  
Mitesh Khapra

This chapter discusses the basic concepts of Word Sense Disambiguation (WSD) and the approaches to solving this problem. Both general purpose WSD and domain specific WSD are presented. The first part of the discussion focuses on existing approaches for WSD, including knowledge-based, supervised, semi-supervised, unsupervised, hybrid, and bilingual approaches. The accuracy value for general purpose WSD as the current state of affairs seems to be pegged at around 65%. This has motivated investigations into domain specific WSD, which is the current trend in the field. In the latter part of the chapter, we present a greedy neural network inspired algorithm for domain specific WSD and compare its performance with other state-of-the-art algorithms for WSD. Our experiments suggest that for domain-specific WSD, simply selecting the most frequent sense of a word does as well as any state-of-the-art algorithm.


Author(s):  
Edoardo Barba ◽  
Luigi Procopio ◽  
Caterina Lacerra ◽  
Tommaso Pasini ◽  
Roberto Navigli

Recently, generative approaches have been used effectively to provide definitions of words in their context. However, the opposite, i.e., generating a usage example given one or more words along with their definitions, has not yet been investigated. In this work, we introduce the novel task of Exemplification Modeling (ExMod), along with a sequence-to-sequence architecture and a training procedure for it. Starting from a set of (word, definition) pairs, our approach is capable of automatically generating high-quality sentences which express the requested semantics. As a result, we can drive the creation of sense-tagged data which cover the full range of meanings in any inventory of interest, and their interactions within sentences. Human annotators agree that the sentences generated are as fluent and semantically-coherent with the input definitions as the sentences in manually-annotated corpora. Indeed, when employed as training data for Word Sense Disambiguation, our examples enable the current state of the art to be outperformed, and higher results to be achieved than when using gold-standard datasets only. We release the pretrained model, the dataset and the software at https://github.com/SapienzaNLP/exmod.


2014 ◽  
Vol 101 (1) ◽  
pp. 29-41
Author(s):  
Aleš Tamchyna ◽  
Fabienne Braune ◽  
Alexander Fraser ◽  
Marine Carpuat ◽  
Hal Daumé iii ◽  
...  

Abstract Current state-of-the-art statistical machine translation (SMT) relies on simple feature functions which make independence assumptions at the level of phrases or hierarchical rules. However, it is well-known that discriminative models can benefit from rich features extracted from the source sentence context outside of the applied phrase or hierarchical rule, which is available at decoding time. We present a framework for the open-source decoder Moses that allows discriminative models over source context to easily be trained on a large number of examples and then be included as feature functions in decoding.


2015 ◽  
Vol 2015 ◽  
pp. 1-13 ◽  
Author(s):  
Jae-Hoon Kim ◽  
Hong-Seok Kwon ◽  
Hyeong-Won Seo

A pivot-based approach for bilingual lexicon extraction is based on the similarity of context vectors represented by words in a pivot language like English. In this paper, in order to show validity and usability of the pivot-based approach, we evaluate the approach in company with two different methods for estimating context vectors: one estimates them from two parallel corpora based on word association between source words (resp., target words) and pivot words and the other estimates them from two parallel corpora based on word alignment tools for statistical machine translation. Empirical results on two language pairs (e.g., Korean-Spanish and Korean-French) have shown that the pivot-based approach is very promising for resource-poor languages and this approach observes its validity and usability. Furthermore, for words with low frequency, our method is also well performed.


2010 ◽  
Vol 36 (2) ◽  
pp. 247-277 ◽  
Author(s):  
Wei Wang ◽  
Jonathan May ◽  
Kevin Knight ◽  
Daniel Marcu

This article shows that the structure of bilingual material from standard parsing and alignment tools is not optimal for training syntax-based statistical machine translation (SMT) systems. We present three modifications to the MT training data to improve the accuracy of a state-of-the-art syntax MT system: re-structuring changes the syntactic structure of training parse trees to enable reuse of substructures; re-labeling alters bracket labels to enrich rule application context; and re-aligning unifies word alignment across sentences to remove bad word alignments and refine good ones. Better structures, labels, and word alignments are learned by the EM algorithm. We show that each individual technique leads to improvement as measured by BLEU, and we also show that the greatest improvement is achieved by combining them. We report an overall 1.48 BLEU improvement on the NIST08 evaluation set over a strong baseline in Chinese/English translation.


Author(s):  
Santosh Kumar T.S.

<div><p><em>        Machine Translation has been an area of linguistic research for almost more than two decades now. But it still remains a very challenging task for devising an automated system which will deliver accurate translations of the natural languages. However, great strides have been made in this field with more success owing to the development of technologies of the web and off late there is a renewed interest in this area of research.  </em></p><p><em>       Technological advancements in the preceding two decades have influenced Machine Translation in a considerable way. Several MT approaches including Statistical Machine Translation greatly benefitted from these advancements, basically making use of the availability of extensive corpora. Web technology web3.0 uses the semantic web technology which represents any object or resource in the web both syntactically and semantically.  This type of representation is very much useful for the computing systems to search any content on the internet similar to lexical search and improve the internet based translations making it more effective and efficient.</em></p><p><em>       In this paper we propose a technique to improve existing statistical Machine Translation methods by making use of semantic web technology. Our focus will be on Tamil and Tamil to English MT. The proposed method could successfully integrate a semantic web technique in the process of WSD which forms part of the MT system. The integration is accomplished by using the capabilities of RDFS and OWL into the WSD component of the MT model. The contribution of this work lies in showing that integrating a Semantic web technique in the WSD system significantly improves the performance of a statistical MT system for a translation from Tamil to English.</em></p></div><em>       In this paper we assume the availability of large corpora in Tamil language and specific domain based ontologies with Tamil semantic web technology using web3.0. We are positive on the expansion and development of Tamil semantic web and subsequently infer that Tamil to English MT will greatly improve the disambiguation concept apart from other related benefits. This method could enable the enhancement of translation quality by improving on word sense disambiguation process while text is translated from Tamil to English language. This method can also be extended to other languages such as Hindi and Indian Languages.</em>


Sign in / Sign up

Export Citation Format

Share Document