scholarly journals A Pendulum Swung Too Far

2011 ◽  
Vol 6 ◽  
Author(s):  
Kenneth Church

Today's students might be faced with a very different set of challenges from those of the 1990s in the not-too-distant future. What should they do when most of the low hanging fruit has been pretty much picked over? In the particular case of Machine Translation, the revival of statistical approaches (e.g., Brown et al. (1993)) started out with finite-state methods for pragmatic reasons, but gradually over time, researchers have become more and more receptive to the use of syntax to capture long-distance dependences, especially when there isn't very much parallel corpora, and for language pairs with very different word orders (e.g., translating between a subject-verb-object (SVO) language like English and a verb final language like Japanese). Going forward, we should expect Machine Translation research to make more and more use of richer and richer linguistic representations. So too, there will soon be a day when stress will become important for speech recognition. Since it isn't possible for textbooks in computational linguistics to cover all of these topics, we should work with colleagues in other departments to make sure that students receive an education that is broad enough to prepare them for all possible futures, or at least all probable futures.

Author(s):  
Rajesh. K. S ◽  
Veena A Kumar ◽  
CH. Dayakar Reddy

Word alignment in bilingual corpora has been a very active research topic in the Machine Translation research groups. In this research paper, we describe an alignment system that aligns English-Malayalam texts at word level in parallel sentences. The alignment of translated segments with source segments is very essential for building parallel corpora. Since word alignment research on Malayalam and English languages is still in its immaturity, it is not a trivial task for Malayalam-English text. A parallel corpus is a collection of texts in two languages, one of which is the translation equivalent of the other. Thus, the main purpose of this system is to construct word-aligned parallel corpus to be used in Malayalam-English machine translation. The proposed approach is a hybrid approach, a combination of corpus based and dictionary lookup approaches. The corpus based approach is based on the first three IBM models and Expectation Maximization (EM) algorithm. For the dictionary lookup approach, the proposed system uses the bilingual Malayalam-English Dictionary.


2004 ◽  
Vol 30 (2) ◽  
pp. 205-225 ◽  
Author(s):  
Francisco Casacuberta ◽  
Enrique Vidal

Finite-state transducers are models that are being used in different areas of pattern recognition and computational linguistics. One of these areas is machine translation, in which the approaches that are based on building models automatically from training examples are becoming more and more attractive. Finite-state transducers are very adequate for use in constrained tasks in which training samples of pairs of sentences are available. A technique for inferring finite-state transducers is proposed in this article. This technique is based on formal relations between finite-state transducers and rational grammars. Given a training corpus of source-target pairs of sentences, the proposed approach uses statistical alignment methods to produce a set of conventional strings from which a stochastic rational grammar (e.g., an n-gram) is inferred. This grammar is finally converted into a finite-state transducer. The proposed methods are assessed through a series of machine translation experiments within the framework of the E u Trans project.


2011 ◽  
Vol 37 (6) ◽  
pp. 637-659 ◽  
Author(s):  
Dimitra Anastasiou ◽  
Rajat Gupta

In this paper we examine the model of crowdsourcing for translation and compare it with Machine Translation (MT). The large volume of material to be translated, the translation of this material into many languages combined with tight deadlines lead enterprises today to follow either crowdsourcing and/or MT. Crowdsourcing translation shares many characteristics with MT, as both can cope with high volume, perform at high speed, and reduce the translation cost. MT is an older technology, whereas crowdsourcing is a new phenomenon gaining much ground over time, mainly through Web 2.0. Examples and challenges of both models will be discussed and the paper is closed with future prospects regarding the combination of crowdsourcing and MT, so that they are not regarded as opponents. These prospects are partially based on the results of a survey we conducted. Based on our background, experience, and research, this paper covers aspects both from the point of view of translation studies and computational linguistics applications as well as of information sciences, and particularly the development of the Web regarding user-generated content.


2010 ◽  
Vol 93 (1) ◽  
pp. 67-76 ◽  
Author(s):  
Francis Tyers ◽  
Felipe Sánchez-Martínez ◽  
Sergio Ortiz-Rojas ◽  
Mikel Forcada

Free/Open-Source Resources in the Apertium Platform for Machine Translation Research and DevelopmentThis paper describes the resources available in the Apertium platform, a free/open-source framework for creating rule-based machine translation systems. Resources within the platform take the form of finite-state morphologies for morphological analysis and generation, bilingual transfer lexica, probabilistic part-of-speech taggers and transfer rule files, all in standardised formats. These resources are described and some examples are given of their reuse and recycling in combination with other machine translation systems.


2020 ◽  
Vol 10 (11) ◽  
pp. 3904
Author(s):  
Van-Hai Vu ◽  
Quang-Phuoc Nguyen ◽  
Joon-Choul Shin ◽  
Cheol-Young Ock

Machine translation (MT) has recently attracted much research on various advanced techniques (i.e., statistical-based and deep learning-based) and achieved great results for popular languages. However, the research on it involving low-resource languages such as Korean often suffer from the lack of openly available bilingual language resources. In this research, we built the open extensive parallel corpora for training MT models, named Ulsan parallel corpora (UPC). Currently, UPC contains two parallel corpora consisting of Korean-English and Korean-Vietnamese datasets. The Korean-English dataset has over 969 thousand sentence pairs, and the Korean-Vietnamese parallel corpus consists of over 412 thousand sentence pairs. Furthermore, the high rate of homographs of Korean causes an ambiguous word issue in MT. To address this problem, we developed a powerful word-sense annotation system based on a combination of sub-word conditional probability and knowledge-based methods, named UTagger. We applied UTagger to UPC and used these corpora to train both statistical-based and deep learning-based neural MT systems. The experimental results demonstrated that using UPC, high-quality MT systems (in terms of the Bi-Lingual Evaluation Understudy (BLEU) and Translation Error Rate (TER) score) can be built. Both UPC and UTagger are available for free download and usage.


2016 ◽  
Vol 22 (4) ◽  
pp. 517-548 ◽  
Author(s):  
ANN IRVINE ◽  
CHRIS CALLISON-BURCH

AbstractWe use bilingual lexicon induction techniques, which learn translations from monolingual texts in two languages, to build an end-to-end statistical machine translation (SMT) system without the use of any bilingual sentence-aligned parallel corpora. We present detailed analysis of the accuracy of bilingual lexicon induction, and show how a discriminative model can be used to combine various signals of translation equivalence (like contextual similarity, temporal similarity, orthographic similarity and topic similarity). Our discriminative model produces higher accuracy translations than previous bilingual lexicon induction techniques. We reuse these signals of translation equivalence as features on a phrase-based SMT system. These monolingually estimated features enhance low resource SMT systems in addition to allowing end-to-end machine translation without parallel corpora.


Author(s):  
Mans Hulden

Finite-state machines—automata and transducers—are ubiquitous in natural-language processing and computational linguistics. This chapter introduces the fundamentals of finite-state automata and transducers, both probabilistic and non-probabilistic, illustrating the technology with example applications and common usage. It also covers the construction of transducers, which correspond to regular relations, and automata, which correspond to regular languages. The technologies introduced are widely employed in natural language processing, computational phonology and morphology in particular, and this is illustrated through common practical use cases.


2011 ◽  
Vol 95 (1) ◽  
pp. 87-106 ◽  
Author(s):  
Bushra Jawaid ◽  
Daniel Zeman

Word-Order Issues in English-to-Urdu Statistical Machine Translation We investigate phrase-based statistical machine translation between English and Urdu, two Indo-European languages that differ significantly in their word-order preferences. Reordering of words and phrases is thus a necessary part of the translation process. While local reordering is modeled nicely by phrase-based systems, long-distance reordering is known to be a hard problem. We perform experiments using the Moses SMT system and discuss reordering models available in Moses. We then present our novel, Urdu-aware, yet generalizable approach based on reordering phrases in syntactic parse tree of the source English sentence. Our technique significantly improves quality of English-Urdu translation with Moses, both in terms of BLEU score and of subjective human judgments.


Author(s):  
Raj Dabre ◽  
Atsushi Fujita

In encoder-decoder based sequence-to-sequence modeling, the most common practice is to stack a number of recurrent, convolutional, or feed-forward layers in the encoder and decoder. While the addition of each new layer improves the sequence generation quality, this also leads to a significant increase in the number of parameters. In this paper, we propose to share parameters across all layers thereby leading to a recurrently stacked sequence-to-sequence model. We report on an extensive case study on neural machine translation (NMT) using our proposed method, experimenting with a variety of datasets. We empirically show that the translation quality of a model that recurrently stacks a single-layer 6 times, despite its significantly fewer parameters, approaches that of a model that stacks 6 different layers. We also show how our method can benefit from a prevalent way for improving NMT, i.e., extending training data with pseudo-parallel corpora generated by back-translation. We then analyze the effects of recurrently stacked layers by visualizing the attentions of models that use recurrently stacked layers and models that do not. Finally, we explore the limits of parameter sharing where we share even the parameters between the encoder and decoder in addition to recurrent stacking of layers.


Sign in / Sign up

Export Citation Format

Share Document