A Pendulum Swung Too Far

Linguistic Issues in Language Technology ◽

10.33011/lilt.v6i.1245 ◽

2011 ◽

Vol 6 ◽

Author(s):

Kenneth Church

Keyword(s):

Machine Translation ◽

Computational Linguistics ◽

Long Distance ◽

Translation Research ◽

Distant Future ◽

Parallel Corpora ◽

Linguistic Representations ◽

Finite State ◽

Statistical Approaches ◽

Over Time

Today's students might be faced with a very different set of challenges from those of the 1990s in the not-too-distant future. What should they do when most of the low hanging fruit has been pretty much picked over? In the particular case of Machine Translation, the revival of statistical approaches (e.g., Brown et al. (1993)) started out with finite-state methods for pragmatic reasons, but gradually over time, researchers have become more and more receptive to the use of syntax to capture long-distance dependences, especially when there isn't very much parallel corpora, and for language pairs with very different word orders (e.g., translating between a subject-verb-object (SVO) language like English and a verb final language like Japanese). Going forward, we should expect Machine Translation research to make more and more use of richer and richer linguistic representations. So too, there will soon be a day when stress will become important for speech recognition. Since it isn't possible for textbooks in computational linguistics to cover all of these topics, we should work with colleagues in other departments to make sure that students receive an education that is broad enough to prepare them for all possible futures, or at least all probable futures.

Download Full-text

Building a Bilingual Corpus based on Hybrid Approach for Malayalam-English Machine Translation

International Journal of Computer Science and Informatics ◽

10.47893/ijcsi.2013.1095 ◽

2013 ◽

pp. 219-224

Author(s):

Rajesh. K. S ◽

Veena A Kumar ◽

CH. Dayakar Reddy

Keyword(s):

Machine Translation ◽

Hybrid Approach ◽

Word Alignment ◽

Translation Research ◽

Parallel Corpora ◽

Parallel Corpus ◽

Word Level ◽

Alignment System ◽

Bilingual Corpora ◽

Active Research

Word alignment in bilingual corpora has been a very active research topic in the Machine Translation research groups. In this research paper, we describe an alignment system that aligns English-Malayalam texts at word level in parallel sentences. The alignment of translated segments with source segments is very essential for building parallel corpora. Since word alignment research on Malayalam and English languages is still in its immaturity, it is not a trivial task for Malayalam-English text. A parallel corpus is a collection of texts in two languages, one of which is the translation equivalent of the other. Thus, the main purpose of this system is to construct word-aligned parallel corpus to be used in Malayalam-English machine translation. The proposed approach is a hybrid approach, a combination of corpus based and dictionary lookup approaches. The corpus based approach is based on the first three IBM models and Expectation Maximization (EM) algorithm. For the dictionary lookup approach, the proposed system uses the bilingual Malayalam-English Dictionary.

Download Full-text

Machine Translation with Inferred Stochastic Finite-State Transducers

Computational Linguistics ◽

10.1162/089120104323093294 ◽

2004 ◽

Vol 30 (2) ◽

pp. 205-225 ◽

Cited By ~ 50

Author(s):

Francisco Casacuberta ◽

Enrique Vidal

Keyword(s):

Machine Translation ◽

Computational Linguistics ◽

Finite State Transducers ◽

Building Models ◽

Training Samples ◽

Finite State ◽

Finite State Transducer ◽

N Gram ◽

Formal Relations ◽

Training Examples

Finite-state transducers are models that are being used in different areas of pattern recognition and computational linguistics. One of these areas is machine translation, in which the approaches that are based on building models automatically from training examples are becoming more and more attractive. Finite-state transducers are very adequate for use in constrained tasks in which training samples of pairs of sentences are available. A technique for inferring finite-state transducers is proposed in this article. This technique is based on formal relations between finite-state transducers and rational grammars. Given a training corpus of source-target pairs of sentences, the proposed approach uses statistical alignment methods to produce a set of conventional strings from which a stochastic rational grammar (e.g., an n-gram) is inferred. This grammar is finally converted into a finite-state transducer. The proposed methods are assessed through a series of machine translation experiments within the framework of the E u Trans project.

Download Full-text

Comparison of crowdsourcing translation with Machine Translation

Journal of Information Science ◽

10.1177/0165551511418760 ◽

2011 ◽

Vol 37 (6) ◽

pp. 637-659 ◽

Cited By ~ 18

Author(s):

Dimitra Anastasiou ◽

Rajat Gupta

Keyword(s):

Machine Translation ◽

Computational Linguistics ◽

High Speed ◽

High Volume ◽

Translation Studies ◽

Point Of View ◽

User Generated Content ◽

Future Prospects ◽

Over Time ◽

The Web

In this paper we examine the model of crowdsourcing for translation and compare it with Machine Translation (MT). The large volume of material to be translated, the translation of this material into many languages combined with tight deadlines lead enterprises today to follow either crowdsourcing and/or MT. Crowdsourcing translation shares many characteristics with MT, as both can cope with high volume, perform at high speed, and reduce the translation cost. MT is an older technology, whereas crowdsourcing is a new phenomenon gaining much ground over time, mainly through Web 2.0. Examples and challenges of both models will be discussed and the paper is closed with future prospects regarding the combination of crowdsourcing and MT, so that they are not regarded as opponents. These prospects are partially based on the results of a survey we conducted. Based on our background, experience, and research, this paper covers aspects both from the point of view of translation studies and computational linguistics applications as well as of information sciences, and particularly the development of the Web regarding user-generated content.

Download Full-text

Free/Open-Source Resources in the Apertium Platform for Machine Translation Research and Development

Prague Bulletin of Mathematical Linguistics ◽

10.2478/v10108-010-0015-5 ◽

2010 ◽

Vol 93 (1) ◽

pp. 67-76 ◽

Cited By ~ 8

Author(s):

Francis Tyers ◽

Felipe Sánchez-Martínez ◽

Sergio Ortiz-Rojas ◽

Mikel Forcada

Keyword(s):

Research And Development ◽

Open Source ◽

Machine Translation ◽

Morphological Analysis ◽

Translation Research ◽

Part Of Speech ◽

Open Source Framework ◽

Finite State ◽

Translation Systems ◽

Free Open Source

Free/Open-Source Resources in the Apertium Platform for Machine Translation Research and DevelopmentThis paper describes the resources available in the Apertium platform, a free/open-source framework for creating rule-based machine translation systems. Resources within the platform take the form of finite-state morphologies for morphological analysis and generation, bilingual transfer lexica, probabilistic part-of-speech taggers and transfer rule files, all in standardised formats. These resources are described and some examples are given of their reuse and recycling in combination with other machine translation systems.

Download Full-text

UPC: An Open Word-Sense Annotated Parallel Corpora for Machine Translation Study

Applied Sciences ◽

10.3390/app10113904 ◽

2020 ◽

Vol 10 (11) ◽

pp. 3904

Author(s):

Van-Hai Vu ◽

Quang-Phuoc Nguyen ◽

Joon-Choul Shin ◽

Cheol-Young Ock

Keyword(s):

Deep Learning ◽

Machine Translation ◽

Ambiguous Word ◽

High Rate ◽

Word Sense ◽

Language Resources ◽

Parallel Corpora ◽

Knowledge Based ◽

Translation Error ◽

Translation Study

Machine translation (MT) has recently attracted much research on various advanced techniques (i.e., statistical-based and deep learning-based) and achieved great results for popular languages. However, the research on it involving low-resource languages such as Korean often suffer from the lack of openly available bilingual language resources. In this research, we built the open extensive parallel corpora for training MT models, named Ulsan parallel corpora (UPC). Currently, UPC contains two parallel corpora consisting of Korean-English and Korean-Vietnamese datasets. The Korean-English dataset has over 969 thousand sentence pairs, and the Korean-Vietnamese parallel corpus consists of over 412 thousand sentence pairs. Furthermore, the high rate of homographs of Korean causes an ambiguous word issue in MT. To address this problem, we developed a powerful word-sense annotation system based on a combination of sub-word conditional probability and knowledge-based methods, named UTagger. We applied UTagger to UPC and used these corpora to train both statistical-based and deep learning-based neural MT systems. The experimental results demonstrated that using UPC, high-quality MT systems (in terms of the Bi-Lingual Evaluation Understudy (BLEU) and Translation Error Rate (TER) score) can be built. Both UPC and UTagger are available for free download and usage.

Download Full-text

End-to-end statistical machine translation with zero or small parallel texts

Natural Language Engineering ◽

10.1017/s1351324916000127 ◽

2016 ◽

Vol 22 (4) ◽

pp. 517-548 ◽

Cited By ~ 5

Author(s):

ANN IRVINE ◽

CHRIS CALLISON-BURCH

Keyword(s):

Detailed Analysis ◽

Machine Translation ◽

Statistical Machine Translation ◽

Parallel Corpora ◽

Low Resource ◽

Bilingual Lexicon ◽

Orthographic Similarity ◽

Discriminative Model ◽

End To End ◽

Parallel Texts

AbstractWe use bilingual lexicon induction techniques, which learn translations from monolingual texts in two languages, to build an end-to-end statistical machine translation (SMT) system without the use of any bilingual sentence-aligned parallel corpora. We present detailed analysis of the accuracy of bilingual lexicon induction, and show how a discriminative model can be used to combine various signals of translation equivalence (like contextual similarity, temporal similarity, orthographic similarity and topic similarity). Our discriminative model produces higher accuracy translations than previous bilingual lexicon induction techniques. We reuse these signals of translation equivalence as features on a phrase-based SMT system. These monolingually estimated features enhance low resource SMT systems in addition to allowing end-to-end machine translation without parallel corpora.

Download Full-text

Four Generations of Machine Translation Research and Prospects for the Future

Language Interpretation and Communication ◽

10.1007/978-1-4615-9077-4_16 ◽

1978 ◽

pp. 171-184 ◽

Cited By ~ 2

Author(s):

Yorick Wilks

Keyword(s):

Machine Translation ◽

Translation Research ◽

The Future

Download Full-text

Finite-State Technology

The Oxford Handbook of Computational Linguistics 2nd edition ◽

10.1093/oxfordhb/9780199573691.013.39 ◽

2018 ◽

Author(s):

Mans Hulden

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Finite State Machines ◽

Regular Languages ◽

Finite State Automata ◽

State Machines ◽

Computational Phonology ◽

Finite State

Finite-state machines—automata and transducers—are ubiquitous in natural-language processing and computational linguistics. This chapter introduces the fundamentals of finite-state automata and transducers, both probabilistic and non-probabilistic, illustrating the technology with example applications and common usage. It also covers the construction of transducers, which correspond to regular relations, and automata, which correspond to regular languages. The technologies introduced are widely employed in natural language processing, computational phonology and morphology in particular, and this is illustrated through common practical use cases.

Download Full-text

Word-Order Issues in English-to-Urdu Statistical Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.2478/v10108-011-0007-0 ◽

2011 ◽

Vol 95 (1) ◽

pp. 87-106 ◽

Cited By ~ 3

Author(s):

Bushra Jawaid ◽

Daniel Zeman

Keyword(s):

Machine Translation ◽

Word Order ◽

Statistical Machine Translation ◽

Parse Tree ◽

Hard Problem ◽

Long Distance ◽

Translation Process ◽

English Sentence ◽

European Languages

Word-Order Issues in English-to-Urdu Statistical Machine Translation We investigate phrase-based statistical machine translation between English and Urdu, two Indo-European languages that differ significantly in their word-order preferences. Reordering of words and phrases is thus a necessary part of the translation process. While local reordering is modeled nicely by phrase-based systems, long-distance reordering is known to be a hard problem. We perform experiments using the Moses SMT system and discuss reordering models available in Moses. We then present our novel, Urdu-aware, yet generalizable approach based on reordering phrases in syntactic parse tree of the source English sentence. Our technique significantly improves quality of English-Urdu translation with Moses, both in terms of BLEU score and of subjective human judgments.

Download Full-text

Recurrent Stacking of Layers for Compact Neural Machine Translation Models

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016292 ◽

2019 ◽

Vol 33 ◽

pp. 6292-6299 ◽

Cited By ~ 2

Author(s):

Raj Dabre ◽

Atsushi Fujita

Keyword(s):

Machine Translation ◽

Single Layer ◽

Training Data ◽

Neural Machine Translation ◽

Parallel Corpora ◽

Translation Quality ◽

Sequence Generation ◽

Sequence Modeling ◽

Back Translation

In encoder-decoder based sequence-to-sequence modeling, the most common practice is to stack a number of recurrent, convolutional, or feed-forward layers in the encoder and decoder. While the addition of each new layer improves the sequence generation quality, this also leads to a significant increase in the number of parameters. In this paper, we propose to share parameters across all layers thereby leading to a recurrently stacked sequence-to-sequence model. We report on an extensive case study on neural machine translation (NMT) using our proposed method, experimenting with a variety of datasets. We empirically show that the translation quality of a model that recurrently stacks a single-layer 6 times, despite its significantly fewer parameters, approaches that of a model that stacks 6 different layers. We also show how our method can benefit from a prevalent way for improving NMT, i.e., extending training data with pseudo-parallel corpora generated by back-translation. We then analyze the effects of recurrently stacked layers by visualizing the attentions of models that use recurrently stacked layers and models that do not. Finally, we explore the limits of parameter sharing where we share even the parameters between the encoder and decoder in addition to recurrent stacking of layers.

Download Full-text