Improve Example-Based Machine Translation Quality for Low-Resource Language Using Ontology

Improving thai-lao neural machine translation with similarity lexicon

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-212236 ◽

2021 ◽

pp. 1-10

Author(s):

Zhiqiang Yu ◽

Yuxin Huang ◽

Junjun Guo

Keyword(s):

Machine Translation ◽

Semantic Information ◽

Neural Machine Translation ◽

Low Resource ◽

Translation Quality ◽

Decoder Architecture ◽

Baseline System ◽

Input Sentence ◽

Resource Conditions ◽

Language Pair

It has been shown that the performance of neural machine translation (NMT) drops starkly in low-resource conditions. Thai-Lao is a typical low-resource language pair of tiny parallel corpus, leading to suboptimal NMT performance on it. However, Thai and Lao have considerable similarities in linguistic morphology and have bilingual lexicon which is relatively easy to obtain. To use this feature, we first build a bilingual similarity lexicon composed of pairs of similar words. Then we propose a novel NMT architecture to leverage the similarity between Thai and Lao. Specifically, besides the prevailing sentence encoder, we introduce an extra similarity lexicon encoder into the conventional encoder-decoder architecture, by which the semantic information carried by the similarity lexicon can be represented. We further provide a simple mechanism in the decoder to balance the information representations delivered from the input sentence and the similarity lexicon. Our approach can fully exploit linguistic similarity carried by the similarity lexicon to improve translation quality. Experimental results demonstrate that our approach achieves significant improvements over the state-of-the-art Transformer baseline system and previous similar works.

Download Full-text

Improve Example-Based Machine Translation Quality for Low-Resource Language Using Ontology

International Journal of Networked and Distributed Computing ◽

10.2991/ijndc.2017.5.3.6 ◽

2017 ◽

Vol 5 (3) ◽

pp. 176 ◽

Cited By ~ 1

Author(s):

Khan Md Anwarus Salam ◽

Setsuo Yamada ◽

Nishino Tetsuro

Keyword(s):

Machine Translation ◽

Low Resource ◽

Translation Quality

Download Full-text

Augmenting Neural Machine Translation through Round-Trip Training Approach

Open Computer Science ◽

10.1515/comp-2019-0019 ◽

2019 ◽

Vol 9 (1) ◽

pp. 268-278 ◽

Cited By ~ 1

Author(s):

Benyamin Ahmadnia ◽

Bonnie J. Dorr

Keyword(s):

Machine Translation ◽

Training Data ◽

Training Dataset ◽

Round Trip ◽

Neural Machine Translation ◽

Low Resource ◽

Translation Quality ◽

High Resource ◽

Training Approach ◽

Language Pair

AbstractThe quality of Neural Machine Translation (NMT), as a data-driven approach, massively depends on quantity, quality and relevance of the training dataset. Such approaches have achieved promising results for bilingually high-resource scenarios but are inadequate for low-resource conditions. Generally, the NMT systems learn from millions of words from bilingual training dataset. However, human labeling process is very costly and time consuming. In this paper, we describe a round-trip training approach to bilingual low-resource NMT that takes advantage of monolingual datasets to address training data bottleneck, thus augmenting translation quality. We conduct detailed experiments on English-Spanish as a high-resource language pair as well as Persian-Spanish as a low-resource language pair. Experimental results show that this competitive approach outperforms the baseline systems and improves translation quality.

Download Full-text

Terminology Translation in Low-Resource Scenarios

Information ◽

10.3390/info10090273 ◽

2019 ◽

Vol 10 (9) ◽

pp. 273

Author(s):

Rejwanul Haque ◽

Mohammed Hasanuzzaman ◽

Andy Way

Keyword(s):

Machine Translation ◽

End Users ◽

Training Data ◽

Classification Task ◽

Domain Experts ◽

Low Resource ◽

Translation Quality ◽

Classification Framework ◽

Industrial Setting ◽

Language Pair

Term translation quality in machine translation (MT), which is usually measured by domain experts, is a time-consuming and expensive task. In fact, this is unimaginable in an industrial setting where customised MT systems often need to be updated for many reasons (e.g., availability of new training data, leading MT techniques). To the best of our knowledge, as of yet, there is no publicly-available solution to evaluate terminology translation in MT automatically. Hence, there is a genuine need to have a faster and less-expensive solution to this problem, which could help end-users to identify term translation problems in MT instantly. This study presents a faster and less expensive strategy for evaluating terminology translation in MT. High correlations of our evaluation results with human judgements demonstrate the effectiveness of the proposed solution. The paper also introduces a classification framework, TermCat, that can automatically classify term translation-related errors and expose specific problems in relation to terminology translation in MT. We carried out our experiments with a low resource language pair, English–Hindi, and found that our classifier, whose accuracy varies across the translation directions, error classes, the morphological nature of the languages, and MT models, generally performs competently in the terminology translation classification task.

Download Full-text

Extremely low-resource neural machine translation for Asian languages

Machine Translation ◽

10.1007/s10590-020-09258-6 ◽

2020 ◽

Vol 34 (4) ◽

pp. 347-382

Author(s):

Raphael Rubino ◽

Benjamin Marie ◽

Raj Dabre ◽

Atushi Fujita ◽

Masao Utiyama ◽

...

Keyword(s):

Machine Translation ◽

Data Augmentation ◽

Statistical Machine Translation ◽

Synthetic Data ◽

Parameter Tuning ◽

Data Generation ◽

Neural Machine Translation ◽

Low Resource ◽

Translation Quality ◽

Asian Languages

AbstractThis paper presents a set of effective approaches to handle extremely low-resource language pairs for self-attention based neural machine translation (NMT) focusing on English and four Asian languages. Starting from an initial set of parallel sentences used to train bilingual baseline models, we introduce additional monolingual corpora and data processing techniques to improve translation quality. We describe a series of best practices and empirically validate the methods through an evaluation conducted on eight translation directions, based on state-of-the-art NMT approaches such as hyper-parameter search, data augmentation with forward and backward translation in combination with tags and noise, as well as joint multilingual training. Experiments show that the commonly used default architecture of self-attention NMT models does not reach the best results, validating previous work on the importance of hyper-parameter tuning. Additionally, empirical results indicate the amount of synthetic data required to efficiently increase the parameters of the models leading to the best translation quality measured by automatic metrics. We show that the best NMT models trained on large amount of tagged back-translations outperform three other synthetic data generation approaches. Finally, comparison with statistical machine translation (SMT) indicates that extremely low-resource NMT requires a large amount of synthetic parallel data obtained with back-translation in order to close the performance gap with the preceding SMT approach.

Download Full-text

Keeping Models Consistent between Pretraining and Translation for Low-Resource Neural Machine Translation

Future Internet ◽

10.3390/fi12120215 ◽

2020 ◽

Vol 12 (12) ◽

pp. 215

Author(s):

Wenbo Zhang ◽

Xiao Li ◽

Yating Yang ◽

Rui Dong ◽

Gongxu Luo

Keyword(s):

Machine Translation ◽

Language Model ◽

Neural Machine Translation ◽

Translation Model ◽

Parallel Corpus ◽

Model Experiments ◽

Low Resource ◽

Translation Quality ◽

Number Of Layers ◽

Cross Lingual

Recently, the pretraining of models has been successfully applied to unsupervised and semi-supervised neural machine translation. A cross-lingual language model uses a pretrained masked language model to initialize the encoder and decoder of the translation model, which greatly improves the translation quality. However, because of a mismatch in the number of layers, the pretrained model can only initialize part of the decoder’s parameters. In this paper, we use a layer-wise coordination transformer and a consistent pretraining translation transformer instead of a vanilla transformer as the translation model. The former has only an encoder, and the latter has an encoder and a decoder, but the encoder and decoder have exactly the same parameters. Both models can guarantee that all parameters in the translation model can be initialized by the pretrained model. Experiments on the Chinese–English and English–German datasets show that compared with the vanilla transformer baseline, our models achieve better performance with fewer parameters when the parallel corpus is small.

Download Full-text

Neural machine translation with a polysynthetic low resource language

Machine Translation ◽

10.1007/s10590-020-09255-9 ◽

2020 ◽

Vol 34 (4) ◽

pp. 325-346

Author(s):

John E. Ortega ◽

Richard Castro Mamani ◽

Kyunghyun Cho

Keyword(s):

Machine Translation ◽

Neural Machine Translation ◽

Low Resource

Download Full-text

Introduction to the Special Issue on Machine Translation for Low-Resource Languages

Machine Translation ◽

10.1007/s10590-020-09256-8 ◽

2021 ◽

Author(s):

Chao-Hong Liu ◽

Alina Karakanta ◽

Audrey N. Tong ◽

Oleg Aulov ◽

Ian M. Soboroff ◽

...

Keyword(s):

Machine Translation ◽

Special Issue ◽

Low Resource

Download Full-text

Context-Aware Neural Machine Translation for Korean Honorific Expressions

Electronics ◽

10.3390/electronics10131589 ◽

2021 ◽

Vol 10 (13) ◽

pp. 1589

Author(s):

Yongkeun Hwang ◽

Yanghoon Kim ◽

Kyomin Jung

Keyword(s):

Machine Translation ◽

Deep Neural Networks ◽

Contextual Information ◽

Context Aware ◽

Neural Machine Translation ◽

Translation Quality ◽

Sentence Level ◽

Proposed Model ◽

The Given ◽

The Relationship

Neural machine translation (NMT) is one of the text generation tasks which has achieved significant improvement with the rise of deep neural networks. However, language-specific problems such as handling the translation of honorifics received little attention. In this paper, we propose a context-aware NMT to promote translation improvements of Korean honorifics. By exploiting the information such as the relationship between speakers from the surrounding sentences, our proposed model effectively manages the use of honorific expressions. Specifically, we utilize a novel encoder architecture that can represent the contextual information of the given input sentences. Furthermore, a context-aware post-editing (CAPE) technique is adopted to refine a set of inconsistent sentence-level honorific translations. To demonstrate the efficacy of the proposed method, honorific-labeled test data is required. Thus, we also design a heuristic that labels Korean sentences to distinguish between honorific and non-honorific styles. Experimental results show that our proposed method outperforms sentence-level NMT baselines both in overall translation quality and honorific translations.

Download Full-text

A Survey on Document-level Neural Machine Translation

ACM Computing Surveys ◽

10.1145/3441691 ◽

2021 ◽

Vol 54 (2) ◽

pp. 1-36

Author(s):

Sameen Maruf ◽

Fahimeh Saleh ◽

Gholamreza Haffari

Keyword(s):

Machine Translation ◽

Language Processing ◽

Research Field ◽

Translation Process ◽

Future Directions ◽

Translation Quality ◽

Current State ◽

Evaluation Strategies ◽

Almost All ◽

Document Level

Machine translation (MT) is an important task in natural language processing (NLP), as it automates the translation process and reduces the reliance on human translators. With the resurgence of neural networks, the translation quality surpasses that of the translations obtained using statistical techniques for most language-pairs. Up until a few years ago, almost all of the neural translation models translated sentences independently , without incorporating the wider document-context and inter-dependencies among the sentences. The aim of this survey article is to highlight the major works that have been undertaken in the space of document-level machine translation after the neural revolution, so researchers can recognize the current state and future directions of this field. We provide an organization of the literature based on novelties in modelling and architectures as well as training and decoding strategies. In addition, we cover evaluation strategies that have been introduced to account for the improvements in document MT, including automatic metrics and discourse-targeted test sets. We conclude by presenting possible avenues for future exploration in this research field.

Download Full-text