scholarly journals Tackling neural machine translation in low-resource settings: a Portuguese case study

2021 ◽  
Author(s):  
Arthur T. Estrella ◽  
João B. O. Souza Filho

Neural machine translation (NMT) nowadays requires an increasing amount of data and computational power, so succeeding in this task with limited data and using a single GPU might be challenging. Strategies such as the use of pre-trained word embeddings, subword embeddings, and data augmentation solutions can potentially address some issues faced in low-resource experimental settings, but their impact on the quality of translations is unclear. This work evaluates some of these strategies on two low-resource experiments beyond just reporting BLEU: errors are categorized on the Portuguese-English pair with the help of a translator, considering semantic and syntactic aspects. The BPE subword approach has shown to be the most effective solution, allowing a BLEU increase of 59% p.p. compared to the standard Transformer.

2021 ◽  
pp. 1-12
Author(s):  
Sahinur Rahman Laskar ◽  
Abdullah Faiz Ur Rahman Khilji ◽  
Partha Pakray ◽  
Sivaji Bandyopadhyay

Language translation is essential to bring the world closer and plays a significant part in building a community among people of different linguistic backgrounds. Machine translation dramatically helps in removing the language barrier and allows easier communication among linguistically diverse communities. Due to the unavailability of resources, major languages of the world are accounted as low-resource languages. This leads to a challenging task of automating translation among various such languages to benefit indigenous speakers. This article investigates neural machine translation for the English–Assamese resource-poor language pair by tackling insufficient data and out-of-vocabulary problems. We have also proposed an approach of data augmentation-based NMT, which exploits synthetic parallel data and shows significantly improved translation accuracy for English-to-Assamese and Assamese-to-English translation and obtained state-of-the-art results.


2020 ◽  
Vol 2020 ◽  
pp. 1-11
Author(s):  
Gong-Xu Luo ◽  
Ya-Ting Yang ◽  
Rui Dong ◽  
Yan-Hong Chen ◽  
Wen-Bo Zhang

Neural machine translation (NMT) for low-resource languages has drawn great attention in recent years. In this paper, we propose a joint back-translation and transfer learning method for low-resource languages. It is widely recognized that data augmentation methods and transfer learning methods are both straight forward and effective ways for low-resource problems. However, existing methods, which utilize one of these methods alone, limit the capacity of NMT models for low-resource problems. In order to make full use of the advantages of existing methods and further improve the translation performance of low-resource languages, we propose a new method to perfectly integrate the back-translation method with mainstream transfer learning architectures, which can not only initialize the NMT model by transferring parameters of the pretrained models, but also generate synthetic parallel data by translating large-scale monolingual data of the target side to boost the fluency of translations. We conduct experiments to explore the effectiveness of the joint method by incorporating back-translation into the parent-child and the hierarchical transfer learning architecture. In addition, different preprocessing and training methods are explored to get better performance. Experimental results on Uygur-Chinese and Turkish-English translation demonstrate the superiority of the proposed method over the baselines that use single methods.


2019 ◽  
Vol 55 (2) ◽  
pp. 491-515 ◽  
Author(s):  
Krzysztof Jassem ◽  
Tomasz Dwojak

Abstract Neural Machine Translation (NMT) has recently achieved promising results for a number of translation pairs. Although the method requires larger volumes of data and more computational power than Statistical Machine Translation (SMT), it is believed to become dominant in near future. In this paper we evaluate SMT and NMT models learned on a domain-specific English-Polish corpus of a moderate size (1,200,000 segments). The experiment shows that both solutions significantly outperform a general-domain online translator. The SMT model achieves a slightly better BLEU score than the NMT model. On the other hand, the process of decoding is noticeably faster in NMT. Human evaluation carried out on a sizeable sample of translations (2,000 pairs) reveals the superiority of the NMT approach, particularly in the aspect of output fluency.


2020 ◽  
Vol 34 (4) ◽  
pp. 347-382
Author(s):  
Raphael Rubino ◽  
Benjamin Marie ◽  
Raj Dabre ◽  
Atushi Fujita ◽  
Masao Utiyama ◽  
...  

AbstractThis paper presents a set of effective approaches to handle extremely low-resource language pairs for self-attention based neural machine translation (NMT) focusing on English and four Asian languages. Starting from an initial set of parallel sentences used to train bilingual baseline models, we introduce additional monolingual corpora and data processing techniques to improve translation quality. We describe a series of best practices and empirically validate the methods through an evaluation conducted on eight translation directions, based on state-of-the-art NMT approaches such as hyper-parameter search, data augmentation with forward and backward translation in combination with tags and noise, as well as joint multilingual training. Experiments show that the commonly used default architecture of self-attention NMT models does not reach the best results, validating previous work on the importance of hyper-parameter tuning. Additionally, empirical results indicate the amount of synthetic data required to efficiently increase the parameters of the models leading to the best translation quality measured by automatic metrics. We show that the best NMT models trained on large amount of tagged back-translations outperform three other synthetic data generation approaches. Finally, comparison with statistical machine translation (SMT) indicates that extremely low-resource NMT requires a large amount of synthetic parallel data obtained with back-translation in order to close the performance gap with the preceding SMT approach.


Information ◽  
2020 ◽  
Vol 11 (5) ◽  
pp. 255
Author(s):  
Yu Li ◽  
Xiao Li ◽  
Yating Yang ◽  
Rui Dong

One important issue that affects the performance of neural machine translation is the scale of available parallel data. For low-resource languages, the amount of parallel data is not sufficient, which results in poor translation quality. In this paper, we propose a diversity data augmentation method that does not use extra monolingual data. We expand the training data by generating diversity pseudo parallel data on the source and target sides. To generate diversity data, the restricted sampling strategy is employed at the decoding steps. Finally, we filter and merge origin data and synthetic parallel corpus to train the final model. In the experiment, the proposed approach achieved 1.96 BLEU points in the IWSLT2014 German–English translation tasks, which was used to simulate a low-resource language. Our approach also consistently and substantially obtained 1.0 to 2.0 BLEU improvement in three other low-resource translation tasks, including English–Turkish, Nepali–English, and Sinhala–English translation tasks.


Author(s):  
Jordi Armengol-Estapé ◽  
Marta R. Costa-jussà

AbstractIntroducing factors such as linguistic features has long been proposed in machine translation to improve the quality of translations. More recently, factored machine translation has proven to still be useful in the case of sequence-to-sequence systems. In this work, we investigate whether this gains hold in the case of the state-of-the-art architecture in neural machine translation, the Transformer, instead of recurrent architectures. We propose a new model, the Factored Transformer, to introduce an arbitrary number of word features in the source sequence in an attentional system. Specifically, we suggest two variants depending on the level at which the features are injected. Moreover, we suggest two combination mechanisms for the word features and words themselves. We experiment both with classical linguistic features and semantic features extracted from a linked data database, and with two low-resource datasets. With the best-found configuration, we show improvements of 0.8 BLEU over the baseline Transformer in the IWSLT German-to-English task. Moreover, we experiment with the more challenging FLoRes English-to-Nepali benchmark, which includes both low-resource and very distant languages, and obtain an improvement of 1.2 BLEU. These improvements are achieved with linguistic and not with semantic information.


Author(s):  
Junliang Guo ◽  
Xu Tan ◽  
Di He ◽  
Tao Qin ◽  
Linli Xu ◽  
...  

Non-autoregressive translation (NAT) models, which remove the dependence on previous target tokens from the inputs of the decoder, achieve significantly inference speedup but at the cost of inferior accuracy compared to autoregressive translation (AT) models. Previous work shows that the quality of the inputs of the decoder is important and largely impacts the model accuracy. In this paper, we propose two methods to enhance the decoder inputs so as to improve NAT models. The first one directly leverages a phrase table generated by conventional SMT approaches to translate source tokens to target tokens, which are then fed into the decoder as inputs. The second one transforms source-side word embeddings to target-side word embeddings through sentence-level alignment and word-level adversary learning, and then feeds the transformed word embeddings into the decoder as inputs. Experimental results show our method largely outperforms the NAT baseline (Gu et al. 2017) by 5.11 BLEU scores on WMT14 English-German task and 4.72 BLEU scores on WMT16 English-Romanian task.


Sign in / Sign up

Export Citation Format

Share Document