scholarly journals Ensemble and Deep Learning for Language-Independent Automatic Selection of Parallel Data

Algorithms ◽  
2019 ◽  
Vol 12 (1) ◽  
pp. 26 ◽  
Author(s):  
Despoina Mouratidis ◽  
Katia Kermanidis

Machine translation is used in many applications in everyday life. Due to the increase of translated documents that need to be organized as useful or not (for building a translation model), the automated categorization of texts (classification), is a popular research field of machine learning. This kind of information can be quite helpful for machine translation. Our parallel corpora (English-Greek and English-Italian) are based on educational data, which are quite difficult to translate. We apply two state of the art architectures, Random Forest (RF) and Deeplearnig4j (DL4J), to our data (which constitute three translation outputs). To our knowledge, this is the first time that deep learning architectures are applied to the automatic selection of parallel data. We also propose new string-based features that seem to be effective for the classifier, and we investigate whether an attribute selection method could be used for better classification accuracy. Experimental results indicate an increase of up to 4% (compared to our previous work) using RF and rather satisfactory results using DL4J.

2020 ◽  
Vol 10 (11) ◽  
pp. 3904
Author(s):  
Van-Hai Vu ◽  
Quang-Phuoc Nguyen ◽  
Joon-Choul Shin ◽  
Cheol-Young Ock

Machine translation (MT) has recently attracted much research on various advanced techniques (i.e., statistical-based and deep learning-based) and achieved great results for popular languages. However, the research on it involving low-resource languages such as Korean often suffer from the lack of openly available bilingual language resources. In this research, we built the open extensive parallel corpora for training MT models, named Ulsan parallel corpora (UPC). Currently, UPC contains two parallel corpora consisting of Korean-English and Korean-Vietnamese datasets. The Korean-English dataset has over 969 thousand sentence pairs, and the Korean-Vietnamese parallel corpus consists of over 412 thousand sentence pairs. Furthermore, the high rate of homographs of Korean causes an ambiguous word issue in MT. To address this problem, we developed a powerful word-sense annotation system based on a combination of sub-word conditional probability and knowledge-based methods, named UTagger. We applied UTagger to UPC and used these corpora to train both statistical-based and deep learning-based neural MT systems. The experimental results demonstrated that using UPC, high-quality MT systems (in terms of the Bi-Lingual Evaluation Understudy (BLEU) and Translation Error Rate (TER) score) can be built. Both UPC and UTagger are available for free download and usage.


2020 ◽  
Vol 12 (1) ◽  
pp. 42-51
Author(s):  
Vishal Goyal ◽  
Ajit Kumar ◽  
Manpreet Singh Lehal

Comparable corpora come as an alternative to parallel corpora for the languages where the parallel corpora is scarce. The efficiency of the models trained on comparable corpora is comparatively less to that of the parallel corpora however it helps to compensate much to the machine translation. In this article, the authors have explored Wikipedia as a potential source and delineated the process of alignment of documents which will be further used for the extraction of parallel data. The parallel data thus extracted will help to enhance the performance of Statistical Machine translation.


2005 ◽  
Vol 31 (4) ◽  
pp. 477-504 ◽  
Author(s):  
Dragos Stefan Munteanu ◽  
Daniel Marcu

We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available.


2019 ◽  
Vol 9 (10) ◽  
pp. 2036
Author(s):  
Jinyi Zhang ◽  
Tadahiro Matsumoto

The translation quality of Neural Machine Translation (NMT) systems depends strongly on the training data size. Sufficient amounts of parallel data are, however, not available for many language pairs. This paper presents a corpus augmentation method, which has two variations: one is for all language pairs, and the other is for the Chinese-Japanese language pair. The method uses both source and target sentences of the existing parallel corpus and generates multiple pseudo-parallel sentence pairs from a long parallel sentence pair containing punctuation marks as follows: (1) split the sentence pair into parallel partial sentences; (2) back-translate the target partial sentences; and (3) replace each partial sentence in the source sentence with the back-translated target partial sentence to generate pseudo-source sentences. The word alignment information, which is used to determine the split points, is modified with “shared Chinese character rates” in segments of the sentence pairs. The experiment results of the Japanese-Chinese and Chinese-Japanese translation with ASPEC-JC (Asian Scientific Paper Excerpt Corpus, Japanese-Chinese) show that the method substantially improves translation performance. We also supply the code (see Supplementary Materials) that can reproduce our proposed method.


Author(s):  
Chang Xu ◽  
Tao Qin ◽  
Gang Wang ◽  
Tie-Yan Liu

Neural machine translation (NMT) has achieved great success. However, collecting large-scale parallel data for training is costly and laborious.  Recently, unsupervised neural machine translation has attracted more and more attention, due to its demand for monolingual corpus only, which is common and easy to obtain, and its great potentials for the low-resource or even zero-resource machine translation. In this work, we propose a general framework called Polygon-Net, which leverages multi auxiliary languages for jointly boosting unsupervised neural machine translation models. Specifically, we design a novel loss function for multi-language unsupervised neural machine translation. In addition, different from the literature that just updating one or two models individually, Polygon-Net enables multiple unsupervised models in the framework to update in turn and enhance each other for the first time. In this way, multiple unsupervised translation models are associated with each other for training to achieve better performance. Experiments on the benchmark datasets including UN Corpus and WMT show that our approach significantly improves over the two-language based methods, and achieves better performance with more languages introduced to the framework. 


Algorithms ◽  
2020 ◽  
Vol 13 (3) ◽  
pp. 61 ◽  
Author(s):  
Konstantinos Demertzis ◽  
Lazaros Iliadis

Deep learning architectures are the most effective methods for analyzing and classifying Ultra-Spectral Images (USI). However, effective training of a Deep Learning (DL) gradient classifier aiming to achieve high classification accuracy, is extremely costly and time-consuming. It requires huge datasets with hundreds or thousands of labeled specimens from expert scientists. This research exploits the MAML++ algorithm in order to introduce the Model-Agnostic Meta-Ensemble Zero-shot Learning (MAME-ZsL) approach. The MAME-ZsL overcomes the above difficulties, and it can be used as a powerful model to perform Hyperspectral Image Analysis (HIA). It is a novel optimization-based Meta-Ensemble Learning architecture, following a Zero-shot Learning (ZsL) prototype. To the best of our knowledge it is introduced to the literature for the first time. It facilitates learning of specialized techniques for the extraction of user-mediated representations, in complex Deep Learning architectures. Moreover, it leverages the use of first and second-order derivatives as pre-training methods. It enhances learning of features which do not cause issues of exploding or diminishing gradients; thus, it avoids potential overfitting. Moreover, it significantly reduces computational cost and training time, and it offers an improved training stability, high generalization performance and remarkable classification accuracy.


Sign in / Sign up

Export Citation Format

Share Document