Ensemble and Deep Learning for Language-Independent Automatic Selection of Parallel Data

Machine translation is used in many applications in everyday life. Due to the increase of translated documents that need to be organized as useful or not (for building a translation model), the automated categorization of texts (classification), is a popular research field of machine learning. This kind of information can be quite helpful for machine translation. Our parallel corpora (English-Greek and English-Italian) are based on educational data, which are quite difficult to translate. We apply two state of the art architectures, Random Forest (RF) and Deeplearnig4j (DL4J), to our data (which constitute three translation outputs). To our knowledge, this is the first time that deep learning architectures are applied to the automatic selection of parallel data. We also propose new string-based features that seem to be effective for the classifier, and we investigate whether an attribute selection method could be used for better classification accuracy. Experimental results indicate an increase of up to 4% (compared to our previous work) using RF and rather satisfactory results using DL4J.

Download Full-text

Automatic Selection of Parallel Data for Machine Translation

IFIP Advances in Information and Communication Technology - Artificial Intelligence Applications and Innovations ◽

10.1007/978-3-319-92016-0_14 ◽

2018 ◽

pp. 146-156

Author(s):

Despoina Mouratidis ◽

Katia Lida Kermanidis

Keyword(s):

Machine Translation ◽

Automatic Selection ◽

Parallel Data ◽

Selection Of

Download Full-text

UPC: An Open Word-Sense Annotated Parallel Corpora for Machine Translation Study

Applied Sciences ◽

10.3390/app10113904 ◽

2020 ◽

Vol 10 (11) ◽

pp. 3904

Author(s):

Van-Hai Vu ◽

Quang-Phuoc Nguyen ◽

Joon-Choul Shin ◽

Cheol-Young Ock

Keyword(s):

Deep Learning ◽

Machine Translation ◽

Ambiguous Word ◽

High Rate ◽

Word Sense ◽

Language Resources ◽

Parallel Corpora ◽

Knowledge Based ◽

Translation Error ◽

Translation Study

Machine translation (MT) has recently attracted much research on various advanced techniques (i.e., statistical-based and deep learning-based) and achieved great results for popular languages. However, the research on it involving low-resource languages such as Korean often suffer from the lack of openly available bilingual language resources. In this research, we built the open extensive parallel corpora for training MT models, named Ulsan parallel corpora (UPC). Currently, UPC contains two parallel corpora consisting of Korean-English and Korean-Vietnamese datasets. The Korean-English dataset has over 969 thousand sentence pairs, and the Korean-Vietnamese parallel corpus consists of over 412 thousand sentence pairs. Furthermore, the high rate of homographs of Korean causes an ambiguous word issue in MT. To address this problem, we developed a powerful word-sense annotation system based on a combination of sub-word conditional probability and knowledge-based methods, named UTagger. We applied UTagger to UPC and used these corpora to train both statistical-based and deep learning-based neural MT systems. The experimental results demonstrated that using UPC, high-quality MT systems (in terms of the Bi-Lingual Evaluation Understudy (BLEU) and Translation Error Rate (TER) score) can be built. Both UPC and UTagger are available for free download and usage.

Download Full-text

Document Alignment for Generation of English-Punjabi Comparable Corpora from Wikipedia

International Journal of E-Adoption ◽

10.4018/ijea.2020010104 ◽

2020 ◽

Vol 12 (1) ◽

pp. 42-51

Author(s):

Vishal Goyal ◽

Ajit Kumar ◽

Manpreet Singh Lehal

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Parallel Corpora ◽

Comparable Corpora ◽

Potential Source ◽

Parallel Data

Comparable corpora come as an alternative to parallel corpora for the languages where the parallel corpora is scarce. The efficiency of the models trained on comparable corpora is comparatively less to that of the parallel corpora however it helps to compensate much to the machine translation. In this article, the authors have explored Wikipedia as a potential source and delineated the process of alignment of documents which will be further used for the extraction of parallel data. The parallel data thus extracted will help to enhance the performance of Statistical Machine translation.

Download Full-text

Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics ◽

10.1162/089120105775299168 ◽

2005 ◽

Vol 31 (4) ◽

pp. 477-504 ◽

Cited By ~ 104

Author(s):

Dragos Stefan Munteanu ◽

Daniel Marcu

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Translation System ◽

Parallel Corpora ◽

Parallel Corpus ◽

Scarce Resources ◽

Parallel Data ◽

Machine Translation System ◽

Novel Method ◽

Arabic And English

We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available.

Download Full-text

Corpus Augmentation for Neural Machine Translation with Chinese-Japanese Parallel Corpora

Applied Sciences ◽

10.3390/app9102036 ◽

2019 ◽

Vol 9 (10) ◽

pp. 2036

Author(s):

Jinyi Zhang ◽

Tadahiro Matsumoto

Keyword(s):

Machine Translation ◽

Scientific Paper ◽

Training Data ◽

Word Alignment ◽

Sentence Pair ◽

Neural Machine Translation ◽

Parallel Corpora ◽

Translation Quality ◽

Parallel Data ◽

Source Sentence

The translation quality of Neural Machine Translation (NMT) systems depends strongly on the training data size. Sufficient amounts of parallel data are, however, not available for many language pairs. This paper presents a corpus augmentation method, which has two variations: one is for all language pairs, and the other is for the Chinese-Japanese language pair. The method uses both source and target sentences of the existing parallel corpus and generates multiple pseudo-parallel sentence pairs from a long parallel sentence pair containing punctuation marks as follows: (1) split the sentence pair into parallel partial sentences; (2) back-translate the target partial sentences; and (3) replace each partial sentence in the source sentence with the back-translated target partial sentence to generate pseudo-source sentences. The word alignment information, which is used to determine the split points, is modified with “shared Chinese character rates” in segments of the sentence pairs. The experiment results of the Japanese-Chinese and Chinese-Japanese translation with ASPEC-JC (Asian Scientific Paper Excerpt Corpus, Japanese-Chinese) show that the method substantially improves translation performance. We also supply the code (see Supplementary Materials) that can reproduce our proposed method.

Download Full-text

Automatic Selection of Preferable Tone-Mapping Method based on Deep Learning

Proceedings of the International Display Workshops ◽

10.36463/idw.2019.0047 ◽

2019 ◽

pp. 47 ◽

Cited By ~ 1

Author(s):

Hirofumi Sasaki ◽

Keita Hirai ◽

Takahiko Horiuchi

Keyword(s):

Deep Learning ◽

Mapping Method ◽

Tone Mapping ◽

Automatic Selection ◽

Selection Of

Download Full-text

Automatic Selection of Preferable Tone-Mapping Method based on Deep Learning

Proceedings of the International Display Workshops ◽

10.36463/idw.2019.aisp2_vhfp6-1 ◽

2019 ◽

pp. 47

Author(s):

Hirofumi Sasaki ◽

Keita Hirai ◽

Takahiko Horiuchi

Keyword(s):

Deep Learning ◽

Mapping Method ◽

Tone Mapping ◽

Automatic Selection ◽

Selection Of

Download Full-text

Polygon-Net: A General Framework for Jointly Boosting Multiple Unsupervised Neural Machine Translation Models

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/739 ◽

2019 ◽

Cited By ~ 1

Author(s):

Chang Xu ◽

Tao Qin ◽

Gang Wang ◽

Tie-Yan Liu

Keyword(s):

Machine Translation ◽

Loss Function ◽

General Framework ◽

Large Scale ◽

Great Success ◽

Neural Machine Translation ◽

Low Resource ◽

Parallel Data ◽

Benchmark Datasets ◽

First Time

Neural machine translation (NMT) has achieved great success. However, collecting large-scale parallel data for training is costly and laborious. Recently, unsupervised neural machine translation has attracted more and more attention, due to its demand for monolingual corpus only, which is common and easy to obtain, and its great potentials for the low-resource or even zero-resource machine translation. In this work, we propose a general framework called Polygon-Net, which leverages multi auxiliary languages for jointly boosting unsupervised neural machine translation models. Specifically, we design a novel loss function for multi-language unsupervised neural machine translation. In addition, different from the literature that just updating one or two models individually, Polygon-Net enables multiple unsupervised models in the framework to update in turn and enhance each other for the first time. In this way, multiple unsupervised translation models are associated with each other for training to achieve better performance. Experiments on the benchmark datasets including UN Corpus and WMT show that our approach significantly improves over the two-language based methods, and achieves better performance with more languages introduced to the framework.

Download Full-text

GeoAI: A Model-Agnostic Meta-Ensemble Zero-Shot Learning Method for Hyperspectral Image Analysis and Classification

Algorithms ◽

10.3390/a13030061 ◽

2020 ◽

Vol 13 (3) ◽

pp. 61 ◽

Cited By ~ 2

Author(s):

Konstantinos Demertzis ◽

Lazaros Iliadis

Keyword(s):

Image Analysis ◽

Deep Learning ◽

Classification Accuracy ◽

Hyperspectral Image ◽

Computational Cost ◽

Training Methods ◽

Training Time ◽

Hyperspectral Image Analysis ◽

First Time ◽

Learning Architectures

Deep learning architectures are the most effective methods for analyzing and classifying Ultra-Spectral Images (USI). However, effective training of a Deep Learning (DL) gradient classifier aiming to achieve high classification accuracy, is extremely costly and time-consuming. It requires huge datasets with hundreds or thousands of labeled specimens from expert scientists. This research exploits the MAML++ algorithm in order to introduce the Model-Agnostic Meta-Ensemble Zero-shot Learning (MAME-ZsL) approach. The MAME-ZsL overcomes the above difficulties, and it can be used as a powerful model to perform Hyperspectral Image Analysis (HIA). It is a novel optimization-based Meta-Ensemble Learning architecture, following a Zero-shot Learning (ZsL) prototype. To the best of our knowledge it is introduced to the literature for the first time. It facilitates learning of specialized techniques for the extraction of user-mediated representations, in complex Deep Learning architectures. Moreover, it leverages the use of first and second-order derivatives as pre-training methods. It enhances learning of features which do not cause issues of exploding or diminishing gradients; thus, it avoids potential overfitting. Moreover, it significantly reduces computational cost and training time, and it offers an improved training stability, high generalization performance and remarkable classification accuracy.

Download Full-text

Deep-Learning-Based Automatic Selection of Fewest Channels for Brain-Machine Interfaces

IEEE Transactions on Cybernetics ◽

10.1109/tcyb.2021.3052813 ◽

2021 ◽

pp. 1-13

Author(s):

Hyun-Seok Kim ◽

Min-Hee Ahn ◽

Byoung-Kyong Min

Keyword(s):

Deep Learning ◽

Brain Machine Interfaces ◽

Automatic Selection ◽

Selection Of

Download Full-text