Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach

To build state-of-the-art Neural Machine Translation (NMT) systems, high-quality parallel sentences are needed. Typically, large amounts of data are scraped from multilingual web sites and aligned into datasets for training. Many tools exist for automatic alignment of such datasets. However, the quality of the resulting aligned corpus can be disappointing. In this paper, we present a tool for automatic misalignment detection (MAD). We treated the task of determining whether a pair of aligned sentences constitutes a genuine translation as a supervised regression problem. We trained our algorithm on a manually labeled dataset in the FR–NL language pair. Our algorithm used shallow features and features obtained after an initial translation step. We showed that both the Levenshtein distance between the target and the translated source, as well as the cosine distance between sentence embeddings of the source and the target were the two most important features for the task of misalignment detection. Using gold standards for alignment, we demonstrated that our model can increase the quality of alignments in a corpus substantially, reaching a precision close to 100%. Finally, we used our tool to investigate the effect of misalignments on NMT performance.

Download Full-text

Unsupervised Quality Estimation for Neural Machine Translation

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00330 ◽

2020 ◽

Vol 8 ◽

pp. 539-555

Author(s):

Marina Fomicheva ◽

Shuo Sun ◽

Lisa Yankovskaya ◽

Frédéric Blain ◽

Francisco Guzmán ◽

...

Keyword(s):

Machine Translation ◽

Real World ◽

State Of The Art ◽

Black Box ◽

Test Time ◽

Quality Estimation ◽

Neural Machine Translation ◽

Real World Applications ◽

Unsupervised Approach

Quality Estimation (QE) is an important component in making Machine Translation (MT) useful in real-world applications, as it is aimed to inform the user on the quality of the MT output at test time. Existing approaches require large amounts of expert annotated data, computation, and time for training. As an alternative, we devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required. Different from most of the current work that treats the MT system as a black box, we explore useful information that can be extracted from the MT system as a by-product of translation. By utilizing methods for uncertainty quantification, we achieve very good correlation with human judgments of quality, rivaling state-of-the-art supervised QE models. To evaluate our approach we collect the first dataset that enables work on both black-box and glass-box approaches to QE.

Download Full-text

An Improved English-to-Mizo Neural Machine Translation

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3445974 ◽

2021 ◽

Vol 20 (4) ◽

pp. 1-21

Author(s):

Candy Lalrempuii ◽

Badal Soni ◽

Partha Pakray

Keyword(s):

Machine Translation ◽

Model Performance ◽

Sentence Length ◽

Prediction Errors ◽

Indian Languages ◽

Neural Machine Translation ◽

Automatic Translation ◽

Translation Techniques ◽

Language Pair

Machine Translation is an effort to bridge language barriers and misinterpretations, making communication more convenient through the automatic translation of languages. The quality of translations produced by corpus-based approaches predominantly depends on the availability of a large parallel corpus. Although machine translation of many Indian languages has progressively gained attention, there is very limited research on machine translation and the challenges of using various machine translation techniques for a low-resource language such as Mizo. In this article, we have implemented and compared statistical-based approaches with modern neural-based approaches for the English–Mizo language pair. We have experimented with different tokenization methods, architectures, and configurations. The performance of translations predicted by the trained models has been evaluated using automatic and human evaluation measures. Furthermore, we have analyzed the prediction errors of the models and the quality of predictions based on variations in sentence length and compared the model performance with the existing baselines.

Download Full-text

Quality of neural machine translation for the Korean-Japanese language pair - the development of editing codes for machine translation -

Interpretation and Translation ◽

10.20305/it201801043071 ◽

2018 ◽

Vol 20 (1) ◽

pp. 43-71

Author(s):

JuRiAe Lee ◽

Keyword(s):

Machine Translation ◽

Japanese Language ◽

Neural Machine Translation ◽

Language Pair

Download Full-text

Is Neural Machine Translation the New State of the Art?

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0013 ◽

2017 ◽

Vol 108 (1) ◽

pp. 109-120 ◽

Cited By ~ 37

Author(s):

Sheila Castilho ◽

Joss Moorkens ◽

Federico Gaspari ◽

Iacer Calixto ◽

John Tinsley ◽

...

Keyword(s):

Machine Translation ◽

State Of The Art ◽

Evaluation Methods ◽

Automatic Evaluation ◽

New Paradigm ◽

Neural Machine Translation ◽

Human Evaluation ◽

Statistical Mt

Abstract This paper discusses neural machine translation (NMT), a new paradigm in the MT field, comparing the quality of NMT systems with statistical MT by describing three studies using automatic and human evaluation methods. Automatic evaluation results presented for NMT are very promising, however human evaluations show mixed results. We report increases in fluency but inconsistent results for adequacy and post-editing effort. NMT undoubtedly represents a step forward for the MT field, but one that the community should be careful not to oversell.

Download Full-text

Evaluation of English–Slovak Neural and Statistical Machine Translation

Applied Sciences ◽

10.3390/app11072948 ◽

2021 ◽

Vol 11 (7) ◽

pp. 2948

Author(s):

Lucia Benkova ◽

Dasa Munkova ◽

Ľubomír Benko ◽

Michal Munk

Keyword(s):

Machine Translation ◽

Statistical Approach ◽

Statistical Machine Translation ◽

Specific Domain ◽

Neural Network Approach ◽

Neural Machine Translation ◽

Translation Quality ◽

The Neural Network ◽

Language Pair

This study is focused on the comparison of phrase-based statistical machine translation (SMT) systems and neural machine translation (NMT) systems using automatic metrics for translation quality evaluation for the language pair of English and Slovak. As the statistical approach is the predecessor of neural machine translation, it was assumed that the neural network approach would generate results with a better quality. An experiment was performed using residuals to compare the scores of automatic metrics of the accuracy (BLEU_n) of the statistical machine translation with those of the neural machine translation. The results showed that the assumption of better neural machine translation quality regardless of the system used was confirmed. There were statistically significant differences between the SMT and NMT in favor of the NMT based on all BLEU_n scores. The neural machine translation achieved a better quality of translation of journalistic texts from English into Slovak, regardless of if it was a system trained on general texts, such as Google Translate, or specific ones, such as the European Commission’s (EC’s) tool, which was trained on a specific-domain.

Download Full-text

Towards Making the Most of BERT in Neural Machine Translation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6479 ◽

2020 ◽

Vol 34 (05) ◽

pp. 9378-9385

Author(s):

Jiacheng Yang ◽

Mingxuan Wang ◽

Hao Zhou ◽

Chengqi Zhao ◽

Weinan Zhang ◽

...

Keyword(s):

Machine Translation ◽

Language Processing ◽

State Of The Art ◽

Fine Tuning ◽

Language Models ◽

German Language ◽

Neural Machine Translation ◽

Dynamic Switching ◽

Previous State ◽

Language Pair

GPT-2 and BERT demonstrate the effectiveness of using pre-trained language models (LMs) on various natural language processing tasks. However, LM fine-tuning often suffers from catastrophic forgetting when applied to resource-rich tasks. In this work, we introduce a concerted training framework (CTnmt) that is the key to integrate the pre-trained LMs to neural machine translation (NMT). Our proposed CTnmt} consists of three techniques: a) asymptotic distillation to ensure that the NMT model can retain the previous pre-trained knowledge; b) a dynamic switching gate to avoid catastrophic forgetting of pre-trained knowledge; and c) a strategy to adjust the learning paces according to a scheduled policy. Our experiments in machine translation show CTnmt gains of up to 3 BLEU score on the WMT14 English-German language pair which even surpasses the previous state-of-the-art pre-training aided NMT by 1.4 BLEU score. While for the large WMT14 English-French task with 40 millions of sentence-pairs, our base model still significantly improves upon the state-of-the-art Transformer big model by more than 1 BLEU score.

Download Full-text

Fully Character-Level Neural Machine Translation without Explicit Segmentation

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00067 ◽

2017 ◽

Vol 5 ◽

pp. 365-378 ◽

Cited By ~ 43

Author(s):

Jason Lee ◽

Kyunghyun Cho ◽

Thomas Hofmann

Keyword(s):

Machine Translation ◽

Convolutional Network ◽

Neural Machine Translation ◽

Single Character ◽

Comparable Performance ◽

Translation Systems ◽

Multiple Languages ◽

Character Sequence ◽

Language Pair

Most existing machine translation systems operate at the level of words, relying on explicit segmentation to extract tokens. We introduce a neural machine translation (NMT) model that maps a source character sequence to a target character sequence without any segmentation. We employ a character-level convolutional network with max-pooling at the encoder to reduce the length of source representation, allowing the model to be trained at a speed comparable to subword-level models while capturing local regularities. Our character-to-character model outperforms a recently proposed baseline with a subword-level encoder on WMT’15 DE-EN and CS-EN, and gives comparable performance on FI-EN and RU-EN. We then demonstrate that it is possible to share a single character-level encoder across multiple languages by training a model on a many-to-one translation task. In this multilingual setting, the character-level encoder significantly outperforms the subword-level encoder on all the language pairs. We observe that on CS-EN, FI-EN and RU-EN, the quality of the multilingual character-level translation even surpasses the models specifically trained on that language pair alone, both in terms of the BLEU score and human judgment.

Download Full-text

Investigating post-editing effort

Cognitive Linguistic Studies ◽

10.1075/cogls.00083.sta ◽

2021 ◽

Vol 8 (2) ◽

pp. 378-403

Author(s):

Maria Stasimioti ◽

Vilelmini Sosoni ◽

Konstantinos Chatzitheodorou

Keyword(s):

Machine Translation ◽

First Language ◽

Working Environment ◽

Translation Product ◽

Greek Language ◽

Neural Machine Translation ◽

Product And Process ◽

Technical Effort ◽

Language Pair

Abstract The working environment of translators has changed significantly in recent decades, with post-editing (PE) emerging as a new trend in the human translation workflow, particularly following the advent of neural machine translation (NMT) and the improvement of the quality of the machine translation (MT) raw output especially at the level of fluency. In addition, the directionality axiom is increasingly being questioned with translators working from and into their first language both in the context of translation (Buchweitz and Alves 2006; Pavlović and Jensen 2009; Fonseca and Barbosa 2015; Hunziker Heeb 2015; Ferreira 2013, 2014; Ferreira et al. 2016; Feng 2017) and in the context of PE (Garcia 2011; Sánchez-Gijón and Torres-Hostench 2014; da Silva et al. 2017; Toledo Báez 2018). In this study we employ product- and process-oriented approaches to investigate directionality in PE in the English-Greek language pair. In particular, we compare the cognitive, temporal, and technical effort expended by translators for the full PE of NMT output in L1 (Greek) with the effort required for the full PE of NMT output in L2 (English), while we also analyze the quality of the final translation product. Our findings reveal that PE in L2, i.e., inverse PE, is less demanding than PE in L1, i.e., direct PE, in terms of the time and keystrokes required, and the cognitive load exerted on translators. Finally, our research shows that directionality does not imply differences in quality.

Download Full-text

Data science in economics: comprehensive review of advanced machine learning and deep learning methods

10.31232/osf.io/4pxq2 ◽

2020 ◽

Author(s):

Saeed Nosratabadi ◽

Amir Mosavi ◽

Puhong Duan ◽

Pedram Ghamisi ◽

Ferdinand Filip ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Data Science ◽

State Of The Art ◽

Science Methods ◽

Learning Models ◽

Diverse Range ◽

Hybrid Machine ◽

Economics Research

This paper provides a state-of-the-art investigation of advances in data science in emerging economic applications. The analysis was performed on novel data science methods in four individual classes of deep learning models, hybrid deep learning models, hybrid machine learning, and ensemble models. Application domains include a wide and diverse range of economics research from the stock market, marketing, and e-commerce to corporate banking and cryptocurrency. Prisma method, a systematic literature review methodology, was used to ensure the quality of the survey. The findings reveal that the trends follow the advancement of hybrid models, which, based on the accuracy metric, outperform other learning algorithms. It is further expected that the trends will converge toward the advancements of sophisticated hybrid deep learning models.

Download Full-text

EVALUATING THE CONTRIBUTION OF TOTAL FACTOR PRODUCTIVITY ON AGRICULTURAL GROWTH IN AN GIANG PROVINCE

Scientific Journal of Tra Vinh University ◽

10.35382/18594816.1.29.2018.30 ◽

2018 ◽

Vol 1 (29) ◽

pp. 29-37

Author(s):

Tan Van Truong

Keyword(s):

Total Factor Productivity ◽

Agricultural Production ◽

Science And Technology ◽

Agricultural Labor ◽

Factor Productivity ◽

Agricultural Growth ◽

Regression Approach ◽

Investment Capital ◽

Agricultural Demand

By the growth regression approach, the research has identified that the investment capital contributed 1,939 and agricultural labor contributed 1,291 to the agricultural growth of An Giang province. More specifically, the contribution of TFP (Total Factor Productivity) to the agricultural growth in the period 2000 - 2004 was averagely 0,11%, in 2005 - 2010 was -5,03%, and in period 2011 - 2016 was 0,81%. The total factor productivity contributed to the agricultural growth slowly. In order to raise the contribution of TFP, the research represents 05 solutions including the increase of the effectiveness of using the investment capital, the increase of the quality of labor, the application of the science and technology into agricultural production, agriculturalrestructuring, and the increase of agricultural demand.

Download Full-text