scholarly journals Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach

Informatics ◽  
2019 ◽  
Vol 6 (3) ◽  
pp. 35
Author(s):  
Defauw ◽  
Szoc ◽  
Bardadym ◽  
Brabers ◽  
Everaert ◽  
...  

To build state-of-the-art Neural Machine Translation (NMT) systems, high-quality parallel sentences are needed. Typically, large amounts of data are scraped from multilingual web sites and aligned into datasets for training. Many tools exist for automatic alignment of such datasets. However, the quality of the resulting aligned corpus can be disappointing. In this paper, we present a tool for automatic misalignment detection (MAD). We treated the task of determining whether a pair of aligned sentences constitutes a genuine translation as a supervised regression problem. We trained our algorithm on a manually labeled dataset in the FR–NL language pair. Our algorithm used shallow features and features obtained after an initial translation step. We showed that both the Levenshtein distance between the target and the translated source, as well as the cosine distance between sentence embeddings of the source and the target were the two most important features for the task of misalignment detection. Using gold standards for alignment, we demonstrated that our model can increase the quality of alignments in a corpus substantially, reaching a precision close to 100%. Finally, we used our tool to investigate the effect of misalignments on NMT performance.

2020 ◽  
Vol 8 ◽  
pp. 539-555
Author(s):  
Marina Fomicheva ◽  
Shuo Sun ◽  
Lisa Yankovskaya ◽  
Frédéric Blain ◽  
Francisco Guzmán ◽  
...  

Quality Estimation (QE) is an important component in making Machine Translation (MT) useful in real-world applications, as it is aimed to inform the user on the quality of the MT output at test time. Existing approaches require large amounts of expert annotated data, computation, and time for training. As an alternative, we devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required. Different from most of the current work that treats the MT system as a black box, we explore useful information that can be extracted from the MT system as a by-product of translation. By utilizing methods for uncertainty quantification, we achieve very good correlation with human judgments of quality, rivaling state-of-the-art supervised QE models. To evaluate our approach we collect the first dataset that enables work on both black-box and glass-box approaches to QE.


Author(s):  
Candy Lalrempuii ◽  
Badal Soni ◽  
Partha Pakray

Machine Translation is an effort to bridge language barriers and misinterpretations, making communication more convenient through the automatic translation of languages. The quality of translations produced by corpus-based approaches predominantly depends on the availability of a large parallel corpus. Although machine translation of many Indian languages has progressively gained attention, there is very limited research on machine translation and the challenges of using various machine translation techniques for a low-resource language such as Mizo. In this article, we have implemented and compared statistical-based approaches with modern neural-based approaches for the English–Mizo language pair. We have experimented with different tokenization methods, architectures, and configurations. The performance of translations predicted by the trained models has been evaluated using automatic and human evaluation measures. Furthermore, we have analyzed the prediction errors of the models and the quality of predictions based on variations in sentence length and compared the model performance with the existing baselines.


2017 ◽  
Vol 108 (1) ◽  
pp. 109-120 ◽  
Author(s):  
Sheila Castilho ◽  
Joss Moorkens ◽  
Federico Gaspari ◽  
Iacer Calixto ◽  
John Tinsley ◽  
...  

Abstract This paper discusses neural machine translation (NMT), a new paradigm in the MT field, comparing the quality of NMT systems with statistical MT by describing three studies using automatic and human evaluation methods. Automatic evaluation results presented for NMT are very promising, however human evaluations show mixed results. We report increases in fluency but inconsistent results for adequacy and post-editing effort. NMT undoubtedly represents a step forward for the MT field, but one that the community should be careful not to oversell.


2021 ◽  
Vol 11 (7) ◽  
pp. 2948
Author(s):  
Lucia Benkova ◽  
Dasa Munkova ◽  
Ľubomír Benko ◽  
Michal Munk

This study is focused on the comparison of phrase-based statistical machine translation (SMT) systems and neural machine translation (NMT) systems using automatic metrics for translation quality evaluation for the language pair of English and Slovak. As the statistical approach is the predecessor of neural machine translation, it was assumed that the neural network approach would generate results with a better quality. An experiment was performed using residuals to compare the scores of automatic metrics of the accuracy (BLEU_n) of the statistical machine translation with those of the neural machine translation. The results showed that the assumption of better neural machine translation quality regardless of the system used was confirmed. There were statistically significant differences between the SMT and NMT in favor of the NMT based on all BLEU_n scores. The neural machine translation achieved a better quality of translation of journalistic texts from English into Slovak, regardless of if it was a system trained on general texts, such as Google Translate, or specific ones, such as the European Commission’s (EC’s) tool, which was trained on a specific-domain.


2020 ◽  
Vol 34 (05) ◽  
pp. 9378-9385
Author(s):  
Jiacheng Yang ◽  
Mingxuan Wang ◽  
Hao Zhou ◽  
Chengqi Zhao ◽  
Weinan Zhang ◽  
...  

GPT-2 and BERT demonstrate the effectiveness of using pre-trained language models (LMs) on various natural language processing tasks. However, LM fine-tuning often suffers from catastrophic forgetting when applied to resource-rich tasks. In this work, we introduce a concerted training framework (CTnmt) that is the key to integrate the pre-trained LMs to neural machine translation (NMT). Our proposed CTnmt} consists of three techniques: a) asymptotic distillation to ensure that the NMT model can retain the previous pre-trained knowledge; b) a dynamic switching gate to avoid catastrophic forgetting of pre-trained knowledge; and c) a strategy to adjust the learning paces according to a scheduled policy. Our experiments in machine translation show CTnmt gains of up to 3 BLEU score on the WMT14 English-German language pair which even surpasses the previous state-of-the-art pre-training aided NMT by 1.4 BLEU score. While for the large WMT14 English-French task with 40 millions of sentence-pairs, our base model still significantly improves upon the state-of-the-art Transformer big model by more than 1 BLEU score.


Author(s):  
Jason Lee ◽  
Kyunghyun Cho ◽  
Thomas Hofmann

Most existing machine translation systems operate at the level of words, relying on explicit segmentation to extract tokens. We introduce a neural machine translation (NMT) model that maps a source character sequence to a target character sequence without any segmentation. We employ a character-level convolutional network with max-pooling at the encoder to reduce the length of source representation, allowing the model to be trained at a speed comparable to subword-level models while capturing local regularities. Our character-to-character model outperforms a recently proposed baseline with a subword-level encoder on WMT’15 DE-EN and CS-EN, and gives comparable performance on FI-EN and RU-EN. We then demonstrate that it is possible to share a single character-level encoder across multiple languages by training a model on a many-to-one translation task. In this multilingual setting, the character-level encoder significantly outperforms the subword-level encoder on all the language pairs. We observe that on CS-EN, FI-EN and RU-EN, the quality of the multilingual character-level translation even surpasses the models specifically trained on that language pair alone, both in terms of the BLEU score and human judgment.


2021 ◽  
Vol 8 (2) ◽  
pp. 378-403
Author(s):  
Maria Stasimioti ◽  
Vilelmini Sosoni ◽  
Konstantinos Chatzitheodorou

Abstract The working environment of translators has changed significantly in recent decades, with post-editing (PE) emerging as a new trend in the human translation workflow, particularly following the advent of neural machine translation (NMT) and the improvement of the quality of the machine translation (MT) raw output especially at the level of fluency. In addition, the directionality axiom is increasingly being questioned with translators working from and into their first language both in the context of translation (Buchweitz and Alves 2006; Pavlović and Jensen 2009; Fonseca and Barbosa 2015; Hunziker Heeb 2015; Ferreira 2013, 2014; Ferreira et al. 2016; Feng 2017) and in the context of PE (Garcia 2011; Sánchez-Gijón and Torres-Hostench 2014; da Silva et al. 2017; Toledo Báez 2018). In this study we employ product- and process-oriented approaches to investigate directionality in PE in the English-Greek language pair. In particular, we compare the cognitive, temporal, and technical effort expended by translators for the full PE of NMT output in L1 (Greek) with the effort required for the full PE of NMT output in L2 (English), while we also analyze the quality of the final translation product. Our findings reveal that PE in L2, i.e., inverse PE, is less demanding than PE in L1, i.e., direct PE, in terms of the time and keystrokes required, and the cognitive load exerted on translators. Finally, our research shows that directionality does not imply differences in quality.


2020 ◽  
Author(s):  
Saeed Nosratabadi ◽  
Amir Mosavi ◽  
Puhong Duan ◽  
Pedram Ghamisi ◽  
Ferdinand Filip ◽  
...  

This paper provides a state-of-the-art investigation of advances in data science in emerging economic applications. The analysis was performed on novel data science methods in four individual classes of deep learning models, hybrid deep learning models, hybrid machine learning, and ensemble models. Application domains include a wide and diverse range of economics research from the stock market, marketing, and e-commerce to corporate banking and cryptocurrency. Prisma method, a systematic literature review methodology, was used to ensure the quality of the survey. The findings reveal that the trends follow the advancement of hybrid models, which, based on the accuracy metric, outperform other learning algorithms. It is further expected that the trends will converge toward the advancements of sophisticated hybrid deep learning models.


2018 ◽  
Vol 1 (29) ◽  
pp. 29-37
Author(s):  
Tan Van Truong

By the growth regression approach, the research has identified that the investment capital contributed 1,939 and agricultural labor contributed 1,291 to the agricultural growth of An Giang province. More specifically, the contribution of TFP (Total Factor Productivity) to the agricultural growth in the period 2000 - 2004 was averagely 0,11%, in 2005 - 2010 was -5,03%, and in period 2011 - 2016 was 0,81%. The total factor productivity contributed to the agricultural growth slowly. In order to raise the contribution of TFP, the research represents 05 solutions including the increase of the effectiveness of using the investment capital, the increase of the quality of labor, the application of the science and technology into agricultural production, agriculturalrestructuring, and the increase of  agricultural demand.


Sign in / Sign up

Export Citation Format

Share Document