Efficient Phrase Table pruning for Hindi to English machine translation through syntactic and marker-based filtering and hybrid similarity measurement

2018 ◽  
Vol 25 (1) ◽  
pp. 171-210
Author(s):  
NILADRI CHATTERJEE ◽  
SUSMITA GUPTA

AbstractFor a given training corpus of parallel sentences, the quality of the output produced by a translation system relies heavily on the underlying similarity measurement criteria. A phrase-based machine translation system derives its output through a generative process using a Phrase Table comprising source and target language phrases. As a consequence, the more effective the Phrase Table is, in terms of its size and the output that may be derived out of it, the better is the expected outcome of the underlying translation system. However, finding the most similar phrase(s) from a given training corpus that can help generate a good quality translation poses a serious challenge. In practice, often there are many parallel phrase entries in a Phrase Table that are either redundant, or do not contribute to the translation results effectively. Identifying these candidate entries and removing them from the Phrase Table will not only reduce the size of the Phrase Table, but should also help in improving the processing speed for generating the translations. The present paper develops a scheme based on syntactic structure and the marker hypothesis (Green 1979, The necessity of syntax markers: two experiments with artificial languages, Journal of Verbal Learning and Behavior) for reducing the size of a Phrase Table, without compromising much on the translation quality of the output, by retaining the non-redundant and meaningful parallel phrases only. The proposed scheme is complemented with an appropriate similarity measurement scheme to achieve maximum efficiency in terms of BLEU scores. Although designed for Hindi to English machine translation, the overall approach is quite general, and is expected to be easily adaptable for other language pairs as well.

2019 ◽  
Vol 8 (2S8) ◽  
pp. 1324-1330

The Bicolano-Tagalog Transfer-based Machine Translation System is a unidirectional machine translator for languages Bicolano and Tagalog. The transfer-based approach is divided into three phase: Pre-Processing Analysis, Morphological Transfer, and Sentence Generation. The system analyze first the source language (Bicolano) input to create some internal representation. This includes the tokenizer, stemmer, POS tag and parser. Through transfer rules, it then typically manipulates this internal representation to transfer parsed source language syntactic structure into target language syntactic structure. Finally, the system generates Tagalog sentence from own morphological and syntactic information. Each phase will undergo training and evaluation test for the competence of end-results. Overall performance shows a 71.71% accuracy rate.


2021 ◽  
Vol 18 (1) ◽  
pp. 217-234
Author(s):  
Katarina Welnitzova ◽  
Barbara Jakubickova ◽  
Roman Králik

Digitalization is one of the key distinctive features of modern environment and social life. Nowadays more and more functions are transferred to the artificial mind. How effective is the replacement of human activity with computer activity? In the given article, this problem is solved by an example of integration of digital technologies into translation activities. It this paper, emphasis is placed on the quality of machine translation (MT) output of legal texts in the language pair English - Slovak. It studies a Criminal Code formulated in the Slovak language which was translated by a human translator into English and consequently via machine translation system Google Translate (GT) back into Slovak. The back-translation - translation of a translated text back into its original language - as a quality assessment tool to detect discrepancies, mistranslations and inevitable differences between the source text and the target text was used. The quality of MT output was evaluated according to Multidimensional Quality Metrics (MQM) standards with the focus on the dimension of Fluency. The multiple comparisons were applied to determine which issues (errors) in Fluency dimension differ from the others. A statistically significant difference is noticed between Agreement and other issues, as well as between Ambiguity and other issues. The errors in Agreement are related to the differences between the languages: English is considered mostly an analytic language, Slovak represents a synthetic language. The issues in the Ambiguity dimension correlate with the type of the text being examined, since legal texts are characterized by relatively complicated wording and numerous terms; moreover, accuracy and unambiguity need to be preserved. Generally, the MT output is able to provide users with basic information about the text. On the other hand, most of the segments need revision and/or correction; in such cases, human intervention and post-editing is necessary.


2019 ◽  
Vol 28 (3) ◽  
pp. 447-453 ◽  
Author(s):  
Sainik Kumar Mahata ◽  
Dipankar Das ◽  
Sivaji Bandyopadhyay

Abstract Machine translation (MT) is the automatic translation of the source language to its target language by a computer system. In the current paper, we propose an approach of using recurrent neural networks (RNNs) over traditional statistical MT (SMT). We compare the performance of the phrase table of SMT to the performance of the proposed RNN and in turn improve the quality of the MT output. This work has been done as a part of the shared task problem provided by the MTIL2017. We have constructed the traditional MT model using Moses toolkit and have additionally enriched the language model using external data sets. Thereafter, we have ranked the phrase tables using an RNN encoder-decoder module created originally as a part of the GroundHog project of LISA lab.


Author(s):  
A.V. Kozina ◽  
Yu.S. Belov

Automatically assessing the quality of machine translation is an important yet challenging task for machine translation research. Translation quality assessment is understood as predicting translation quality without reference to the source text. Translation quality depends on the specific machine translation system and often requires post-editing. Manual editing is a long and expensive process. Since the need to quickly determine the quality of translation increases, its automation is required. In this paper, we propose a quality assessment method based on ensemble supervised machine learning methods. The bilingual corpus WMT 2019 for the EnglishRussian language pair was used as data. The text data volume is 17089 sentences, 85% of the data was used for training, and 15% for testing the model. Linguistic functions extracted from the text in the source and target languages were used as features for training the system, since it is these characteristics that can most accurately characterize the translation in terms of quality. The following tools were used for feature extraction: a free language modeling tool based on SRILM and a Stanford POS Tagger parts of speech tagger. Before training the system, the text was preprocessed. The model was trained using three regression methods: Bagging, Extra Tree, and Random Forest. The algorithms were implemented in the Python programming language using the Scikit learn library. The parameters of the random forest method have been optimized using a grid search. The performance of the model was assessed by the mean absolute error MAE and the root mean square error RMSE, as well as by the Pearsоn coefficient, which determines the correlation with human judgment. Testing was carried out using three machine translation systems: Google and Bing neural systems, Mouses statistical machine translation systems based on phrases and based on syntax. Based on the results of the work, the method of additional trees showed itself best. In addition, for all categories of indicators under consideration, the best results are achieved using the Google machine translation system. The developed method showed good results close to human judgment. The system can be used for further research in the task of assessing the quality of translation.


2020 ◽  
Vol 30 (01) ◽  
pp. 2050002
Author(s):  
Taichi Aida ◽  
Kazuhide Yamamoto

Current methods of neural machine translation may generate sentences with different levels of quality. Methods for automatically evaluating translation output from machine translation can be broadly classified into two types: a method that uses human post-edited translations for training an evaluation model, and a method that uses a reference translation that is the correct answer during evaluation. On the one hand, it is difficult to prepare post-edited translations because it is necessary to tag each word in comparison with the original translated sentences. On the other hand, users who actually employ the machine translation system do not have a correct reference translation. Therefore, we propose a method that trains the evaluation model without using human post-edited sentences and in the test set, estimates the quality of output sentences without using reference translations. We define some indices and predict the quality of translations with a regression model. For the quality of the translated sentences, we employ the BLEU score calculated from the number of word [Formula: see text]-gram matches between the translated sentence and the reference translation. After that, we compute the correlation between quality scores predicted by our method and BLEU actually computed from references. According to the experimental results, the correlation with BLEU is the highest when XGBoost uses all the indices. Moreover, looking at each index, we find that the sentence log-likelihood and the model uncertainty, which are based on the joint probability of generating the translated sentence, are important in BLEU estimation.


2018 ◽  
Vol 6 (3) ◽  
pp. 79-92
Author(s):  
Sahar A. El-Rahman ◽  
Tarek A. El-Shishtawy ◽  
Raafat A. El-Kammar

This article presents a realistic technique for the machine aided translation system. In this technique, the system dictionary is partitioned into a multi-module structure for fast retrieval of Arabic features of English words. Each module is accessed through an interface that includes the necessary morphological rules, which directs the search toward the proper sub-dictionary. Another factor that aids fast retrieval of Arabic features of words is the prediction of the word category, and accesses its sub-dictionary to retrieve the corresponding attributes. The system consists of three main parts, which are the source language analysis, the transfer rules between source language (English) and target language (Arabic), and the generation of the target language. The proposed system is able to translate, some negative forms, demonstrations, and conjunctions, and also adjust nouns, verbs, and adjectives according their attributes. Then, it adds the symptom of Arabic words to generate a correct sentence.


2016 ◽  
Vol 13 ◽  
Author(s):  
Sharid Loáiciga ◽  
Cristina Grisot

This paper proposes a method for improving the results of a statistical Machine Translation system using boundedness, a pragmatic component of the verbal phrase’s lexical aspect. First, the paper presents manual and automatic annotation experiments for lexical aspect in English-French parallel corpora. It will be shown that this aspectual property is identified and classified with ease both by humans and by automatic systems. Second, Statistical Machine Translation experiments using the boundedness annotations are presented. These experiments show that the information regarding lexical aspect is useful to improve the output of a Machine Translation system in terms of better choices of verbal tenses in the target language, as well as better lexical choices. Ultimately, this work aims at providing a method for the automatic annotation of data with boundedness information and at contributing to Machine Translation by taking into account linguistic data.


2021 ◽  
Vol 11 (16) ◽  
pp. 7662
Author(s):  
Yong-Seok Choi ◽  
Yo-Han Park ◽  
Seung Yun ◽  
Sang-Hun Kim ◽  
Kong-Joo Lee

Korean and Japanese have different writing scripts but share the same Subject-Object-Verb (SOV) word order. In this study, we pre-train a language-generation model using a Masked Sequence-to-Sequence pre-training (MASS) method on Korean and Japanese monolingual corpora. When building the pre-trained generation model, we allow the smallest number of shared vocabularies between the two languages. Then, we build an unsupervised Neural Machine Translation (NMT) system between Korean and Japanese based on the pre-trained generation model. Despite the different writing scripts and few shared vocabularies, the unsupervised NMT system performs well compared to other pairs of languages. Our interest is in the common characteristics of both languages that make the unsupervised NMT perform so well. In this study, we propose a new method to analyze cross-attentions between a source and target language to estimate the language differences from the perspective of machine translation. We calculate cross-attention measurements between Korean–Japanese and Korean–English pairs and compare their performances and characteristics. The Korean–Japanese pair has little difference in word order and a morphological system, and thus the unsupervised NMT between Korean and Japanese can be trained well even without parallel sentences and shared vocabularies.


2022 ◽  
Vol 2022 ◽  
pp. 1-11
Author(s):  
Syed Abdul Basit Andrabi ◽  
Abdul Wahid

Machine translation is an ongoing field of research from the last decades. The main aim of machine translation is to remove the language barrier. Earlier research in this field started with the direct word-to-word replacement of source language by the target language. Later on, with the advancement in computer and communication technology, there was a paradigm shift to data-driven models like statistical and neural machine translation approaches. In this paper, we have used a neural network-based deep learning technique for English to Urdu languages. Parallel corpus sizes of around 30923 sentences are used. The corpus contains sentences from English-Urdu parallel corpus, news, and sentences which are frequently used in day-to-day life. The corpus contains 542810 English tokens and 540924 Urdu tokens, and the proposed system is trained and tested using 70 : 30 criteria. In order to evaluate the efficiency of the proposed system, several automatic evaluation metrics are used, and the model output is also compared with the output from Google Translator. The proposed model has an average BLEU score of 45.83.


Sign in / Sign up

Export Citation Format

Share Document