An improvement of translation quality with adding key-words in parallel corpus

Author(s):  
Liang Tian ◽  
Fai Wong ◽  
Sam Chao
2009 ◽  
Vol 54 (1) ◽  
pp. 181-188 ◽  
Author(s):  
Tayebeh Mosavi Miangah

Abstract In recent years the exploitation of large text corpora in solving various kinds of linguistic problems, including those of translation, is commonplace. Yet a large-scale English-Persian corpus is still unavailable, because of certain difficulties and the amount of work required to overcome them. The project reported here is an attempt to constitute an English-Persian parallel corpus composed of digital texts and Web documents containing little or no noise. The Internet is useful because translations of existing texts are often published on the Web. The task is to find parallel pages in English and Persian, to judge their translation quality, and to download and align them. The corpus so created is of course open; that is, more material can be added as the need arises. One of the main activities associated with building such a corpus is to develop software for parallel concordancing, in which a user can enter a search string in one language and see all the citations for that string in it and corresponding sentences in the target language. Our intention is to construct general translation memory software using the present English-Persian parallel corpus.


Author(s):  
Rupjyoti Baruah ◽  
Rajesh Kumar Mundotiya ◽  
Anil Kumar Singh

Machine translation (MT) systems have been built using numerous different techniques for bridging the language barriers. These techniques are broadly categorized into approaches like Statistical Machine Translation (SMT) and Neural Machine Translation (NMT). End-to-end NMT systems significantly outperform SMT in translation quality on many language pairs, especially those with the adequate parallel corpus. We report comparative experiments on baseline MT systems for Assamese to other Indo-Aryan languages (in both translation directions) using the traditional Phrase-Based SMT as well as some more successful NMT architectures, namely basic sequence-to-sequence model with attention, Transformer, and finetuned Transformer. The results are evaluated using the most prominent and popular standard automatic metric BLEU (BiLingual Evaluation Understudy), as well as other well-known metrics for exploring the performance of different baseline MT systems, since this is the first such work involving Assamese. The evaluation scores are compared for SMT and NMT models for the effectiveness of bi-directional language pairs involving Assamese and other Indo-Aryan languages (Bangla, Gujarati, Hindi, Marathi, Odia, Sinhalese, and Urdu). The highest BLEU scores obtained are for Assamese to Sinhalese for SMT (35.63) and the Assamese to Bangla for NMT systems (seq2seq is 50.92, Transformer is 50.01, and finetuned Transformer is 50.19). We also try to relate the results with the language characteristics, distances, family trees, domains, data sizes, and sentence lengths. We find that the effect of the domain is the most important factor affecting the results for the given data domains and sizes. We compare our results with the only existing MT system for Assamese (Bing Translator) and also with pairs involving Hindi.


10.29007/gxv3 ◽  
2018 ◽  
Author(s):  
Shuyuan Cao ◽  
Iria Da-Cunha ◽  
Mikel Iruskieta

Spanish and Chinese are two very different languages in all language levels. Therefore, translation (both human and machine translation) from one to another and learning one of them as a foreign language are challenging tasks. Some automatic translation systems exist for this pair of languages, but there is enough room to improve the translation quality between Spanish and Chinese. In addition, the accessible sources, such as a parallel corpus for studying and understanding this language pair, are still few. In this paper, we present how we have created a Spanish-Chinese parallel corpus designed for language learning and translation tasks at the discourse level. This corpus has been enriched automatically with part-of-speech (POS) and several queries based on morpho-syntactic information can be realized. We have made available the parallel corpus to the academic community.


Author(s):  
Hidayatul Khoiriyah

<p style="text-align: justify;"><em>The development of technology has a big impact on human life. The existence of a machine translation is the result of technological advancements that aim to facilitate humans in translating one language into another. The focus of this research is to examine the quality of the google translate machine in terms of vocabulary accuracy, clarity, and reasonableness of meaning. Data of mufradāt taken from several Arabic translation dictionaries, while the text is taken from the phenomenal work of Dr. Aidh Qorni in the book Lā Tahzan. The method used in this research is the translation critic method. </em></p><p style="text-align: justify;"><em>The results showed that in terms of the accuracy of vocabulary and terms, Google Translate has a good translation quality. In terms of clarity and reasonableness of meaning, google translate has not been able to transmit ideas from the source language well into the target language. Furthermore, in grammatical, the results of the google translate translation do not have a grammatical arrangement, the results of the google translate translation do not have a good grammatical structure and are by following the rules that applied in the target Indonesian language.</em></p><p style="text-align: justify;"><em>From the data, it shows that google translate should not be used as a basis for translating an Arabic text into Indonesian, especially in translating verses of the Qur'</em><em>ā</em><em>n and Hadīts. A beginner translator should prefer a dictionary rather than using google translate to effort and improve the ability to translate.</em></p><p style="text-align: justify;"><strong><em>Key Words: Translation, Google Translate, Arabic</em></strong></p>


Author(s):  
Nataša Milivojević

The paper revisits the issue of semantic equivalency of two aspectual verbs, start and krenuti, which is proposed by xxx (2021a, 2021b). The present analysis focuses on the causative and dynamic semantic features of start and krenuti, with the aim of a contrastive analysis of the aspectual constructions headed by these two verbs. It is shown that both start and krenuti, provided that the necessary linguistic conditions are met, have the ability to “cancel” the event initiated via constructional phase modification. The conditions for such event-cancelling result from the lexical semantics of start and krenuti, as well as from the semantic co-composition on the level of the aspectual construction as a whole. The theoretical frame of the analysis is the presupposition and consequence account by A. Freed (1979). The contrastive analysis and presented theoretical conclusions are backed by a parallel corpus of 200 English and Serbian sentences compiled from the Corpus of Global Web-Based English (GlowBE 2013) and the Corpus of Contemporary Serbian Language (SrpKor 2013). Key words: aspectualizers, aspectual constructions, aspectual event, temporal structure, presupposition and consequence, event-cancelling


2016 ◽  
Vol 2 (2) ◽  
Author(s):  
Irta Fitriana

<p>Translating utterances is not similar to translating sentences. It requires special attention as there is an intended meaning or message transferred by a speaker to a hearer. Context of the situation overshadowing the utterance must be obeyed carefully. Thus the messages will be easily revealed. Speech act is a way that allows the messages of utterances to be seen. Schiffrin (2001) stated that speech act is one of pragmatics’ basic ingredients arranging by words and corresponding to sentences and some ways to avoid kinds of misunderstanding in communication. The focus of speech act is illucution since it shows the intention of utterances uttered. It is also much correlated withtranslation. Intranslatingan utterance, itis not merelytranslated literally, butthere is also an intentionthat shouldbe translated. This paper is aimed to analyze directive speech act in Eat Pray Love and its translation into Indonesian. It tries to reveal the functions of directive speech acts, translation techniques used and the translation quality (readability, accuracy, and acceptability).</p><p>Key words: speech act, directives, translation, readability, accuracy, and <br />acceptability</p>


2020 ◽  
Vol 7 (3) ◽  
pp. 471
Author(s):  
Herry Sujaini

<p class="Body">Korpus paralel memiliki peran yang sangat penting dalam mesin penerjemah statistik (MPS). Korpus paralel yang diperoleh berbagai sumber biasanya memiliki kualitas yang kurang baik, sedangkan kuantitas korpus paralel merupakan tuntutan utama bagi hasil penerjemahan yang baik. Penelitian ini bertujuan untuk mengetahui efek ukuran dan kualitas korpus paralel di MPS. Penelitian ini menggunakan metode <em>bilingual</em> <em>evaluation understudy</em> (BLEU) untuk mengklasifikasikan pasangan kalimat paralel sebagai kalimat berkualitas tinggi atau buruk. Metode ini diterapkan ke korpus paralel yang berisi 1,5 M pasangan kalimat Inggris-Indonesia paralel dan memperoleh 900K pasangan kalimat paralel berkualitas tinggi. Beberapa sistem MPS dengan berbagai ukuran korpus paralel mentah dan korpus berkualitas tinggi yang difilter dilatih dengan MOSES dan dievaluasi kinerjanya. Hasil percobaan yang dilakukan menunjukkan bahwa ukuran korpus paralel merupakan  faktor utama dalam kinerja terjemahan. Selain itu, kinerja terjemahan yang  lebih baik dapat dicapai dengan korpus berkualitas tinggi yang lebih kecil menggunakan metode filter berkualitas. Hasil eksperimen pada MPS bahasa Inggris-Indonesia menunjukkan bahwa dengan menggunakan 60% kalimat yang kualitas terjemahannya baik, kualitas terjemahan dapat meningkat sebesar 7,31%.</p><p class="Body"> </p><p class="Body"><em><strong>Abstract</strong></em></p><p class="Abstract"><em>The parallel corpus has a very important role in the statistical machine translator (SMT) system. The parallel corpus obtained by various sources usually has poor quality, while the quantity of parallel corpus is the main demand for good translation results. This study aims to determine the effect of the size and quality of parallel corpus at SMT. This study uses the bilingual evaluation understudy (BLEU) method to classify pairs of parallel sentences as high-quality or bad sentences. This method is applied to a parallel corpus containing 1.5 M parallel English-Indonesian sentence pairs and obtaining 900K pairs of high-quality parallel sentences. Some SMT systems with various sizes of raw parallel bodies and high-quality corpus filtered are trained with MOSES and evaluated for performance. The experimental results show that the size of the parallel corpus is a major factor in translation performance. In addition, better translation performance can be achieved with a smaller high-quality corpus using a quality filter method.The experimental results in the English-Indonesian SMT show that by using 60% of sentences whose translation quality is good, the quality of the translation can increase by 7.31%.</em></p><p class="Body"><em><strong><br /></strong></em></p>


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Boxiang Liu ◽  
Liang Huang

Abstract Background Biomedical language translation requires multi-lingual fluency as well as relevant domain knowledge. Such requirements make it challenging to train qualified translators and costly to generate high-quality translations. Machine translation represents an effective alternative, but accurate machine translation requires large amounts of in-domain data. While such datasets are abundant in general domains, they are less accessible in the biomedical domain. Chinese and English are two of the most widely spoken languages, yet to our knowledge, a parallel corpus does not exist for this language pair in the biomedical domain. Description We developed an effective pipeline to acquire and process an English-Chinese parallel corpus from the New England Journal of Medicine (NEJM). This corpus consists of about 100,000 sentence pairs and 3,000,000 tokens on each side. We showed that training on out-of-domain data and fine-tuning with as few as 4000 NEJM sentence pairs improve translation quality by 25.3 (13.4) BLEU for en$$\rightarrow$$ → zh (zh$$\rightarrow$$ → en) directions. Translation quality continues to improve at a slower pace on larger in-domain data subsets, with a total increase of 33.0 (24.3) BLEU for en$$\rightarrow$$ → zh (zh$$\rightarrow$$ → en) directions on the full dataset. Conclusions The code and data are available at https://github.com/boxiangliu/ParaMed.


Author(s):  
Yana Fedorko ◽  
Tetiana Yablonskaya

The article is focused on peculiarities of English and Chinese political discourse translation into Ukrainian. The advantages and disadvantages of machine translation are described on the basis of linguistic analysis of online Google Translate and M-Translate systems. The reasons of errors in translation are identified and the need of post-correction to improve the quality of translation is wanted. Key words: political discourse, automatic translation, online machine translation systems, machine translation quality assessment.


2020 ◽  
Vol 12 (12) ◽  
pp. 215
Author(s):  
Wenbo Zhang ◽  
Xiao Li ◽  
Yating Yang ◽  
Rui Dong ◽  
Gongxu Luo

Recently, the pretraining of models has been successfully applied to unsupervised and semi-supervised neural machine translation. A cross-lingual language model uses a pretrained masked language model to initialize the encoder and decoder of the translation model, which greatly improves the translation quality. However, because of a mismatch in the number of layers, the pretrained model can only initialize part of the decoder’s parameters. In this paper, we use a layer-wise coordination transformer and a consistent pretraining translation transformer instead of a vanilla transformer as the translation model. The former has only an encoder, and the latter has an encoder and a decoder, but the encoder and decoder have exactly the same parameters. Both models can guarantee that all parameters in the translation model can be initialized by the pretrained model. Experiments on the Chinese–English and English–German datasets show that compared with the vanilla transformer baseline, our models achieve better performance with fewer parameters when the parallel corpus is small.


Sign in / Sign up

Export Citation Format

Share Document