An improvement of translation quality with adding key-words in parallel corpus

Abstract In recent years the exploitation of large text corpora in solving various kinds of linguistic problems, including those of translation, is commonplace. Yet a large-scale English-Persian corpus is still unavailable, because of certain difficulties and the amount of work required to overcome them. The project reported here is an attempt to constitute an English-Persian parallel corpus composed of digital texts and Web documents containing little or no noise. The Internet is useful because translations of existing texts are often published on the Web. The task is to find parallel pages in English and Persian, to judge their translation quality, and to download and align them. The corpus so created is of course open; that is, more material can be added as the need arises. One of the main activities associated with building such a corpus is to develop software for parallel concordancing, in which a user can enter a search string in one language and see all the citations for that string in it and corresponding sentences in the target language. Our intention is to construct general translation memory software using the present English-Persian parallel corpus.

Download Full-text

Low Resource Neural Machine Translation: Assamese to/from Other Indo-Aryan (Indic) Languages

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3469721 ◽

2022 ◽

Vol 21 (1) ◽

pp. 1-32

Author(s):

Rupjyoti Baruah ◽

Rajesh Kumar Mundotiya ◽

Anil Kumar Singh

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Basic Sequence ◽

Neural Machine Translation ◽

Parallel Corpus ◽

Translation Quality ◽

Evaluation Scores ◽

Language Characteristics ◽

The Given ◽

Family Trees

Machine translation (MT) systems have been built using numerous different techniques for bridging the language barriers. These techniques are broadly categorized into approaches like Statistical Machine Translation (SMT) and Neural Machine Translation (NMT). End-to-end NMT systems significantly outperform SMT in translation quality on many language pairs, especially those with the adequate parallel corpus. We report comparative experiments on baseline MT systems for Assamese to other Indo-Aryan languages (in both translation directions) using the traditional Phrase-Based SMT as well as some more successful NMT architectures, namely basic sequence-to-sequence model with attention, Transformer, and finetuned Transformer. The results are evaluated using the most prominent and popular standard automatic metric BLEU (BiLingual Evaluation Understudy), as well as other well-known metrics for exploring the performance of different baseline MT systems, since this is the first such work involving Assamese. The evaluation scores are compared for SMT and NMT models for the effectiveness of bi-directional language pairs involving Assamese and other Indo-Aryan languages (Bangla, Gujarati, Hindi, Marathi, Odia, Sinhalese, and Urdu). The highest BLEU scores obtained are for Assamese to Sinhalese for SMT (35.63) and the Assamese to Bangla for NMT systems (seq2seq is 50.92, Transformer is 50.01, and finetuned Transformer is 50.19). We also try to relate the results with the language characteristics, distances, family trees, domains, data sizes, and sentence lengths. We find that the effect of the domain is the most important factor affecting the results for the given data domains and sizes. We compare our results with the only existing MT system for Assamese (Bing Translator) and also with pairs involving Hindi.

Download Full-text

Toward the Elaboration of a Spanish-Chinese Parallel Annotated Corpus

10.29007/gxv3 ◽

2018 ◽

Author(s):

Shuyuan Cao ◽

Iria Da-Cunha ◽

Mikel Iruskieta

Keyword(s):

Language Learning ◽

Academic Community ◽

Parallel Corpus ◽

Translation Quality ◽

Automatic Translation ◽

Part Of Speech ◽

Challenging Tasks ◽

Syntactic Information ◽

Translation Systems ◽

Language Pair

Spanish and Chinese are two very different languages in all language levels. Therefore, translation (both human and machine translation) from one to another and learning one of them as a foreign language are challenging tasks. Some automatic translation systems exist for this pair of languages, but there is enough room to improve the translation quality between Spanish and Chinese. In addition, the accessible sources, such as a parallel corpus for studying and understanding this language pair, are still few. In this paper, we present how we have created a Spanish-Chinese parallel corpus designed for language learning and translation tasks at the discourse level. This corpus has been enriched automatically with part-of-speech (POS) and several queries based on morpho-syntactic information can be realized. We have made available the parallel corpus to the academic community.

Download Full-text

KUALITAS HASIL TERJEMAHAN GOOGLE TRANSLATE DARI BAHASA ARAB KE BAHASA INDONESIA

Al Mi'yar: Jurnal Ilmiah Pembelajaran Bahasa Arab dan Kebahasaaraban ◽

10.35931/am.v3i1.205 ◽

2020 ◽

Vol 3 (1) ◽

pp. 127

Author(s):

Hidayatul Khoiriyah

Keyword(s):

Key Words ◽

Machine Translation ◽

Human Life ◽

Target Language ◽

Arabic Text ◽

Grammatical Structure ◽

Source Language ◽

Translation Quality ◽

Bahasa Indonesia

The development of technology has a big impact on human life. The existence of a machine translation is the result of technological advancements that aim to facilitate humans in translating one language into another. The focus of this research is to examine the quality of the google translate machine in terms of vocabulary accuracy, clarity, and reasonableness of meaning. Data of mufradāt taken from several Arabic translation dictionaries, while the text is taken from the phenomenal work of Dr. Aidh Qorni in the book Lā Tahzan. The method used in this research is the translation critic method. The results showed that in terms of the accuracy of vocabulary and terms, Google Translate has a good translation quality. In terms of clarity and reasonableness of meaning, google translate has not been able to transmit ideas from the source language well into the target language. Furthermore, in grammatical, the results of the google translate translation do not have a grammatical arrangement, the results of the google translate translation do not have a good grammatical structure and are by following the rules that applied in the target Indonesian language.From the data, it shows that google translate should not be used as a basis for translating an Arabic text into Indonesian, especially in translating verses of the Qur'ān and Hadīts. A beginner translator should prefer a dictionary rather than using google translate to effort and improve the ability to translate.Key Words: Translation, Google Translate, Arabic

Download Full-text

EVENT-CANCELLING SEMANTICS OF THE ENGLISH ASPECTUALIZER START AND ITS SERBIAN EQUIVALENT KRENUTI

ГОДИШЊАК ФИЛОЗОФСКОГ ФАКУЛТЕТА У НОВОМ САДУ ◽

10.19090/gff.2021.1.31-46 ◽

2021 ◽

Vol 46 (1) ◽

pp. 31-46

Author(s):

Nataša Milivojević

Keyword(s):

Key Words ◽

Temporal Structure ◽

Semantic Features ◽

Contrastive Analysis ◽

Web Based ◽

Dynamic Semantic ◽

Parallel Corpus ◽

Phase Modification ◽

Theoretical Frame ◽

Aspectual Verbs

The paper revisits the issue of semantic equivalency of two aspectual verbs, start and krenuti, which is proposed by xxx (2021a, 2021b). The present analysis focuses on the causative and dynamic semantic features of start and krenuti, with the aim of a contrastive analysis of the aspectual constructions headed by these two verbs. It is shown that both start and krenuti, provided that the necessary linguistic conditions are met, have the ability to “cancel” the event initiated via constructional phase modification. The conditions for such event-cancelling result from the lexical semantics of start and krenuti, as well as from the semantic co-composition on the level of the aspectual construction as a whole. The theoretical frame of the analysis is the presupposition and consequence account by A. Freed (1979). The contrastive analysis and presented theoretical conclusions are backed by a parallel corpus of 200 English and Serbian sentences compiled from the Corpus of Global Web-Based English (GlowBE 2013) and the Corpus of Contemporary Serbian Language (SrpKor 2013). Key words: aspectualizers, aspectual constructions, aspectual event, temporal structure, presupposition and consequence, event-cancelling

Download Full-text

TRANSLATION ANALYSIS OF DIRECTIVE SPEECH ACTS IN "EAT PRAY LOVE" NOVEL AND ITS TRANSLATION INTO INDONESIAN

PRASASTI Journal of Linguistics ◽

10.20961/prasasti.v2i2.332 ◽

2016 ◽

Vol 2 (2) ◽

Author(s):

Irta Fitriana

Keyword(s):

Key Words ◽

Speech Acts ◽

Speech Act ◽

Intended Meaning ◽

Translation Quality ◽

Translation Analysis ◽

Translation Techniques ◽

Directive Speech Acts ◽

Directive Speech

Translating utterances is not similar to translating sentences. It requires special attention as there is an intended meaning or message transferred by a speaker to a hearer. Context of the situation overshadowing the utterance must be obeyed carefully. Thus the messages will be easily revealed. Speech act is a way that allows the messages of utterances to be seen. Schiffrin (2001) stated that speech act is one of pragmatics’ basic ingredients arranging by words and corresponding to sentences and some ways to avoid kinds of misunderstanding in communication. The focus of speech act is illucution since it shows the intention of utterances uttered. It is also much correlated withtranslation. Intranslatingan utterance, itis not merelytranslated literally, butthere is also an intentionthat shouldbe translated. This paper is aimed to analyze directive speech act in Eat Pray Love and its translation into Indonesian. It tries to reveal the functions of directive speech acts, translation techniques used and the translation quality (readability, accuracy, and acceptability).Key words: speech act, directives, translation, readability, accuracy, and acceptability

Download Full-text

Peningkatan Akurasi Mesin Penerjemah Bahasa Inggris - Indonesia dengan Memaksimalkan Kualitas dan Kuantitas Korpus Paralel

Jurnal Teknologi Informasi dan Ilmu Komputer ◽

10.25126/jtiik.2020732076 ◽

2020 ◽

Vol 7 (3) ◽

pp. 471

Author(s):

Herry Sujaini

Keyword(s):

Poor Quality ◽

Experimental Results ◽

High Quality ◽

Parallel Corpus ◽

Translation Quality

Korpus paralel memiliki peran yang sangat penting dalam mesin penerjemah statistik (MPS). Korpus paralel yang diperoleh berbagai sumber biasanya memiliki kualitas yang kurang baik, sedangkan kuantitas korpus paralel merupakan tuntutan utama bagi hasil penerjemahan yang baik. Penelitian ini bertujuan untuk mengetahui efek ukuran dan kualitas korpus paralel di MPS. Penelitian ini menggunakan metode bilingual evaluation understudy (BLEU) untuk mengklasifikasikan pasangan kalimat paralel sebagai kalimat berkualitas tinggi atau buruk. Metode ini diterapkan ke korpus paralel yang berisi 1,5 M pasangan kalimat Inggris-Indonesia paralel dan memperoleh 900K pasangan kalimat paralel berkualitas tinggi. Beberapa sistem MPS dengan berbagai ukuran korpus paralel mentah dan korpus berkualitas tinggi yang difilter dilatih dengan MOSES dan dievaluasi kinerjanya. Hasil percobaan yang dilakukan menunjukkan bahwa ukuran korpus paralel merupakan faktor utama dalam kinerja terjemahan. Selain itu, kinerja terjemahan yang lebih baik dapat dicapai dengan korpus berkualitas tinggi yang lebih kecil menggunakan metode filter berkualitas. Hasil eksperimen pada MPS bahasa Inggris-Indonesia menunjukkan bahwa dengan menggunakan 60% kalimat yang kualitas terjemahannya baik, kualitas terjemahan dapat meningkat sebesar 7,31%. AbstractThe parallel corpus has a very important role in the statistical machine translator (SMT) system. The parallel corpus obtained by various sources usually has poor quality, while the quantity of parallel corpus is the main demand for good translation results. This study aims to determine the effect of the size and quality of parallel corpus at SMT. This study uses the bilingual evaluation understudy (BLEU) method to classify pairs of parallel sentences as high-quality or bad sentences. This method is applied to a parallel corpus containing 1.5 M parallel English-Indonesian sentence pairs and obtaining 900K pairs of high-quality parallel sentences. Some SMT systems with various sizes of raw parallel bodies and high-quality corpus filtered are trained with MOSES and evaluated for performance. The experimental results show that the size of the parallel corpus is a major factor in translation performance. In addition, better translation performance can be achieved with a smaller high-quality corpus using a quality filter method.The experimental results in the English-Indonesian SMT show that by using 60% of sentences whose translation quality is good, the quality of the translation can increase by 7.31%.

Download Full-text

ParaMed: a parallel corpus for English–Chinese translation in the biomedical domain

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01621-8 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Boxiang Liu ◽

Liang Huang

Keyword(s):

New England ◽

Machine Translation ◽

Domain Knowledge ◽

Language Translation ◽

Fine Tuning ◽

Biomedical Domain ◽

Chinese Translation ◽

Parallel Corpus ◽

Translation Quality ◽

Full Dataset

Abstract Background Biomedical language translation requires multi-lingual fluency as well as relevant domain knowledge. Such requirements make it challenging to train qualified translators and costly to generate high-quality translations. Machine translation represents an effective alternative, but accurate machine translation requires large amounts of in-domain data. While such datasets are abundant in general domains, they are less accessible in the biomedical domain. Chinese and English are two of the most widely spoken languages, yet to our knowledge, a parallel corpus does not exist for this language pair in the biomedical domain. Description We developed an effective pipeline to acquire and process an English-Chinese parallel corpus from the New England Journal of Medicine (NEJM). This corpus consists of about 100,000 sentence pairs and 3,000,000 tokens on each side. We showed that training on out-of-domain data and fine-tuning with as few as 4000 NEJM sentence pairs improve translation quality by 25.3 (13.4) BLEU for en$$\rightarrow$$ → zh (zh$$\rightarrow$$ → en) directions. Translation quality continues to improve at a slower pace on larger in-domain data subsets, with a total increase of 33.0 (24.3) BLEU for en$$\rightarrow$$ → zh (zh$$\rightarrow$$ → en) directions on the full dataset. Conclusions The code and data are available at https://github.com/boxiangliu/ParaMed.

Download Full-text

The Varibility of Reproduction: Emotive Units in a Literary Text (On the material of Ukrainian, Russian and Chinese) Variation of reflection of English-speaking emotional units in the translation of an artistic work (in Ukrainian, Russian, Chinese)

Naukovy Visnyk of South Ukrainian National Pedagogical University named after K D Ushynsky Linguistic Sciences ◽

10.24195/2616-5317-2018-27-24 ◽

2019 ◽

Vol 26 (27) ◽

pp. 211-222

Author(s):

Yana Fedorko ◽

Tetiana Yablonskaya

Keyword(s):

Key Words ◽

Machine Translation ◽

Political Discourse ◽

Literary Text ◽

Translation Quality ◽

Automatic Translation ◽

Advantages And Disadvantages ◽

English Speaking ◽

Translation Systems

The article is focused on peculiarities of English and Chinese political discourse translation into Ukrainian. The advantages and disadvantages of machine translation are described on the basis of linguistic analysis of online Google Translate and M-Translate systems. The reasons of errors in translation are identified and the need of post-correction to improve the quality of translation is wanted. Key words: political discourse, automatic translation, online machine translation systems, machine translation quality assessment.

Download Full-text

Keeping Models Consistent between Pretraining and Translation for Low-Resource Neural Machine Translation

Future Internet ◽

10.3390/fi12120215 ◽

2020 ◽

Vol 12 (12) ◽

pp. 215

Author(s):

Wenbo Zhang ◽

Xiao Li ◽

Yating Yang ◽

Rui Dong ◽

Gongxu Luo

Keyword(s):

Machine Translation ◽

Language Model ◽

Neural Machine Translation ◽

Translation Model ◽

Parallel Corpus ◽

Model Experiments ◽

Low Resource ◽

Translation Quality ◽

Number Of Layers ◽

Cross Lingual

Recently, the pretraining of models has been successfully applied to unsupervised and semi-supervised neural machine translation. A cross-lingual language model uses a pretrained masked language model to initialize the encoder and decoder of the translation model, which greatly improves the translation quality. However, because of a mismatch in the number of layers, the pretrained model can only initialize part of the decoder’s parameters. In this paper, we use a layer-wise coordination transformer and a consistent pretraining translation transformer instead of a vanilla transformer as the translation model. The former has only an encoder, and the latter has an encoder and a decoder, but the encoder and decoder have exactly the same parameters. Both models can guarantee that all parameters in the translation model can be initialized by the pretrained model. Experiments on the Chinese–English and English–German datasets show that compared with the vanilla transformer baseline, our models achieve better performance with fewer parameters when the parallel corpus is small.

Download Full-text