Learning Lessons from Bilingual Corpora: Benefits for Machine Translation

Oliver Streiter; Leonid L. Iomdin

doi:10.1075/ijcl.5.2.06str

Learning Lessons from Bilingual Corpora: Benefits for Machine Translation

International Journal of Corpus Linguistics ◽

10.1075/ijcl.5.2.06str ◽

2000 ◽

Vol 5 (2) ◽

pp. 199-230 ◽

Cited By ~ 1

Author(s):

Oliver Streiter ◽

Leonid L. Iomdin

Keyword(s):

Machine Translation ◽

Subject Domain ◽

Rule Based ◽

Parallel Corpora ◽

Specific Subject ◽

Multiword Expressions ◽

Bilingual Corpora ◽

Subject Domains

The research described in this paper is rooted in the endeavors to combine the advantages of corpus-based and rule-based MT approaches in order to improve the performance of MT systems—most importantly, the quality of translation. The authors review the ongoing activities in the field and present a case study, which shows how translation knowledge can be drawn from parallel corpora and compiled into the lexicon of a rule-based MT system. These data are obtained with the help of three procedures: (1) identification of hence unknown one-word translations, (2) statistical rating of the known one-word translations, and (3) extraction of new translations of multiword expressions (MWEs) followed by compilation steps which create new rules for the MT engine. As a result, the lexicon is enriched with translation equivalents attested for different subject domains, which facilitates the tuning of the MT system to a specific subject domain and improves the quality and adequacy of translation.

Download Full-text

Hybrid Arabic-English Machine Translation to Solve Reordering and Ambiguity Problems

Journal of University of Human Development ◽

10.21928/juhd.v1n4y2015.pp413-416 ◽

2015 ◽

Vol 1 (4) ◽

pp. 413

Author(s):

Khalid Shaker Alubaidi

Keyword(s):

Machine Translation ◽

Target Material ◽

Rule Based ◽

Parallel Corpora ◽

Linguistic Rule ◽

Hybrid Machine Translation ◽

In The Beginning ◽

Better Than ◽

Lexical Analyzer

The problem in Arabic to English rule-based machine translation is that the rule-based lexical analyzer leaves some amount of ambiguity; therefore a statistical approach is used to resolve the ambiguity problem. Rule Based Machine Translation (RBMT) uses linguistic rule between two languages which is built manually by human in general, whereas SMT uses appearance statistic of word in parallel corpora. In this paper, those different approaches are combined into Arabic-English Hybrid Machine Translation (HMT) system to get the advantage from both kind of information. In the beginning, Arabic text will be inputted into RBMT to solve reordering problem. Then, the output will be edited by SMT to solve the ambiguity problem and generate the final translation of English text. SMT is capable to do this because on the training process, it uses RBMT’s output (English) as source material and real translation (English) as target material. The results showed that the quality of translation in HMT system is better than SMT system.

Download Full-text

A Novel Rules Based Approach for Estimating Software Birthmark

The Scientific World JOURNAL ◽

10.1155/2015/579390 ◽

2015 ◽

Vol 2015 ◽

pp. 1-8 ◽

Cited By ~ 8

Author(s):

Shah Nazir ◽

Sara Shahzad ◽

Sher Afzal Khan ◽

Norma Binti Alias ◽

Sajid Anwar

Keyword(s):

Fuzzy Logic ◽

Soft Computing ◽

Fuzzy Rule ◽

New Technique ◽

License Agreement ◽

Rule Based ◽

Software Birthmark ◽

A New Technique

Software birthmark is a unique quality of software to detect software theft. Comparing birthmarks of software can tell us whether a program or software is a copy of another. Software theft and piracy are rapidly increasing problems of copying, stealing, and misusing the software without proper permission, as mentioned in the desired license agreement. The estimation of birthmark can play a key role in understanding the effectiveness of a birthmark. In this paper, a new technique is presented to evaluate and estimate software birthmark based on the two most sought-after properties of birthmarks, that is, credibility and resilience. For this purpose, the concept of soft computing such as probabilistic and fuzzy computing has been taken into account and fuzzy logic is used to estimate properties of birthmark. The proposed fuzzy rule based technique is validated through a case study and the results show that the technique is successful in assessing the specified properties of the birthmark, its resilience and credibility. This, in turn, shows how much effort will be required to detect the originality of the software based on its birthmark.

Download Full-text

Recurrent Stacking of Layers for Compact Neural Machine Translation Models

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016292 ◽

2019 ◽

Vol 33 ◽

pp. 6292-6299 ◽

Cited By ~ 2

Author(s):

Raj Dabre ◽

Atsushi Fujita

Keyword(s):

Machine Translation ◽

Single Layer ◽

Training Data ◽

Neural Machine Translation ◽

Parallel Corpora ◽

Translation Quality ◽

Sequence Generation ◽

Sequence Modeling ◽

Back Translation

In encoder-decoder based sequence-to-sequence modeling, the most common practice is to stack a number of recurrent, convolutional, or feed-forward layers in the encoder and decoder. While the addition of each new layer improves the sequence generation quality, this also leads to a significant increase in the number of parameters. In this paper, we propose to share parameters across all layers thereby leading to a recurrently stacked sequence-to-sequence model. We report on an extensive case study on neural machine translation (NMT) using our proposed method, experimenting with a variety of datasets. We empirically show that the translation quality of a model that recurrently stacks a single-layer 6 times, despite its significantly fewer parameters, approaches that of a model that stacks 6 different layers. We also show how our method can benefit from a prevalent way for improving NMT, i.e., extending training data with pseudo-parallel corpora generated by back-translation. We then analyze the effects of recurrently stacked layers by visualizing the attentions of models that use recurrently stacked layers and models that do not. Finally, we explore the limits of parameter sharing where we share even the parameters between the encoder and decoder in addition to recurrent stacking of layers.

Download Full-text

Function words in statistical machine-translated Chinese and original Chinese: A study into the translationese of machine translation systems

Digital Scholarship in the Humanities ◽

10.1093/llc/fqy050 ◽

2018 ◽

Vol 34 (4) ◽

pp. 752-771

Author(s):

Chen-li Kuo

Keyword(s):

Machine Translation ◽

Attribute Selection ◽

Close Attention ◽

Function Words ◽

Rule Based ◽

Source Language ◽

Statistical Mt ◽

Chinese Texts ◽

Translation Systems

Abstract Statistical approaches have become the mainstream in machine translation (MT), for their potential in producing less rigid and more natural translations than rule-based approaches. However, on closer examination, the uses of function words between statistical machine-translated Chinese and the original Chinese are different, and such differences may be associated with translationese as discussed in translation studies. This article examines the distribution of Chinese function words in a comparable corpus consisting of MTs and the original Chinese texts extracted from Wikipedia. An attribute selection technique is used to investigate which types of function words are significant in discriminating between statistical machine-translated Chinese and the original texts. The results show that statistical MT overuses the most frequent function words, even when alternatives exist. To improve the quality of the end product, developers of MT should pay close attention to modelling Chinese conjunctions and adverbial function words. The results also suggest that machine-translated Chinese shares some characteristics with human-translated texts, including normalization and being influenced by the source language; however, machine-translated texts do not exhibit other characteristics of translationese such as explicitation.

Download Full-text

Machine Translation: Phrase-Based, Rule-Based and Neural Approaches with Linguistic Evaluation

Cybernetics and Information Technologies ◽

10.1515/cait-2017-0014 ◽

2017 ◽

Vol 17 (2) ◽

pp. 28-43 ◽

Cited By ~ 4

Author(s):

Vivien Macketanz ◽

Eleftherios Avramidis ◽

Aljoscha Burchardt ◽

Jindrich Helcl ◽

Ankit Srivastava

Keyword(s):

Machine Translation ◽

Evaluation Method ◽

Open Data ◽

Linked Open Data ◽

Rule Based ◽

Named Entity ◽

Phrase Extraction ◽

Translation Rule ◽

The World

Abstract In this article we present a novel linguistically driven evaluation method and apply it to the main approaches of Machine Translation (Rule-based, Phrase-based, Neural) to gain insights into their strengths and weaknesses in much more detail than provided by current evaluation schemes. Translating between two languages requires substantial modelling of knowledge about the two languages, about translation, and about the world. Using English-German IT-domain translation as a case-study, we also enhance the Phrase-based system by exploiting parallel treebanks for syntax-aware phrase extraction and by interfacing with Linked Open Data (LOD) for extracting named entity translations in a post decoding framework.

Download Full-text

Corpus Quality Improvement to Improve the Quality of Statistical Translator Machines (Case Study of Indonesian Language to Java Krama)

Jurnal Linguistik Komputasional (JLK) ◽

10.26418/jlk.v1i2.12 ◽

2018 ◽

Vol 1 (2) ◽

pp. 65

Author(s):

Muhammad Gerdy Asparilla ◽

Herry Sujaini ◽

Rudy Dwi Nyoto

Keyword(s):

Quality Improvement ◽

Machine Translation ◽

Statistical Machine Translation ◽

Bahasa Indonesia

Bahasa merupakan alat komunikasi yang dijadikan sarana untuk berinteraksi dengan masyarakat sekitar.Kemampuan akan penguasaan banyak bahasa tentunya akan mempermudah untuk berinteraksi dengan orang lain dari berbagai daerah yang berbeda. Oleh karena itu, diperlukan penerjemah untuk menambah pengetahuan akan berbagai bahasa yang ada. Mesin Penerjemah Statistik (Statistical Machine Translation) merupakan sebuah pendekatan mesin penerjemah dengan hasil terjemahan yang dihasilkan atas dasar model statistik yang parameter-parameternya diambil dari hasil analisis korpus paralel. Korpus paralel adalah pasangan korpus yang berisi kalimat-kalimat dalam suatu bahasa dan terjemahannya. Salah satu fitur yang digunakan untuk meningkatkan kualitas hasil terjemahan adalah dengan optimasi korpus. Tujuan yang ingin dicapai dalam penelitian ini adalah melakukan untuk melihat pengaruh kualitas korpus dengan memfilter pasangan kalimat-kalimat dengan terjemahan berkualitas. Filter yang digunakan adalah nilai minimal setiap kalimat yang di uji dengan metode Bilingual Evaluation Understudy (BLEU). Pengujian dilakukan dengan membandingkan nilai akurasi hasil terjemahan sebelum dan setelah optimasi korpus. Dari hasil penelitian, penggunaan optimasi korpus dapat meningkatkan kualitas terjemahan untuk mesin penerjemah bahasa Indonesia ke bahasa Jawa krama. Hal itu terlihat dari hasil pengujian dengan menambahkan optimasi korpus pada 15 kalimat uji diluar korpus terdapat peningkatan rata - rata nilai BLEU sebesar 10.53% dan dengan menggunakan 100 kalimat uji yang berasal dari korpus optimasi terdapat peningkatan rata-rata nilai BLEU sebesar 11.63% pada pengujian otomatis serta 0.03% pada pengujian oleh ahli bahasa. Berdasarkan hal tersebut, mesin penerjemah statistik bahasa Indonesia ke bahasa Jawa krama dengan penggunaan fitur optimasi korpus dapat meningkatkan nilai akurasi hasil terjemahan.

Download Full-text

A brief study of the Autshumato Machine Translation Web Service for South African languages

Literator ◽

10.4102/lit.v42i1.1766 ◽

2021 ◽

Vol 42 (1) ◽

Author(s):

Nomsa J. Skosana ◽

Respect Mlambo

Keyword(s):

Web Service ◽

South African ◽

Machine Translation ◽

Language Processing ◽

High Speed ◽

Training Data ◽

Parallel Corpora ◽

African Languages ◽

Official Languages

The scarcity of adequate resources for South African languages poses a huge challenge for their functional development in specialised fields such as science and technology. The study examines the Autshumato Machine Translation (MT) Web Service, created by the Centre for Text Technology at the North-West University. This software supports both formal and informal translations as a machine-aided human translation tool. We investigate the system in terms of its advantages and limitations and suggest possible solutions for South African languages. The results show that the system is essential as it offers high-speed translation and operates as an open-source platform. It also provides multiple translations from sentences, documents and web pages. Some South African languages were included whilst others were excluded and we find this to be a limitation of the system. We also find that the system was trained with a limited amount of data, and this has an adverse effect on the quality of the output. The study suggests that adding specialised parallel corpora from various contemporary fields for all official languages and involving language experts in the pre-editing of training data can be a major step towards improving the quality of the system’s output. The study also outlines that developers should consider integrating the system with other natural language processing applications. Finally, the initiatives discussed in this study will help to improve this MT system to be a more effective translation tool for all the official languages of South Africa.

Download Full-text

Improving the quality of Machine Translation using rule based tense synthesizer for Hindi

2015 IEEE International Advance Computing Conference (IACC) ◽

10.1109/iadcc.2015.7154741 ◽

2015 ◽

Cited By ~ 1

Author(s):

Shashi Pal Singh ◽

Ajai Kumar ◽

Hemant Darbari ◽

Anshika Gupta

Keyword(s):

Machine Translation ◽

Rule Based

Download Full-text

Building a Bilingual Corpus based on Hybrid Approach for Malayalam-English Machine Translation

International Journal of Computer Science and Informatics ◽

10.47893/ijcsi.2013.1095 ◽

2013 ◽

pp. 219-224

Author(s):

Rajesh. K. S ◽

Veena A Kumar ◽

CH. Dayakar Reddy

Keyword(s):

Machine Translation ◽

Hybrid Approach ◽

Word Alignment ◽

Translation Research ◽

Parallel Corpora ◽

Parallel Corpus ◽

Word Level ◽

Alignment System ◽

Bilingual Corpora ◽

Active Research

Word alignment in bilingual corpora has been a very active research topic in the Machine Translation research groups. In this research paper, we describe an alignment system that aligns English-Malayalam texts at word level in parallel sentences. The alignment of translated segments with source segments is very essential for building parallel corpora. Since word alignment research on Malayalam and English languages is still in its immaturity, it is not a trivial task for Malayalam-English text. A parallel corpus is a collection of texts in two languages, one of which is the translation equivalent of the other. Thus, the main purpose of this system is to construct word-aligned parallel corpus to be used in Malayalam-English machine translation. The proposed approach is a hybrid approach, a combination of corpus based and dictionary lookup approaches. The corpus based approach is based on the first three IBM models and Expectation Maximization (EM) algorithm. For the dictionary lookup approach, the proposed system uses the bilingual Malayalam-English Dictionary.

Download Full-text

A North Saami to South Saami Machine Translation Prototype

Northern European Journal of Language Technology ◽

10.3384/nejlt.2000-1533.1642 ◽

2016 ◽

Vol 4 ◽

pp. 11-27

Author(s):

Lene Antonsen ◽

Trond Trosterud ◽

Francis M. Tyers

Keyword(s):

Machine Translation ◽

School Administration ◽

Single Domain ◽

Restricted Domain ◽

Rule Based ◽

Rule Based System

The paper describes a rule-based machine translation (MT) system from North to South Saami. The system is designed for a workflow where North Saami functions as pivot language in translation from Norwegian or Swedish. We envisage manual translation from Norwegian or Swedish to North Saami, and thereafter MT to South Saami. The system was aimed at a single domain, that of texts for use in school administration. We evaluated the system in terms of the quality of translations for postediting. Two out of three of the Norwegian to South Saami professional translators found the output of the system to be useful. The evaluation shows that it is possible to make a functioning rule-based system with a small transfer lexicon and a small number of rules and achieve results that are useful for a restricted domain, even if there are substantial differences b etween the languages.

Download Full-text