Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages

Machine Translation ◽

10.1007/s10590-021-09260-6 ◽

2021 ◽

Author(s):

Tanmai Khanna ◽

Jonathan N. Washington ◽

Francis M. Tyers ◽

Sevilay Bayatlı ◽

Daniel G. Swanson ◽

...

Keyword(s):

Open Source ◽

Machine Translation ◽

Lexical Selection ◽

Rule Based ◽

Low Resource ◽

Language Technology ◽

Language Data ◽

Recursive Structures ◽

Platform Translation ◽

Free Open Source

AbstractThis paper presents an overview of Apertium, a free and open-source rule-based machine translation platform. Translation in Apertium happens through a pipeline of modular tools, and the platform continues to be improved as more language pairs are added. Several advances have been implemented since the last publication, including some new optional modules: a module that allows rules to process recursive structures at the structural transfer stage, a module that deals with contiguous and discontiguous multi-word expressions, and a module that resolves anaphora to aid translation. Also highlighted is the hybridisation of Apertium through statistical modules that augment the pipeline, and statistical methods that augment existing modules. This includes morphological disambiguation, weighted structural transfer, and lexical selection modules that learn from limited data. The paper also discusses how a platform like Apertium can be a critical part of access to language technology for so-called low-resource languages, which might be ignored or deemed unapproachable by popular corpus-based translation technologies. Finally, the paper presents some of the released and unreleased language pairs, concluding with a brief look at some supplementary Apertium tools that prove valuable to users as well as language developers. All Apertium-related code, including language data, is free/open-source and available at https://github.com/apertium.

Download Full-text

Apertium: a free/open-source platform for rule-based machine translation

Machine Translation ◽

10.1007/s10590-011-9090-0 ◽

2011 ◽

Vol 25 (2) ◽

pp. 127-144 ◽

Cited By ~ 53

Author(s):

Mikel L. Forcada ◽

Mireia Ginestí-Rosell ◽

Jacob Nordfalk ◽

Jim O’Regan ◽

Sergio Ortiz-Rojas ◽

...

Keyword(s):

Open Source ◽

Machine Translation ◽

Rule Based ◽

Free Open Source

Download Full-text

Vélþýðingar á íslensku og Apertium-þýðingarkerfið

Orð og tunga ◽

10.33112/ordogtunga.18.8 ◽

2016 ◽

Vol 18 ◽

pp. 131-143

Author(s):

Ingibjörg Elsa Björnsdóttir

Keyword(s):

Open Source ◽

Machine Translation ◽

Rapid Development ◽

Translation System ◽

Rule Based ◽

Language Technology ◽

Translation Rule ◽

Machine Translation System

There has been rapid development in language technology and machine translation in recent decades. There are three main types of machine translation: statistical ma-chine translation, rule-based machine translation, and example-based machine translation. In this article the Apertium machine translation system is discussed in particular. While Apertium was originally designed to translate between closely related languages, it can now handle languages that are much more different and variable in structure. Anyone can participate in the development of the Apertium system since it is an open source soft ware. Thus Apertium is one of the best options available in order to research and develop a machine translation system for Icelandic. The Apertium system has an easy-to-use interface, and it translates almost instantly from Icelandic into English or Swedish. However, the system still has certain limitations as regards vocabulary and ambiguity.

Download Full-text

English to Kurdish Rule-based Machine Translation System

UHD Journal of Science and Technology ◽

10.21928/uhdjst.v2n2y2018.pp32-39 ◽

2018 ◽

Vol 2 (2) ◽

pp. 32

Author(s):

Kanaan Mikael Kaka-Khan

Keyword(s):

Open Source ◽

Machine Translation ◽

Translation System ◽

Simple Sentence ◽

Rule Based ◽

Ongoing Effort ◽

Machine Translation System ◽

Compound Sentence ◽

Free Open Source

In this paper we present a machine translation system developed to translate simple English sentences to Kurdish. The system is based on the (apertuim) free open source engine that provides the environment and the required tools to develop a machine translation system. The developed system is used to translate some as simple sentence, compound sentence, phrases and idioms from English to Kurdish. The resulting translation is then evaluated manually for accuracy and completeness compared to the result produced by the popular (inKurdish) English to Kurdish machine translation system. The result shows that our system is more accurate than inkurdish system. This paper contributes towards the ongoing effort to achieve full machine-based translation in general and English to Kurdish machine translation in specific.

Download Full-text

UQuAD1.0: Development of an Urdu Question Answering Training Data for Machine Reading Comprehension

10.36227/techrxiv.16924255 ◽

2021 ◽

Author(s):

Samreen Ahmed ◽

shakeel khoja

Keyword(s):

Reading Comprehension ◽

Machine Translation ◽

Large Scale ◽

Question Answering ◽

Training Data ◽

Significant Progress ◽

Rule Based ◽

Low Resource ◽

Machine Reading ◽

Answer Format

<p>In recent years, low-resource Machine Reading Comprehension (MRC) has made significant progress, with models getting remarkable performance on various language datasets. However, none of these models have been customized for the Urdu language. This work explores the semi-automated creation of the Urdu Question Answering Dataset (UQuAD1.0) by combining machine-translated SQuAD with human-generated samples derived from Wikipedia articles and Urdu RC worksheets from Cambridge O-level books. UQuAD1.0 is a large-scale Urdu dataset intended for extractive machine reading comprehension tasks consisting of 49k question Answers pairs in question, passage, and answer format. In UQuAD1.0, 45000 pairs of QA were generated by machine translation of the original SQuAD1.0 and approximately 4000 pairs via crowdsourcing. In this study, we used two types of MRC models: rule-based baseline and advanced Transformer-based models. However, we have discovered that the latter outperforms the others; thus, we have decided to concentrate solely on Transformer-based architectures. Using XLMRoBERTa and multi-lingual BERT, we acquire an F<sub>1</sub> score of 0.66 and 0.63, respectively.</p>

Download Full-text

Otedama: Fast Rule-Based Pre-Ordering for Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2016-0015 ◽

2016 ◽

Vol 106 (1) ◽

pp. 159-168 ◽

Cited By ~ 1

Author(s):

Julian Hitschler ◽

Laura Jehl ◽

Sariya Karimova ◽

Mayumi Ohta ◽

Benjamin Körner ◽

...

Keyword(s):

Open Source ◽

Machine Translation ◽

State Of The Art ◽

Statistical Machine Translation ◽

Training Data ◽

Translation System ◽

Rule Based ◽

Machine Translation System ◽

Target Languages ◽

Established Technique

Abstract We present Otedama, a fast, open-source tool for rule-based syntactic pre-ordering, a well established technique in statistical machine translation. Otedama implements both a learner for pre-ordering rules, as well as a component for applying these rules to parsed sentences. Our system is compatible with several external parsers and capable of accommodating many source and all target languages in any machine translation paradigm which uses parallel training data. We demonstrate improvements on a patent translation task over a state-of-the-art English-Japanese hierarchical phrase-based machine translation system. We compare Otedama with an existing syntax-based pre-ordering system, showing comparable translation performance at a runtime speedup of a factor of 4.5-10.

Download Full-text

Bridging the “gApp”: improving neural machine translation systems for multiword expression detection

Yearbook of Phraseology ◽

10.1515/phras-2020-0005 ◽

2020 ◽

Vol 11 (1) ◽

pp. 61-80

Author(s):

Carlos Manuel Hidalgo-Ternero ◽

Gloria Corpas Pastor

Keyword(s):

Open Source ◽

Machine Translation ◽

Automatic Identification ◽

Neural Machine Translation ◽

Multiword Expressions ◽

Continuous Form ◽

Text Preprocessing ◽

Translation Systems ◽

Free Open Source

AbstractThe present research introduces the tool gApp, a Python-based text preprocessing system for the automatic identification and conversion of discontinuous multiword expressions (MWEs) into their continuous form in order to enhance neural machine translation (NMT). To this end, an experiment with semi-fixed verb–noun idiomatic combinations (VNICs) will be carried out in order to evaluate to what extent gApp can optimise the performance of the two main free open-source NMT systems —Google Translate and DeepL— under the challenge of MWE discontinuity in the Spanish into English directionality. In the light of our promising results, the study concludes with suggestions on how to further optimise MWE-aware NMT systems.

Download Full-text

UQuAD1.0: Development of an Urdu Question Answering Training Data for Machine Reading Comprehension

10.36227/techrxiv.16924255.v1 ◽

2021 ◽

Author(s):

Samreen Ahmed ◽

shakeel khoja

Keyword(s):

Reading Comprehension ◽

Machine Translation ◽

Large Scale ◽

Question Answering ◽

Training Data ◽

Significant Progress ◽

Rule Based ◽

Low Resource ◽

Machine Reading ◽

Answer Format

Download Full-text

Free/open-source machine translation: preface

Machine Translation ◽

10.1007/s10590-011-9113-x ◽

2011 ◽

Vol 25 (2) ◽

pp. 83-86 ◽

Cited By ~ 1

Author(s):

Felipe Sánchez-Martínez ◽

Mikel L. Forcada

Keyword(s):

Open Source ◽

Machine Translation ◽

Free Open Source

Download Full-text

Inferring Shallow-Transfer Machine Translation Rules from Small Parallel Corpora

Journal of Artificial Intelligence Research ◽

10.1613/jair.2735 ◽

2009 ◽

Vol 34 ◽

pp. 605-635 ◽

Cited By ~ 11

Author(s):

F. Sánchez-Martínez ◽

M. L. Forcada

Keyword(s):

Open Source ◽

Machine Translation ◽

Parallel Corpora ◽

Bilingual Dictionary ◽

Translation Quality ◽

Statistical Mt ◽

Transfer Rules ◽

Word Translation ◽

And Control ◽

Free Open Source

This paper describes a method for the automatic inference of structural transfer rules to be used in a shallow-transfer machine translation (MT) system from small parallel corpora. The structural transfer rules are based on alignment templates, like those used in statistical MT. Alignment templates are extracted from sentence-aligned parallel corpora and extended with a set of restrictions which are derived from the bilingual dictionary of the MT system and control their application as transfer rules. The experiments conducted using three different language pairs in the free/open-source MT platform Apertium show that translation quality is improved as compared to word-for-word translation (when no transfer rules are used), and that the resulting translation quality is close to that obtained using hand-coded transfer rules. The method we present is entirely unsupervised and benefits from information in the rest of modules of the MT system in which the inferred rules are applied.

Download Full-text

RuLearn: an Open-source Toolkit for the Automatic Inference of Shallow-transfer Rules for Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2016-0018 ◽

2016 ◽

Vol 106 (1) ◽

pp. 193-204

Author(s):

Víctor M. Sánchez-Cartagena ◽

Juan Antonio Pérez-Ortiz ◽

Felipe Sánchez-Martínez

Keyword(s):

Open Source ◽

Machine Translation ◽

Statistical Machine Translation ◽

Rule Based ◽

Parallel Corpora ◽

Linguistic Resources ◽

Data Sparseness ◽

Translation Quality ◽

Transfer Rules ◽

Automatic Inference

Abstract This paper presents ruLearn, an open-source toolkit for the automatic inference of rules for shallow-transfer machine translation from scarce parallel corpora and morphological dictionaries. ruLearn will make rule-based machine translation a very appealing alternative for under-resourced language pairs because it avoids the need for human experts to handcraft transfer rules and requires, in contrast to statistical machine translation, a small amount of parallel corpora (a few hundred parallel sentences proved to be sufficient). The inference algorithm implemented by ruLearn has been recently published by the same authors in Computer Speech & Language (volume 32). It is able to produce rules whose translation quality is similar to that obtained by using hand-crafted rules. ruLearn generates rules that are ready for their use in the Apertium platform, although they can be easily adapted to other platforms. When the rules produced by ruLearn are used together with a hybridisation strategy for integrating linguistic resources from shallow-transfer rule-based machine translation into phrase-based statistical machine translation (published by the same authors in Journal of Artificial Intelligence Research, volume 55), they help to mitigate data sparseness. This paper also shows how to use ruLearn and describes its implementation.

Download Full-text