RuLearn: an Open-source Toolkit for the Automatic Inference of Shallow-transfer Rules for Machine Translation

Abstract This paper presents ruLearn, an open-source toolkit for the automatic inference of rules for shallow-transfer machine translation from scarce parallel corpora and morphological dictionaries. ruLearn will make rule-based machine translation a very appealing alternative for under-resourced language pairs because it avoids the need for human experts to handcraft transfer rules and requires, in contrast to statistical machine translation, a small amount of parallel corpora (a few hundred parallel sentences proved to be sufficient). The inference algorithm implemented by ruLearn has been recently published by the same authors in Computer Speech & Language (volume 32). It is able to produce rules whose translation quality is similar to that obtained by using hand-crafted rules. ruLearn generates rules that are ready for their use in the Apertium platform, although they can be easily adapted to other platforms. When the rules produced by ruLearn are used together with a hybridisation strategy for integrating linguistic resources from shallow-transfer rule-based machine translation into phrase-based statistical machine translation (published by the same authors in Journal of Artificial Intelligence Research, volume 55), they help to mitigate data sparseness. This paper also shows how to use ruLearn and describes its implementation.

Download Full-text

Inferring Shallow-Transfer Machine Translation Rules from Small Parallel Corpora

Journal of Artificial Intelligence Research ◽

10.1613/jair.2735 ◽

2009 ◽

Vol 34 ◽

pp. 605-635 ◽

Cited By ~ 11

Author(s):

F. Sánchez-Martínez ◽

M. L. Forcada

Keyword(s):

Open Source ◽

Machine Translation ◽

Parallel Corpora ◽

Bilingual Dictionary ◽

Translation Quality ◽

Statistical Mt ◽

Transfer Rules ◽

Word Translation ◽

And Control ◽

Free Open Source

This paper describes a method for the automatic inference of structural transfer rules to be used in a shallow-transfer machine translation (MT) system from small parallel corpora. The structural transfer rules are based on alignment templates, like those used in statistical MT. Alignment templates are extracted from sentence-aligned parallel corpora and extended with a set of restrictions which are derived from the bilingual dictionary of the MT system and control their application as transfer rules. The experiments conducted using three different language pairs in the free/open-source MT platform Apertium show that translation quality is improved as compared to word-for-word translation (when no transfer rules are used), and that the resulting translation quality is close to that obtained using hand-coded transfer rules. The method we present is entirely unsupervised and benefits from information in the rest of modules of the MT system in which the inferred rules are applied.

Download Full-text

Integrating Rules and Dictionaries from Shallow-Transfer Machine Translation into Phrase-Based Statistical Machine Translation

Journal of Artificial Intelligence Research ◽

10.1613/jair.4761 ◽

2016 ◽

Vol 55 ◽

pp. 17-61 ◽

Cited By ~ 4

Author(s):

Víctor M. Sánchez-Cartagena ◽

Juan Antonio Pérez-Ortiz ◽

Felipe Sánchez-Martínez

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Black Box ◽

Source Language ◽

Linguistic Resources ◽

Translation Quality ◽

Hybrid Approaches ◽

Transfer Rule ◽

Transfer Rules ◽

New Strategy

We describe a hybridisation strategy whose objective is to integrate linguistic resources from shallow-transfer rule-based machine translation (RBMT) into phrase-based statistical machine translation (PBSMT). It basically consists of enriching the phrase table of a PBSMT system with bilingual phrase pairs matching transfer rules and dictionary entries from a shallow-transfer RBMT system. This new strategy takes advantage of how the linguistic resources are used by the RBMT system to segment the source-language sentences to be translated, and overcomes the limitations of existing hybrid approaches that treat the RBMT systems as a black box. Experimental results confirm that our approach delivers translations of higher quality than existing ones, and that it is specially useful when the parallel corpus available for training the SMT system is small or when translating out-of-domain texts that are well covered by the RBMT dictionaries. A combination of this approach with a recently proposed unsupervised shallow-transfer rule inference algorithm results in a significantly greater translation quality than that of a baseline PBSMT; in this case, the only hand-crafted resource used are the dictionaries commonly used in RBMT. Moreover, the translation quality achieved by the hybrid system built with automatically inferred rules is similar to that obtained by those built with hand-crafted rules.

Download Full-text

Otedama: Fast Rule-Based Pre-Ordering for Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2016-0015 ◽

2016 ◽

Vol 106 (1) ◽

pp. 159-168 ◽

Cited By ~ 1

Author(s):

Julian Hitschler ◽

Laura Jehl ◽

Sariya Karimova ◽

Mayumi Ohta ◽

Benjamin Körner ◽

...

Keyword(s):

Open Source ◽

Machine Translation ◽

State Of The Art ◽

Statistical Machine Translation ◽

Training Data ◽

Translation System ◽

Rule Based ◽

Machine Translation System ◽

Target Languages ◽

Established Technique

Abstract We present Otedama, a fast, open-source tool for rule-based syntactic pre-ordering, a well established technique in statistical machine translation. Otedama implements both a learner for pre-ordering rules, as well as a component for applying these rules to parsed sentences. Our system is compatible with several external parsers and capable of accommodating many source and all target languages in any machine translation paradigm which uses parallel training data. We demonstrate improvements on a patent translation task over a state-of-the-art English-Japanese hierarchical phrase-based machine translation system. We compare Otedama with an existing syntax-based pre-ordering system, showing comparable translation performance at a runtime speedup of a factor of 4.5-10.

Download Full-text

Sulis: An Open Source Transfer Decoder for Deep Syntactic Statistical Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.2478/v10108-010-0005-7 ◽

2010 ◽

Vol 93 (1) ◽

pp. 17-26 ◽

Cited By ~ 1

Author(s):

Yvette Graham

Keyword(s):

Open Source ◽

Linear Combination ◽

Machine Translation ◽

Statistical Machine Translation ◽

Language Model ◽

Beam Search ◽

Translation Model ◽

Transfer Rules ◽

Log Linear

Sulis: An Open Source Transfer Decoder for Deep Syntactic Statistical Machine Translation In this paper, we describe an open source transfer decoder for Deep Syntactic Transfer-Based Statistical Machine Translation. Transfer decoding involves the application of transfer rules to a SL structure. The N-best TL structures are found via a beam search of TL hypothesis structures which are ranked via a log-linear combination of feature scores, such as translation model and dependency-based language model.

Download Full-text

Margin Infused Relaxed Algorithm for Moses

Prague Bulletin of Mathematical Linguistics ◽

10.2478/v10108-011-0012-3 ◽

2011 ◽

Vol 96 (1) ◽

pp. 69-78 ◽

Cited By ~ 5

Author(s):

Eva Hasler ◽

Barry Haddow ◽

Philipp Koehn

Keyword(s):

Open Source ◽

Machine Translation ◽

Error Rate ◽

Statistical Machine Translation ◽

Experimental Results ◽

Minimum Error ◽

Feature Sets ◽

Translation Quality ◽

Core Feature ◽

Minimum Error Rate Training

Margin Infused Relaxed Algorithm for Moses We describe an open-source implementation of the Margin Infused Relaxed Algorithm (MIRA) for statistical machine translation (SMT). The implementation is part of the Moses toolkit and can be used as an alternative to standard minimum error rate training (MERT). A description of the implementation and its usage on core feature sets as well as large, sparse feature sets is given and we report experimental results comparing the performance of MIRA with MERT in terms of translation quality and stability.

Download Full-text

“Bilingual Expert” Can Find Translation Errors

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016367 ◽

2019 ◽

Vol 33 ◽

pp. 6367-6374 ◽

Cited By ~ 7

Author(s):

Kai Fan ◽

Jiayi Wang ◽

Bo Li ◽

Fengming Zhou ◽

Boxing Chen ◽

...

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Language Model ◽

Target Language ◽

Quality Estimation ◽

Parallel Corpora ◽

Translation Quality ◽

Expert Model ◽

Parallel Data ◽

Translation Errors

The performances of machine translation (MT) systems are usually evaluated by the metric BLEU when the golden references are provided. However, in the case of model inference or production deployment, golden references are usually expensively available, such as human annotation with bilingual expertise. In order to address the issue of translation quality estimation (QE) without reference, we propose a general framework for automatic evaluation of the translation output for the QE task in the Conference on Statistical Machine Translation (WMT). We first build a conditional target language model with a novel bidirectional transformer, named neural bilingual expert model, which is pre-trained on large parallel corpora for feature extraction. For QE inference, the bilingual expert model can simultaneously produce the joint latent representation between the source and the translation, and real-valued measurements of possible erroneous tokens based on the prior knowledge learned from parallel data. Subsequently, the features will further be fed into a simple Bi-LSTM predictive model for quality estimation. The experimental results show that our approach achieves the state-of-the-art performance in most public available datasets of WMT 2017/2018 QE task.

Download Full-text

Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information

Computational Linguistics ◽

10.1162/089120104323093285 ◽

2004 ◽

Vol 30 (2) ◽

pp. 181-204 ◽

Cited By ~ 31

Author(s):

Sonja Nießen ◽

Hermann Ney

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Training Data ◽

Target Language ◽

Linguistic Knowledge ◽

Parallel Corpora ◽

Scarce Resources ◽

Translation Quality ◽

Sentence Level ◽

Statistical Systems

In statistical machine translation, correspondences between the words in the source and the target language are learned from parallel corpora, and often little or no linguistic knowledge is used to structure the underlying models. In particular, existing statistical systems for machine translation often treat different inflected forms of the same lemma as if they were independent of one another. The bilingual training data can be better exploited by explicitly taking into account the interdependencies of related inflected forms. We propose the construction of hierarchical lexicon models on the basis of equivalence classes of words. In addition, we introduce sentence-level restructuring transformations which aim at the assimilation of word order in related sentences. We have systematically investigated the amount of bilingual training data required to maintain an acceptable quality of machine translation. The combination of the suggested methods for improving translation quality in frameworks with scarce resources has been successfully tested: We were able to reduce the amount of bilingual training data to less than 10% of the original corpus, while losing only 1.6% in translation quality. The improvement of the translation results is demonstrated on two German-English corpora taken from the Verbmobil task and the Nespole! task.

Download Full-text

Adabot: Fault-Tolerant Java Decompiler (Student Abstract)

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i10.7203 ◽

2020 ◽

Vol 34 (10) ◽

pp. 13861-13862

Author(s):

Zhiming Li ◽

Qing Wu ◽

Kun Qian

Keyword(s):

Reverse Engineering ◽

Machine Translation ◽

Fault Tolerant ◽

Safety Concern ◽

Statistical Machine Translation ◽

Abstract Syntax ◽

Rule Based ◽

Abstract Syntax Tree ◽

Important Field ◽

Internal Architecture

Reverse Engineering has been an extremely important field in software engineering, it helps us to better understand and analyze the internal architecture and interrealtions of executables. Classical Java reverse engineering task includes disassembly and decompilation. Traditional Abstract Syntax Tree (AST) based disassemblers and decompilers are strictly rule defined and thus highly fault intolerant when bytecode obfuscation were introduced for safety concern. In this work, we view decompilation as a statistical machine translation task and propose a decompilation framework which is fully based on self-attention mechanism. Through better adaption to the linguistic uniqueness of bytecode, our model fully outperforms rule-based models and previous works based on recurrence mechanism.

Download Full-text

An Open-Source Web-Based Tool for Resource-Agnostic Interactive Translation Prediction

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2014-0015 ◽

2014 ◽

Vol 102 (1) ◽

pp. 69-80 ◽

Cited By ~ 2

Author(s):

Torregrosa Daniel ◽

Forcada Mikel L. ◽

Pérez-Ortiz Juan Antonio

Keyword(s):

Open Source ◽

Machine Translation ◽

Web Application ◽

Statistical Machine Translation ◽

Black Box ◽

Translation System ◽

Web Tool ◽

Web Based ◽

Strongly Coupled ◽

Machine Translation System

Abstract We present a web-based open-source tool for interactive translation prediction (ITP) and describe its underlying architecture. ITP systems assist human translators by making context-based computer-generated suggestions as they type. Most of the ITP systems in literature are strongly coupled with a statistical machine translation system that is conveniently adapted to provide the suggestions. Our system, however, follows a resource-agnostic approach and suggestions are obtained from any unmodified black-box bilingual resource. This paper reviews our ITP method and describes the architecture of Forecat, a web tool, partly based on the recent technology of web components, that eases the use of our ITP approach in any web application requiring this kind of translation assistance. We also evaluate the performance of our method when using an unmodified Moses-based statistical machine translation system as the bilingual resource.

Download Full-text

End-to-end statistical machine translation with zero or small parallel texts

Natural Language Engineering ◽

10.1017/s1351324916000127 ◽

2016 ◽

Vol 22 (4) ◽

pp. 517-548 ◽

Cited By ~ 5

Author(s):

ANN IRVINE ◽

CHRIS CALLISON-BURCH

Keyword(s):

Detailed Analysis ◽

Machine Translation ◽

Statistical Machine Translation ◽

Parallel Corpora ◽

Low Resource ◽

Bilingual Lexicon ◽

Orthographic Similarity ◽

Discriminative Model ◽

End To End ◽

Parallel Texts

AbstractWe use bilingual lexicon induction techniques, which learn translations from monolingual texts in two languages, to build an end-to-end statistical machine translation (SMT) system without the use of any bilingual sentence-aligned parallel corpora. We present detailed analysis of the accuracy of bilingual lexicon induction, and show how a discriminative model can be used to combine various signals of translation equivalence (like contextual similarity, temporal similarity, orthographic similarity and topic similarity). Our discriminative model produces higher accuracy translations than previous bilingual lexicon induction techniques. We reuse these signals of translation equivalence as features on a phrase-based SMT system. These monolingually estimated features enhance low resource SMT systems in addition to allowing end-to-end machine translation without parallel corpora.

Download Full-text