scholarly journals Phrase Table Induction Using In-Domain Monolingual Data for Domain Adaptation in Statistical Machine Translation

2017 ◽  
Vol 5 ◽  
pp. 487-500
Author(s):  
Benjamin Marie ◽  
Atsushi Fujita

We present a new framework to induce an in-domain phrase table from in-domain monolingual data that can be used to adapt a general-domain statistical machine translation system to the targeted domain. Our method first compiles sets of phrases in source and target languages separately and generates candidate phrase pairs by taking the Cartesian product of the two phrase sets. It then computes inexpensive features for each candidate phrase pair and filters them using a supervised classifier in order to induce an in-domain phrase table. We experimented on the language pair English–French, both translation directions, in two domains and obtained consistently better results than a strong baseline system that uses an in-domain bilingual lexicon. We also conducted an error analysis that showed the induced phrase tables proposed useful translations, especially for words and phrases unseen in the parallel data used to train the general-domain baseline system.

2005 ◽  
Vol 31 (4) ◽  
pp. 477-504 ◽  
Author(s):  
Dragos Stefan Munteanu ◽  
Daniel Marcu

We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available.


2016 ◽  
Vol 106 (1) ◽  
pp. 159-168 ◽  
Author(s):  
Julian Hitschler ◽  
Laura Jehl ◽  
Sariya Karimova ◽  
Mayumi Ohta ◽  
Benjamin Körner ◽  
...  

Abstract We present Otedama, a fast, open-source tool for rule-based syntactic pre-ordering, a well established technique in statistical machine translation. Otedama implements both a learner for pre-ordering rules, as well as a component for applying these rules to parsed sentences. Our system is compatible with several external parsers and capable of accommodating many source and all target languages in any machine translation paradigm which uses parallel training data. We demonstrate improvements on a patent translation task over a state-of-the-art English-Japanese hierarchical phrase-based machine translation system. We compare Otedama with an existing syntax-based pre-ordering system, showing comparable translation performance at a runtime speedup of a factor of 4.5-10.


2020 ◽  
Vol 44 (1) ◽  
pp. 33-50
Author(s):  
Ivan Dunđer

Machine translation is increasingly becoming a hot research topic in information and communication sciences, computer science and computational linguistics, due to the fact that it enables communication and transferring of meaning across different languages. As the Croatian language can be considered low-resourced in terms of available services and technology, development of new domain-specific machine translation systems is important, especially due to raised interest and needs of industry, academia and everyday users. Machine translation is not perfect, but it is crucial to assure acceptable quality, which is purpose-dependent. In this research, different statistical machine translation systems were built – but one system utilized domain adaptation in particular, with the intention of boosting the output of machine translation. Afterwards, extensive evaluation has been performed – in form of applying several automatic quality metrics and human evaluation with focus on various aspects. Evaluation is done in order to assess the quality of specific machine-translated text.


Author(s):  
K. Jaya ◽  
Deepa Gupta

Even though lot of Statistical Machine Translation(SMT) research work is happening for English-Hindi language pair, there is no effort done to standardize the dataset. Each of the research work uses different dataset, different parameters and different number of sentences during various phases of translation resulting in varied translation output. So comparing  these models, understand the result of these models, to get insight into corpus behavior for these models, regenerating the result of these research work  becomes tedious. This necessitates the need for standardization of dataset and to identify the common parameter for the development of model.  The main contribution of this paper is to discuss an approach to standardize the dataset and to identify the best parameter which in combination gives best performance. It also investigates a novel corpus augmentation approach to improve the translation quality of English-Hindi bidirectional statistical machine translation system. This model works well for the scarce resource without incorporating the external parallel data corpus of the underlying language.  This experiment is carried out using Open Source phrase-based toolkit Moses. Indian Languages Corpora Initiative (ILCI) Hindi-English tourism corpus is used.  With limited dataset, considerable improvement is achieved using the corpus augmentation approach for the English-Hindi bidirectional SMT system.


Author(s):  
K. Jaya ◽  
Deepa Gupta

Even though lot of Statistical Machine Translation(SMT) research work is happening for English-Hindi language pair, there is no effort done to standardize the dataset. Each of the research work uses different dataset, different parameters and different number of sentences during various phases of translation resulting in varied translation output. So comparing  these models, understand the result of these models, to get insight into corpus behavior for these models, regenerating the result of these research work  becomes tedious. This necessitates the need for standardization of dataset and to identify the common parameter for the development of model.  The main contribution of this paper is to discuss an approach to standardize the dataset and to identify the best parameter which in combination gives best performance. It also investigates a novel corpus augmentation approach to improve the translation quality of English-Hindi bidirectional statistical machine translation system. This model works well for the scarce resource without incorporating the external parallel data corpus of the underlying language.  This experiment is carried out using Open Source phrase-based toolkit Moses. Indian Languages Corpora Initiative (ILCI) Hindi-English tourism corpus is used.  With limited dataset, considerable improvement is achieved using the corpus augmentation approach for the English-Hindi bidirectional SMT system.


2016 ◽  
Vol 1 (1) ◽  
pp. 45-49
Author(s):  
Avinash Singh ◽  
Asmeet Kour ◽  
Shubhnandan S. Jamwal

The objective behind this paper is to analyze the English-Dogri parallel corpus translation. Machine translation is the translation from one language into another language. Machine translation is the biggest application of the Natural Language Processing (NLP). Moses is statistical machine translation system allow to train translation models for any language pair. We have developed translation system using Statistical based approach which helps in translating English to Dogri and vice versa. The parallel corpus consists of 98,973 sentences. The system gives accuracy of 80% in translating English to Dogri and the system gives accuracy of 87% in translating Dogri to English system.


2014 ◽  
Vol 102 (1) ◽  
pp. 69-80 ◽  
Author(s):  
Torregrosa Daniel ◽  
Forcada Mikel L. ◽  
Pérez-Ortiz Juan Antonio

Abstract We present a web-based open-source tool for interactive translation prediction (ITP) and describe its underlying architecture. ITP systems assist human translators by making context-based computer-generated suggestions as they type. Most of the ITP systems in literature are strongly coupled with a statistical machine translation system that is conveniently adapted to provide the suggestions. Our system, however, follows a resource-agnostic approach and suggestions are obtained from any unmodified black-box bilingual resource. This paper reviews our ITP method and describes the architecture of Forecat, a web tool, partly based on the recent technology of web components, that eases the use of our ITP approach in any web application requiring this kind of translation assistance. We also evaluate the performance of our method when using an unmodified Moses-based statistical machine translation system as the bilingual resource.


2016 ◽  
Vol 5 (4) ◽  
pp. 51-66 ◽  
Author(s):  
Krzysztof Wolk ◽  
Krzysztof P. Marasek

The quality of machine translation is rapidly evolving. Today one can find several machine translation systems on the web that provide reasonable translations, although the systems are not perfect. In some specific domains, the quality may decrease. A recently proposed approach to this domain is neural machine translation. It aims at building a jointly-tuned single neural network that maximizes translation performance, a very different approach from traditional statistical machine translation. Recently proposed neural machine translation models often belong to the encoder-decoder family in which a source sentence is encoded into a fixed length vector that is, in turn, decoded to generate a translation. The present research examines the effects of different training methods on a Polish-English Machine Translation system used for medical data. The European Medicines Agency parallel text corpus was used as the basis for training of neural and statistical network-based translation systems. A comparison and implementation of a medical translator is the main focus of our experiments.


Sign in / Sign up

Export Citation Format

Share Document