English-Kazakh Parallel Corpus For Statistical Machine Translation

The objective behind this paper is to analyze the English-Dogri parallel corpus translation. Machine translation is the translation from one language into another language. Machine translation is the biggest application of the Natural Language Processing (NLP). Moses is statistical machine translation system allow to train translation models for any language pair. We have developed translation system using Statistical based approach which helps in translating English to Dogri and vice versa. The parallel corpus consists of 98,973 sentences. The system gives accuracy of 80% in translating English to Dogri and the system gives accuracy of 87% in translating Dogri to English system.

Download Full-text

Analyzing Subword Techniques to Improve English to Sinhala Neural Machine Translation

International Journal of Asian Language Processing ◽

10.1142/s2717554520500174 ◽

2021 ◽

pp. 2050017

Author(s):

Rashmini Naranpanawa ◽

Ravinga Perera ◽

Thilakshi Fonseka ◽

Uthayasanker Thayasivam

Keyword(s):

Machine Translation ◽

State Of The Art ◽

Statistical Machine Translation ◽

Translation System ◽

Rare Word ◽

Neural Machine Translation ◽

Parallel Corpus ◽

Low Resource ◽

Word Level ◽

Morphologically Rich Languages

Neural machine translation (NMT) is a remarkable approach which performs much better than the Statistical machine translation (SMT) models when there is an abundance of parallel corpus. However, vanilla NMT is primarily based upon word-level with a fixed vocabulary. Therefore, low resource morphologically rich languages such as Sinhala are mostly affected by the out of vocabulary (OOV) and Rare word problems. Recent advancements in subword techniques have opened up opportunities for low resource communities by enabling open vocabulary translation. In this paper, we extend our recently published state-of-the-art EN-SI translation system using the transformer and explore standard subword techniques on top of it to identify which subword approach has a greater effect on English Sinhala language pair. Our models demonstrate that subword segmentation strategies along with the state-of-the-art NMT can perform remarkably when translating English sentences into a rich morphology language regardless of a large parallel corpus.

Download Full-text

Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics ◽

10.1162/089120105775299168 ◽

2005 ◽

Vol 31 (4) ◽

pp. 477-504 ◽

Cited By ~ 104

Author(s):

Dragos Stefan Munteanu ◽

Daniel Marcu

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Translation System ◽

Parallel Corpora ◽

Parallel Corpus ◽

Scarce Resources ◽

Parallel Data ◽

Machine Translation System ◽

Novel Method ◽

Arabic And English

We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available.

Download Full-text

Refining semi-automatic parallel corpus creation for Zulu to English statistical machine translation

2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech) ◽

10.1109/robomech.2016.7813168 ◽

2016 ◽

Author(s):

Gideon Kotze

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Parallel Corpus ◽

Corpus Creation

Download Full-text

Building a Spanish-Portuguese parallel corpus for statistical machine translation

Companion Proceedings of the XIV Brazilian Symposium on Multimedia and the Web - WebMedia '08 ◽

10.1145/1809980.1810069 ◽

2008 ◽

Author(s):

Wilker F. Aziz ◽

Thiago A. S. Pardo ◽

Ivandré Paraboni

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Parallel Corpus

Download Full-text

Neural machine translation of low-resource languages using SMT phrase pair injection

Natural Language Engineering ◽

10.1017/s1351324920000303 ◽

2020 ◽

pp. 1-22

Author(s):

Sukanta Sen ◽

Mohammed Hasanuzzaman ◽

Asif Ekbal ◽

Pushpak Bhattacharyya ◽

Andy Way

Keyword(s):

Machine Translation ◽

Large Scale ◽

Production Systems ◽

Statistical Machine Translation ◽

Training Data ◽

Original Training ◽

Neural Machine Translation ◽

Parallel Corpus ◽

Low Resource ◽

Better Than

Abstract Neural machine translation (NMT) has recently shown promising results on publicly available benchmark datasets and is being rapidly adopted in various production systems. However, it requires high-quality large-scale parallel corpus, and it is not always possible to have sufficiently large corpus as it requires time, money, and professionals. Hence, many existing large-scale parallel corpus are limited to the specific languages and domains. In this paper, we propose an effective approach to improve an NMT system in low-resource scenario without using any additional data. Our approach aims at augmenting the original training data by means of parallel phrases extracted from the original training data itself using a statistical machine translation (SMT) system. Our proposed approach is based on the gated recurrent unit (GRU) and transformer networks. We choose the Hindi–English, Hindi–Bengali datasets for Health, Tourism, and Judicial (only for Hindi–English) domains. We train our NMT models for 10 translation directions, each using only 5–23k parallel sentences. Experiments show the improvements in the range of 1.38–15.36 BiLingual Evaluation Understudy points over the baseline systems. Experiments show that transformer models perform better than GRU models in low-resource scenarios. In addition to that, we also find that our proposed method outperforms SMT—which is known to work better than the neural models in low-resource scenarios—for some translation directions. In order to further show the effectiveness of our proposed model, we also employ our approach to another interesting NMT task, for example, old-to-modern English translation, using a tiny parallel corpus of only 2.7K sentences. For this task, we use publicly available old-modern English text which is approximately 1000 years old. Evaluation for this task shows significant improvement over the baseline NMT.

Download Full-text

Optimizing Statistical Machine Translation for Text Simplification

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00107 ◽

2016 ◽

Vol 4 ◽

pp. 401-415 ◽

Cited By ~ 32

Author(s):

Wei Xu ◽

Courtney Napoles ◽

Ellie Pavlick ◽

Quanze Chen ◽

Chris Callison-Burch

Keyword(s):

Machine Translation ◽

Large Scale ◽

Statistical Machine Translation ◽

Parallel Corpus ◽

Iterative Development ◽

Text Simplification ◽

Multiple References

Most recent sentence simplification systems use basic machine translation models to learn lexical and syntactic paraphrases from a manually simplified parallel corpus. These methods are limited by the quality and quantity of manually simplified corpora, which are expensive to build. In this paper, we conduct an in-depth adaptation of statistical machine translation to perform text simplification, taking advantage of large-scale paraphrases learned from bilingual texts and a small amount of manual simplifications with multiple references. Our work is the first to design automatic metrics that are effective for tuning and evaluating simplification systems, which will facilitate iterative development for this task.

Download Full-text

Verb Phrases Alignment Technique for English-Malayalam Parallel Corpus in Statistical Machine Translation Special issue on MTIL 2017

Journal of Intelligent Systems ◽

10.1515/jisys-2018-0066 ◽

2019 ◽

Vol 28 (3) ◽

pp. 479-492

Author(s):

Mary Priya Sebastian ◽

G. Santhosh Kumar

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Foreign Languages ◽

Parallel Corpus ◽

Linguistic Resources ◽

Translation Tools ◽

Verb Phrases ◽

Alignment Technique ◽

Developing Area ◽

Malayalam Language

Abstract Machine translation (MT) from English to foreign languages is a fast developing area of research, and various techniques of translation are discussed in the literature. However, translation from English to Malayalam, a Dravidian language, is still in the rising stage, and works in this field have not flourished to a great extent, so far. The main reason of this shortcoming is the non-availability of linguistic resources and translation tools in the Malayalam language. A parallel corpus with alignment is one of such resources that are essential for a machine translator system. This paper focuses on a technique that enables automatic setting up of a verb-aligned parallel corpus by exploring the internal structure of the English and Malayalam language, which in turn facilitates the task of machine translation from English to Malayalam.

Download Full-text

Low Resource Neural Machine Translation: Assamese to/from Other Indo-Aryan (Indic) Languages

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3469721 ◽

2022 ◽

Vol 21 (1) ◽

pp. 1-32

Author(s):

Rupjyoti Baruah ◽

Rajesh Kumar Mundotiya ◽

Anil Kumar Singh

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Basic Sequence ◽

Neural Machine Translation ◽

Parallel Corpus ◽

Translation Quality ◽

Evaluation Scores ◽

Language Characteristics ◽

The Given ◽

Family Trees

Machine translation (MT) systems have been built using numerous different techniques for bridging the language barriers. These techniques are broadly categorized into approaches like Statistical Machine Translation (SMT) and Neural Machine Translation (NMT). End-to-end NMT systems significantly outperform SMT in translation quality on many language pairs, especially those with the adequate parallel corpus. We report comparative experiments on baseline MT systems for Assamese to other Indo-Aryan languages (in both translation directions) using the traditional Phrase-Based SMT as well as some more successful NMT architectures, namely basic sequence-to-sequence model with attention, Transformer, and finetuned Transformer. The results are evaluated using the most prominent and popular standard automatic metric BLEU (BiLingual Evaluation Understudy), as well as other well-known metrics for exploring the performance of different baseline MT systems, since this is the first such work involving Assamese. The evaluation scores are compared for SMT and NMT models for the effectiveness of bi-directional language pairs involving Assamese and other Indo-Aryan languages (Bangla, Gujarati, Hindi, Marathi, Odia, Sinhalese, and Urdu). The highest BLEU scores obtained are for Assamese to Sinhalese for SMT (35.63) and the Assamese to Bangla for NMT systems (seq2seq is 50.92, Transformer is 50.01, and finetuned Transformer is 50.19). We also try to relate the results with the language characteristics, distances, family trees, domains, data sizes, and sentence lengths. We find that the effect of the domain is the most important factor affecting the results for the given data domains and sizes. We compare our results with the only existing MT system for Assamese (Bing Translator) and also with pairs involving Hindi.

Download Full-text

Aspects of Multilingual News Summarisation

Advances in Data Mining and Database Management - Innovative Document Summarization Techniques ◽

10.4018/978-1-4666-5019-0.ch012 ◽

2014 ◽

pp. 277-294

Author(s):

Josef Steinberger ◽

Ralf Steinberger ◽

Hristo Tanev ◽

Vanni Zavarella ◽

Marco Turchi

Keyword(s):

Machine Translation ◽

Latent Semantic Analysis ◽

Automatic System ◽

Semantic Analysis ◽

Statistical Machine Translation ◽

Challenging Problem ◽

Parallel Corpus ◽

High Performing ◽

Domain Specific ◽

Multiple Languages

In this chapter, the authors discuss several pertinent aspects of an automatic system that generates summaries in multiple languages for sets of topic-related news articles (multilingual multi-document summarisation), gathered by news aggregation systems. The discussion follows a framework based on Latent Semantic Analysis (LSA) because LSA was shown to be a high-performing method across many different languages. Starting from a sentence-extractive approach, the authors show how domain-specific aspects can be used and how a compression and paraphrasing method can be plugged in. They also discuss the challenging problem of summarisation evaluation in different languages. In particular, the authors describe two approaches: the first uses a parallel corpus and the second statistical machine translation.

Download Full-text