Building a Bilingual Corpus based on Hybrid Approach for Malayalam-English Machine Translation

Word alignment in bilingual corpora has been a very active research topic in the Machine Translation research groups. In this research paper, we describe an alignment system that aligns English-Malayalam texts at word level in parallel sentences. The alignment of translated segments with source segments is very essential for building parallel corpora. Since word alignment research on Malayalam and English languages is still in its immaturity, it is not a trivial task for Malayalam-English text. A parallel corpus is a collection of texts in two languages, one of which is the translation equivalent of the other. Thus, the main purpose of this system is to construct word-aligned parallel corpus to be used in Malayalam-English machine translation. The proposed approach is a hybrid approach, a combination of corpus based and dictionary lookup approaches. The corpus based approach is based on the first three IBM models and Expectation Maximization (EM) algorithm. For the dictionary lookup approach, the proposed system uses the bilingual Malayalam-English Dictionary.

Download Full-text

Analyzing Subword Techniques to Improve English to Sinhala Neural Machine Translation

International Journal of Asian Language Processing ◽

10.1142/s2717554520500174 ◽

2021 ◽

pp. 2050017

Author(s):

Rashmini Naranpanawa ◽

Ravinga Perera ◽

Thilakshi Fonseka ◽

Uthayasanker Thayasivam

Keyword(s):

Machine Translation ◽

State Of The Art ◽

Statistical Machine Translation ◽

Translation System ◽

Rare Word ◽

Neural Machine Translation ◽

Parallel Corpus ◽

Low Resource ◽

Word Level ◽

Morphologically Rich Languages

Neural machine translation (NMT) is a remarkable approach which performs much better than the Statistical machine translation (SMT) models when there is an abundance of parallel corpus. However, vanilla NMT is primarily based upon word-level with a fixed vocabulary. Therefore, low resource morphologically rich languages such as Sinhala are mostly affected by the out of vocabulary (OOV) and Rare word problems. Recent advancements in subword techniques have opened up opportunities for low resource communities by enabling open vocabulary translation. In this paper, we extend our recently published state-of-the-art EN-SI translation system using the transformer and explore standard subword techniques on top of it to identify which subword approach has a greater effect on English Sinhala language pair. Our models demonstrate that subword segmentation strategies along with the state-of-the-art NMT can perform remarkably when translating English sentences into a rich morphology language regardless of a large parallel corpus.

Download Full-text

Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation

Computational Intelligence and Neuroscience ◽

10.1155/2021/6682385 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Michael Adjeisah ◽

Guohua Liu ◽

Douglas Omwenga Nyabuga ◽

Richard Nuetey Nortey ◽

Jinling Song

Keyword(s):

Machine Translation ◽

Language Processing ◽

Training Data ◽

Target Language ◽

Similarity Metrics ◽

Mahalanobis Distances ◽

Parallel Corpora ◽

Parallel Corpus ◽

Low Resource ◽

Sentence Level

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.

Download Full-text

Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics ◽

10.1162/089120105775299168 ◽

2005 ◽

Vol 31 (4) ◽

pp. 477-504 ◽

Cited By ~ 104

Author(s):

Dragos Stefan Munteanu ◽

Daniel Marcu

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Translation System ◽

Parallel Corpora ◽

Parallel Corpus ◽

Scarce Resources ◽

Parallel Data ◽

Machine Translation System ◽

Novel Method ◽

Arabic And English

We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available.

Download Full-text

Neural machine translation research based on the semantic vector of the tri-lingual parallel corpus

2016 International Conference on Machine Learning and Cybernetics (ICMLC) ◽

10.1109/icmlc.2016.7860879 ◽

2016 ◽

Cited By ~ 2

Author(s):

Xiao-Xue Wang ◽

Cong-Hui Zhu ◽

Sheng Li ◽

Tie-Jun Zhao ◽

De-Quan Zheng

Keyword(s):

Machine Translation ◽

Translation Research ◽

Neural Machine Translation ◽

Parallel Corpus

Download Full-text

Corpus Augmentation for Neural Machine Translation with Chinese-Japanese Parallel Corpora

Applied Sciences ◽

10.3390/app9102036 ◽

2019 ◽

Vol 9 (10) ◽

pp. 2036

Author(s):

Jinyi Zhang ◽

Tadahiro Matsumoto

Keyword(s):

Machine Translation ◽

Scientific Paper ◽

Training Data ◽

Word Alignment ◽

Sentence Pair ◽

Neural Machine Translation ◽

Parallel Corpora ◽

Translation Quality ◽

Parallel Data ◽

Source Sentence

The translation quality of Neural Machine Translation (NMT) systems depends strongly on the training data size. Sufficient amounts of parallel data are, however, not available for many language pairs. This paper presents a corpus augmentation method, which has two variations: one is for all language pairs, and the other is for the Chinese-Japanese language pair. The method uses both source and target sentences of the existing parallel corpus and generates multiple pseudo-parallel sentence pairs from a long parallel sentence pair containing punctuation marks as follows: (1) split the sentence pair into parallel partial sentences; (2) back-translate the target partial sentences; and (3) replace each partial sentence in the source sentence with the back-translated target partial sentence to generate pseudo-source sentences. The word alignment information, which is used to determine the split points, is modified with “shared Chinese character rates” in segments of the sentence pairs. The experiment results of the Japanese-Chinese and Chinese-Japanese translation with ASPEC-JC (Asian Scientific Paper Excerpt Corpus, Japanese-Chinese) show that the method substantially improves translation performance. We also supply the code (see Supplementary Materials) that can reproduce our proposed method.

Download Full-text

A Character Level Based and Word Level Based Approach for Chinese-Vietnamese Machine Translation

Computational Intelligence and Neuroscience ◽

10.1155/2016/9821608 ◽

2016 ◽

Vol 2016 ◽

pp. 1-11 ◽

Cited By ~ 6

Author(s):

Phuoc Tran ◽

Dien Dinh ◽

Hien T. Nguyen

Keyword(s):

Machine Translation ◽

Hybrid Approach ◽

Sparse Data ◽

Word Segmentation ◽

Experimental Results ◽

Translation System ◽

Word Level ◽

Data Problem ◽

Sparse Data Problem ◽

Language Pair

Chinese and Vietnamese have the same isolated language; that is, the words are not delimited by spaces. In machine translation, word segmentation is often done first when translating from Chinese or Vietnamese into different languages (typically English) and vice versa. However, it is a matter for consideration that words may or may not be segmented when translating between two languages in which spaces are not used between words, such as Chinese and Vietnamese. Since Chinese-Vietnamese is a low-resource language pair, the sparse data problem is evident in the translation system of this language pair. Therefore, while translating, whether it should be segmented or not becomes more important. In this paper, we propose a new method for translating Chinese to Vietnamese based on a combination of the advantages of character level and word level translation. In addition, a hybrid approach that combines statistics and rules is used to translate on the word level. And at the character level, a statistical translation is used. The experimental results showed that our method improved the performance of machine translation over that of character or word level translation.

Download Full-text

Maximum Expected Likelihood Estimation for Zero-resource Neural Machine Translation

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/594 ◽

2017 ◽

Cited By ~ 4

Author(s):

Hao Zheng ◽

Yong Cheng ◽

Yang Liu

Keyword(s):

Machine Translation ◽

Likelihood Estimation ◽

Neural Machine Translation ◽

Parallel Corpora ◽

Translation Model ◽

Parallel Corpus ◽

Target Model ◽

High Resource ◽

Remarkable Progress

While neural machine translation (NMT) has made remarkable progress in translating a handful of high-resource language pairs recently, parallel corpora are not always available for many zero-resource language pairs. To deal with this problem, we propose an approach to zero-resource NMT via maximum expected likelihood estimation. The basic idea is to maximize the expectation with respect to a pivot-to-source translation model for the intended source-to-target model on a pivot-target parallel corpus. To approximate the expectation, we propose two methods to connect the pivot-to-source and source-to-target models. Experiments on two zero-resource language pairs show that the proposed approach yields substantial gains over baseline methods. We also observe that when trained jointly with the source-to-target model, the pivot-to-source translation model also obtains improvements over independent training.

Download Full-text

Equivalence and Non-equivalence in Parallel Corpora

International Journal of Corpus Linguistics ◽

10.1075/ijcl.6.si.13var ◽

2001 ◽

Vol 6 (3) ◽

pp. 167-177

Author(s):

Tamás Váradi ◽

Gábor Kiss

Keyword(s):

Implicit Knowledge ◽

Source Material ◽

Target Language ◽

Word Alignment ◽

Parallel Corpora ◽

Parallel Corpus ◽

Multiword Units ◽

Word Classes ◽

Lexical Items ◽

Bilingual Lexicography

The present paper shows how an aligned parallel corpus can be used to investigate the consistency of translation equivalence across the two languages in a parallel corpus. The particular issues addressed are the bidirectionality of translation equivalence, the coverage of multiword units, and the amount of implicit knowledge presupposed on the part of the user in interpreting the data. Three lexical items belonging to different word classes were chosen for analysis: the noun head, the verb give, and the preposition with. George Orwell’s novel 1984 was used as source material as it available in English-Hungarian sentence-aligned form. It is argued that the analysis of translation equivalents displayed in sets of concordances with aligned sentences in the target language holds important implications for bilingual lexicography and automatic word alignment methodology.

Download Full-text

Evaluating a Pivot-Based Approach for Bilingual Lexicon Extraction

Computational Intelligence and Neuroscience ◽

10.1155/2015/434153 ◽

2015 ◽

Vol 2015 ◽

pp. 1-13 ◽

Cited By ~ 3

Author(s):

Jae-Hoon Kim ◽

Hong-Seok Kwon ◽

Hyeong-Won Seo

Keyword(s):

Machine Translation ◽

Word Association ◽

Statistical Machine Translation ◽

Low Frequency ◽

The Other ◽

Word Alignment ◽

Parallel Corpora ◽

Empirical Results ◽

Bilingual Lexicon ◽

Resource Poor

A pivot-based approach for bilingual lexicon extraction is based on the similarity of context vectors represented by words in a pivot language like English. In this paper, in order to show validity and usability of the pivot-based approach, we evaluate the approach in company with two different methods for estimating context vectors: one estimates them from two parallel corpora based on word association between source words (resp., target words) and pivot words and the other estimates them from two parallel corpora based on word alignment tools for statistical machine translation. Empirical results on two language pairs (e.g., Korean-Spanish and Korean-French) have shown that the pivot-based approach is very promising for resource-poor languages and this approach observes its validity and usability. Furthermore, for words with low frequency, our method is also well performed.

Download Full-text

Learning Lessons from Bilingual Corpora: Benefits for Machine Translation

International Journal of Corpus Linguistics ◽

10.1075/ijcl.5.2.06str ◽

2000 ◽

Vol 5 (2) ◽

pp. 199-230 ◽

Cited By ~ 1

Author(s):

Oliver Streiter ◽

Leonid L. Iomdin

Keyword(s):

Machine Translation ◽

Subject Domain ◽

Rule Based ◽

Parallel Corpora ◽

Specific Subject ◽

Multiword Expressions ◽

Bilingual Corpora ◽

Subject Domains

The research described in this paper is rooted in the endeavors to combine the advantages of corpus-based and rule-based MT approaches in order to improve the performance of MT systems—most importantly, the quality of translation. The authors review the ongoing activities in the field and present a case study, which shows how translation knowledge can be drawn from parallel corpora and compiled into the lexicon of a rule-based MT system. These data are obtained with the help of three procedures: (1) identification of hence unknown one-word translations, (2) statistical rating of the known one-word translations, and (3) extraction of new translations of multiword expressions (MWEs) followed by compilation steps which create new rules for the MT engine. As a result, the lexicon is enriched with translation equivalents attested for different subject domains, which facilitates the tuning of the MT system to a specific subject domain and improves the quality and adequacy of translation.

Download Full-text