scholarly journals Statistical Machine Translation Dayak Language – Indonesia Language

2021 ◽  
Vol 16 (1) ◽  
pp. 49
Author(s):  
Muhammad Fiqri Khaikal ◽  
Arie Ardiyanti Suryani

This Paper aims to discuss how to create the local language machine translation of Indonesia Language where the reason of local language selection was carried out as considering the using of machine translator for local language are still infrequently found mainly for Dayak Language machine translator.  Machine Translation on this research had used statistical approach where the resource data that was taken originated from articles on dayaknews.com pages with total parallel corpus was approximately 1000 Dayak Language – Indonesia Language furthermore as this research contains the corpus with total 1000 sentences accordingly divided into three sections in order to comprehend the certain analysis from a pattern that was created.  The monolingual corpus was collected approximately 1000 sentences of Indonesia Language.  The testing was carried out using Bilingual Evaluation Understudy (BLEU) tool and had result the highest accuracy value amounting to 49.15% which increase from some the others machine translator amounting to approximately 3%.

Webology ◽  
2021 ◽  
Vol 18 (Special Issue 02) ◽  
pp. 208-222
Author(s):  
Vikas Pandey ◽  
Dr.M.V. Padmavati ◽  
Dr. Ramesh Kumar

Machine Translation is a subfield of Natural language Processing (NLP) which uses to translate source language to target language. In this paper an attempt has been made to make a Hindi Chhattisgarhi machine translation system which is based on statistical approach. In the state of Chhattisgarh there is a long awaited need for Hindi to Chhattisgarhi machine translation system for converting Hindi into Chhattisgarhi especially for non Chhattisgarhi speaking people. In order to develop Hindi Chhattisgarhi statistical machine translation system an open source software called Moses is used. Moses is a statistical machine translation system and used to automatically train the translation model for Hindi Chhattisgarhi language pair called as parallel corpus. A collection of structured text to study linguistic properties is called corpus. This machine translation system works on parallel corpus of 40,000 Hindi-Chhattisgarhi bilingual sentences. In order to overcome translation problem related to proper noun and unknown words, a transliteration system is also embedded in it. These sentences are extracted from various domains like stories, novels, text books and news papers etc. This system is tested on 1000 sentences to check the grammatical correctness of sentences and it was found that an accuracy of 75% is achieved.


Author(s):  
Muhammad Hermawan ◽  
Herry Sujaini ◽  
Novi Safriadi

The diversity of languages makes the need for translation so that communication between individuals of different languages can be appropriately established. The statistical translator engine (SMT) was a translator engine based on a statistical approach to parallel corpus analysis. One crucial part of SMT was language modeling (LM). LM was the calculation of word probability from a corpus-based on n-grams. There was a smoothing algorithm in LM where this algorithm will bring up the probability of a word whose value was zero. This study compares the use of the best smoothing algorithm from each of the three LM according to the standard Moses, namely KenLM, SRILM, and IRSTLM. For SRILM using smoothing algorithm interpolation with Witten-bell and interpolation with Ristads natural discounting, for KenLM using interpolation with modified Kneser-ney smoothing algorithm, and for IRSTLM using modified Kneser-ney and Witten-bell algorithm which was referenced based on previously researched. This study uses a corpus of 10,000 sentences. Tests carried out by BLEU and testing by Melayu Sambas linguists. Based on the results of BLEU testing and linguist testing, the best smoothing algorithm was chosen, namely modified Kneser-ney in KenLM LM, where the average results of automated testing, for Indonesian-Melayu Sambas and vice versa were 41. 6925% and 46. 66%. Moreover, for testing linguists, the accuracy of the Indonesian-Melayu Sambas language and vice versa was 77. 3165% and 77. 9095%


2016 ◽  
Vol 1 (1) ◽  
pp. 45-49
Author(s):  
Avinash Singh ◽  
Asmeet Kour ◽  
Shubhnandan S. Jamwal

The objective behind this paper is to analyze the English-Dogri parallel corpus translation. Machine translation is the translation from one language into another language. Machine translation is the biggest application of the Natural Language Processing (NLP). Moses is statistical machine translation system allow to train translation models for any language pair. We have developed translation system using Statistical based approach which helps in translating English to Dogri and vice versa. The parallel corpus consists of 98,973 sentences. The system gives accuracy of 80% in translating English to Dogri and the system gives accuracy of 87% in translating Dogri to English system.


Author(s):  
Rashmini Naranpanawa ◽  
Ravinga Perera ◽  
Thilakshi Fonseka ◽  
Uthayasanker Thayasivam

Neural machine translation (NMT) is a remarkable approach which performs much better than the Statistical machine translation (SMT) models when there is an abundance of parallel corpus. However, vanilla NMT is primarily based upon word-level with a fixed vocabulary. Therefore, low resource morphologically rich languages such as Sinhala are mostly affected by the out of vocabulary (OOV) and Rare word problems. Recent advancements in subword techniques have opened up opportunities for low resource communities by enabling open vocabulary translation. In this paper, we extend our recently published state-of-the-art EN-SI translation system using the transformer and explore standard subword techniques on top of it to identify which subword approach has a greater effect on English Sinhala language pair. Our models demonstrate that subword segmentation strategies along with the state-of-the-art NMT can perform remarkably when translating English sentences into a rich morphology language regardless of a large parallel corpus.


2014 ◽  
Vol 3 (3) ◽  
pp. 65-72
Author(s):  
Ayana Kuandykova ◽  
Amandyk Kartbayev ◽  
Tannur Kaldybekov

2005 ◽  
Vol 31 (4) ◽  
pp. 477-504 ◽  
Author(s):  
Dragos Stefan Munteanu ◽  
Daniel Marcu

We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available.


2020 ◽  
pp. 1-22
Author(s):  
Sukanta Sen ◽  
Mohammed Hasanuzzaman ◽  
Asif Ekbal ◽  
Pushpak Bhattacharyya ◽  
Andy Way

Abstract Neural machine translation (NMT) has recently shown promising results on publicly available benchmark datasets and is being rapidly adopted in various production systems. However, it requires high-quality large-scale parallel corpus, and it is not always possible to have sufficiently large corpus as it requires time, money, and professionals. Hence, many existing large-scale parallel corpus are limited to the specific languages and domains. In this paper, we propose an effective approach to improve an NMT system in low-resource scenario without using any additional data. Our approach aims at augmenting the original training data by means of parallel phrases extracted from the original training data itself using a statistical machine translation (SMT) system. Our proposed approach is based on the gated recurrent unit (GRU) and transformer networks. We choose the Hindi–English, Hindi–Bengali datasets for Health, Tourism, and Judicial (only for Hindi–English) domains. We train our NMT models for 10 translation directions, each using only 5–23k parallel sentences. Experiments show the improvements in the range of 1.38–15.36 BiLingual Evaluation Understudy points over the baseline systems. Experiments show that transformer models perform better than GRU models in low-resource scenarios. In addition to that, we also find that our proposed method outperforms SMT—which is known to work better than the neural models in low-resource scenarios—for some translation directions. In order to further show the effectiveness of our proposed model, we also employ our approach to another interesting NMT task, for example, old-to-modern English translation, using a tiny parallel corpus of only 2.7K sentences. For this task, we use publicly available old-modern English text which is approximately 1000 years old. Evaluation for this task shows significant improvement over the baseline NMT.


Author(s):  
Wei Xu ◽  
Courtney Napoles ◽  
Ellie Pavlick ◽  
Quanze Chen ◽  
Chris Callison-Burch

Most recent sentence simplification systems use basic machine translation models to learn lexical and syntactic paraphrases from a manually simplified parallel corpus. These methods are limited by the quality and quantity of manually simplified corpora, which are expensive to build. In this paper, we conduct an in-depth adaptation of statistical machine translation to perform text simplification, taking advantage of large-scale paraphrases learned from bilingual texts and a small amount of manual simplifications with multiple references. Our work is the first to design automatic metrics that are effective for tuning and evaluating simplification systems, which will facilitate iterative development for this task.


Sign in / Sign up

Export Citation Format

Share Document