statistical machine translation
Recently Published Documents


TOTAL DOCUMENTS

998
(FIVE YEARS 113)

H-INDEX

31
(FIVE YEARS 4)

Author(s):  
Rupjyoti Baruah ◽  
Rajesh Kumar Mundotiya ◽  
Anil Kumar Singh

Machine translation (MT) systems have been built using numerous different techniques for bridging the language barriers. These techniques are broadly categorized into approaches like Statistical Machine Translation (SMT) and Neural Machine Translation (NMT). End-to-end NMT systems significantly outperform SMT in translation quality on many language pairs, especially those with the adequate parallel corpus. We report comparative experiments on baseline MT systems for Assamese to other Indo-Aryan languages (in both translation directions) using the traditional Phrase-Based SMT as well as some more successful NMT architectures, namely basic sequence-to-sequence model with attention, Transformer, and finetuned Transformer. The results are evaluated using the most prominent and popular standard automatic metric BLEU (BiLingual Evaluation Understudy), as well as other well-known metrics for exploring the performance of different baseline MT systems, since this is the first such work involving Assamese. The evaluation scores are compared for SMT and NMT models for the effectiveness of bi-directional language pairs involving Assamese and other Indo-Aryan languages (Bangla, Gujarati, Hindi, Marathi, Odia, Sinhalese, and Urdu). The highest BLEU scores obtained are for Assamese to Sinhalese for SMT (35.63) and the Assamese to Bangla for NMT systems (seq2seq is 50.92, Transformer is 50.01, and finetuned Transformer is 50.19). We also try to relate the results with the language characteristics, distances, family trees, domains, data sizes, and sentence lengths. We find that the effect of the domain is the most important factor affecting the results for the given data domains and sizes. We compare our results with the only existing MT system for Assamese (Bing Translator) and also with pairs involving Hindi.


2021 ◽  
Vol 8 (4) ◽  
pp. 2173-2186
Author(s):  
Quranul Alfahrezi Agigi

In this rapid technological development, there are still at least some machine translators from regional languages ​​to Indonesian. Therefore, this paper discusses to make a statistical translation machine for the Muna language into Indonesian because at least there are still at least a Muna translation machine into Indonesian. The approach used a statistically based using parallel corpus. In this study, the data taken came from a book entitled Folklore of Buton and Muna in Southeast Sulawesi and several folklore articles on the internet. The number of parallel corpus used is 1050 sentence lines and the monolingual corpus is 1351 sentence lines. The scenarios that will be carried out in this experiment are divided into two scenarios. Scenario 1 is testing on the parallel corpus (training) which is tested using the available sentence lines and these sentence lines will be added to each experiment, while the rest of the sentence lines that are owned will be used in the parallel corpus (testing). In scenario 2, the test is carried out by comparing the lines of the monolingual corpus sentences after subtracting and adding sentences. In order for scenario 2 to run, accuracy is needed in scenario 1 which is the best. The test was carried out 6 times using BLEU (Bilingual Evaluation Understudy) tools. From the results of the tests carried out, the best accuracy value is 29.83%.


Author(s):  
Zolzaya Byambadorj ◽  
Ryota Nishimura ◽  
Altangerel Ayush ◽  
Norihide Kitaoka

The huge increase in social media use in recent years has resulted in new forms of social interaction, changing our daily lives. Due to increasing contact between people from different cultures as a result of globalization, there has also been an increase in the use of the Latin alphabet, and as a result a large amount of transliterated text is being used on social media. In this study, we propose a variety of character level sequence-to-sequence (seq2seq) models for normalizing noisy, transliterated text written in Latin script into Mongolian Cyrillic script, for scenarios in which there is a limited amount of training data available. We applied performance enhancement methods, which included various beam search strategies, N-gram-based context adoption, edit distance-based correction and dictionary-based checking, in novel ways to two basic seq2seq models. We experimentally evaluated these two basic models as well as fourteen enhanced seq2seq models, and compared their noisy text normalization performance with that of a transliteration model and a conventional statistical machine translation (SMT) model. The proposed seq2seq models improved the robustness of the basic seq2seq models for normalizing out-of-vocabulary (OOV) words, and most of our models achieved higher normalization performance than the conventional method. When using test data during our text normalization experiment, our proposed method which included checking each hypothesis during the inference period achieved the lowest word error rate (WER = 13.41%), which was 4.51% fewer errors than when using the conventional SMT method.


2021 ◽  
Vol 12 (5) ◽  
pp. 1-51
Author(s):  
Yu Wang ◽  
Yuelin Wang ◽  
Kai Dang ◽  
Jie Liu ◽  
Zhuo Liu

Grammatical error correction (GEC) is an important application aspect of natural language processing techniques, and GEC system is a kind of very important intelligent system that has long been explored both in academic and industrial communities. The past decade has witnessed significant progress achieved in GEC for the sake of increasing popularity of machine learning and deep learning. However, there is not a survey that untangles the large amount of research works and progress in this field. We present the first survey in GEC for a comprehensive retrospective of the literature in this area. We first give the definition of GEC task and introduce the public datasets and data annotation schema. After that, we discuss six kinds of basic approaches, six commonly applied performance boosting techniques for GEC systems, and three data augmentation methods. Since GEC is typically viewed as a sister task of Machine Translation (MT), we put more emphasis on the statistical machine translation (SMT)-based approaches and neural machine translation (NMT)-based approaches for the sake of their importance. Similarly, some performance-boosting techniques are adapted from MT and are successfully combined with GEC systems for enhancement on the final performance. More importantly, after the introduction of the evaluation in GEC, we make an in-depth analysis based on empirical results in aspects of GEC approaches and GEC systems for a clearer pattern of progress in GEC, where error type analysis and system recapitulation are clearly presented. Finally, we discuss five prospective directions for future GEC researches.


Author(s):  
Andy Way

Phrase-Based Statistical Machine Translation (PB-SMT) is clearly the leading paradigm in the field today. Nevertheless—and this may come as some surprise to the PB-SMT community—most translators and, somewhat more surprisingly perhaps, many experienced MT protagonists find the basic model extremely difficult to understand. The main aim of this paper, therefore, is to discuss why this might be the case. Our basic thesis is that proponents of PB-SMT do not seek to address any community other than their own, for they do not feel any need to do so. We demonstrate that this was not always the case; on the contrary, when statistical models of trans-lation were first presented, the language used to describe how such a model might work was very conciliatory, and inclusive. Over the next five years, things changed considerably; once SMT achieved dominance particularly over the rule-based paradigm, it had established a position where it did not need to bring along the rest of the MT community with it, and in our view, this has largely pertained to this day. Having discussed these issues, we discuss three additional issues: the role of automatic MT evaluation metrics when describing PB-SMT systems; the recent syntactic embellishments of PB-SMT, noting especially that most of these contributions have come from researchers who have prior experience in fields other than statistical models of translation; and the relationship between PB-SMT and other models of translation, suggesting that there are many gains to be had if the SMT community were to open up more to the other MT paradigms.


Author(s):  
Lieve Macken ◽  
Els Lefever

In this paper, we will describe the current state-of-the-art of Statistical Machine Translation (SMT), and reflect on how SMT handles meaning. Statistical Machine Translation is a corpus-based approach to MT: it de-rives the required knowledge to generate new translations from corpora. General-purpose SMT systems do not use any formal semantic representa-tion. Instead, they directly extract translationally equivalent words or word sequences – expressions with the same meaning – from bilingual parallel corpora. All statistical translation models are based on the idea of word alignment, i.e., the automatic linking of corresponding words in parallel texts. The first generation SMT systems were word-based. From a linguistic point of view, the major problem with word-based systems is that the mean-ing of a word is often ambiguous, and is determined by its context. Current state-of-the-art SMT-systems try to capture the local contextual dependen-cies by using phrases instead of words as units of translation. In order to solve more complex ambiguity problems (where a broader text scope or even domain information is needed), a Word Sense Disambiguation (WSD) module is integrated in the Machine Translation environment.


Author(s):  
Arun Babhulgaonkar ◽  
Shefali Sonavane

Hindi is the national language of India. However, most of the Government records, resolutions, news, etc. are documented in English which remote villagers may not understand. This fact motivates to develop an automatic language translation system from English to Hindi. Machine translation is the process of translating a text in one natural language into another natural language using computer system. Grammatical structure of Hindi language is very much complex than English language. The structural difference between English and Hindi language makes it difficult to achieve good quality translation results. In this paper, phrase-based statistical machine translation approach (PBSMT) is used for translation. Translation, reordering and language model are main working components of a PBSMT system. This paper evaluates the impact of various combinations of these PBSMT system parameters on automated English to Hindi language translation quality. Freely available n-gram-based BLEU metric and TER metric are used for evaluating the results.


Author(s):  
Quyền Đặng Thanh

Trong hệ thống dịch máy thống kê (Statistical Machine Translation - SMT), gióng hàng từ là một nhiệm vụ quan trọng và có ảnh hưởng lớn đến chất lượng hệ dịch. Hiện nay, chưa có nghiên cứu nào sử dụng các kỹ thuật chia nhỏ từ cho hệ thống dịch máy thống kê cặp ngôn ngữ Việt-Anh. Trong bài báo này, chúng tôi đề xuất một hướng tiếp cận sử dụng các kỹ thuật chia nhỏ từ vào hệ thống dịch máy thống kê nhằm nâng cao chất lượng gióng hàng từ, từ đó nâng cao chất lượng hệ dịch cho cặp ngôn ngữ Việt-Anh. Ngoài việc áp dụng kỹ thuật chia nhỏ từ như một bước tiền xử lý, chúng tôi còn đề xuất cải tiến mô hình gióng hàng từ để nâng cao chất lượng hệ dịch. Phương pháp đề xuất đã được cài đặt, thử nghiệm với các kỹ thuật chia nhỏ từ khác nhau như BPE, Wordpiece, unigram và Morfessor, kết quả thử nghiệm cho thấy, việc áp dụng phương pháp đề xuất đều giúp tăng điểm BLEU so với kết quả baseline, với kết quả cao nhất sử dụng kỹ thuật BPE giúp tăng 0.81 điểm BLEU.


Sign in / Sign up

Export Citation Format

Share Document