Optimizing Tokenization Choice for Machine Translation across Multiple Target Languages

AbstractTokenization is very helpful for Statistical Machine Translation (SMT), especially when translating from morphologically rich languages. Typically, a single tokenization scheme is applied to the entire source-language text and regardless of the target language. In this paper, we evaluate the hypothesis that SMT performance may benefit from different tokenization schemes for different words within the same text, and also for different target languages. We apply this approach to Arabic as a source language, with five target languages of varying morphological complexity: English, French, Spanish, Russian and Chinese. Our results show that different target languages indeed require different source-language schemes; and a context-variable tokenization scheme can outperform a context-constant scheme with a statistically significant performance enhancement of about 1.4 BLEU points.

Download Full-text

An Experimental Platform for Cross-Language Document Retrieval

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.284-287.3325 ◽

2013 ◽

Vol 284-287 ◽

pp. 3325-3329

Author(s):

Long Yue Wang ◽

Derek F. Wong ◽

Lidia S. Chao

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Document Retrieval ◽

Training Data ◽

Target Language ◽

Source Language ◽

Experimental Platform ◽

Precision Evaluation ◽

Query Generation ◽

Cross Language

This paper presents a proposed Cross-Language Document Retrieval experimental platform integrated with preprocessing of training data, document translation, query generation, document retrieval and precision evaluation modules. Given a certain document in source language, it will be translated into target language by statistical machine translation module which is trained by selected training data. The query generation module then selects the most relevant words in the translated version of the document as searching query. After all the documents in the target language are ranked by the document retrieval module, the system will choose the N-best documents as its target language versions. Finally, the results can be evaluated by precision evaluator, which can reflect the merits of the strategies. Experimental results showed that this platform was effective and achieved very good performance.

Download Full-text

Source Language Adaptation Approaches for Resource-Poor Machine Translation

Computational Linguistics ◽

10.1162/coli_a_00248 ◽

2016 ◽

Vol 42 (2) ◽

pp. 277-306 ◽

Cited By ~ 8

Author(s):

Pidong Wang ◽

Preslav Nakov ◽

Hwee Tou Ng

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Target Language ◽

Source Language ◽

World Languages ◽

Word Level ◽

Resource Poor ◽

Morphological Variants ◽

Cross Lingual ◽

Translation Systems

Most of the world languages are resource-poor for statistical machine translation; still, many of them are actually related to some resource-rich language. Thus, we propose three novel, language-independent approaches to source language adaptation for resource-poor statistical machine translation. Specifically, we build improved statistical machine translation models from a resource-poor language POOR into a target language TGT by adapting and using a large bitext for a related resource-rich language RICH and the same target language TGT. We assume a small POOR–TGT bitext from which we learn word-level and phrase-level paraphrases and cross-lingual morphological variants between the resource-rich and the resource-poor language. Our work is of importance for resource-poor machine translation because it can provide a useful guideline for people building machine translation systems for resource-poor languages. Our experiments for Indonesian/Malay–English translation show that using the large adapted resource-rich bitext yields 7.26 BLEU points of improvement over the unadapted one and 3.09 BLEU points over the original small bitext. Moreover, combining the small POOR–TGT bitext with the adapted bitext outperforms the corresponding combinations with the unadapted bitext by 1.93–3.25 BLEU points. We also demonstrate the applicability of our approaches to other languages and domains.

Download Full-text

Synthetic Treebanking for Cross-Lingual Dependency Parsing

Journal of Artificial Intelligence Research ◽

10.1613/jair.4785 ◽

2016 ◽

Vol 55 ◽

pp. 209-248 ◽

Cited By ~ 7

Author(s):

Jörg Tiedemann ◽

Zeljko Agić

Keyword(s):

Machine Translation ◽

Target Language ◽

Dependency Parsing ◽

Practical Applications ◽

Source Language ◽

Part Of Speech ◽

Statistical Dependency ◽

Target Languages ◽

Cross Lingual ◽

The Impact

How do we parse the languages for which no treebanks are available? This contribution addresses the cross-lingual viewpoint on statistical dependency parsing, in which we attempt to make use of resource-rich source language treebanks to build and adapt models for the under-resourced target languages. We outline the benefits, and indicate the drawbacks of the current major approaches. We emphasize synthetic treebanking: the automatic creation of target language treebanks by means of annotation projection and machine translation. We present competitive results in cross-lingual dependency parsing using a combination of various techniques that contribute to the overall success of the method. We further include a detailed discussion about the impact of part-of-speech label accuracy on parsing results that provide guidance in practical applications of cross-lingual methods for truly under-resourced languages.

Download Full-text

Emoji-Powered Representation Learning for Cross-Lingual Sentiment Classification (Extended Abstract)

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/649 ◽

2020 ◽

Author(s):

Zhenpeng Chen ◽

Sheng Shen ◽

Ziniu Hu ◽

Xuan Lu ◽

Qiaozhu Mei ◽

...

Keyword(s):

Machine Translation ◽

Representation Learning ◽

Sentiment Classification ◽

Target Language ◽

Learning Method ◽

Source Language ◽

Translation Tools ◽

Target Languages ◽

Cross Lingual ◽

Cross Language

Sentiment classification typically relies on a large amount of labeled data. In practice, the availability of labels is highly imbalanced among different languages. To tackle this problem, cross-lingual sentiment classification approaches aim to transfer knowledge learned from one language that has abundant labeled examples (i.e., the source language, usually English) to another language with fewer labels (i.e., the target language). The source and the target languages are usually bridged through off-the-shelf machine translation tools. Through such a channel, cross-language sentiment patterns can be successfully learned from English and transferred into the target languages. This approach, however, often fails to capture sentiment knowledge specific to the target language. In this paper, we employ emojis, which are widely available in many languages, as a new channel to learn both the cross-language and the language-specific sentiment patterns. We propose a novel representation learning method that uses emoji prediction as an instrument to learn respective sentiment-aware representations for each language. The learned representations are then integrated to facilitate cross-lingual sentiment classification.

Download Full-text

An Overview of Statistical Machine Translation Tools

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse/v7i7/0201 ◽

2017 ◽

Vol 7 (7) ◽

pp. 289

Author(s):

Mir Aadil ◽

M. Asger

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Target Language ◽

Quality Of Results ◽

Source Language ◽

Translation Tools

The process Machine translation is a combination of many complex sub-processes and the quality of results of each sub-process executed in a well defined sequence determine the overall accuracy of the translation. Statistical Machine Translation approach considers each sentence in target language as a possible translation of any source language sentence. The possibility is calculated by probability and as obvious, sentence with highest probability is treated as the best translation. SMT is the most favoured approach not only because of its good results for corpus rich language pairs, but also for the tools that SMT approach has been enhanced with in past two and half decades. The paper gives a brief introduction to SMT: its steps and different tools available for each step.

Download Full-text

Inflection rules for Marathi to English in rule based machine translation

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v10.i3.pp780-788 ◽

2021 ◽

Vol 10 (3) ◽

pp. 780

Author(s):

Namrata G Kharate ◽

Varsha H Patil

Keyword(s):

Natural Language Processing ◽

Machine Translation ◽

Language Processing ◽

Important Application ◽

Target Language ◽

Rule Based ◽

Parts Of Speech ◽

Source Language ◽

Target Languages ◽

Correct Translation

Machine translation is important application in natural language processing. Machine translation means translation from source language to target language to save the meaning of the sentence. A large amount of research is going on in the area of machine translation. However, research with machine translation remains highly localized to the particular source and target languages as they differ syntactically and morphologically. Appropriate inflections result correct translation. This paper elaborates the rules for inflecting the parts-of-speech and implements the inflection for Marathi to English translation. The inflection of nouns, pronouns, verbs, adjectives are carried out on the basis of semantics of the sentence. The results are discussed with examples.

Download Full-text

Reusing Monolingual Pre-Trained Models by Cross-Connecting Seq2seq Models for Machine Translation

Applied Sciences ◽

10.3390/app11188737 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8737

Author(s):

Jiun Oh ◽

Yong-Suk Choi

Keyword(s):

Machine Translation ◽

Intermediate Layer ◽

Language Model ◽

Target Language ◽

Source Language ◽

Performance Change ◽

The Cross ◽

Target Languages ◽

Cross Connection

This work uses sequence-to-sequence (seq2seq) models pre-trained on monolingual corpora for machine translation. We pre-train two seq2seq models with monolingual corpora for the source and target languages, then combine the encoder of the source language model and the decoder of the target language model, i.e., the cross-connection. We add an intermediate layer between the pre-trained encoder and the decoder to help the mapping of each other since the modules are pre-trained completely independently. These monolingual pre-trained models can work as a multilingual pre-trained model because one model can be cross-connected with another model pre-trained on any other language, while their capacity is not affected by the number of languages. We will demonstrate that our method improves the translation performance significantly over the random baseline. Moreover, we will analyze the appropriate choice of the intermediate layer, the importance of each part of a pre-trained model, and the performance change along with the size of the bitext.

Download Full-text

THE IMPLEMENTATION OF COGNITIVE ACADEMIC LANGUAGE LEARNING STRATEGIES (CALLS) TO HELP THE STUDENTS IN RECOGNIZING THE LEXICAL CONSTRAINTS ON THE STUDENTS TRANSLATION

JURNAL ELINK ◽

10.30736/e-link.v4i2.18 ◽

2016 ◽

Vol 4 (2) ◽

Author(s):

Diah Astuty

Keyword(s):

Language Learning ◽

Learning Strategies ◽

Academic Language ◽

Target Language ◽

Language Learning Strategies ◽

Source Language ◽

Lexical Constraints ◽

Language Text

his study aims to describe the sorts of lexical constraints that appeared on the students translation when translating some source language texts into some target language texts. The competence of linguistic fields that the students have acquired is in the fact assumed to be inadequate and it can cause the lexical constraints.Keywords: CALLS, lexical constraints,source language text,target language text

Download Full-text

Analysis Accuracy of Similar Word Based Clustering (EWSB) Algorithm on Machine Translator Bahasa Indonesia-Minang

Kinetik Game Technology Information System Computer Network Computing Electronics and Control ◽

10.22219/kinetik.v3i3.241 ◽

2018 ◽

Vol 3 (3) ◽

Author(s):

Herry Sujaini

Keyword(s):

Machine Translation ◽

Clustering Algorithm ◽

Statistical Machine Translation ◽

Target Language ◽

Word Similarity ◽

Similar Word ◽

Word Clustering ◽

Translation Accuracy ◽

Bahasa Indonesia

Extended Word Similarity Based (EWSB) Clustering is a word clustering algorithm based on the value of words similarity obtained from the computation of a corpus. One of the benefits of clustering with this algorithm is to improve the translation of a statistical machine translation. Previous research proved that EWSB algorithm could improve the Indonesian-English translator, where the algorithm was applied to Indonesian language as target language.This paper discusses the results of a research using EWSB algorithm on a Indonesian to Minang statistical machine translator, where the algorithm is applied to Minang language as the target language. The research obtained resulted that the EWSB algorithm is quite effective when used in Minang language as the target language. The results of this study indicate that EWSB algorithm can improve the translation accuracy by 6.36%.

Download Full-text

Analyzing Subword Techniques to Improve English to Sinhala Neural Machine Translation

International Journal of Asian Language Processing ◽

10.1142/s2717554520500174 ◽

2021 ◽

pp. 2050017

Author(s):

Rashmini Naranpanawa ◽

Ravinga Perera ◽

Thilakshi Fonseka ◽

Uthayasanker Thayasivam

Keyword(s):

Machine Translation ◽

State Of The Art ◽

Statistical Machine Translation ◽

Translation System ◽

Rare Word ◽

Neural Machine Translation ◽

Parallel Corpus ◽

Low Resource ◽

Word Level ◽

Morphologically Rich Languages

Neural machine translation (NMT) is a remarkable approach which performs much better than the Statistical machine translation (SMT) models when there is an abundance of parallel corpus. However, vanilla NMT is primarily based upon word-level with a fixed vocabulary. Therefore, low resource morphologically rich languages such as Sinhala are mostly affected by the out of vocabulary (OOV) and Rare word problems. Recent advancements in subword techniques have opened up opportunities for low resource communities by enabling open vocabulary translation. In this paper, we extend our recently published state-of-the-art EN-SI translation system using the transformer and explore standard subword techniques on top of it to identify which subword approach has a greater effect on English Sinhala language pair. Our models demonstrate that subword segmentation strategies along with the state-of-the-art NMT can perform remarkably when translating English sentences into a rich morphology language regardless of a large parallel corpus.

Download Full-text