Machine Translation Utilizing the Frequent-Item Set Concept

In this paper, we introduce new concepts in the machine translation paradigm. We treat the corpus as a database of frequent word sets. A translation request triggers association rules joining phrases present in the source language, and phrases present in the target language. It has to be noted that a sequential scan of the corpus for such phrases will increase the response time in an unexpected manner. We introduce the pre-processing of the bilingual corpus through proposing a data structure called Corpus-Trie (CT) that renders a bilingual parallel corpus in a compact data structure representing frequent data items sets. We also present algorithms which utilize the CT to respond to translation requests and explore novel techniques in exhaustive experiments. Experiments were performed on specific language pairs, although the proposed method is not restricted to any specific language. Moreover, the proposed Corpus-Trie can be extended from bilingual corpora to accommodate multi-language corpora. Experiments indicated that the response time of a translation request is logarithmic to the count of unrepeated phrases in the original bilingual corpus (and thus, the Corpus-Trie size). In practical situations, 5–20% of the log of the number of the nodes have to be visited. The experimental results indicate that the BLEU score for the proposed CT system increases with the size of the number of phrases in the CT, for both English-Arabic and English-French translations. The proposed CT system was demonstrated to be better than both Omega-T and Apertium in quality of translation from a corpus size exceeding 1,600,000 phrases for English-Arabic translation, and 300,000 phrases for English-French translation.

Download Full-text

Analysis Accuracy of Similar Word Based Clustering (EWSB) Algorithm on Machine Translator Bahasa Indonesia-Minang

Kinetik Game Technology Information System Computer Network Computing Electronics and Control ◽

10.22219/kinetik.v3i3.241 ◽

2018 ◽

Vol 3 (3) ◽

Author(s):

Herry Sujaini

Keyword(s):

Machine Translation ◽

Clustering Algorithm ◽

Statistical Machine Translation ◽

Target Language ◽

Word Similarity ◽

Similar Word ◽

Word Clustering ◽

Translation Accuracy ◽

Bahasa Indonesia

Extended Word Similarity Based (EWSB) Clustering is a word clustering algorithm based on the value of words similarity obtained from the computation of a corpus. One of the benefits of clustering with this algorithm is to improve the translation of a statistical machine translation. Previous research proved that EWSB algorithm could improve the Indonesian-English translator, where the algorithm was applied to Indonesian language as target language.This paper discusses the results of a research using EWSB algorithm on a Indonesian to Minang statistical machine translator, where the algorithm is applied to Minang language as the target language. The research obtained resulted that the EWSB algorithm is quite effective when used in Minang language as the target language. The results of this study indicate that EWSB algorithm can improve the translation accuracy by 6.36%.

Download Full-text

Evaluation of Machine Translation Metrics for Czech as the Target Language

Prague Bulletin of Mathematical Linguistics ◽

10.2478/v10108-009-0026-2 ◽

2009 ◽

Vol 92 (1) ◽

Cited By ~ 3

Author(s):

Kamil Kos ◽

Ondřej Bojar

Keyword(s):

Machine Translation ◽

Target Language

Download Full-text

Optimizing Tokenization Choice for Machine Translation across Multiple Target Languages

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0025 ◽

2017 ◽

Vol 108 (1) ◽

pp. 257-269 ◽

Cited By ~ 4

Author(s):

Nasser Zalmout ◽

Nizar Habash

Keyword(s):

Machine Translation ◽

Performance Enhancement ◽

Statistical Machine Translation ◽

Target Language ◽

Source Language ◽

Context Variable ◽

Significant Performance ◽

Morphologically Rich Languages ◽

Target Languages ◽

Language Text

AbstractTokenization is very helpful for Statistical Machine Translation (SMT), especially when translating from morphologically rich languages. Typically, a single tokenization scheme is applied to the entire source-language text and regardless of the target language. In this paper, we evaluate the hypothesis that SMT performance may benefit from different tokenization schemes for different words within the same text, and also for different target languages. We apply this approach to Arabic as a source language, with five target languages of varying morphological complexity: English, French, Spanish, Russian and Chinese. Our results show that different target languages indeed require different source-language schemes; and a context-variable tokenization scheme can outperform a context-constant scheme with a statistically significant performance enhancement of about 1.4 BLEU points.

Download Full-text

Cross-Sectional Study of Refusal Speech Act Used by Iraqi Undergraduate Students

Koya University Journal of Humanities and Social Sciences ◽

10.14500/kujhss.v3n1y2020.pp166-173 ◽

2020 ◽

Vol 3 (1) ◽

pp. 166-173

Author(s):

Hutheifa Y. Turki ◽

Juma’a Q. Hussein ◽

Ahmed A. Al-Kubaisy

Keyword(s):

Undergraduate Students ◽

Speech Acts ◽

Speech Act ◽

Cross Sectional Study ◽

Target Language ◽

Cross Sectional ◽

Efl Learners ◽

Proficiency Level ◽

Arabic And English ◽

Refusal Strategies

This paper is conducted to investigate how Iraqi EFL learners refuse different speech acts across different proficiency levels. It aims to examine the most appropriate strategies used by 2nd year students of English as compared to those of 4th year when refusing their interlocutors' invitation, suggestion, and offer. WDCT questionnaire was used to collect data from 40 Iraqi undergraduate students of English: 20 2nd year and 20 4th year. Adopting Beebe et al.'s (1990) theory of refusal, data collected was analyzed quantitatively using statistical analysis. The findings revealed that the 2nd year students of English were more frequent in using direct refusals than their 4th year counterparts. This means the latter were more aware of using refusals politely than the former. On the other hand, the findings showed that 4th year students more frequent in their use of indirect refusal strategies that the 2nd year students. This indicates that the EFL learners of low proficiency level might not bridge the gap between the pragmalinguistic strategies and the grammatical form of the target language. This means that they were not pragmatically competent of the use of the appropriate pragmalinguistic strategies. This implies that the 2nd year students need to pay more attention to pragmatics and use their refusal strategies appropriately. Thus, the paper recommends conducting further research on the use of refusal speech act in Arabic and English.

Download Full-text

Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation

Computational Intelligence and Neuroscience ◽

10.1155/2021/6682385 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Michael Adjeisah ◽

Guohua Liu ◽

Douglas Omwenga Nyabuga ◽

Richard Nuetey Nortey ◽

Jinling Song

Keyword(s):

Machine Translation ◽

Language Processing ◽

Training Data ◽

Target Language ◽

Similarity Metrics ◽

Mahalanobis Distances ◽

Parallel Corpora ◽

Parallel Corpus ◽

Low Resource ◽

Sentence Level

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.

Download Full-text

IntroVNMT: An Introspective Model for Variational Neural Machine Translation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6411 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8830-8837

Author(s):

Xin Sheng ◽

Linli Xu ◽

Junliang Guo ◽

Jingchang Liu ◽

Ruoyu Zhao ◽

...

Keyword(s):

Machine Translation ◽

Latent Variables ◽

Image Synthesis ◽

Target Language ◽

Generative Adversarial Network ◽

Neural Machine Translation ◽

Adversarial Network ◽

Proposed Model ◽

Model Training ◽

High Level

We propose a novel introspective model for variational neural machine translation (IntroVNMT) in this paper, inspired by the recent successful application of introspective variational autoencoder (IntroVAE) in high quality image synthesis. Different from the vanilla variational NMT model, IntroVNMT is capable of improving itself introspectively by evaluating the quality of the generated target sentences according to the high-level latent variables of the real and generated target sentences. As a consequence of introspective training, the proposed model is able to discriminate between the generated and real sentences of the target language via the latent variables generated by the encoder of the model. In this way, IntroVNMT is able to generate more realistic target sentences in practice. In the meantime, IntroVNMT inherits the advantages of the variational autoencoders (VAEs), and the model training process is more stable than the generative adversarial network (GAN) based models. Experimental results on different translation tasks demonstrate that the proposed model can achieve significant improvements over the vanilla variational NMT model.

Download Full-text

Controlling Neural Machine Translation Formality with Synthetic Supervision

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6379 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8568-8575

Author(s):

Xing Niu ◽

Marine Carpuat

Keyword(s):

Machine Translation ◽

Target Language ◽

Sentence Pair ◽

English Sentence ◽

Neural Machine Translation ◽

Source Language ◽

Training Scheme ◽

Training Examples ◽

Language Content ◽

Missing Element

This work aims to produce translations that convey source language content at a formality level that is appropriate for a particular audience. Framing this problem as a neural sequence-to-sequence task ideally requires training triplets consisting of a bilingual sentence pair labeled with target language formality. However, in practice, available training examples are limited to English sentence pairs of different styles, and bilingual parallel sentences of unknown formality. We introduce a novel training scheme for multi-task models that automatically generates synthetic training triplets by inferring the missing element on the fly, thus enabling end-to-end training. Comprehensive automatic and human assessments show that our best model outperforms existing models by producing translations that better match desired formality levels while preserving the source meaning.1

Download Full-text

MTIL2017: Machine Translation Using Recurrent Neural Network on Statistical Machine Translation

Journal of Intelligent Systems ◽

10.1515/jisys-2018-0016 ◽

2019 ◽

Vol 28 (3) ◽

pp. 447-453 ◽

Cited By ~ 5

Author(s):

Sainik Kumar Mahata ◽

Dipankar Das ◽

Sivaji Bandyopadhyay

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Language Model ◽

Target Language ◽

Data Sets ◽

Shared Task ◽

Automatic Translation ◽

External Data ◽

Statistical Mt

Abstract Machine translation (MT) is the automatic translation of the source language to its target language by a computer system. In the current paper, we propose an approach of using recurrent neural networks (RNNs) over traditional statistical MT (SMT). We compare the performance of the phrase table of SMT to the performance of the proposed RNN and in turn improve the quality of the MT output. This work has been done as a part of the shared task problem provided by the MTIL2017. We have constructed the traditional MT model using Moses toolkit and have additionally enriched the language model using external data sets. Thereafter, we have ranked the phrase tables using an RNN encoder-decoder module created originally as a part of the GroundHog project of LISA lab.

Download Full-text

The deterioration of the usage of ‘kaana’ in the Holy Quran via translation

Babel ◽

10.1075/babel.50.3.02alk ◽

2004 ◽

Vol 50 (3) ◽

pp. 215-229 ◽

Cited By ~ 1

Author(s):

Mohammad Al-Khawalda

Keyword(s):

Machine Translation ◽

Significant Role ◽

Computer Programs ◽

Original Text ◽

Tense And Aspect ◽

Holy Quran ◽

Back Translation ◽

The Holy Quran ◽

Arabic And English ◽

Aspectual Meaning

This paper investigates the accuracy of the translation of the Arabic copula kaana (be-past-he) in the holy Quran. The first one hundred usages of kaana are selected for investigation. The examples are exclusively derived from Surat al-baqarah (1) and surat ali?umran (2). The translation under discussion is taken from ‘Holy Quran, CD, 6th ed. Saxir for Computer Programs’ The translation has been checked via back translation, which was compared with the original temporal and aspectual meaning expressed by the usage of kaana. It turns out that the translation of kaana caused confusion rather than understanding. It also seems that most of the inadequacies result from insufficient understanding of the mechanism of tense and aspect in both the Arabic and English languages. Moreover, in most cases, the modal usage of kaana which plays a significant role, is ignored by the translator(s). In addition to back translation carried out by some scholars, the translation has also been checked via ‘Machine Translation’ which shows a real abuse of the original text.

Download Full-text

Neural Networks Classifier for Data Selection in Statistical Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0027 ◽

2017 ◽

Vol 108 (1) ◽

pp. 283-294 ◽

Cited By ~ 1

Author(s):

Álvaro Peris ◽

Mara Chinea-Ríos ◽

Francisco Casacuberta

Keyword(s):

Neural Networks ◽

Machine Translation ◽

Domain Adaptation ◽

Statistical Machine Translation ◽

Data Selection ◽

Target Domain ◽

Translation Quality ◽

Bilingual Corpora ◽

Proper Estimation ◽

Adaptation Field

AbstractCorpora are precious resources, as they allow for a proper estimation of statistical machine translation models. Data selection is a variant of the domain adaptation field, aimed to extract those sentences from an out-of-domain corpus that are the most useful to translate a different target domain. We address the data selection problem in statistical machine translation as a classification task. We present a new method, based on neural networks, able to deal with monolingual and bilingual corpora. Empirical results show that our data selection method provides slightly better translation quality, compared to a state-of-the-art method (cross-entropy), requiring substantially less data. Moreover, the results obtained are coherent across different language pairs, demonstrating the robustness of our proposal.

Download Full-text