Topic-Based Dissimilarity and Sensitivity Models for Translation Rule Selection

Journal of Artificial Intelligence Research ◽

10.1613/jair.4265 ◽

2014 ◽

Vol 50 ◽

pp. 1-30 ◽

Cited By ~ 3

Author(s):

M. Zhang ◽

X. Xiao ◽

D. Xiong ◽

Q. Liu

Keyword(s):

Machine Translation ◽

Topic Model ◽

Statistical Machine Translation ◽

Model Space ◽

Target Language ◽

Translation Quality ◽

Rule Selection ◽

Translation Rule ◽

Selection Experiments ◽

Target Side

Translation rule selection is a task of selecting appropriate translation rules for an ambiguous source-language segment. As translation ambiguities are pervasive in statistical machine translation, we introduce two topic-based models for translation rule selection which incorporates global topic information into translation disambiguation. We associate each synchronous translation rule with source- and target-side topic distributions.With these topic distributions, we propose a topic dissimilarity model to select desirable (less dissimilar) rules by imposing penalties for rules with a large value of dissimilarity of their topic distributions to those of given documents. In order to encourage the use of non-topic specific translation rules, we also present a topic sensitivity model to balance translation rule selection between generic rules and topic-specific rules. Furthermore, we project target-side topic distributions onto the source-side topic model space so that we can benefit from topic information of both the source and target language. We integrate the proposed topic dissimilarity and sensitivity model into hierarchical phrase-based machine translation for synchronous translation rule selection. Experiments show that our topic-based translation rule selection model can substantially improve translation quality.

Download Full-text

Pre-Reordering for Neural Machine Translation: Helpful or Harmful?

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0018 ◽

2017 ◽

Vol 108 (1) ◽

pp. 171-182 ◽

Cited By ~ 5

Author(s):

Jinhua Du ◽

Andy Way

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Word Class ◽

Word Embeddings ◽

Neural Machine Translation ◽

Parts Of Speech ◽

Translation Quality ◽

The Impact ◽

Japanese English ◽

Target Side

AbstractPre-reordering, a preprocessing to make the source-side word orders close to those of the target side, has been proven very helpful for statistical machine translation (SMT) in improving translation quality. However, is it the case in neural machine translation (NMT)? In this paper, we firstly investigate the impact of pre-reordered source-side data on NMT, and then propose to incorporate features for the pre-reordering model in SMT as input factors into NMT (factored NMT). The features, namely parts-of-speech (POS), word class and reordered index, are encoded as feature vectors and concatenated to the word embeddings to provide extra knowledge for NMT. Pre-reordering experiments conducted on Japanese↔English and Chinese↔English show that pre-reordering the source-side data for NMT is redundant and NMT models trained on pre-reordered data deteriorate translation performance. However, factored NMT using SMT-based pre-reordering features on Japanese→English and Chinese→English is beneficial and can further improve by 4.48 and 5.89 relative BLEU points, respectively, compared to the baseline NMT system.

Download Full-text

“Bilingual Expert” Can Find Translation Errors

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016367 ◽

2019 ◽

Vol 33 ◽

pp. 6367-6374 ◽

Cited By ~ 7

Author(s):

Kai Fan ◽

Jiayi Wang ◽

Bo Li ◽

Fengming Zhou ◽

Boxing Chen ◽

...

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Language Model ◽

Target Language ◽

Quality Estimation ◽

Parallel Corpora ◽

Translation Quality ◽

Expert Model ◽

Parallel Data ◽

Translation Errors

The performances of machine translation (MT) systems are usually evaluated by the metric BLEU when the golden references are provided. However, in the case of model inference or production deployment, golden references are usually expensively available, such as human annotation with bilingual expertise. In order to address the issue of translation quality estimation (QE) without reference, we propose a general framework for automatic evaluation of the translation output for the QE task in the Conference on Statistical Machine Translation (WMT). We first build a conditional target language model with a novel bidirectional transformer, named neural bilingual expert model, which is pre-trained on large parallel corpora for feature extraction. For QE inference, the bilingual expert model can simultaneously produce the joint latent representation between the source and the translation, and real-valued measurements of possible erroneous tokens based on the prior knowledge learned from parallel data. Subsequently, the features will further be fed into a simple Bi-LSTM predictive model for quality estimation. The experimental results show that our approach achieves the state-of-the-art performance in most public available datasets of WMT 2017/2018 QE task.

Download Full-text

Modality and Negation in SIMT Use of Modality and Negation in Semantically-Informed Syntactic MT

Computational Linguistics ◽

10.1162/coli_a_00099 ◽

2012 ◽

Vol 38 (2) ◽

pp. 411-438 ◽

Cited By ~ 11

Author(s):

Kathryn Baker ◽

Michael Bloodgood ◽

Bonnie J. Dorr ◽

Chris Callison-Burch ◽

Nathaniel W. Filardo ◽

...

Keyword(s):

Machine Translation ◽

Semantic Information ◽

Statistical Machine Translation ◽

Target Language ◽

Annotation Scheme ◽

Data Set ◽

Translation Quality ◽

Semantic Annotations ◽

Language Technology ◽

Johns Hopkins University

This article describes the resource- and system-building efforts of an 8-week Johns Hopkins University Human Language Technology Center of Excellence Summer Camp for Applied Language Exploration (SCALE-2009) on Semantically Informed Machine Translation (SIMT). We describe a new modality/negation (MN) annotation scheme, the creation of a (publicly available) MN lexicon, and two automated MN taggers that we built using the annotation scheme and lexicon. Our annotation scheme isolates three components of modality and negation: a trigger (a word that conveys modality or negation), a target (an action associated with modality or negation), and a holder (an experiencer of modality). We describe how our MN lexicon was semi-automatically produced and we demonstrate that a structure-based MN tagger results in precision around 86% (depending on genre) for tagging of a standard LDC data set. We apply our MN annotation scheme to statistical machine translation using a syntactic framework that supports the inclusion of semantic annotations. Syntactic tags enriched with semantic annotations are assigned to parse trees in the target-language training texts through a process of tree grafting. Although the focus of our work is modality and negation, the tree grafting procedure is general and supports other types of semantic information. We exploit this capability by including named entities, produced by a pre-existing tagger, in addition to the MN elements produced by the taggers described here. The resulting system significantly outperformed a linguistically naive baseline model (Hiero), and reached the highest scores yet reported on the NIST 2009 Urdu–English test set. This finding supports the hypothesis that both syntactic and semantic information can improve translation quality.

Download Full-text

Neural Network Machine Translation Method Based on Unsupervised Domain Adaptation

Complexity ◽

10.1155/2020/6657344 ◽

2020 ◽

Vol 2020 ◽

pp. 1-11

Author(s):

Rui Wang

Keyword(s):

Neural Network ◽

Machine Translation ◽

Large Scale ◽

Domain Adaptation ◽

Structural Information ◽

Statistical Machine Translation ◽

Target Language ◽

Great Success ◽

Parallel Corpora ◽

Translation Rule

Relying on large-scale parallel corpora, neural machine translation has achieved great success in certain language pairs. However, the acquisition of high-quality parallel corpus is one of the main difficulties in machine translation research. In order to solve this problem, this paper proposes unsupervised domain adaptive neural network machine translation. This method can be trained using only two unrelated monolingual corpora and obtain a good translation result. This article first measures the matching degree of translation rules by adding relevant subject information to the translation rules and dynamically calculating the similarity between each translation rule and the document to be translated during the decoding process. Secondly, through the joint training of multiple training tasks, the source language can learn useful semantic and structural information from the monolingual corpus of a third language that is not parallel to the current two languages during the process of translation into the target language. Experimental results show that better results can be obtained than traditional statistical machine translation.

Download Full-text

Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information

Computational Linguistics ◽

10.1162/089120104323093285 ◽

2004 ◽

Vol 30 (2) ◽

pp. 181-204 ◽

Cited By ~ 31

Author(s):

Sonja Nießen ◽

Hermann Ney

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Training Data ◽

Target Language ◽

Linguistic Knowledge ◽

Parallel Corpora ◽

Scarce Resources ◽

Translation Quality ◽

Sentence Level ◽

Statistical Systems

In statistical machine translation, correspondences between the words in the source and the target language are learned from parallel corpora, and often little or no linguistic knowledge is used to structure the underlying models. In particular, existing statistical systems for machine translation often treat different inflected forms of the same lemma as if they were independent of one another. The bilingual training data can be better exploited by explicitly taking into account the interdependencies of related inflected forms. We propose the construction of hierarchical lexicon models on the basis of equivalence classes of words. In addition, we introduce sentence-level restructuring transformations which aim at the assimilation of word order in related sentences. We have systematically investigated the amount of bilingual training data required to maintain an acceptable quality of machine translation. The combination of the suggested methods for improving translation quality in frameworks with scarce resources has been successfully tested: We were able to reduce the amount of bilingual training data to less than 10% of the original corpus, while losing only 1.6% in translation quality. The improvement of the translation results is demonstrated on two German-English corpora taken from the Verbmobil task and the Nespole! task.

Download Full-text

Analysis Accuracy of Similar Word Based Clustering (EWSB) Algorithm on Machine Translator Bahasa Indonesia-Minang

Kinetik Game Technology Information System Computer Network Computing Electronics and Control ◽

10.22219/kinetik.v3i3.241 ◽

2018 ◽

Vol 3 (3) ◽

Author(s):

Herry Sujaini

Keyword(s):

Machine Translation ◽

Clustering Algorithm ◽

Statistical Machine Translation ◽

Target Language ◽

Word Similarity ◽

Similar Word ◽

Word Clustering ◽

Translation Accuracy ◽

Bahasa Indonesia

Extended Word Similarity Based (EWSB) Clustering is a word clustering algorithm based on the value of words similarity obtained from the computation of a corpus. One of the benefits of clustering with this algorithm is to improve the translation of a statistical machine translation. Previous research proved that EWSB algorithm could improve the Indonesian-English translator, where the algorithm was applied to Indonesian language as target language.This paper discusses the results of a research using EWSB algorithm on a Indonesian to Minang statistical machine translator, where the algorithm is applied to Minang language as the target language. The research obtained resulted that the EWSB algorithm is quite effective when used in Minang language as the target language. The results of this study indicate that EWSB algorithm can improve the translation accuracy by 6.36%.

Download Full-text

A Context-Aware Topic Model for Statistical Machine Translation

10.3115/v1/p15-1023 ◽

2015 ◽

Cited By ~ 5

Author(s):

Jinsong Su ◽

Deyi Xiong ◽

Yang Liu ◽

Xianpei Han ◽

Hongyu Lin ◽

...

Keyword(s):

Machine Translation ◽

Topic Model ◽

Statistical Machine Translation ◽

Context Aware

Download Full-text

Discontinuous Statistical Machine Translation with Target-Side Dependency Syntax

10.18653/v1/w15-3029 ◽

2015 ◽

Author(s):

Nina Seemann ◽

Andreas Maletti

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Dependency Syntax ◽

Target Side

Download Full-text

Improving statistical machine translation using lexicalized rule selection

Proceedings of the 22nd International Conference on Computational Linguistics - COLING '08 ◽

10.3115/1599081.1599122 ◽

2008 ◽

Cited By ~ 6

Author(s):

Zhongjun He ◽

Qun Liu ◽

Shouxun Lin

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Rule Selection

Download Full-text

Optimizing Tokenization Choice for Machine Translation across Multiple Target Languages

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0025 ◽

2017 ◽

Vol 108 (1) ◽

pp. 257-269 ◽

Cited By ~ 4

Author(s):

Nasser Zalmout ◽

Nizar Habash

Keyword(s):

Machine Translation ◽

Performance Enhancement ◽

Statistical Machine Translation ◽

Target Language ◽

Source Language ◽

Context Variable ◽

Significant Performance ◽

Morphologically Rich Languages ◽

Target Languages ◽

Language Text

AbstractTokenization is very helpful for Statistical Machine Translation (SMT), especially when translating from morphologically rich languages. Typically, a single tokenization scheme is applied to the entire source-language text and regardless of the target language. In this paper, we evaluate the hypothesis that SMT performance may benefit from different tokenization schemes for different words within the same text, and also for different target languages. We apply this approach to Arabic as a source language, with five target languages of varying morphological complexity: English, French, Spanish, Russian and Chinese. Our results show that different target languages indeed require different source-language schemes; and a context-variable tokenization scheme can outperform a context-constant scheme with a statistically significant performance enhancement of about 1.4 BLEU points.

Download Full-text