DiaBLa: a corpus of bilingual spontaneous written dialogues for machine translation

Language Resources and Evaluation ◽

10.1007/s10579-020-09514-4 ◽

2020 ◽

Author(s):

Rachel Bawden ◽

Eric Bilinski ◽

Thomas Lavergne ◽

Sophie Rosset

Keyword(s):

Machine Translation ◽

Role Play ◽

A Posteriori ◽

Mediated Communication ◽

Test Set ◽

Fine Grained ◽

Initial Analysis ◽

Sentence Level

AbstractWe present a new English–French dataset for the evaluation of Machine Translation (MT) for informal, written bilingual dialogue. The test set contains 144 spontaneous dialogues (5700+ sentences) between native English and French speakers, mediated by one of two neural MT systems in a range of role-play settings. The dialogues are accompanied by fine-grained sentence-level judgments of MT quality, produced by the dialogue participants themselves, as well as by manually normalised versions and reference translations produced a posteriori. The motivation for the corpus is twofold: to provide (i) a unique resource for evaluating MT models, and (ii) a corpus for the analysis of MT-mediated communication. We provide an initial analysis of the corpus to confirm that the participants’ judgments reveal perceptible differences in MT quality between the two MT systems used.

Download Full-text

Enhancing Lexical Translation Consistency for Document-Level Neural Machine Translation

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3485469 ◽

2022 ◽

Vol 21 (3) ◽

pp. 1-21

Author(s):

Xiaomian Kang ◽

Yang Zhao ◽

Jiajun Zhang ◽

Chengqing Zong

Keyword(s):

Machine Translation ◽

English Translation ◽

Test Set ◽

Neural Machine Translation ◽

Global Context ◽

Translation Quality ◽

Sentence Level ◽

Document Level

Document-level neural machine translation (DocNMT) has yielded attractive improvements. In this article, we systematically analyze the discourse phenomena in Chinese-to-English translation, and focus on the most obvious ones, namely lexical translation consistency. To alleviate the lexical inconsistency, we propose an effective approach that is aware of the words which need to be translated consistently and constrains the model to produce more consistent translations. Specifically, we first introduce a global context extractor to extract the document context and consistency context, respectively. Then, the two types of global context are integrated into a encoder enhancer and a decoder enhancer to improve the lexical translation consistency. We create a test set to evaluate the lexical consistency automatically. Experiments demonstrate that our approach can significantly alleviate the lexical translation inconsistency. In addition, our approach can also substantially improve the translation quality compared to sentence-level Transformer.

Download Full-text

The Suboptimal WMT Test Sets and Its Impact on Human Parity

10.20944/preprints202110.0199.v1 ◽

2021 ◽

Author(s):

Ahrii Kim ◽

Yunju Bak ◽

Jimin Sun ◽

Sungwon Lyu ◽

Changmin Lee

Keyword(s):

Machine Translation ◽

Web Crawling ◽

Data Set ◽

Test Set ◽

Neural Machine Translation ◽

Sentence Level ◽

Test Sets ◽

Source Test

With the advent of Neural Machine Translation, the more the achievement of human-machine parity is claimed at WMT, the more we come to ask ourselves if their evaluation environment can be trusted. In this paper, we argue that the low quality of the source test set of the news track at WMT may lead to an overrated human parity claim. First of all, we report nine types of so-called technical contaminants in the data set, originated from an absence of meticulous inspection after web-crawling. Our empirical findings show that when they are corrected, about 5% of the segments that have previously achieved a human parity claim turn out to be statistically invalid. Such a tendency gets evident when the contaminated sentences are solely concerned. To the best of our knowledge, it is the first attempt to question the “source” side of the test set as a potential cause of the overclaim of human parity. We cast evidence for such phenomenon that according to sentence-level TER scores, those trivial errors change a good part of system translations. We conclude that to overlook it would be a mistake, especially when it comes to an NMT evaluation.

Download Full-text

Improving Context-Aware Neural Machine Translation Using Self-Attentive Sentence Embedding

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6494 ◽

2020 ◽

Vol 34 (05) ◽

pp. 9498-9506 ◽

Cited By ~ 1

Author(s):

Hyeongu Yun ◽

Yongkeun Hwang ◽

Kyomin Jung

Keyword(s):

Machine Translation ◽

Contextual Information ◽

Context Aware ◽

Pronoun Resolution ◽

Test Set ◽

Neural Machine Translation ◽

Attentional Networks ◽

Multiple Context ◽

Sentence Level ◽

Level Information

Fully Attentional Networks (FAN) like Transformer (Vaswani et al. 2017) has shown superior results in Neural Machine Translation (NMT) tasks and has become a solid baseline for translation tasks. More recent studies also have reported experimental results that additional contextual sentences improve translation qualities of NMT models (Voita et al. 2018; Müller et al. 2018; Zhang et al. 2018). However, those studies have exploited multiple context sentences as a single long concatenated sentence, that may cause the models to suffer from inefficient computational complexities and long-range dependencies. In this paper, we propose Hierarchical Context Encoder (HCE) that is able to exploit multiple context sentences separately using the hierarchical FAN structure. Our proposed encoder first abstracts sentence-level information from preceding sentences in a self-attentive way, and then hierarchically encodes context-level information. Through extensive experiments, we observe that our HCE records the best performance measured in BLEU score on English-German, English-Turkish, and English-Korean corpus. In addition, we observe that our HCE records the best performance in a crowd-sourced test set which is designed to evaluate how well an encoder can exploit contextual information. Finally, evaluation on English-Korean pronoun resolution test suite also shows that our HCE can properly exploit contextual information.

Download Full-text

Context-Aware Neural Machine Translation for Korean Honorific Expressions

Electronics ◽

10.3390/electronics10131589 ◽

2021 ◽

Vol 10 (13) ◽

pp. 1589

Author(s):

Yongkeun Hwang ◽

Yanghoon Kim ◽

Kyomin Jung

Keyword(s):

Machine Translation ◽

Deep Neural Networks ◽

Contextual Information ◽

Context Aware ◽

Neural Machine Translation ◽

Translation Quality ◽

Sentence Level ◽

Proposed Model ◽

The Given ◽

The Relationship

Neural machine translation (NMT) is one of the text generation tasks which has achieved significant improvement with the rise of deep neural networks. However, language-specific problems such as handling the translation of honorifics received little attention. In this paper, we propose a context-aware NMT to promote translation improvements of Korean honorifics. By exploiting the information such as the relationship between speakers from the surrounding sentences, our proposed model effectively manages the use of honorific expressions. Specifically, we utilize a novel encoder architecture that can represent the contextual information of the given input sentences. Furthermore, a context-aware post-editing (CAPE) technique is adopted to refine a set of inconsistent sentence-level honorific translations. To demonstrate the efficacy of the proposed method, honorific-labeled test data is required. Thus, we also design a heuristic that labels Korean sentences to distinguish between honorific and non-honorific styles. Experimental results show that our proposed method outperforms sentence-level NMT baselines both in overall translation quality and honorific translations.

Download Full-text

Using Word Embeddings to Enforce Document-Level Lexical Consistency in Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0011 ◽

2017 ◽

Vol 108 (1) ◽

pp. 85-96 ◽

Cited By ~ 2

Author(s):

Eva Martínez Garcia ◽

Carles Creus ◽

Cristina España-Bonet ◽

Lluís Màrquez

Keyword(s):

Machine Translation ◽

Evaluation Metrics ◽

Automatic Evaluation ◽

Word Embeddings ◽

Standard Document ◽

Sentence Level ◽

Word Translation ◽

Stochastic Mechanism ◽

Document Level

Abstract We integrate new mechanisms in a document-level machine translation decoder to improve the lexical consistency of document translations. First, we develop a document-level feature designed to score the lexical consistency of a translation. This feature, which applies to words that have been translated into different forms within the document, uses word embeddings to measure the adequacy of each word translation given its context. Second, we extend the decoder with a new stochastic mechanism that, at translation time, allows to introduce changes in the translation oriented to improve its lexical consistency. We evaluate our system on English–Spanish document translation, and we conduct automatic and manual assessments of its quality. The automatic evaluation metrics, applied mainly at sentence level, do not reflect significant variations. On the contrary, the manual evaluation shows that the system dealing with lexical consistency is preferred over both a standard sentence-level and a standard document-level phrase-based MT systems.

Download Full-text

Efficient Context-Aware Neural Machine Translation with Layer-Wise Weighting and Input-Aware Gating

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/544 ◽

2020 ◽

Author(s):

Hongfei Xu ◽

Deyi Xiong ◽

Josef van Genabith ◽

Qiuhui Liu

Keyword(s):

Machine Translation ◽

Contextual Information ◽

Computational Cost ◽

Representation Learning ◽

Vital Role ◽

Context Aware ◽

Neural Machine Translation ◽

Gating Mechanism ◽

Sentence Level ◽

Parallel Data

Existing Neural Machine Translation (NMT) systems are generally trained on a large amount of sentence-level parallel data, and during prediction sentences are independently translated, ignoring cross-sentence contextual information. This leads to inconsistency between translated sentences. In order to address this issue, context-aware models have been proposed. However, document-level parallel data constitutes only a small part of the parallel data available, and many approaches build context-aware models based on a pre-trained frozen sentence-level translation model in a two-step training manner. The computational cost of these approaches is usually high. In this paper, we propose to make the most of layers pre-trained on sentence-level data in contextual representation learning, reusing representations from the sentence-level Transformer and significantly reducing the cost of incorporating contexts in translation. We find that representations from shallow layers of a pre-trained sentence-level encoder play a vital role in source context encoding, and propose to perform source context encoding upon weighted combinations of pre-trained encoder layers' outputs. Instead of separately performing source context and input encoding, we propose to iteratively and jointly encode the source input and its contexts and to generate input-aware context representations with a cross-attention layer and a gating mechanism, which resets irrelevant information in context encoding. Our context-aware Transformer model outperforms the recent CADec [Voita et al., 2019c] on the English-Russian subtitle data and is about twice as fast in training and decoding.

Download Full-text

Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation

Computational Intelligence and Neuroscience ◽

10.1155/2021/6682385 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Michael Adjeisah ◽

Guohua Liu ◽

Douglas Omwenga Nyabuga ◽

Richard Nuetey Nortey ◽

Jinling Song

Keyword(s):

Machine Translation ◽

Language Processing ◽

Training Data ◽

Target Language ◽

Similarity Metrics ◽

Mahalanobis Distances ◽

Parallel Corpora ◽

Parallel Corpus ◽

Low Resource ◽

Sentence Level

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.

Download Full-text

Identity Theories and Technology

Handbook of Research on Technoself ◽

10.4018/978-1-4666-2211-1.ch002 ◽

2013 ◽

pp. 26-44 ◽

Cited By ~ 1

Author(s):

Robert Andrew Dunn

Keyword(s):

Communication Theory ◽

Computer Mediated Communication ◽

Role Play ◽

Multiple Identities ◽

The Internet ◽

Mediated Communication ◽

Virtual Identity ◽

Computer Mediated ◽

The Impact ◽

Modern Identity

Modern identity has been shaped by technology, which has in turn shaped theories in understanding identity. How one communicates who they are to others is given limitless possibilities by the advent of the Internet and computer-mediated environments. Thus, identity theory today must take into account computer-mediated communication theory and research. Such research indicates four ways in which identity is affected by technology. First, researchers have discussed the differences between an individual’s true identity and the virtual identity he or she presents, via self-selected text and images, to an online world. Second, researchers have discussed how the Internet can provide both protective anonymity for those who seek it and cathartic disclosure for those who need it. Third, researchers have discussed ways in which users pursue both reflective virtual lives online and role-play with identities, often multiple identities. Fourth, researchers have conducted experiments that reflect the impact that virtual identity has on the practice of communication and the impact communication has on the presentation of the self.

Download Full-text

Predicting and Analyzing Language Specificity in Social Media Posts

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016415 ◽

2019 ◽

Vol 33 ◽

pp. 6415-6422 ◽

Cited By ~ 1

Author(s):

Yifan Gao ◽

Yang Zhong ◽

Daniel Preoţiuc-Pietro ◽

Junyi Jessy Li

Keyword(s):

Social Media ◽

Computational Linguistics ◽

Large Scale ◽

Pearson Correlation ◽

Absolute Error ◽

Fine Grained ◽

Argumentation Mining ◽

Sentence Level ◽

Prediction Systems ◽

Mental Health Factors

In computational linguistics, specificity quantifies how much detail is engaged in text. It is an important characteristic of speaker intention and language style, and is useful in NLP applications such as summarization and argumentation mining. Yet to date, expert-annotated data for sentence-level specificity are scarce and confined to the news genre. In addition, systems that predict sentence specificity are classifiers trained to produce binary labels (general or specific).We collect a dataset of over 7,000 tweets annotated with specificity on a fine-grained scale. Using this dataset, we train a supervised regression model that accurately estimates specificity in social media posts, reaching a mean absolute error of 0.3578 (for ratings on a scale of 1-5) and 0.73 Pearson correlation, significantly improving over baselines and previous sentence specificity prediction systems. We also present the first large-scale study revealing the social, temporal and mental health factors underlying language specificity on social media.

Download Full-text

Computer-Mediated Communication Research

Handbook of Research on Electronic Surveys and Measurements ◽

10.4018/978-1-59140-792-8.ch022 ◽

2007 ◽

pp. 207-222

Author(s):

J.D. Wallace

Keyword(s):

Business Education ◽

Computer Mediated Communication ◽

Scholarly Communication ◽

Communication Research ◽

Journal Articles ◽

Mediated Communication ◽

Fine Grained ◽

Computer Mediated ◽

Core Areas ◽

Education Psychology

This chapter asks “what is meant by computer-mediated communication research?” Numerous databases were examined concerning business, education, psychology, sociology, and social sciences from 1966 through 2005. A survey of the literature produced close to two thousand scholarly journal articles and bibliometric techniques were used to establish core areas. Specifically, journals, authors and concepts were identified. Then, more prevalent features within the dataset were targeted and a fine grained analysis was conducted on research affiliated terms and concepts clustering around those terms. What was found was an area of scholarly communication, heavily popularized in education related journals. Likewise topics under investigation tended to be education and internet affiliated. The distribution of first authors was overwhelming populated by one time authorship. The most prominent research methodology emerging was case studies. Other specific research methodologies tended to be textually related such as content and discourse analysis. This study was significant for two reasons. First, it documented CMC’s literature historical emergence through a longitudinal analysis. Second, it identified descriptive boundaries concerning authors, journals, and concepts that were prevalent in the literature.

Download Full-text