Guidance to Pre-tokeniztion for SacreBLEU: Meta-Evaluation in Korean

Mapping Intimacies ◽

10.20944/preprints202201.0018.v1 ◽

2022 ◽

Author(s):

Ahrii Kim ◽

Jinhyun Kim

Keyword(s):

Empirical Study ◽

Automatic Evaluation ◽

Human Judgment ◽

Evaluation Data ◽

Human Evaluation ◽

Mt Evaluation ◽

Evaluation Metric ◽

Agglutinative Languages

SacreBLEU, by incorporating a text normalizing step in the pipeline, has been well-received as an automatic evaluation metric in recent years. With agglutinative languages such as Korean, however, the metric cannot provide a conceivable result without the help of customized pre-tokenization. In this regard, this paper endeavors to examine the influence of diversified pre-tokenization schemes –word, morpheme, character, and subword– on the aforementioned metric by performing a meta-evaluation with manually-constructed into-Korean human evaluation data. Our empirical study demonstrates that the correlation of SacreBLEU (to human judgment) fluctuates consistently by the token type. The reliability of the metric even deteriorates due to some tokenization, and MeCab is not an exception. Guiding through the proper usage of tokenizer for each metric, we stress the significance of a character level and the insignificance of a Jamo level in MT evaluation.

Download Full-text

Automatische Evaluation der Humanübersetzung: BLEU vs. METEOR

Lebende Sprachen ◽

10.1515/les-2020-0009 ◽

2020 ◽

Vol 65 (1) ◽

pp. 181-205

Author(s):

Hye-Yeon Chung

Keyword(s):

Automatic Evaluation ◽

Human Evaluation ◽

The Individual

AbstractHuman evaluation (HE) of translation is generally considered to be valid, but it requires a lot of effort. Automatic evaluation (AE) which assesses the quality of machine translations can be done easily, but it still requires validation. This study addresses the questions of whether and how AE can be used for human translations. For this purpose AE formulas and HE criteria were compared to each other in order to examine the validity of AE. In the empirical part of the study, 120 translations were evaluated by professional translators as well as by two representative AE-systems, BLEU/ METEOR, respectively. The correlations between AE and HE were relatively high at 0.849** (BLEU) and 0.862** (METEOR) in the overall analysis, but in the ratings of the individual texts, AE and ME exhibited a substantial difference. The AE-ME correlations were often below 0.3 or even in the negative range. Ultimately, the results indicate that neither METEOR nor BLEU can be used to assess human translation at this stage. But this paper suggests three possibilities to apply AE to compromise the weakness of HE.

Download Full-text

Empirical Study on Human Evaluation of Complex Argumentation Frameworks

Logics in Artificial Intelligence - Lecture Notes in Computer Science ◽

10.1007/978-3-030-19570-0_7 ◽

2019 ◽

pp. 102-115 ◽

Cited By ~ 2

Author(s):

Marcos Cramer ◽

Mathieu Guillaume

Keyword(s):

Empirical Study ◽

Human Evaluation ◽

Complex Argumentation

Download Full-text

Insight into Multiple References in an MT Evaluation Metric

Lecture Notes in Computer Science - Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data ◽

10.1007/978-3-319-25816-4_11 ◽

2015 ◽

pp. 131-140

Author(s):

Ying Qin ◽

Lucia Specia

Keyword(s):

Mt Evaluation ◽

Evaluation Metric ◽

Multiple References ◽

Insight Into

Download Full-text

Dictionary-Guided Editing Networks for Paraphrase Generation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016546 ◽

2019 ◽

Vol 33 ◽

pp. 6546-6553 ◽

Cited By ~ 2

Author(s):

Shaohan Huang ◽

Yu Wu ◽

Furu Wei ◽

Zhongzhi Luan

Keyword(s):

Automatic Evaluation ◽

Word Level ◽

Human Evaluation ◽

Original Sentence ◽

Novel Approach ◽

Benchmark Datasets ◽

Source Sentence ◽

Input Sentence ◽

Paraphrase Generation ◽

Selection Of

An intuitive way for a human to write paraphrase sentences is to replace words or phrases in the original sentence with their corresponding synonyms and make necessary changes to ensure the new sentences are fluent and grammatically correct. We propose a novel approach to modeling the process with dictionary-guided editing networks which effectively conduct rewriting on the source sentence to generate paraphrase sentences. It jointly learns the selection of the appropriate word level and phrase level paraphrase pairs in the context of the original sentence from an off-the-shelf dictionary as well as the generation of fluent natural language sentences. Specifically, the system retrieves a set of word level and phrase level paraphrase pairs derived from the Paraphrase Database (PPDB) for the original sentence, which is used to guide the decision of which the words might be deleted or inserted with the soft attention mechanism under the sequence-to-sequence framework. We conduct experiments on two benchmark datasets for paraphrase generation, namely the MSCOCO and Quora dataset. The automatic evaluation results demonstrate that our dictionary-guided editing networks outperforms the baseline methods. On human evaluation, results indicate that the generated paraphrases are grammatically correct and relevant to the input sentence.

Download Full-text

Syntactically Look-Ahead Attention Network for Sentence Compression

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6315 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8050-8057

Author(s):

Hidetaka Kamigaito ◽

Manabu Okumura

Keyword(s):

Automatic Evaluation ◽

Future Time ◽

Attention Network ◽

Human Evaluation ◽

Sentence Compression ◽

Look Ahead ◽

The Future

Sentence compression is the task of compressing a long sentence into a short one by deleting redundant words. In sequence-to-sequence (Seq2Seq) based models, the decoder unidirectionally decides to retain or delete words. Thus, it cannot usually explicitly capture the relationships between decoded words and unseen words that will be decoded in the future time steps. Therefore, to avoid generating ungrammatical sentences, the decoder sometimes drops important words in compressing sentences. To solve this problem, we propose a novel Seq2Seq model, syntactically look-ahead attention network (SLAHAN), that can generate informative summaries by explicitly tracking both dependency parent and child words during decoding and capturing important words that will be decoded in the future. The results of the automatic evaluation on the Google sentence compression dataset showed that SLAHAN achieved the best kept-token-based-F1, ROUGE-1, ROUGE-2 and ROUGE-L scores of 85.5, 79.3, 71.3 and 79.1, respectively. SLAHAN also improved the summarization performance on longer sentences. Furthermore, in the human evaluation, SLAHAN improved informativeness without losing readability.

Download Full-text

MinKSR: A Novel MT Evaluation Metric for Coordinating Human Translators with the CAT-Oriented Input Method

Communications in Computer and Information Science - Machine Translation ◽

10.1007/978-981-10-3635-4_1 ◽

2016 ◽

pp. 1-13 ◽

Cited By ~ 1

Author(s):

Guoping Huang ◽

Chunlu Zhao ◽

Hongyuan Ma ◽

Yu Zhou ◽

Jiajun Zhang

Keyword(s):

Input Method ◽

Mt Evaluation ◽

Evaluation Metric

Download Full-text

Variable Precision Bayesian Rough Set Model and Its Application to Human Evaluation Data

Lecture Notes in Computer Science - Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing ◽

10.1007/11548669_31 ◽

2005 ◽

pp. 294-303 ◽

Cited By ~ 10

Author(s):

Tatsuo Nishino ◽

Mitsuo Nagamachi ◽

Hideo Tanaka

Keyword(s):

Rough Set ◽

Evaluation Data ◽

Variable Precision ◽

Human Evaluation

Download Full-text

STD: An Automatic Evaluation Metric for Machine Translation Based on Word Embeddings

IEEE/ACM Transactions on Audio Speech and Language Processing ◽

10.1109/taslp.2019.2922845 ◽

2019 ◽

Vol 27 (10) ◽

pp. 1497-1506 ◽

Cited By ~ 1

Author(s):

Pairui Li ◽

Chuan Chen ◽

Wujie Zheng ◽

Yuetang Deng ◽

Fanghua Ye ◽

...

Keyword(s):

Machine Translation ◽

Automatic Evaluation ◽

Word Embeddings ◽

Evaluation Metric

Download Full-text

Evaluating Arm Movement Imitation

American Journal of Undergraduate Research ◽

10.33697/ajur.2006.004 ◽

2006 ◽

Vol 4 (4) ◽

Cited By ~ 4

Author(s):

Alexandra Constantin ◽

Maja Matarić

Keyword(s):

Empirical Study ◽

Arm Movement ◽

Pairwise Similarity ◽

Rotational Angle ◽

Movement Trajectories ◽

Human Evaluation ◽

Movement Imitation

In this paper, we present a metric for assessing the quality of arm movement imitation. We develop a joint-rotational-angle-based segmentation and comparison algorithm that rates pairwise similarity of arm movement trajectories on a scale of 1-10. We describe an empirical study designed to validate the algorithm we developed, by comparing it to human evaluation of imitation. The results provide evidence that the evaluation of the automatic metric did not significantly differ from human evaluation.

Download Full-text

Evaluating RBMT output for -ing forms: A study of four tar-get languages

Linguistica Antverpiensia, New Series – Themes in Translation Studies ◽

10.52034/lanstts.v8i.247 ◽

2021 ◽

Vol 8 ◽

Author(s):

Nora Aranberri-Monasterio ◽

Sharon O‘Brien

Keyword(s):

Target Language ◽

Evaluation Metrics ◽

Source Text ◽

Human Evaluation ◽

Mt Evaluation ◽

Rule Sets ◽

Target Languages ◽

Controlled Language

-ing forms in English are reported to be problematic for Machine Transla-tion and are often the focus of rules in Controlled Language rule sets. We investigated how problematic -ing forms are for an RBMT system, translat-ing into four target languages in the IT domain. Constituent-based human evaluation was used and the results showed that, in general, -ing forms do not deserve their bad reputation. A comparison with the results of five automated MT evaluation metrics showed promising correlations. Some issues prevail, however, and can vary from target language to target lan-guage. We propose different strategies for dealing with these problems, such as Controlled Language rules, semi-automatic post-editing, source text tagging and “post-editing” the source text.

Download Full-text