scholarly journals Guidance to Pre-tokeniztion for SacreBLEU: Meta-Evaluation in Korean

Author(s):  
Ahrii Kim ◽  
Jinhyun Kim

SacreBLEU, by incorporating a text normalizing step in the pipeline, has been well-received as an automatic evaluation metric in recent years. With agglutinative languages such as Korean, however, the metric cannot provide a conceivable result without the help of customized pre-tokenization. In this regard, this paper endeavors to examine the influence of diversified pre-tokenization schemes –word, morpheme, character, and subword– on the aforementioned metric by performing a meta-evaluation with manually-constructed into-Korean human evaluation data. Our empirical study demonstrates that the correlation of SacreBLEU (to human judgment) fluctuates consistently by the token type. The reliability of the metric even deteriorates due to some tokenization, and MeCab is not an exception. Guiding through the proper usage of tokenizer for each metric, we stress the significance of a character level and the insignificance of a Jamo level in MT evaluation.

2020 ◽  
Vol 65 (1) ◽  
pp. 181-205
Author(s):  
Hye-Yeon Chung

AbstractHuman evaluation (HE) of translation is generally considered to be valid, but it requires a lot of effort. Automatic evaluation (AE) which assesses the quality of machine translations can be done easily, but it still requires validation. This study addresses the questions of whether and how AE can be used for human translations. For this purpose AE formulas and HE criteria were compared to each other in order to examine the validity of AE. In the empirical part of the study, 120 translations were evaluated by professional translators as well as by two representative AE-systems, BLEU/ METEOR, respectively. The correlations between AE and HE were relatively high at 0.849** (BLEU) and 0.862** (METEOR) in the overall analysis, but in the ratings of the individual texts, AE and ME exhibited a substantial difference. The AE-ME correlations were often below 0.3 or even in the negative range. Ultimately, the results indicate that neither METEOR nor BLEU can be used to assess human translation at this stage. But this paper suggests three possibilities to apply AE to compromise the weakness of HE.


Author(s):  
Shaohan Huang ◽  
Yu Wu ◽  
Furu Wei ◽  
Zhongzhi Luan

An intuitive way for a human to write paraphrase sentences is to replace words or phrases in the original sentence with their corresponding synonyms and make necessary changes to ensure the new sentences are fluent and grammatically correct. We propose a novel approach to modeling the process with dictionary-guided editing networks which effectively conduct rewriting on the source sentence to generate paraphrase sentences. It jointly learns the selection of the appropriate word level and phrase level paraphrase pairs in the context of the original sentence from an off-the-shelf dictionary as well as the generation of fluent natural language sentences. Specifically, the system retrieves a set of word level and phrase level paraphrase pairs derived from the Paraphrase Database (PPDB) for the original sentence, which is used to guide the decision of which the words might be deleted or inserted with the soft attention mechanism under the sequence-to-sequence framework. We conduct experiments on two benchmark datasets for paraphrase generation, namely the MSCOCO and Quora dataset. The automatic evaluation results demonstrate that our dictionary-guided editing networks outperforms the baseline methods. On human evaluation, results indicate that the generated paraphrases are grammatically correct and relevant to the input sentence.


2020 ◽  
Vol 34 (05) ◽  
pp. 8050-8057
Author(s):  
Hidetaka Kamigaito ◽  
Manabu Okumura

Sentence compression is the task of compressing a long sentence into a short one by deleting redundant words. In sequence-to-sequence (Seq2Seq) based models, the decoder unidirectionally decides to retain or delete words. Thus, it cannot usually explicitly capture the relationships between decoded words and unseen words that will be decoded in the future time steps. Therefore, to avoid generating ungrammatical sentences, the decoder sometimes drops important words in compressing sentences. To solve this problem, we propose a novel Seq2Seq model, syntactically look-ahead attention network (SLAHAN), that can generate informative summaries by explicitly tracking both dependency parent and child words during decoding and capturing important words that will be decoded in the future. The results of the automatic evaluation on the Google sentence compression dataset showed that SLAHAN achieved the best kept-token-based-F1, ROUGE-1, ROUGE-2 and ROUGE-L scores of 85.5, 79.3, 71.3 and 79.1, respectively. SLAHAN also improved the summarization performance on longer sentences. Furthermore, in the human evaluation, SLAHAN improved informativeness without losing readability.


2019 ◽  
Vol 27 (10) ◽  
pp. 1497-1506 ◽  
Author(s):  
Pairui Li ◽  
Chuan Chen ◽  
Wujie Zheng ◽  
Yuetang Deng ◽  
Fanghua Ye ◽  
...  

Author(s):  
Alexandra Constantin ◽  
Maja Matarić

In this paper, we present a metric for assessing the quality of arm movement imitation. We develop a joint-rotational-angle-based segmentation and comparison algorithm that rates pairwise similarity of arm movement trajectories on a scale of 1-10. We describe an empirical study designed to validate the algorithm we developed, by comparing it to human evaluation of imitation. The results provide evidence that the evaluation of the automatic metric did not significantly differ from human evaluation.


Author(s):  
Nora Aranberri-Monasterio ◽  
Sharon O‘Brien

-ing forms in English are reported to be problematic for Machine Transla-tion and are often the focus of rules in Controlled Language rule sets. We investigated how problematic -ing forms are for an RBMT system, translat-ing into four target languages in the IT domain. Constituent-based human evaluation was used and the results showed that, in general, -ing forms do not deserve their bad reputation. A comparison with the results of five automated MT evaluation metrics showed promising correlations. Some issues prevail, however, and can vary from target language to target lan-guage. We propose different strategies for dealing with these problems, such as Controlled Language rules, semi-automatic post-editing, source text tagging and “post-editing” the source text.


Sign in / Sign up

Export Citation Format

Share Document