Question Answering as an Automatic Evaluation Metric for News Article Summarization

Abstract A desirable property of a reference-based evaluation metric that measures the content quality of a summary is that it should estimate how much information that summary has in common with a reference. Traditional text overlap based metrics such as ROUGE fail to achieve this because they are limited to matching tokens, either lexically or via embeddings. In this work, we propose a metric to evaluate the content quality of a summary using question-answering (QA). QA-based methods directly measure a summary’s information overlap with a reference, making them fundamentally different than text overlap metrics. We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval. QAEval outperforms current state-of-the-art metrics on most evaluations using benchmark datasets, while being competitive on others due to limitations of state-of-the-art models. Through a careful analysis of each component of QAEval, we identify its performance bottlenecks and estimate that its potential upper-bound performance surpasses all other automatic metrics, approaching that of the gold-standard Pyramid Method.1

Download Full-text

AVA: an Automatic eValuation Approach for Question Answering Systems

10.18653/v1/2021.naacl-main.412 ◽

2021 ◽

Author(s):

Thuy Vu ◽

Alessandro Moschitti

Keyword(s):

Question Answering ◽

Automatic Evaluation ◽

Evaluation Approach ◽

Question Answering Systems

Download Full-text

Guidance to Pre-tokeniztion for SacreBLEU: Meta-Evaluation in Korean

10.20944/preprints202201.0018.v1 ◽

2022 ◽

Author(s):

Ahrii Kim ◽

Jinhyun Kim

Keyword(s):

Empirical Study ◽

Automatic Evaluation ◽

Human Judgment ◽

Evaluation Data ◽

Human Evaluation ◽

Mt Evaluation ◽

Evaluation Metric ◽

Agglutinative Languages

SacreBLEU, by incorporating a text normalizing step in the pipeline, has been well-received as an automatic evaluation metric in recent years. With agglutinative languages such as Korean, however, the metric cannot provide a conceivable result without the help of customized pre-tokenization. In this regard, this paper endeavors to examine the influence of diversified pre-tokenization schemes –word, morpheme, character, and subword– on the aforementioned metric by performing a meta-evaluation with manually-constructed into-Korean human evaluation data. Our empirical study demonstrates that the correlation of SacreBLEU (to human judgment) fluctuates consistently by the token type. The reliability of the metric even deteriorates due to some tokenization, and MeCab is not an exception. Guiding through the proper usage of tokenizer for each metric, we stress the significance of a character level and the insignificance of a Jamo level in MT evaluation.

Download Full-text

Examining the Effects of Two Mindful Eating Exercises on Chocolate Consumption

European Journal of Health Psychology ◽

10.1027/2512-8442/a000040 ◽

2019 ◽

Vol 26 (4) ◽

pp. 120-128 ◽

Cited By ~ 3

Author(s):

Michael Mantzios ◽

Kirby Skillett ◽

Helen Egan

Keyword(s):

Likert Scale ◽

News Article ◽

Mindful Eating ◽

The Impact

Abstract. The present study aimed to investigate and compare the impact of the Mindful Construal Diary (MCD) and the Mindful Raisin Exercise on the sensory tasting experience of chocolate and participants’ chocolate consumption. Participants were randomly allocated into three conditions (MCD, mindful raisin exercise, and mindless control), and engaged with either the MCD, the mindful raisin exercise, or, were asked to read a news article, respectively, while they ate a piece of chocolate. They then rated their satisfaction and desire to consume more chocolate on a 10-point Likert scale, and filled in a state mindful eating scale. Afterward, participants were informed that the study had ended and were asked to wait while the experimenter recorded some information, and any extra chocolate consumption during this time was recorded. Participants in both mindfulness conditions consumed significantly less chocolate after the exercise than participants in the control condition. No significant differences were found between the three conditions on ratings of satisfaction and desire to consume more chocolate. Both the MCD and the raisin exercise can be used to successfully moderate the intake of calorific foods, while the MCD can be utilized as an alternative practice to the typical meditation-based interventions.

Download Full-text

Question Answering as an Automatic Evaluation Metric for News Article Summarization

Contrasting Human Opinion of Non-factoid Question Answering with Automatic Evaluation

Towards Automatic Evaluation of Reused Answers in Community Question Answering

STD: An Automatic Evaluation Metric for Machine Translation Based on Word Embeddings

An automatic evaluation metric for Ancient-Modern Chinese translation

An Anchor-Based Automatic Evaluation Metric for Document Summarization

On choosing an effective automatic evaluation metric for microblog summarisation

Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary

AVA: an Automatic eValuation Approach for Question Answering Systems

Guidance to Pre-tokeniztion for SacreBLEU: Meta-Evaluation in Korean

Examining the Effects of Two Mindful Eating Exercises on Chocolate Consumption

Export Citation Format