Automatic Evaluation of Chat-Oriented Dialogue Systems Using Large-Scale Multi-references

Author(s):  
Hiroaki Sugiyama ◽  
Toyomi Meguro ◽  
Ryuichiro Higashinaka
AI Magazine ◽  
2020 ◽  
Vol 41 (3) ◽  
pp. 18-27
Author(s):  
Mikhail Burtsev ◽  
Varvara Logacheva

Development of conversational systems is one of the most challenging tasks in natural language processing, and it is especially hard in the case of open-domain dialogue. The main factors that hinder progress in this area are lack of training data and difficulty of automatic evaluation. Thus, to reliably evaluate the quality of such models, one needs to resort to time-consuming and expensive human evaluation. We tackle these problems by organizing the Conversational Intelligence Challenge (ConvAI) — open competition of dialogue systems. Our goals are threefold: to work out a good design for human evaluation of open-domain dialogue, to grow open-source code base for conversational systems, and to harvest and publish new datasets. Over the course of ConvAI1 and ConvAI2 competitions, we developed a framework for evaluation of chatbots in messaging platforms and used it to evaluate over 30 dialogue systems in two conversational tasks — discussion of short text snippets from Wikipedia and personalized small talk. These large-scale evaluation experiments were performed by recruiting volunteers as well as paid workers. As a result, we succeeded in collecting a dataset of around 5,000 long meaningful human-to-bot dialogues and got many insights into the organization of human evaluation. This dataset can be used to train an automatic evaluation model or to improve the quality of dialogue systems. Our analysis of ConvAI1 and ConvAI2 competitions shows that the future work in this area should be centered around the more active participation of volunteers in the assessment of dialogue systems. To achieve that, we plan to make the evaluation setup more engaging.


2019 ◽  
Vol 1 (2) ◽  
pp. 187-200
Author(s):  
Zhengyu Zhao ◽  
Weinan Zhang ◽  
Wanxiang Che ◽  
Zhigang Chen ◽  
Yibo Zhang

The human-computer dialogue has recently attracted extensive attention from both academia and industry as an important branch in the field of artificial intelligence (AI). However, there are few studies on the evaluation of large-scale Chinese human-computer dialogue systems. In this paper, we introduce the Second Evaluation of Chinese Human-Computer Dialogue Technology, which focuses on the identification of a user's intents and intelligent processing of intent words. The Evaluation consists of user intent classification (Task 1) and online testing of task-oriented dialogues (Task 2), the data sets of which are provided by iFLYTEK Corporation. The evaluation tasks and data sets are introduced in detail, and meanwhile, the evaluation results and the existing problems in the evaluation are discussed.


Author(s):  
Shiquan Yang ◽  
Rui Zhang ◽  
Sarah M. Erfani ◽  
Jey Han Lau

Knowledge bases (KBs) are usually essential for building practical dialogue systems. Recently we have seen rapidly growing interest in integrating knowledge bases into dialogue systems. However, existing approaches mostly deal with knowledge bases of a single modality, typically textual information. As today's knowledge bases become abundant with multimodal information such as images, audios and videos, the limitation of existing approaches greatly hinders the development of dialogue systems. In this paper, we focus on task-oriented dialogue systems and address this limitation by proposing a novel model that integrates external multimodal KB reasoning with pre-trained language models. We further enhance the model via a novel multi-granularity fusion mechanism to capture multi-grained semantics in the dialogue history. To validate the effectiveness of the proposed model, we collect a new large-scale (14K) dialogue dataset MMDialKB, built upon multimodal KB. Both automatic and human evaluation results on MMDialKB demonstrate the superiority of our proposed framework over strong baselines.


Author(s):  
Fabio Paternò ◽  
Francesca Pulina ◽  
Carmen Santoro ◽  
Henrike Gappa ◽  
Yehya Mohamad

Abstract The recent European legislation emphasizes the importance of enabling people with disabilities to have access to online information and services of public sector bodies. To this regard, automatic evaluation and monitoring of Web accessibility can play a key role for various stakeholders involved in creating and maintaining over time accessible products. In this paper we present the results of elicitation activities that we carried out in a European project to collect experience and feedback from Web commissioners, developers and content authors of websites and web applications. The purpose was to understand their current practices in addressing accessibility issues, identify the barriers they encounter when exploiting automatic support in ensuring the accessibility of Web resources, and receive indications about what functionalities they would like to exploit in order to better manage accessibility evaluation and monitoring.


2020 ◽  
Vol 8 ◽  
pp. 281-295
Author(s):  
Qi Zhu ◽  
Kaili Huang ◽  
Zheng Zhang ◽  
Xiaoyan Zhu ◽  
Minlie Huang

To advance multi-domain (cross-domain) dialogue modeling as well as alleviate the shortage of Chinese task-oriented datasets, we propose CrossWOZ, the first large-scale Chinese Cross-Domain Wizard-of-Oz task-oriented dataset. It contains 6K dialogue sessions and 102K utterances for 5 domains, including hotel, restaurant, attraction, metro, and taxi. Moreover, the corpus contains rich annotation of dialogue states and dialogue acts on both user and system sides. About 60% of the dialogues have cross-domain user goals that favor inter-domain dependency and encourage natural transition across domains in conversation. We also provide a user simulator and several benchmark models for pipelined task-oriented dialogue systems, which will facilitate researchers to compare and evaluate their models on this corpus. The large size and rich annotation of CrossWOZ make it suitable to investigate a variety of tasks in cross-domain dialogue modeling, such as dialogue state tracking, policy learning, user simulation, etc.


Author(s):  
Christian Bühler ◽  
Helmut Heck ◽  
Olaf Perlick ◽  
Annika Nietzio ◽  
Nils Ulltveit-Moe

Author(s):  
Sixing Wu ◽  
Ying Li ◽  
Dawei Zhang ◽  
Yang Zhou ◽  
Zhonghai Wu

Insufficient semantic understanding of dialogue always leads to the appearance of generic responses, in generative dialogue systems. Recently, high-quality knowledge bases have been introduced to enhance dialogue understanding, as well as to reduce the prevalence of boring responses. Although such knowledge-aware approaches have shown tremendous potential, they always utilize the knowledge in a black-box fashion. As a result, the generation process is somewhat uncontrollable, and it is also not interpretable. In this paper, we introduce a topic fact-based commonsense knowledge-aware approach, TopicKA. Different from previous works, TopicKA generates responses conditioned not only on the query message but also on a topic fact with an explicit semantic meaning, which also controls the direction of generation. Topic facts are recommended by a recommendation network trained under the Teacher-Student framework. To integrate the recommendation network and the generation network, this paper designs four schemes, which include two non-sampling schemes and two sampling methods. We collected and constructed a large-scale Chinese commonsense knowledge graph. Experimental results on an open Chinese benchmark dataset indicate that our model outperforms baselines in terms of both the objective and the subjective metrics.


Author(s):  
Tong Wang ◽  
Ping Chen ◽  
Boyang Li

An important and difficult challenge in building computational models for narratives is the automatic evaluation of narrative quality. Quality evaluation connects narrative understanding and generation as generation systems need to evaluate their own products. To circumvent difficulties in acquiring annotations, we employ upvotes in social media as an approximate measure for story quality. We collected 54,484 answers from a crowd-powered question-and-answer website, Quora, and then used active learning to build a classifier that labeled 28,320 answers as stories. To predict the number of upvotes without the use of social network features, we create neural networks that model textual regions and the interdependence among regions, which serve as strong benchmarks for future research. To our best knowledge, this is the first large-scale study for automatic evaluation of narrative quality.


Sign in / Sign up

Export Citation Format

Share Document