scholarly journals Meta-evaluation of Conversational Search Evaluation Metrics

2021 ◽  
Vol 39 (4) ◽  
pp. 1-42
Author(s):  
Zeyang Liu ◽  
Ke Zhou ◽  
Max L. Wilson

Conversational search systems, such as Google assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging, given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remain to be investigated. In this article, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability : the ability to detect “actual” performance differences as opposed to those observed by chance; (2) fidelity : the ability to agree with ultimate user preference; and (3) intuitiveness : the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics vary significantly across different scenarios, whereas consistent with prior studies, existing metrics only achieve weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.

2017 ◽  
Vol 7 (1.5) ◽  
pp. 170 ◽  
Author(s):  
Saravanan Chandrasekaran ◽  
Vijay Bhanu Srinivasan ◽  
Latha Parthiban

The Quality of Service (QoS) is enforced in discovering an optimal web service (WS).The QoS is uncertain due to the fluctuating performance of WS in the dynamic cloud environment. We propose a Fuzzy based Bayesian Network (FBN) system for Efficient QoS prediction. The novel method comprises three processes namely Semantic QoS Annotation, QoS Prediction, and Adaptive QoS using cloud infrastructure. The FBN employs the compliance factor to measure the performance of QoS attributes and fuzzy inference rules to infer the service capability. The inference rules are defined according to the user preference which assists to achieve the user satisfaction. The FBN returns the optimal WSs from a set of functionally equivalent WS. The unpredictable and extreme access of the selected WS is handled using cloud infrastructure. The results show that the FBN approach achieves nearly 95% of QoS prediction accuracy when providing an adequate number of past QoS data, and improves the prediction probability by 2.6% more than that of the existing approach.  


Author(s):  
Hong-In Cheng ◽  
Patrick E. Patterson

With the increasing use of e-business web sites, users are often asked to select a menu-item from a large numbers of options. In this research, the pull-down menu, fisheye menu and grid menu were tested to compare the performance time, error rate, user satisfaction, simplicity, user friendliness, usefulness, and overall user preference of each menu type. The grid menu was more efficient in selection speed than the pull-down and fisheye menus when the number of menu-items was 50 and 100. The time needed to choose a menu-item with a grid menu was less affected by the size of menu and the physical location of an item within a menu. The pull-down and the grid menus were considered to be more satisfactory, simple, user friendly, and useful than the fisheye menu. 42.3 percent of subjects indicated that the grid menu was their preferred selection tool among the menus. The grid menu is an efficient and robust alternative menu choice for small and middle size menu lists.


2018 ◽  
Vol 44 (3) ◽  
pp. 393-401 ◽  
Author(s):  
Ehud Reiter

The BLEU metric has been widely used in NLP for over 15 years to evaluate NLP systems, especially in machine translation and natural language generation. I present a structured review of the evidence on whether BLEU is a valid evaluation technique—in other words, whether BLEU scores correlate with real-world utility and user-satisfaction of NLP systems; this review covers 284 correlations reported in 34 papers. Overall, the evidence supports using BLEU for diagnostic evaluation of MT systems (which is what it was originally proposed for), but does not support using BLEU outside of MT, for evaluation of individual texts, or for scientific hypothesis testing.


2011 ◽  
Vol 14 (1) ◽  
Author(s):  
Rocío L. Cecchini ◽  
Carlos M. Lorenzetti ◽  
Ana G. Maguitman ◽  
Filippo Menczer

The absence of reliable and efficient techniques to evaluate information retrieval systems has become a bottleneck in the development of novel retrieval methods. In traditional approaches users or hired evaluators provide manual assessments of relevance. However these approaches are neither efficient nor reliable since they do not scale with the complexity and heterogeneity of available digital information. Automatic approaches, on the other hand, could be efficient but disregard semantic data, which is usually important to assess the actual performance of the evaluated methods. This article proposes to use topic ontologies and semantic similarity data derived from these ontologies to implement an automatic semantic evaluation framework for information retrieval systems. The use of semantic simi- larity data allows to capture the notion of partial relevance, generalizing traditional evaluation metrics, and giving rise to novel performance measures such as semantic precision and semantic harmonic mean. The validity of the approach is supported by user studies and the application of the proposed framework is illustrated with the evaluation of topical retrieval systems. The evaluated systems include a baseline, a supervised version of the Bo1 query refinement method and two multi-objective evolutionary algorithms for context-based retrieval. Finally, we discuss the advantages of ap- plying evaluation metrics that account for semantic similarity data and partial relevance over existing metrics based on the notion of total relevance.


2021 ◽  
Vol 39 (4) ◽  
pp. 1-22
Author(s):  
Aldo Lipani ◽  
Ben Carterette ◽  
Emine Yilmaz

As conversational agents like Siri and Alexa gain in popularity and use, conversation is becoming a more and more important mode of interaction for search. Conversational search shares some features with traditional search, but differs in some important respects: conversational search systems are less likely to return ranked lists of results (a SERP), more likely to involve iterated interactions, and more likely to feature longer, well-formed user queries in the form of natural language questions. Because of these differences, traditional methods for search evaluation (such as the Cranfield paradigm) do not translate easily to conversational search. In this work, we propose a framework for offline evaluation of conversational search, which includes a methodology for creating test collections with relevance judgments, an evaluation measure based on a user interaction model, and an approach to collecting user interaction data to train the model. The framework is based on the idea of “subtopics”, often used to model novelty and diversity in search and recommendation, and the user model is similar to the geometric browsing model introduced by RBP and used in ERR. As far as we know, this is the first work to combine these ideas into a comprehensive framework for offline evaluation of conversational search.


2020 ◽  
Vol 34 (01) ◽  
pp. 1137-1144
Author(s):  
Tong Yu ◽  
Yilin Shen ◽  
Hongxia Jin

With the recent advances of multimodal interactive recommendations, the users are able to express their preference by natural language feedback to the item images, to find the desired items. However, the existing systems either retrieve only one item or require the user to specify (e.g., by click or touch) the commented items from a list of recommendations in each user interaction. As a result, the users are not hands-free and the recommendations may be impractical. We propose a hands-free visual dialog recommender system to interactively recommend a list of items. At each time, the system shows a list of items with visual appearance. The user can comment on the list in natural language, to describe the desired features they further want. With these multimodal data, the system chooses another list of items to recommend. To understand the user preference from these multimodal data, we develop neural network models which identify the described items among the list and further predict the desired attributes. To achieve efficient interactive recommendations, we leverage the inferred user preference and further develop a novel bandit algorithm. Specifically, to avoid the system exploring more than needed, the desired attributes are utilized to reduce the exploration space. More importantly, to achieve sample efficient learning in this hands-free setting, we derive additional samples from the user's relative preference expressed in natural language and design a pairwise logistic loss in bandit learning. Our bandit model is jointly updated by the pairwise logistic loss on the additional samples derived from natural language feedback and the traditional logistic loss. The empirical results show that the probability of finding the desired items by our system is about 3 times as high as that by the traditional interactive recommenders, after a few user interactions.


2019 ◽  
Vol 75 (6) ◽  
pp. 1370-1395
Author(s):  
Sophie Rutter ◽  
Elaine G. Toms ◽  
Paul David Clough

Purpose To design effective task-responsive search systems, sufficient understanding of users’ tasks must be gained and their characteristics described. Although existing multi-dimensional task schemes can be used to describe users’ search and work tasks, they do not take into account the information use environment (IUE) that contextualises the task. The paper aims to discuss these issues. Design/methodology/approach With a focus on English primary schools, in four stages a multi-dimensional task scheme was developed that distinguishes between task characteristics generic to all environments, and those that are specific to schools. In Stage 1, a provisional scheme was developed based upon the existing literature. In the next two stages, through interviews with teachers and observations of school children, the provisional scheme was populated and revised. In Stage 4, whether search tasks with the same information use can be distinguished by their characteristics was examined. Findings Ten generic characteristics were identified (nature of work task, search task originator, search task flexibility, search task doer, search task necessity, task output, search goal, stage in work task, resources and information use) and four characteristics specific to primary schools (curricular area, use in curricular area, planning and location). For the different information uses, some characteristics are more typical than others. Practical implications The resulting scheme, based on children’s real-life information seeking, should be used in the design and evaluation of search systems and digital libraries that support school children. More generally, the scheme can also be used in other environments. Originality/value This is the first study to develop a multi-dimensional task scheme that considers encompasses the IUE.


Sign in / Sign up

Export Citation Format

Share Document