Creation of Reliable Relevance Judgments in Information Retrieval Systems Evaluation Experimentation through Crowdsourcing: A Review

Test collection is used to evaluate the information retrieval systems in laboratory-based evaluation experimentation. In a classic setting, generating relevance judgments involves human assessors and is a costly and time consuming task. Researchers and practitioners are still being challenged in performing reliable and low-cost evaluation of retrieval systems. Crowdsourcing as a novel method of data acquisition is broadly used in many research fields. It has been proven that crowdsourcing is an inexpensive and quick solution as well as a reliable alternative for creating relevance judgments. One of the crowdsourcing applications in IR is to judge relevancy of query document pair. In order to have a successful crowdsourcing experiment, the relevance judgment tasks should be designed precisely to emphasize quality control. This paper is intended to explore different factors that have an influence on the accuracy of relevance judgments accomplished by workers and how to intensify the reliability of judgments in crowdsourcing experiment.

Download Full-text

A Comparison of Retrieval Result Relevance Judgments Between American and Chinese Users

Journal of Global Information Management ◽

10.4018/jgim.2020070108 ◽

2020 ◽

Vol 28 (3) ◽

pp. 148-168

Author(s):

Jin Zhang ◽

Yuehua Zhao ◽

Xin Cai ◽

Taowen Le ◽

Wei Fei ◽

...

Keyword(s):

Information Retrieval ◽

Relevance Judgment ◽

Search Tasks ◽

Retrieval Systems ◽

Relevance Judgments ◽

Cross Language Information Retrieval ◽

Subject Categories ◽

Information Retrieval Systems ◽

Retrieval Result ◽

Cross Language

Relevance judgment plays an extremely significant role in information retrieval. This study investigates the differences between American users and Chinese users in relevance judgment during the information retrieval process. 384 sets of relevance scores with 50 scores in each set were collected from 16 American users and 16 Chinese users as they judged retrieval records from two major search engines based on 24 predefined search tasks from 4 domain categories. Statistical analyses reveal that there are significant differences between American assessors and Chinese assessors in relevance judgments. Significant gender differences also appear within both the American and the Chinese assessor groups. The study also revealed significant interactions among cultures, genders, and subject categories. These findings can enhance the understanding of cultural impact on information retrieval and can assist in the design of effective cross-language information retrieval systems.

Download Full-text

Low-cost evaluation techniques for information retrieval systems: A review

Journal of Informetrics ◽

10.1016/j.joi.2012.12.001 ◽

2013 ◽

Vol 7 (2) ◽

pp. 301-312 ◽

Cited By ~ 8

Author(s):

Shiva Imani Moghadasi ◽

Sri Devi Ravana ◽

Sudharshan N. Raman

Keyword(s):

Information Retrieval ◽

Low Cost ◽

Cost Evaluation ◽

Retrieval Systems ◽

Information Retrieval Systems ◽

Evaluation Techniques

Download Full-text

Evaluating the effectiveness of information retrieval systems using effort-based relevance judgment

Aslib Journal of Information Management ◽

10.1108/ajim-04-2018-0086 ◽

2019 ◽

Vol 71 (1) ◽

pp. 2-17

Author(s):

Prabha Rajagopal ◽

Sri Devi Ravana ◽

Yun Sing Koh ◽

Vimala Balakrishnan

Keyword(s):

Information Retrieval ◽

Design Methodology ◽

Shallow Depth ◽

Content Type ◽

Relevance Judgment ◽

Retrieval Systems ◽

Relevance Judgments ◽

Information Retrieval Systems

Purpose The effort in addition to relevance is a major factor for satisfaction and utility of the document to the actual user. The purpose of this paper is to propose a method in generating relevance judgments that incorporate effort without human judges’ involvement. Then the study determines the variation in system rankings due to low effort relevance judgment in evaluating retrieval systems at different depth of evaluation. Design/methodology/approach Effort-based relevance judgments are generated using a proposed boxplot approach for simple document features, HTML features and readability features. The boxplot approach is a simple yet repeatable approach in classifying documents’ effort while ensuring outlier scores do not skew the grading of the entire set of documents. Findings The retrieval systems evaluation using low effort relevance judgments has a stronger influence on shallow depth of evaluation compared to deeper depth. It is proved that difference in the system rankings is due to low effort documents and not the number of relevant documents. Originality/value Hence, it is crucial to evaluate retrieval systems at shallow depth using low effort relevance judgments.

Download Full-text

Mahak: A Test Collection for Evaluation of Farsi Information Retrieval Systems

2007 IEEE/ACS International Conference on Computer Systems and Applications ◽

10.1109/aiccsa.2007.370697 ◽

2007 ◽

Cited By ~ 8

Author(s):

Kyumars Sheykh Esmaili ◽

Hassan Abolhassani ◽

Mahmood Neshati ◽

Ehsan Behrangi ◽

Asreen Rostami ◽

...

Keyword(s):

Information Retrieval ◽

Test Collection ◽

Retrieval Systems ◽

Information Retrieval Systems

Download Full-text

Document-based approach to improve the accuracy of pairwise comparison in evaluating information retrieval systems

Aslib Journal of Information Management ◽

10.1108/ajim-12-2014-0171 ◽

2015 ◽

Vol 67 (4) ◽

pp. 408-421

Author(s):

Sri Devi Ravana ◽

MASUMEH SADAT TAHERI ◽

Prabha Rajagopal

Keyword(s):

Information Retrieval ◽

Pairwise Comparison ◽

Current Method ◽

Statistical Testing ◽

Test Collection ◽

Content Type ◽

Retrieval Systems ◽

Information Retrieval Systems ◽

The Mean ◽

The Difference

Purpose – The purpose of this paper is to propose a method to have more accurate results in comparing performance of the paired information retrieval (IR) systems with reference to the current method, which is based on the mean effectiveness scores of the systems across a set of identified topics/queries. Design/methodology/approach – Based on the proposed approach, instead of the classic method of using a set of topic scores, the documents level scores are considered as the evaluation unit. These document scores are the defined document’s weight, which play the role of the mean average precision (MAP) score of the systems as a significance test’s statics. The experiments were conducted using the TREC 9 Web track collection. Findings – The p-values generated through the two types of significance tests, namely the Student’s t-test and Mann-Whitney show that by using the document level scores as an evaluation unit, the difference between IR systems is more significant compared with utilizing topic scores. Originality/value – Utilizing a suitable test collection is a primary prerequisite for IR systems comparative evaluation. However, in addition to reusable test collections, having an accurate statistical testing is a necessity for these evaluations. The findings of this study will assist IR researchers to evaluate their retrieval systems and algorithms more accurately.

Download Full-text

Relevant Intra-Actions in Networked Environments

Social Information Retrieval Systems ◽

10.4018/978-1-59904-543-6.ch012 ◽

2011 ◽

pp. 230-251 ◽

Cited By ~ 1

Author(s):

Theresa Dirndorfer Anderson

Keyword(s):

Information Retrieval ◽

Information Systems ◽

Conceptual Framework ◽

Case Studies ◽

Theoretical Framework ◽

Design Development ◽

Retrieval Systems ◽

Relevance Judgments ◽

Information Retrieval Systems ◽

Collaborative Information Retrieval

This chapter uses a study of human assessments of relevance to demonstrate how individual relevance judgments and retrieval practices embody collaborative elements that contribute to the overall progress of that person’s individual work. After discussing key themes of the conceptual framework, the chapter will discuss two case studies that serve as powerful illustrations of these themes for researchers and practitioners alike. These case studies, outcomes of a two-year ethnographic exploration of research practices, illustrate the theoretical position presented in part one of the chapter, providing lessons for the ways that people work with information systems to generate knowledge and the conditions that will support these practices. The chapter shows that collaboration does not have to be explicit to influence searcher behavior. It seeks to present both a theoretical framework and case studies that can be applied to the design, development and evaluation of collaborative information retrieval systems.

Download Full-text

Low-cost and robust evaluation of information retrieval systems

ACM SIGIR Forum ◽

10.1145/1480506.1480527 ◽

2008 ◽

Vol 42 (2) ◽

pp. 104-104 ◽

Cited By ~ 3

Author(s):

Benjamin A. Carterette

Keyword(s):

Information Retrieval ◽

Low Cost ◽

Retrieval Systems ◽

Information Retrieval Systems

Download Full-text

An evaluation of some conflation algorithms for information retrieval

Journal of Information Science ◽

10.1177/016555158100300403 ◽

1981 ◽

Vol 3 (4) ◽

pp. 177-183 ◽

Cited By ~ 63

Author(s):

Martin Lennon ◽

David S. Peirce ◽

Brian D. Tarry ◽

Peter Willett

Keyword(s):

Information Retrieval ◽

Test Collection ◽

Retrieval Systems ◽

Information Retrieval Systems

The characteristics of conflation algorithms are discussed and examples given of some algorithms which have been used for information retrieval systems. Comparative experiments with a range of keyword dictionaries and with the Cranfield document test collection suggest that there is relatively little difference in the performance of the algorithms despite the widely disparate means by which they have been developed and by which they operate.

Download Full-text