Effective collection construction for information retrieval evaluation and optimization

The availability of test collections in Cranfield paradigm has significantly benefited the development of models, methods and tools in information retrieval. Such test collections typically consist of a set of topics, a document collection and a set of relevance assessments. Constructing these test collections requires effort of various perspectives such as topic selection, document selection, relevance assessment, and relevance label aggregation etc. The work in the thesis provides a fundamental way of constructing and utilizing test collections in information retrieval in an effective, efficient and reliable manner. To that end, we have focused on four aspects. We first study the document selection issue when building test collections. We devise an active sampling method for efficient large-scale evaluation [Li and Kanoulas, 2017]. Different from past sampling-based approaches, we account for the fact that some systems are of higher quality than others, and we design the sampling distribution to over-sample documents from these systems. At the same time, the estimated evaluation measures are unbiased, and assessments can be used to evaluate new, novel systems without introducing any systematic error. Then a natural further step is determining when to stop the document selection and assessment procedure. This is an important but understudied problem in the construction of test collections. We consider both the gain of identifying relevant documents and the cost of assessing documents as the optimization goals. We handle the problem under the continuous active learning framework by jointly training a ranking model to rank documents, and estimating the total number of relevant documents in the collection using a "greedy" sampling method [Li and Kanoulas, 2020]. The next stage of constructing a test collection is assessing relevance. We study how to denoise relevance assessments by aggregating from multiple crowd annotation sources to obtain high-quality relevance assessments. This helps to boost the quality of relevance assessments acquired in a crowdsourcing manner. We assume a Gaussian process prior on query-document pairs to model their correlation. The proposed model shows good performance in terms of interring true relevance labels. Besides, it allows predicting relevance labels for new tasks that has no crowd annotations, which is a new functionality of CrowdGP. Ablation studies demonstrate that the effectiveness is attributed to the modelling of task correlation based on the axillary information of tasks and the prior relevance information of documents to queries. After a test collection is constructed, it can be used to either evaluate retrieval systems or train a ranking model. We propose to use it to optimize the configuration of retrieval systems. We use Bayesian optimization approach to model the effect of a δ -step in the configuration space to the effectiveness of the retrieval system, by suggesting to use different similarity functions (covariance functions) for continuous and categorical values, and examine their ability to effectively and efficiently guide the search in the configuration space [Li and Kanoulas, 2018]. Beyond the algorithmic and empirical contributions, work done as part of this thesis also contributed to the research community as the CLEF Technology Assisted Reviews in Empirical Medicine Tracks in 2017, 2018, and 2019 [Kanoulas et al., 2017, 2018, 2019]. Awarded by: University of Amsterdam, Amsterdam, The Netherlands. Supervised by: Evangelos Kanoulas. Available at: https://dare.uva.nl/search?identifier=3438a2b6-9271-4f2c-add5-3c811cc48d42.

Download Full-text

Mahak: A Test Collection for Evaluation of Farsi Information Retrieval Systems

2007 IEEE/ACS International Conference on Computer Systems and Applications ◽

10.1109/aiccsa.2007.370697 ◽

2007 ◽

Cited By ~ 8

Author(s):

Kyumars Sheykh Esmaili ◽

Hassan Abolhassani ◽

Mahmood Neshati ◽

Ehsan Behrangi ◽

Asreen Rostami ◽

...

Keyword(s):

Information Retrieval ◽

Test Collection ◽

Retrieval Systems ◽

Information Retrieval Systems

Download Full-text

Document-based approach to improve the accuracy of pairwise comparison in evaluating information retrieval systems

Aslib Journal of Information Management ◽

10.1108/ajim-12-2014-0171 ◽

2015 ◽

Vol 67 (4) ◽

pp. 408-421

Author(s):

Sri Devi Ravana ◽

MASUMEH SADAT TAHERI ◽

Prabha Rajagopal

Keyword(s):

Information Retrieval ◽

Pairwise Comparison ◽

Current Method ◽

Statistical Testing ◽

Test Collection ◽

Content Type ◽

Retrieval Systems ◽

Information Retrieval Systems ◽

The Mean ◽

The Difference

Purpose – The purpose of this paper is to propose a method to have more accurate results in comparing performance of the paired information retrieval (IR) systems with reference to the current method, which is based on the mean effectiveness scores of the systems across a set of identified topics/queries. Design/methodology/approach – Based on the proposed approach, instead of the classic method of using a set of topic scores, the documents level scores are considered as the evaluation unit. These document scores are the defined document’s weight, which play the role of the mean average precision (MAP) score of the systems as a significance test’s statics. The experiments were conducted using the TREC 9 Web track collection. Findings – The p-values generated through the two types of significance tests, namely the Student’s t-test and Mann-Whitney show that by using the document level scores as an evaluation unit, the difference between IR systems is more significant compared with utilizing topic scores. Originality/value – Utilizing a suitable test collection is a primary prerequisite for IR systems comparative evaluation. However, in addition to reusable test collections, having an accurate statistical testing is a necessity for these evaluations. The findings of this study will assist IR researchers to evaluate their retrieval systems and algorithms more accurately.

Download Full-text

Creation of Reliable Relevance Judgments in Information Retrieval Systems Evaluation Experimentation through Crowdsourcing: A Review

The Scientific World JOURNAL ◽

10.1155/2014/135641 ◽

2014 ◽

Vol 2014 ◽

pp. 1-13 ◽

Cited By ~ 6

Author(s):

Parnia Samimi ◽

Sri Devi Ravana

Keyword(s):

Information Retrieval ◽

Low Cost ◽

Cost Evaluation ◽

Test Collection ◽

Relevance Judgment ◽

Research Fields ◽

Retrieval Systems ◽

Relevance Judgments ◽

Information Retrieval Systems ◽

Novel Method

Test collection is used to evaluate the information retrieval systems in laboratory-based evaluation experimentation. In a classic setting, generating relevance judgments involves human assessors and is a costly and time consuming task. Researchers and practitioners are still being challenged in performing reliable and low-cost evaluation of retrieval systems. Crowdsourcing as a novel method of data acquisition is broadly used in many research fields. It has been proven that crowdsourcing is an inexpensive and quick solution as well as a reliable alternative for creating relevance judgments. One of the crowdsourcing applications in IR is to judge relevancy of query document pair. In order to have a successful crowdsourcing experiment, the relevance judgment tasks should be designed precisely to emphasize quality control. This paper is intended to explore different factors that have an influence on the accuracy of relevance judgments accomplished by workers and how to intensify the reliability of judgments in crowdsourcing experiment.

Download Full-text

A Test Collection for Dataset Retrieval in Biodiversity Research

Research Ideas and Outcomes ◽

10.3897/rio.7.e67887 ◽

2021 ◽

Vol 7 ◽

Author(s):

Felicitas Löffler ◽

Andreas Schuldt ◽

Birgitta König-Ries ◽

Helge Bruelheide ◽

Friederike Klan

Keyword(s):

Information Needs ◽

Research Field ◽

Test Collection ◽

Test Collections ◽

Biodiversity Research ◽

Retrieval Systems ◽

Binary Relevance ◽

Increasing Demand ◽

World Information ◽

Bef China

Searching for scientific datasets is a prominent task in scholars' daily research practice. A variety of data publishers, archives and data portals offer search applications that allow the discovery of datasets. The evaluation of such dataset retrieval systems requires proper test collections, including questions that reflect real world information needs of scholars, a set of datasets and human judgements assessing the relevance of the datasets to the questions in the benchmark corpus. Unfortunately, only very few test collections exist for a dataset search. In this paper, we introduce the BEF-China test collection, the very first test collection for dataset retrieval in biodiversity research, a research field with an increasing demand in data discovery services. The test collection consists of 14 questions, a corpus of 372 datasets from the BEF-China project and binary relevance judgements provided by a biodiversity expert.

Download Full-text

An evaluation of some conflation algorithms for information retrieval

Journal of Information Science ◽

10.1177/016555158100300403 ◽

1981 ◽

Vol 3 (4) ◽

pp. 177-183 ◽

Cited By ~ 63

Author(s):

Martin Lennon ◽

David S. Peirce ◽

Brian D. Tarry ◽

Peter Willett

Keyword(s):

Information Retrieval ◽

Test Collection ◽

Retrieval Systems ◽

Information Retrieval Systems

The characteristics of conflation algorithms are discussed and examples given of some algorithms which have been used for information retrieval systems. Comparative experiments with a range of keyword dictionaries and with the Cranfield document test collection suggest that there is relatively little difference in the performance of the algorithms despite the widely disparate means by which they have been developed and by which they operate.

Download Full-text

Information retrieval evaluation using test collections

Information Retrieval ◽

10.1007/s10791-016-9281-7 ◽

2016 ◽

Vol 19 (3) ◽

pp. 225-229 ◽

Cited By ~ 6

Author(s):

Falk Scholer ◽

Diane Kelly ◽

Ben Carterette

Keyword(s):

Information Retrieval ◽

Test Collections ◽

Information Retrieval Evaluation

Download Full-text

On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluation

Information Processing & Management ◽

10.1016/j.ipm.2021.102688 ◽

2021 ◽

Vol 58 (6) ◽

pp. 102688

Author(s):

Kevin Roitero ◽

Eddy Maddalena ◽

Stefano Mizzaro ◽

Falk Scholer

Keyword(s):

Information Retrieval ◽

Information Retrieval Evaluation ◽

Relevance Assessments

Download Full-text

The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR '15 ◽

10.1145/2766462.2767760 ◽

2015 ◽

Cited By ~ 12

Author(s):

Andrew Turpin ◽

Falk Scholer ◽

Stefano Mizzaro ◽

Eddy Maddalena

Keyword(s):

Information Retrieval ◽

Magnitude Estimation ◽

Information Retrieval Evaluation ◽

Relevance Assessments

Download Full-text

Test Collection Based Evaluation of Information Retrieval Systems

Foundations and Trends® in Information Retrieval ◽

10.1561/1500000009 ◽

2010 ◽

Vol 4 (4) ◽

pp. 247-375 ◽

Cited By ~ 164

Author(s):

Mark Sanderson

Keyword(s):

Information Retrieval ◽

Test Collection ◽

Retrieval Systems ◽

Information Retrieval Systems

Download Full-text

Cheap IR evaluation

ACM SIGIR Forum ◽

10.1145/3483382.3483400 ◽

2020 ◽

Vol 54 (2) ◽

pp. 1-2

Author(s):

Kevin Roitero

Keyword(s):

Statistical Power ◽

Information Needs ◽

Web Search ◽

State Of The Art ◽

Extensive Study ◽

Test Collection ◽

Test Collections ◽

Fine Grained ◽

Retrieval Systems ◽

Ranked List

To evaluate Information Retrieval (IR) effectiveness, a possible approach is to use test collections, which are composed of a collection of documents, a set of description of information needs (called topics), and a set of relevant documents to each topic. Test collections are modelled in a competition scenario: for example, in the well known TREC initiative, participants run their own retrieval systems over a set of topics and they provide a ranked list of retrieved documents; some of the retrieved documents (usually the first ranked) constitute the so called pool, and their relevance is evaluated by human assessors; the document list is then used to compute effectiveness metrics and rank the participant systems. Private Web Search companies also run their in-house evaluation exercises; although the details are mostly unknown, and the aims are somehow different, the overall approach shares several issues with the test collection approach. The aim of this work is to: (i) develop and improve some state-of-the-art work on the evaluation of IR effectiveness while saving resources, and (ii) propose a novel, more principled and engineered, overall approach to test collection based effectiveness evaluation. In this thesis we focus on three main directions: the first part details the usage of few topics (i.e., information needs) in retrieval evaluation and shows an extensive study detailing the effect of using fewer topics for retrieval evaluation in terms of number of topics, topics subsets, and statistical power. The second part of this thesis discusses the evaluation without relevance judgements, reproducing, extending, and generalizing state-of-the-art methods and investigating their combinations by means of data fusion techniques and machine learning. Finally, the third part uses crowdsourcing to gather relevance labels, and in particular shows the effect of using fine grained judgement scales; furthermore, explores methods to transform judgements between different relevance scales. Awarded by: University of Udine, Udine, Italy on 19 March 2020. Supervised by: Professor Stefano Mizzaro. Available at: https://kevinroitero.com/resources/kr-phd-thesis.pdf.

Download Full-text