relevance judgments Latest Research Papers

As conversational agents like Siri and Alexa gain in popularity and use, conversation is becoming a more and more important mode of interaction for search. Conversational search shares some features with traditional search, but differs in some important respects: conversational search systems are less likely to return ranked lists of results (a SERP), more likely to involve iterated interactions, and more likely to feature longer, well-formed user queries in the form of natural language questions. Because of these differences, traditional methods for search evaluation (such as the Cranfield paradigm) do not translate easily to conversational search. In this work, we propose a framework for offline evaluation of conversational search, which includes a methodology for creating test collections with relevance judgments, an evaluation measure based on a user interaction model, and an approach to collecting user interaction data to train the model. The framework is based on the idea of “subtopics”, often used to model novelty and diversity in search and recommendation, and the user model is similar to the geometric browsing model introduced by RBP and used in ERR. As far as we know, this is the first work to combine these ideas into a comprehensive framework for offline evaluation of conversational search.

Download Full-text

Evaluating Relevance Judgments with Pairwise Discriminative Power

10.1145/3459637.3482428 ◽

2021 ◽

Author(s):

Zhumin Chu ◽

Jiaxin Mao ◽

Fan Zhang ◽

Yiqun Liu ◽

Tetsuya Sakai ◽

...

Keyword(s):

Discriminative Power ◽

Relevance Judgments

Download Full-text

ARQMath

ACM SIGIR Forum ◽

10.1145/3483382.3483388 ◽

2020 ◽

Vol 54 (2) ◽

pp. 1-9

Author(s):

Richard Zanibbi ◽

Behrooz Mansouri ◽

Anurag Agarwal ◽

Douglas W. Oard

Keyword(s):

Undergraduate Students ◽

Question Answering ◽

Training Data ◽

Barriers To Entry ◽

Challenging Problem ◽

Shared Task ◽

Problem Domain ◽

Community Question Answering ◽

Relevance Judgments ◽

First Time

The Answer Retrieval for Questions on Math (ARQMath) evaluation was run for the first time at CLEF 2020. ARQMath is the first Community Question Answering (CQA) shared task for math, retrieving existing answers from Math Stack Exchange (MSE) that can help to answer previously unseen math questions. ARQMath also introduces a new protocol for math formula search, where formulas are evaluated in context using a query formula's associated question post, and posts associated with each retrieved formula. Over 70 topics were annotated for each task by eight undergraduate students supervised by a professor of mathematics. A formula index is provided in three formats: LATEX, Presentation MathML, and Content MathML, avoiding the need for participants to extract these themselves. In addition to detailed relevance judgments, tools are provided to parse MSE data, generate question threads in HTML, and evaluate retrieval results. To make comparisons with participating systems fairer, nDCG' (i.e., nDCG for assessed hits only) is used to compare systems for each task. ARQMath will continue in CLEF 2021, with training data from 2020 and baseline systems for both tasks to reduce barriers to entry for this challenging problem domain.

Download Full-text

A Comparative Analysis of System Features Used in the TREC-COVID Information Retrieval Challenge

10.1101/2020.10.15.20213645 ◽

2020 ◽

Author(s):

Jimmy Chen ◽

William R. Hersh

Keyword(s):

Information Retrieval ◽

System Performance ◽

Fine Tuning ◽

Search Performance ◽

Journal Articles ◽

Scientific Publications ◽

Test Collections ◽

Relevance Judgments ◽

Improved Performance ◽

Information Retrieval Methods

AbstractThe COVID-19 pandemic has resulted in a rapidly growing quantity of scientific publications from journal articles, preprints, and other sources. The TREC-COVID Challenge was created to evaluate information retrieval methods and systems for this quickly expanding corpus. Based on the COVID-19 Open Research Dataset (CORD-19), several dozen research teams participated in over 5 rounds of the TREC-COVID Challenge. While previous work has compared IR techniques used on other test collections, there are no studies that have analyzed the methods used by participants in the TREC-COVID Challenge. We manually reviewed team run reports from Rounds 2 and 5, extracted features from the documented methodologies, and used a univariate and multivariate regression-based analysis to identify features associated with higher retrieval performance. We observed that fine-tuning datasets with relevance judgments, MS-MARCO, and CORD-19 document vectors was associated with improved performance in Round 2 but not in Round 5. Though the relatively decreased heterogeneity of runs in Round 5 may explain the lack of significance in that round, fine-tuning has been found to improve search performance in previous challenge evaluations by improving a system’s ability to map relevant queries and phrases to documents. Furthermore, term expansion was associated with improvement in system performance, and the use of the narrative field in the TREC-COVID topics was associated with decreased system performance in both rounds. These findings emphasize the need for clear queries in search. While our study has some limitations in its generalizability and scope of techniques analyzed, we identified some IR techniques that may be useful in building search systems for COVID-19 using the TREC-COVID test collections.

Download Full-text

The Impact of Negative Relevance Judgments on NDCG

Proceedings of the 29th ACM International Conference on Information & Knowledge Management ◽

10.1145/3340531.3412123 ◽

2020 ◽

Author(s):

Lukas Gienapp ◽

Maik Fröbe ◽

Matthias Hagen ◽

Martin Potthast

Keyword(s):

Relevance Judgments ◽

The Impact

Download Full-text

A Comparison of Retrieval Result Relevance Judgments Between American and Chinese Users

Journal of Global Information Management ◽

10.4018/jgim.2020070108 ◽

2020 ◽

Vol 28 (3) ◽

pp. 148-168

Author(s):

Jin Zhang ◽

Yuehua Zhao ◽

Xin Cai ◽

Taowen Le ◽

Wei Fei ◽

...

Keyword(s):

Information Retrieval ◽

Relevance Judgment ◽

Search Tasks ◽

Retrieval Systems ◽

Relevance Judgments ◽

Cross Language Information Retrieval ◽

Subject Categories ◽

Information Retrieval Systems ◽

Retrieval Result ◽

Cross Language

Relevance judgment plays an extremely significant role in information retrieval. This study investigates the differences between American users and Chinese users in relevance judgment during the information retrieval process. 384 sets of relevance scores with 50 scores in each set were collected from 16 American users and 16 Chinese users as they judged retrieval records from two major search engines based on 24 predefined search tasks from 4 domain categories. Statistical analyses reveal that there are significant differences between American assessors and Chinese assessors in relevance judgments. Significant gender differences also appear within both the American and the Chinese assessor groups. The study also revealed significant interactions among cultures, genders, and subject categories. These findings can enhance the understanding of cultural impact on information retrieval and can assist in the design of effective cross-language information retrieval systems.

Download Full-text

BERT-PLI: Modeling Paragraph-Level Interactions for Legal Case Retrieval

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/484 ◽

2020 ◽

Author(s):

Yunqiu Shao ◽

Jiaxin Mao ◽

Yiqun Liu ◽

Weizhi Ma ◽

Ken Satoh ◽

...

Keyword(s):

Large Scale ◽

Ad Hoc ◽

Computational Cost ◽

Small Scale ◽

Case Retrieval ◽

Retrieval Task ◽

Semantic Relationships ◽

Relevance Judgments ◽

Legal Case ◽

Definition Of

Legal case retrieval is a specialized IR task that involves retrieving supporting cases given a query case. Compared with traditional ad-hoc text retrieval, the legal case retrieval task is more challenging since the query case is much longer and more complex than common keyword queries. Besides that, the definition of relevance between a query case and a supporting case is beyond general topical relevance and it is therefore difficult to construct a large-scale case retrieval dataset, especially one with accurate relevance judgments. To address these challenges, we propose BERT-PLI, a novel model that utilizes BERT to capture the semantic relationships at the paragraph-level and then infers the relevance between two cases by aggregating paragraph-level interactions. We fine-tune the BERT model with a relatively small-scale case law entailment dataset to adapt it to the legal scenario and employ a cascade framework to reduce the computational cost. We conduct extensive experiments on the benchmark of the relevant case retrieval task in COLIEE 2019. Experimental results demonstrate that our proposed method outperforms existing solutions.

Download Full-text

Effectiveness evaluation without human relevance judgments: A systematic analysis of existing methods and of their combinations

Information Processing & Management ◽

10.1016/j.ipm.2019.102149 ◽

2020 ◽

Vol 57 (2) ◽

pp. 102149

Author(s):

Kevin Roitero ◽

Andrea Brunello ◽

Giuseppe Serra ◽

Stefano Mizzaro

Keyword(s):

Effectiveness Evaluation ◽

Systematic Analysis ◽

Relevance Judgments

Download Full-text

The usefulness of multimedia surrogates for making relevance judgments about digital video objects

Information Processing & Management ◽

10.1016/j.ipm.2019.102091 ◽

2019 ◽

Vol 56 (6) ◽

pp. 102091 ◽

Cited By ~ 1

Author(s):

Barbara M. Wildemuth ◽

Gary Marchionini ◽

Xin Fu ◽

Jun Sung Oh ◽

Meng Yang

Keyword(s):

Digital Video ◽

Video Objects ◽

Relevance Judgments

Download Full-text

On the (Ir)Relevance of Psychological Research: Students versus Scientists and Implications for Teaching

Journal of Educational and Developmental Psychology ◽

10.5539/jedp.v9n1p59 ◽

2019 ◽

Vol 9 (1) ◽

pp. 59

Author(s):

C. Dominik Güss ◽

Travis Bishop

Keyword(s):

Undergraduate Students ◽

Basic Research ◽

The United States ◽

Psychological Research ◽

Research Articles ◽

Scientific Psychology ◽

Relevance Judgments ◽

Psychology Courses ◽

Research Students ◽

Research Studies

Research articles are widely used in the training of undergraduate students. Editors and reviewers of the top scientific psychology journals influence the development in the field by publishing certain articles and rejecting others, probably assuming that the published articles are empirically sound and theoretically highly relevant. The current study investigated if published articles are indeed regarded as relevant by a sample of 393 psychology undergraduate students from a university in the Southeast of the United States. The students’ age ranged between 18 and 57 (M = 23, SD = 6.05) and 84% were female. Students received brief statements about potential research studies and rated them regarding relevance, not knowing that the summaries were from actual research studies published in peer-reviewed journals. Results showed that (1) overall, research articles were regarded as generally irrelevant, (2) applied articles were regarded as more relevant than basic research articles, (3) ratings did not differ based on gender or age, and (4) the more advanced students were in the Psychology program, the higher their relevance ratings were for applied research as compared to basic research. Results are comforting or disturbing; comforting, because students might not have the professional expertise to make such relevance judgments; disturbing, because results might indicate how specialized and insulated journals have become by not addressing topics relevant to a wider population. Results also have implications for teaching research methods and experimental psychology courses.

Download Full-text

relevance judgments
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

How Am I Doing?: Evaluating Conversational Search Systems Offline

Evaluating Relevance Judgments with Pairwise Discriminative Power

ARQMath

A Comparative Analysis of System Features Used in the TREC-COVID Information Retrieval Challenge

The Impact of Negative Relevance Judgments on NDCG

A Comparison of Retrieval Result Relevance Judgments Between American and Chinese Users

BERT-PLI: Modeling Paragraph-Level Interactions for Legal Case Retrieval

Effectiveness evaluation without human relevance judgments: A systematic analysis of existing methods and of their combinations

The usefulness of multimedia surrogates for making relevance judgments about digital video objects

On the (Ir)Relevance of Psychological Research: Students versus Scientists and Implications for Teaching

Export Citation Format

relevance judgmentsRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

How Am I Doing?: Evaluating Conversational Search Systems Offline

Evaluating Relevance Judgments with Pairwise Discriminative Power

ARQMath

A Comparative Analysis of System Features Used in the TREC-COVID Information Retrieval Challenge

The Impact of Negative Relevance Judgments on NDCG

A Comparison of Retrieval Result Relevance Judgments Between American and Chinese Users

BERT-PLI: Modeling Paragraph-Level Interactions for Legal Case Retrieval

Effectiveness evaluation without human relevance judgments: A systematic analysis of existing methods and of their combinations

The usefulness of multimedia surrogates for making relevance judgments about digital video objects

On the (Ir)Relevance of Psychological Research: Students versus Scientists and Implications for Teaching

relevance judgments
Recently Published Documents