relevance judgments
Recently Published Documents


TOTAL DOCUMENTS

108
(FIVE YEARS 13)

H-INDEX

21
(FIVE YEARS 1)

2021 ◽  
Vol 39 (4) ◽  
pp. 1-22
Author(s):  
Aldo Lipani ◽  
Ben Carterette ◽  
Emine Yilmaz

As conversational agents like Siri and Alexa gain in popularity and use, conversation is becoming a more and more important mode of interaction for search. Conversational search shares some features with traditional search, but differs in some important respects: conversational search systems are less likely to return ranked lists of results (a SERP), more likely to involve iterated interactions, and more likely to feature longer, well-formed user queries in the form of natural language questions. Because of these differences, traditional methods for search evaluation (such as the Cranfield paradigm) do not translate easily to conversational search. In this work, we propose a framework for offline evaluation of conversational search, which includes a methodology for creating test collections with relevance judgments, an evaluation measure based on a user interaction model, and an approach to collecting user interaction data to train the model. The framework is based on the idea of “subtopics”, often used to model novelty and diversity in search and recommendation, and the user model is similar to the geometric browsing model introduced by RBP and used in ERR. As far as we know, this is the first work to combine these ideas into a comprehensive framework for offline evaluation of conversational search.


2021 ◽  
Author(s):  
Zhumin Chu ◽  
Jiaxin Mao ◽  
Fan Zhang ◽  
Yiqun Liu ◽  
Tetsuya Sakai ◽  
...  

2020 ◽  
Vol 54 (2) ◽  
pp. 1-9
Author(s):  
Richard Zanibbi ◽  
Behrooz Mansouri ◽  
Anurag Agarwal ◽  
Douglas W. Oard

The Answer Retrieval for Questions on Math (ARQMath) evaluation was run for the first time at CLEF 2020. ARQMath is the first Community Question Answering (CQA) shared task for math, retrieving existing answers from Math Stack Exchange (MSE) that can help to answer previously unseen math questions. ARQMath also introduces a new protocol for math formula search, where formulas are evaluated in context using a query formula's associated question post, and posts associated with each retrieved formula. Over 70 topics were annotated for each task by eight undergraduate students supervised by a professor of mathematics. A formula index is provided in three formats: LATEX, Presentation MathML, and Content MathML, avoiding the need for participants to extract these themselves. In addition to detailed relevance judgments, tools are provided to parse MSE data, generate question threads in HTML, and evaluate retrieval results. To make comparisons with participating systems fairer, nDCG' (i.e., nDCG for assessed hits only) is used to compare systems for each task. ARQMath will continue in CLEF 2021, with training data from 2020 and baseline systems for both tasks to reduce barriers to entry for this challenging problem domain.


2020 ◽  
Author(s):  
Jimmy Chen ◽  
William R. Hersh

AbstractThe COVID-19 pandemic has resulted in a rapidly growing quantity of scientific publications from journal articles, preprints, and other sources. The TREC-COVID Challenge was created to evaluate information retrieval methods and systems for this quickly expanding corpus. Based on the COVID-19 Open Research Dataset (CORD-19), several dozen research teams participated in over 5 rounds of the TREC-COVID Challenge. While previous work has compared IR techniques used on other test collections, there are no studies that have analyzed the methods used by participants in the TREC-COVID Challenge. We manually reviewed team run reports from Rounds 2 and 5, extracted features from the documented methodologies, and used a univariate and multivariate regression-based analysis to identify features associated with higher retrieval performance. We observed that fine-tuning datasets with relevance judgments, MS-MARCO, and CORD-19 document vectors was associated with improved performance in Round 2 but not in Round 5. Though the relatively decreased heterogeneity of runs in Round 5 may explain the lack of significance in that round, fine-tuning has been found to improve search performance in previous challenge evaluations by improving a system’s ability to map relevant queries and phrases to documents. Furthermore, term expansion was associated with improvement in system performance, and the use of the narrative field in the TREC-COVID topics was associated with decreased system performance in both rounds. These findings emphasize the need for clear queries in search. While our study has some limitations in its generalizability and scope of techniques analyzed, we identified some IR techniques that may be useful in building search systems for COVID-19 using the TREC-COVID test collections.


2020 ◽  
Vol 28 (3) ◽  
pp. 148-168
Author(s):  
Jin Zhang ◽  
Yuehua Zhao ◽  
Xin Cai ◽  
Taowen Le ◽  
Wei Fei ◽  
...  

Relevance judgment plays an extremely significant role in information retrieval. This study investigates the differences between American users and Chinese users in relevance judgment during the information retrieval process. 384 sets of relevance scores with 50 scores in each set were collected from 16 American users and 16 Chinese users as they judged retrieval records from two major search engines based on 24 predefined search tasks from 4 domain categories. Statistical analyses reveal that there are significant differences between American assessors and Chinese assessors in relevance judgments. Significant gender differences also appear within both the American and the Chinese assessor groups. The study also revealed significant interactions among cultures, genders, and subject categories. These findings can enhance the understanding of cultural impact on information retrieval and can assist in the design of effective cross-language information retrieval systems.


Author(s):  
Yunqiu Shao ◽  
Jiaxin Mao ◽  
Yiqun Liu ◽  
Weizhi Ma ◽  
Ken Satoh ◽  
...  

Legal case retrieval is a specialized IR task that involves retrieving supporting cases given a query case. Compared with traditional ad-hoc text retrieval, the legal case retrieval task is more challenging since the query case is much longer and more complex than common keyword queries. Besides that, the definition of relevance between a query case and a supporting case is beyond general topical relevance and it is therefore difficult to construct a large-scale case retrieval dataset, especially one with accurate relevance judgments. To address these challenges, we propose BERT-PLI, a novel model that utilizes BERT to capture the semantic relationships at the paragraph-level and then infers the relevance between two cases by aggregating paragraph-level interactions. We fine-tune the BERT model with a relatively small-scale case law entailment dataset to adapt it to the legal scenario and employ a cascade framework to reduce the computational cost. We conduct extensive experiments on the benchmark of the relevant case retrieval task in COLIEE 2019. Experimental results demonstrate that our proposed method outperforms existing solutions.


2019 ◽  
Vol 56 (6) ◽  
pp. 102091 ◽  
Author(s):  
Barbara M. Wildemuth ◽  
Gary Marchionini ◽  
Xin Fu ◽  
Jun Sung Oh ◽  
Meng Yang

2019 ◽  
Vol 9 (1) ◽  
pp. 59
Author(s):  
C. Dominik Güss ◽  
Travis Bishop

Research articles are widely used in the training of undergraduate students. Editors and reviewers of the top scientific psychology journals influence the development in the field by publishing certain articles and rejecting others, probably assuming that the published articles are empirically sound and theoretically highly relevant. The current study investigated if published articles are indeed regarded as relevant by a sample of 393 psychology undergraduate students from a university in the Southeast of the United States. The students’ age ranged between 18 and 57 (M = 23, SD = 6.05) and 84% were female. Students received brief statements about potential research studies and rated them regarding relevance, not knowing that the summaries were from actual research studies published in peer-reviewed journals. Results showed that (1) overall, research articles were regarded as generally irrelevant, (2) applied articles were regarded as more relevant than basic research articles, (3) ratings did not differ based on gender or age, and (4) the more advanced students were in the Psychology program, the higher their relevance ratings were for applied research as compared to basic research. Results are comforting or disturbing; comforting, because students might not have the professional expertise to make such relevance judgments; disturbing, because results might indicate how specialized and insulated journals have become by not addressing topics relevant to a wider population. Results also have implications for teaching research methods and experimental psychology courses.


Sign in / Sign up

Export Citation Format

Share Document