retrieval models
Recently Published Documents


TOTAL DOCUMENTS

282
(FIVE YEARS 82)

H-INDEX

25
(FIVE YEARS 4)

2022 ◽  
Vol 40 (3) ◽  
pp. 1-37
Author(s):  
Edward Kai Fung Dang ◽  
Robert Wing Pong Luk ◽  
James Allan

In Information Retrieval, numerous retrieval models or document ranking functions have been developed in the quest for better retrieval effectiveness. Apart from some formal retrieval models formulated on a theoretical basis, various recent works have applied heuristic constraints to guide the derivation of document ranking functions. While many recent methods are shown to improve over established and successful models, comparison among these new methods under a common environment is often missing. To address this issue, we perform an extensive and up-to-date comparison of leading term-independence retrieval models implemented in our own retrieval system. Our study focuses on the following questions: (RQ1) Is there a retrieval model that consistently outperforms all other models across multiple collections; (RQ2) What are the important features of an effective document ranking function? Our retrieval experiments performed on several TREC test collections of a wide range of sizes (up to the terabyte-sized Clueweb09 Category B) enable us to answer these research questions. This work also serves as a reproducibility study for leading retrieval models. While our experiments show that no single retrieval model outperforms all others across all tested collections, some recent retrieval models, such as MATF and MVD, consistently perform better than the common baselines.


2022 ◽  
Vol 40 (3) ◽  
pp. 1-24
Author(s):  
Jiaul H. Paik ◽  
Yash Agrawal ◽  
Sahil Rishi ◽  
Vaishal Shah

Existing probabilistic retrieval models do not restrict the domain of the random variables that they deal with. In this article, we show that the upper bound of the normalized term frequency ( tf ) from the relevant documents is much smaller than the upper bound of the normalized tf from the whole collection. As a result, the existing models suffer from two major problems: (i) the domain mismatch causes data modeling error, (ii) since the outliers have very large magnitude and the retrieval models follow tf hypothesis, the combination of these two factors tends to overestimate the relevance score. In an attempt to address these problems, we propose novel weighted probabilistic models based on truncated distributions. We evaluate our models on a set of large document collections. Significant performance improvement over six existing probabilistic models is demonstrated.


2022 ◽  
Vol 40 (1) ◽  
pp. 1-36
Author(s):  
J. Shane Culpepper ◽  
Guglielmo Faggioli ◽  
Nicola Ferro ◽  
Oren Kurland

Several recent studies have explored the interaction effects between topics, systems, corpora, and components when measuring retrieval effectiveness. However, all of these previous studies assume that a topic or information need is represented by a single query. In reality, users routinely reformulate queries to satisfy an information need. In recent years, there has been renewed interest in the notion of “query variations” which are essentially multiple user formulations for an information need. Like many retrieval models, some queries are highly effective while others are not. This is often an artifact of the collection being searched which might be more or less sensitive to word choice. Users rarely have perfect knowledge about the underlying collection, and so finding queries that work is often a trial-and-error process. In this work, we explore the fundamental problem of system interaction effects between collections, ranking models, and queries. To answer this important question, we formalize the analysis using ANalysis Of VAriance (ANOVA) models to measure multiple components effects across collections and topics by nesting multiple query variations within each topic. Our findings show that query formulations have a comparable effect size of the topic factor itself, which is known to be the factor with the greatest effect size in prior ANOVA studies. Both topic and formulation have a substantially larger effect size than any other factor, including the ranking algorithms and, surprisingly, even query expansion. This finding reinforces the importance of further research in understanding the role of query rewriting in IR related tasks.


2022 ◽  
Vol 40 (1) ◽  
pp. 1-44
Author(s):  
Longxuan Ma ◽  
Mingda Li ◽  
Wei-Nan Zhang ◽  
Jiapeng Li ◽  
Ting Liu

Incorporating external knowledge into dialogue generation has been proven to benefit the performance of an open-domain Dialogue System (DS), such as generating informative or stylized responses, controlling conversation topics. In this article, we study the open-domain DS that uses unstructured text as external knowledge sources ( U nstructured T ext E nhanced D ialogue S ystem ( UTEDS )). The existence of unstructured text entails distinctions between UTEDS and traditional data-driven DS and we aim at analyzing these differences. We first give the definition of the UTEDS related concepts, then summarize the recently released datasets and models. We categorize UTEDS into Retrieval and Generative models and introduce them from the perspective of model components. The retrieval models consist of Fusion, Matching, and Ranking modules, while the generative models comprise Dialogue and Knowledge Encoding, Knowledge Selection (KS), and Response Generation modules. We further summarize the evaluation methods utilized in UTEDS and analyze the current models’ performance. At last, we discuss the future development trends of UTEDS, hoping to inspire new research in this field.


Water ◽  
2022 ◽  
Vol 14 (1) ◽  
pp. 128
Author(s):  
Mengying Cui ◽  
Yonghua Sun ◽  
Chen Huang ◽  
Mengjun Li

The water components affecting turbidity are complex and changeable, and the spectral response mechanism of each water quality parameter is different. Therefore, this study mainly aimed at the turbidity monitoring by unmanned aerial vehicle (UAV) hyperspectral technology, and establishes a set of turbidity retrieval models through the artificial control experiment, and verifies the model’s accuracy through UAV flight and water sample data in the same period. The results of this experiment can also be extended to different inland waters for turbidity retrieval. Retrieval of turbidity values of small inland water bodies can provide support for the study of the degree of water pollution. We collected the images and data of aquaculture ponds and irrigation ditches in Dawa District, Panjin City, Liaoning Province. Twenty-nine standard turbidity solutions with different concentration gradients (concentration from 0 to 360 NTU—the abbreviation of Nephelometric Turbidity Unit, which stands for scattered turbidity.) were established through manual control and we simultaneously collected hyperspectral data from the spectral values of standard solutions. The sensitive band to turbidity was obtained after analyzing the spectral information. We established four kinds of retrieval, including the single band, band ratio, normalized ratio, and the partial least squares (PLS) models. We selected the two models with the highest R2 for accuracy verification. The band ratio model and PLS model had the highest accuracy, and R2 was, respectively, 0.65 and 0.72. The hyperspectral image data obtained by UAV were combined with the PLS model, which had the highest R2 to estimate the spatial distribution of water turbidity. The turbidity of the water areas in the study area was 5–300 NTU, and most of which are 5–80 NTU. It shows that the PLS models can retrieve the turbidity with high accuracy of aquaculture ponds, irrigation canals, and reservoirs in Dawa District of Panjin City, Liaoning Province. The experimental results are consistent with the conclusions of the field investigation.


2022 ◽  
Vol 2022 ◽  
pp. 1-11
Author(s):  
Meichao Yan ◽  
Yu Wen ◽  
Qingxuan Shi ◽  
Xuedong Tian

Aiming at the defects of traditional full-text retrieval models in dealing with mathematical expressions, which are special objects different from ordinary texts, a multimodal retrieval and ranking method for scientific documents based on hesitant fuzzy sets (HFS) and XLNet is proposed. This method integrates multimodal information, such as mathematical expression images and context text, as keywords to realize the retrieval of scientific documents. In the image modal, the images of mathematical expressions are recognized, and the hesitancy fuzzy set theory is introduced to calculate the hesitancy fuzzy similarity between mathematical query expressions and the mathematical expressions in candidate scientific documents. Meanwhile, in the text mode, XLNet is used to generate word vectors of the mathematical expression context to obtain the similarity between the query text and the mathematical expression context of the candidate scientific documents. Finally, the multimodal evaluation is integrated, and the hesitation fuzzy set is constructed at the document level to obtain the final scores of the scientific documents and corresponding ranked output. The experimental results show that the recall and precision of this method are 0.774 and 0.663 on the NTCIR dataset, respectively, and the average normalized discounted cumulative gain (NDCG) value of the top-10 ranking results is 0.880 on the Chinese scientific document (CSD) dataset.


2021 ◽  
Vol 21 (1) ◽  
pp. 58-62
Author(s):  
P.K. SHARMA ◽  
D. KUMAR ◽  
H. S. SRIVASTAVA ◽  
P. PATEL ◽  
T. SIVASANKAR

The study aims to retrieve soil moisture from RISAT-1 hybrid polarimetric SAR data. Although the use of linear polarimetric SAR data has been well understood and documented, but hybrid polarimetric SAR data is grossly under explored and under reported for this purpose. Regression analysis has been carried to develop soil moisture retrieval models and validated the same. The retrieval models have been developed from back scattering coefficients (RH & RV) and m- space decomposition parameters (even bounce, odd bounce, and volume component) generated from RISAT-1 hybrid polarimetric SAR data. A total of three models are analyzed in this work, (i) using both RH &RV, (ii) volume component, and (iii) using even bounce, odd bounce and volume component. The study results showed that the model using m- decomposition derived parameters can provide better accuracy with R2 and RMSE of 0.92 and 2.45 per cent respectively in comparison to other two models.  


Author(s):  
Washington Cunha ◽  
Leonardo Rocha ◽  
Marcos A. Gonçalves

Pipelines for Text Classification are sequences of tasks needed to be performed to classify documents. The pre-processing phase of these pipelines involves different ways of manipulating documents for the learning phase. This Master Thesis introduces three new steps into the traditional pre-processing phase: 1) Meta-Features Generation; 2) Sparsification; and 3) Selective Sampling. Our experimental results, based on more than 5.600 measurements, show that our proposal can achieve significant gains in effectiveness when compared to the traditional TF-IDF representation (up to 52%) and word embeddings (up to 46%), at a much lower cost (9.7x faster). Our Master Thesis also includes a thorough and rigorous evaluation of the trade-offs between cost and effectiveness associated with the introduction of these new steps into the pipeline, as well as a comprehensive comparative experimental evaluation of many alternatives. This thesis falls under the topics of (i) Document Management and Classification, (ii) Information Retrieval Models and Techniques, (iii) and Text Database of the SBBD Call for Papers.


Information ◽  
2021 ◽  
Vol 12 (10) ◽  
pp. 402
Author(s):  
Stefano Marchesin ◽  
Giorgio Maria Di Nunzio ◽  
Maristella Agosti

In Information Retrieval (IR), the semantic gap represents the mismatch between users’ queries and how retrieval models answer to these queries. In this paper, we explore how to use external knowledge resources to enhance bag-of-words representations and reduce the effect of the semantic gap between queries and documents. In this regard, we propose several simple but effective knowledge-based query expansion and reduction techniques, and we evaluate them for the medical domain. The query reformulations proposed are used to increase the probability of retrieving relevant documents through the addition to, or the removal from, the original query of highly specific terms. The experimental analyses on different test collections for Precision Medicine IR show the effectiveness of the developed techniques. In particular, a specific subset of query reformulations allow retrieval models to achieve top performing results in all the considered test collections.


2021 ◽  
pp. 547-562
Author(s):  
Digvijay Desai ◽  
Aniruddha Ghadge ◽  
Roshan Wazare ◽  
Jayshree Bagade

Sign in / Sign up

Export Citation Format

Share Document