What is the difference? A cognitive dissimilarity measure for information retrieval result sets

Abstract Average precision (AP) is one of the most widely used metrics in information retrieval and natural language processing research. It is usually thought that the expected AP of a system that ranks documents randomly is equal to the proportion of relevant documents in the collection. This paper shows that this value is only approximate, and provides a procedure for efficiently computing the exact value. An analysis of the difference between the approximate and the exact value shows that the discrepancy is large when the collection contains few documents, but becomes very small when it contains at least 600 documents.

Download Full-text

A Comparison of Retrieval Result Relevance Judgments Between American and Chinese Users

Journal of Global Information Management ◽

10.4018/jgim.2020070108 ◽

2020 ◽

Vol 28 (3) ◽

pp. 148-168

Author(s):

Jin Zhang ◽

Yuehua Zhao ◽

Xin Cai ◽

Taowen Le ◽

Wei Fei ◽

...

Keyword(s):

Information Retrieval ◽

Relevance Judgment ◽

Search Tasks ◽

Retrieval Systems ◽

Relevance Judgments ◽

Cross Language Information Retrieval ◽

Subject Categories ◽

Information Retrieval Systems ◽

Retrieval Result ◽

Cross Language

Relevance judgment plays an extremely significant role in information retrieval. This study investigates the differences between American users and Chinese users in relevance judgment during the information retrieval process. 384 sets of relevance scores with 50 scores in each set were collected from 16 American users and 16 Chinese users as they judged retrieval records from two major search engines based on 24 predefined search tasks from 4 domain categories. Statistical analyses reveal that there are significant differences between American assessors and Chinese assessors in relevance judgments. Significant gender differences also appear within both the American and the Chinese assessor groups. The study also revealed significant interactions among cultures, genders, and subject categories. These findings can enhance the understanding of cultural impact on information retrieval and can assist in the design of effective cross-language information retrieval systems.

Download Full-text

A Civil Code Article Information Retrieval System based on Phrase Alignment with Article Structure Analysis and Ensemble Approach

10.29007/5zzj ◽

2018 ◽

Author(s):

Masaharu Yoshioka ◽

Daiki Onodera

Keyword(s):

Information Retrieval ◽

Retrieval System ◽

Phase 1 ◽

Civil Code ◽

Information Retrieval System ◽

Ensemble Approach ◽

Good For ◽

The Difference ◽

Final Answer ◽

Phrase Alignment

In this paper, we introduce a system for COLIEE task phase 1 that retrieves relevant civil code article(s) for making correct entailment to the questions of Japanese Bar Exam. This system is an extended version of our previous system that based on legal terminology and civil code article structure. However, the performance of the previous system is not as good as best performance system of the task. In this paper, we introduce concept of phrase alignment that takes into account the civil code article structure. In addition, due to the variations of the question types, the settings that are good for particular type of questions may not be good for other types of questions. Therefore, we propose to use systems with different settings and generate final answer by aggregating the output of different systems based on ensemble approach. Finally, we also discuss the difference between English task and Japanese task based on the retrieval results of Indri, one of the state-of-the-art information retrieval system.

Download Full-text

Document-based approach to improve the accuracy of pairwise comparison in evaluating information retrieval systems

Aslib Journal of Information Management ◽

10.1108/ajim-12-2014-0171 ◽

2015 ◽

Vol 67 (4) ◽

pp. 408-421

Author(s):

Sri Devi Ravana ◽

MASUMEH SADAT TAHERI ◽

Prabha Rajagopal

Keyword(s):

Information Retrieval ◽

Pairwise Comparison ◽

Current Method ◽

Statistical Testing ◽

Test Collection ◽

Content Type ◽

Retrieval Systems ◽

Information Retrieval Systems ◽

The Mean ◽

The Difference

Purpose – The purpose of this paper is to propose a method to have more accurate results in comparing performance of the paired information retrieval (IR) systems with reference to the current method, which is based on the mean effectiveness scores of the systems across a set of identified topics/queries. Design/methodology/approach – Based on the proposed approach, instead of the classic method of using a set of topic scores, the documents level scores are considered as the evaluation unit. These document scores are the defined document’s weight, which play the role of the mean average precision (MAP) score of the systems as a significance test’s statics. The experiments were conducted using the TREC 9 Web track collection. Findings – The p-values generated through the two types of significance tests, namely the Student’s t-test and Mann-Whitney show that by using the document level scores as an evaluation unit, the difference between IR systems is more significant compared with utilizing topic scores. Originality/value – Utilizing a suitable test collection is a primary prerequisite for IR systems comparative evaluation. However, in addition to reusable test collections, having an accurate statistical testing is a necessity for these evaluations. The findings of this study will assist IR researchers to evaluate their retrieval systems and algorithms more accurately.

Download Full-text

Some Measures of Picture Fuzzy Sets and Their Application

Journal of Science and Technology Issue on Information and Communications Technology ◽

10.31130/jst.2017.49 ◽

2017 ◽

Vol 3 (2) ◽

pp. 35

Author(s):

Nguyen Van Dinh ◽

Nguyen Xuan Thao

Keyword(s):

Decision Making ◽

Fuzzy Sets ◽

Fuzzy Set ◽

Distance Measure ◽

Intuitionistic Fuzzy Set ◽

Dissimilarity Measure ◽

Intuitionistic Fuzzy ◽

Picture Fuzzy Set ◽

The Difference ◽

Picture Fuzzy Sets

To measure the difference of two fuzzy sets (FSs) / intuitionistic sets (IFSs), we can use the distance measure and dissimilarity measure between fuzzy sets/intuitionistic fuzzy set. Characterization of distance/dissimilarity measure between fuzzy sets/intuitionistic fuzzy set is important as it has application in different areas: pattern recognition, image segmentation, and decision making. Picture fuzzy set (PFS) is a generalization of fuzzy set and intuitionistic set, so that it have many application. In this paper, we introduce concepts: difference between PFS-sets, distance measure and dissimilarity measure between picture fuzzy sets, and also provide the formulas for determining these values. We also present an application of dissimilarity measures in the sample recognition problems, can also be considered a decision-making problem.

Download Full-text

Combination of Evidence with Different Weighting Factors: A Novel Probabilistic-Based Dissimilarity Measure Approach

Journal of Sensors ◽

10.1155/2015/509385 ◽

2015 ◽

Vol 2015 ◽

pp. 1-9 ◽

Cited By ~ 23

Author(s):

Mengmeng Ma ◽

Jiyao An

Keyword(s):

Dissimilarity Measure ◽

New Combination ◽

Basic Belief ◽

Multisensor Data Fusion ◽

Weighting Factors ◽

The Difference ◽

Combination Approach ◽

Shafer Theory ◽

Theory Of Evidence ◽

Comparison Of The Results

To solve the invalidation problem of Dempster-Shafer theory of evidence (DS) with high conflict in multisensor data fusion, this paper presents a novel combination approach of conflict evidence with different weighting factors using a new probabilistic dissimilarity measure. Firstly, an improved probabilistic transformation function is proposed to map basic belief assignments (BBAs) to probabilities. Then, a new dissimilarity measure integrating fuzzy nearness and introduced correlation coefficient is proposed to characterize not only the difference between basic belief functions (BBAs) but also the divergence degree of the hypothesis that two BBAs support. Finally, the weighting factors used to reassign conflicts on BBAs are developed and Dempster’s rule is chosen to combine the discounted sources. Simple numerical examples are employed to demonstrate the merit of the proposed method. Through analysis and comparison of the results, the new combination approach can effectively solve the problem of conflict management with better convergence performance and robustness.

Download Full-text

An Improved Semidiscrete Matrix Decomposition and its Application in Chinese Information Retrieval

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.241-244.3121 ◽

2012 ◽

Vol 241-244 ◽

pp. 3121-3124 ◽

Cited By ~ 1

Author(s):

Yang Luo

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Matrix Decomposition ◽

Latent Semantic Indexing ◽

Semantic Indexing ◽

Storage Space ◽

Important Direction ◽

The Difference

Information retrieval is an important direction in the area of natural language processing .This paper introduced semidiscrete matrix decomposition in latent semantic indexing. We aimed at it’s disadvantage in storage space and presented SSDD,then we compare the difference of SVD and SDD and SSDD in performance

Download Full-text

Recuperación de Información sobre Patentes: Comparación de Recuperación de Información Web Entre Patentscope y Google Patents

KnE Engineering ◽

10.18502/keg.v3i1.1480 ◽

2018 ◽

Vol 3 (1) ◽

pp. 768

Author(s):

Gema Castillo ◽

Aránzazu Berbey Álvarez ◽

Humberto Alvarez ◽

Isabel De La Torre Diez

Keyword(s):

Information Retrieval ◽

Open Access ◽

Information Needs ◽

Retrieval System ◽

Information Retrieval System ◽

Reliable Information ◽

Summary Table ◽

The People ◽

Global Companies ◽

The Difference

The goal is to present the main free and open access search engines such as PATENTSCOPE and Google Patents. It also seeks to verify the information retrieval system, which seeks to transform the user's information needs into a list or collection of documents whose content satisfies that need. We present the comparison of both verifying each one independently and then, a summary table. Finally, it is concluded that the constant search for inventions can make the difference between the positions of competences between global companies; It is for this reason that patents prove to be a source of reliable information on the subjects of interest of the people or companies. Pantestscope and Google Patents allows you to download as much data as a table for future analysis of the information. Keywords: Information retrieval, Patents, Patentscope, Google Patents, Web

Download Full-text

Probability-based fusion of information retrieval result sets

Artificial Intelligence Review ◽

10.1007/s10462-007-9021-x ◽

2007 ◽

Vol 25 (1-2) ◽

pp. 179-191 ◽

Cited By ~ 6

Author(s):

D. Lillis ◽

F. Toolan ◽

A. Mur ◽

L. Peng ◽

R. Collier ◽

...

Keyword(s):

Information Retrieval ◽

Retrieval Result

Download Full-text

Generating Javanese Stopwords List using K-means Clustering Algorithm

Knowledge Engineering and Data Science ◽

10.17977/um018v3i22020p106-111 ◽

2020 ◽

Vol 3 (2) ◽

pp. 106

Author(s):

Aji Prasetya Wibawa ◽

Hidayah Kariima Fithri ◽

Ilham Ari Elbaith Zaeni ◽

Andrew Nafalski

Keyword(s):

Information Retrieval ◽

Clustering Algorithm ◽

Confusion Matrix ◽

Word List ◽

Memory Storage ◽

Clustering Method ◽

Specific Language ◽

Stop Word ◽

The Difference

Stopword removal necessary in Information Retrieval. It can remove frequently appeared and general words to reduce memory storage. The algorithm eliminates each word that is precisely the same as the word in the stopword list. However, generating the list could be time-consuming. The words in a specific language and domain must be collected and validated by specialists. This research aims to develop a new way to generate a stop word list using the K-means Clustering method. The proposed approach groups words based on their frequency. The confusion matrix calculates the difference between the findings with a valid stopword list created by a Javanese linguist. The accuracy of the proposed method is 78.28% (K=7). The result shows that the generation of Javanese stopword lists using a clustering method is reliable.

Download Full-text