scholarly journals Large-scale image search with text for information retrieval

2021 ◽  
Vol 4 (1) ◽  
pp. 87-89
Author(s):  
Janardan Bhatta

Searching images in a large database is a major requirement in Information Retrieval Systems. Expecting image search results based on a text query is a challenging task. In this paper, we leverage the power of Computer Vision and Natural Language Processing in Distributed Machines to lower the latency of search results. Image pixel features are computed based on contrastive loss function for image search. Text features are computed based on the Attention Mechanism for text search. These features are aligned together preserving the information in each text and image feature. Previously, the approach was tested only in multilingual models. However, we have tested it in image-text dataset and it enabled us to search in any form of text or images with high accuracy.

1988 ◽  
Vol 11 (1-2) ◽  
pp. 33-46 ◽  
Author(s):  
Tove Fjeldvig ◽  
Anne Golden

The fact that a lexeme can appear in various forms causes problems in information retrieval. As a solution to this problem, we have developed methods for automatic root lemmatization, automatic truncation and automatic splitting of compound words. All the methods have as their basis a set of rules which contain information regarding inflected and derived forms of words – and not a dictionary. The methods have been tested on several collections of texts, and have produced very good results. By controlled experiments in text retrieval, we have studied the effects on search results. These results show that both the method of automatic root lemmatization and the method of automatic truncation make a considerable improvement on search quality. The experiments with splitting of compound words did not give quite the same improvement, however, but all the same this experiment showed that such a method could contribute to a richer and more complete search request.


2012 ◽  
pp. 386-409 ◽  
Author(s):  
Ourdia Bouidghaghen ◽  
Lynda Tamine

The explosion of the information available on the Internet has made traditional information retrieval systems, characterized by one size fits all approaches, less effective. Indeed, users are overwhelmed by the information delivered by such systems in response to their queries, particularly when the latter are ambiguous. In order to tackle this problem, the state-of-the-art reveals that there is a growing interest towards contextual information retrieval (CIR) which relies on various sources of evidence issued from the user’s search background and environment, in order to improve the retrieval accuracy. This chapter focuses on mobile context, highlights challenges they present for IR, and gives an overview of CIR approaches applied in this environment. Then, the authors present an approach to personalize search results for mobile users by exploiting both cognitive and spatio-temporal contexts. The experimental evaluation undertaken in front of Yahoo search shows that the approach improves the quality of top search result lists and enhances search result precision.


Author(s):  
Max Chevalier ◽  
Christine Julien ◽  
Chantal Soulé-Dupuy

Searching information can be realized thanks to specific tools called Information Retrieval Systems IRS (also called “search engines”). To provide more accurate results to users, most of such systems offer personalization features. To do this, each system models a user in order to adapt search results that will be displayed. In a multi-application context (e.g., when using several search engines for a unique query), personalization techniques can be considered as limited because the user model (also called profile) is incomplete since it does not exploit actions/queries coming from other search engines. So, sharing user models between several search engines is a challenge in order to provide more efficient personalization techniques. A semantic architecture for user profile interoperability is proposed to reach this goal. This architecture is also important because it can be used in many other contexts to share various resources models, for instance a document model, between applications. It is also ensuring the possibility for every system to keep its own representation of each resource while providing a solution to easily share it.


Author(s):  
Sherry Koshman ◽  
Edie Rasmussen

From the 1994 CAIS Conference: The Information Industry in Transition McGill University, Montreal, Quebec. May 25 - 27, 1994."Conventional" information retrieval systems (IRS), originating in the research of the 11950s and 1960s, are based on keyword matching and the application of Boolean operators to produce a set of retrieved documents from a database. In the ensuing years, research in information retrieval has identified a number of innovations (for example, automatic weighting of terms, ranked output, and relevance feedback) which have the potential to significantly enhance the performance of IRS, though commercial vendors have been slow to incorporate these changes into their systems. This was the situation in 1988 which led Radecki, in a special issue of Information Processing & Management, to examine the potential for improvements in conventional Boolean retrieval systems, and explore the reasons why these improvements had not been implemented in operational systems. Over the last five years, this position has begun to change as commercial vendors such as Dialog, Dow Jones, West Publishing, and Mead have implemented new, non-Boolean features in their systems, including natural language input, weighted keyword terms, and document ranking. This paper identifies some of the significant findings of IR research and compares them to the implementation of non-Boolean features in such systems. The preliminary survey of new features in commercial systems suggests the need for new methods of evaluation, including the development of evalutation measures appropriate to large-scale, interactive systems.


Author(s):  
Tomoki Takada ◽  
◽  
Mizuki Arai ◽  
Tomohiro Takagi

Nowadays, an increasingly large amount of information exists on the web. Therefore, a method is needed that enables us to find necessary information quickly because this is becoming increasingly difficult for users. To solve this problem, information retrieval systems like Google and recommendation systems like that on Amazon are used. In this paper, we focus on information retrieval systems. These retrieval systems require index terms, which affect the precision of retrieval. Two methods generally decide index terms. One is analyzing a text using natural language processing and deciding index terms using varying amounts of statistics. The other is someone choosing document keywords as index terms. However, the latter method requires too much time and effort and becomes more impractical as information grows. Therefore, we propose the Nikkei annotator system, which is based on the model of the human brain and learns patterns of past keyword annotation and automatically outputs keywords that users prefer. The purposes of the proposed method are automating manual keyword annotation and achieving high speed and high accuracy keyword annotation. Experimental results showed that the proposed method is more accurate than TFIDF and Naive Bayes in P@5 and P@10. Moreover, these results also showed that the proposed method could annotate about 19 times faster than Naive Bayes.


Author(s):  
S. Naseehath

Webometric research has fallen into two main categories, namely link analysis and search engine evaluation. Search engines are also used to collect data for link analysis. A set of measurements is proposed for evaluating web search engine performance. Some measurements are adapted from the concepts of recall and precision, which are commonly used in evaluating traditional information retrieval systems. Others are newly developed to evaluate search engine stability, which is unique to web information retrieval systems. Overlapping of search results, annual growth of search results on each search engines, variation of results on search using synonyms are also used to evaluate the relative efficiency of search engines. In this study, the investigator attempts to conduct a webometric study on the topic medical tourism in Kerala using six search engines; these include three general search engines, namely Bing, Google, and Lycos, and three metasearch engines, namely Dogpile, ixquick, and WebCrawler.


Author(s):  
Claudio Gutiérrez-Soto ◽  
Gilles Hubert

When using information retrieval systems, information related to searches is typically stored in files, which are well known as log files. By contrast, past search results of previously submitted queries are ignored most of the time. Nevertheless, past search results can be profitable for new searches. Some approaches in Information Retrieval exploit the previous searches in a customizable way for a single user. On the contrary, approaches that deal with past searches collectively are less common. This paper deals with such an approach, by using past results of similar past queries submitted by other users, to build the answers for new submitted queries. It proposes two Monte Carlo algorithms to build the result for a new query by selecting relevant documents associated to the most similar past query. Experiments were carried out to evaluate the effectiveness of the proposed algorithms using several dataset variants. These algorithms were also compared with the baseline approach based on the cosine measure, from which they reuse past results. Simulated datasets were designed for the experiments, following the Cranfield paradigm, well established in the Information Retrieval domain. The empirical results show the interest of our approach.


2016 ◽  
pp. 1117-1137 ◽  
Author(s):  
Claudio Gutiérrez-Soto ◽  
Gilles Hubert

When using information retrieval systems, information related to searches is typically stored in files, which are well known as log files. By contrast, past search results of previously submitted queries are ignored most of the time. Nevertheless, past search results can be profitable for new searches. Some approaches in Information Retrieval exploit the previous searches in a customizable way for a single user. On the contrary, approaches that deal with past searches collectively are less common. This paper deals with such an approach, by using past results of similar past queries submitted by other users, to build the answers for new submitted queries. It proposes two Monte Carlo algorithms to build the result for a new query by selecting relevant documents associated to the most similar past query. Experiments were carried out to evaluate the effectiveness of the proposed algorithms using several dataset variants. These algorithms were also compared with the baseline approach based on the cosine measure, from which they reuse past results. Simulated datasets were designed for the experiments, following the Cranfield paradigm, well established in the Information Retrieval domain. The empirical results show the interest of our approach.


2018 ◽  
Vol 36 (3) ◽  
pp. 430-444
Author(s):  
Sholeh Arastoopoor

Purpose The degree to which a text is considered readable depends on the capability of the reader. This assumption puts different information retrieval systems at the risk of retrieving unreadable or hard-to-be-read yet relevant documents for their users. This paper aims to examine the potential use of concept-based readability measures along with classic measures for re-ranking search results in information retrieval systems, specifically in the Persian language. Design/methodology/approach Flesch–Dayani as a classic readability measure along with document scope (DS) and document cohesion (DC) as domain-specific measures have been applied for scoring the retrieved documents from Google (181 documents) and the RICeST database (215 documents) in the field of computer science and information technology (IT). The re-ranked result has been compared with the ranking of potential users regarding their readability. Findings The results show that there is a difference among subcategories of the computer science and IT field according to their readability and understandability. This study also shows that it is possible to develop a hybrid score based on DS and DC measures and, among all four applied scores in re-ranking the documents, the re-ranked list of documents based on the DSDC score shows correlation with re-ranking of the participants in both groups. Practical implications The findings of this study would foster a new option in re-ranking search results based on their difficulty for experts and non-experts in different fields. Originality/value The findings and the two-mode re-ranking model proposed in this paper along with its primary focus on domain-specific readability in the Persian language would help Web search engines and online databases in further refining the search results in pursuit of retrieving useful texts for users with differing expertise.


2020 ◽  
Vol 54 (1) ◽  
pp. 1-2
Author(s):  
Joel M. Mackenzie

As both the availability of internet access and the prominence of smart devices continue to increase, data is being generated at a rate faster than ever before. This massive increase in data production comes with many challenges, including efficiency concerns for the storage and retrieval of such large-scale data. However, users have grown to expect the sub-second response times that are common in most modern search engines, creating a problem --- how can such large amounts of data continue to be served efficiently enough to satisfy end users? This dissertation investigates several issues regarding tail latency in large-scale information retrieval systems. Tail latency corresponds to the high percentile latency that is observed from a system --- in the case of search, this latency typically corresponds to how long it takes for a query to be processed. In particular, keeping tail latency as low as possible translates to a good experience for all users, as tail latency is directly related to the worst-case latency and hence, the worst possible user experience. The key idea in targeting tail latency is to move from questions such as "what is the median latency of our search engine?" to questions which more accurately capture user experience such as "how many queries take more than 200 ms to return answers?" or "what is the worst case latency that a user may be subject to, and how often might it occur?" While various strategies exist for efficiently processing queries over large textual corpora, prior research has focused almost entirely on improvements to the average processing time or cost of search systems. As a first contribution, we examine some state-of-the-art retrieval algorithms for two popular index organizations, and discuss the trade-offs between them, paying special attention to the notion of tail latency. This research uncovers a number of observations that are subsequently leveraged for improved search efficiency and effectiveness. We then propose and solve a new problem, which involves processing a number of related query variations together, known as multi-queries , to yield higher quality search results. We experiment with a number of algorithmic approaches to efficiently process these multi-queries, and report on the cost, efficiency, and effectiveness trade-offs present with each. Finally, we examine how predictive models can be used to improve the tail latency and end-to-end cost of a commonly used multi-stage retrieval architecture without impacting result effectiveness. By combining ideas from numerous areas of information retrieval, we propose a prediction framework which can be used for training and evaluating several efficiency/effectiveness trade-off parameters, resulting in improved trade-offs between cost, result quality, and tail latency.


Sign in / Sign up

Export Citation Format

Share Document