Large-scale image search with text for information retrieval

Janardan Bhatta

doi:10.3126/jiee.v4i1.35390

Experiments with Language-based Aids in Information Retrieval Systems

Nordic Journal of Linguistics ◽

10.1017/s0332586500001736 ◽

1988 ◽

Vol 11 (1-2) ◽

pp. 33-46 ◽

Cited By ~ 2

Author(s):

Tove Fjeldvig ◽

Anne Golden

Keyword(s):

Information Retrieval ◽

Text Retrieval ◽

Considerable Improvement ◽

Controlled Experiments ◽

Compound Words ◽

Search Results ◽

Retrieval Systems ◽

Complete Search ◽

Information Retrieval Systems ◽

Search Quality

The fact that a lexeme can appear in various forms causes problems in information retrieval. As a solution to this problem, we have developed methods for automatic root lemmatization, automatic truncation and automatic splitting of compound words. All the methods have as their basis a set of rules which contain information regarding inflected and derived forms of words – and not a dictionary. The methods have been tested on several collections of texts, and have produced very good results. By controlled experiments in text retrieval, we have studied the effects on search results. These results show that both the method of automatic root lemmatization and the method of automatic truncation make a considerable improvement on search quality. The experiments with splitting of compound words did not give quite the same improvement, however, but all the same this experiment showed that such a method could contribute to a richer and more complete search request.

Download Full-text

Spatio-Temporal Based Personalization for Mobile Search

Next Generation Search Engines ◽

10.4018/978-1-4666-0330-1.ch017 ◽

2012 ◽

pp. 386-409 ◽

Cited By ~ 3

Author(s):

Ourdia Bouidghaghen ◽

Lynda Tamine

Keyword(s):

Information Retrieval ◽

Contextual Information ◽

The Internet ◽

Search Results ◽

Search Result ◽

Retrieval Systems ◽

Information Retrieval Systems ◽

Spatio Temporal ◽

Mobile Context

The explosion of the information available on the Internet has made traditional information retrieval systems, characterized by one size fits all approaches, less effective. Indeed, users are overwhelmed by the information delivered by such systems in response to their queries, particularly when the latter are ambiguous. In order to tackle this problem, the state-of-the-art reveals that there is a growing interest towards contextual information retrieval (CIR) which relies on various sources of evidence issued from the user’s search background and environment, in order to improve the retrieval accuracy. This chapter focuses on mobile context, highlights challenges they present for IR, and gives an overview of CIR approaches applied in this environment. Then, the authors present an approach to personalize search results for mobile users by exploiting both cognitive and spatio-temporal contexts. The experimental evaluation undertaken in front of Yahoo search shows that the approach improves the quality of top search result lists and enhances search result precision.

Download Full-text

User Models for Adaptive Information Retrieval on the Web

International Journal of Adaptive Resilient and Autonomic Systems ◽

10.4018/jaras.2012070101 ◽

2012 ◽

Vol 3 (3) ◽

pp. 1-19

Author(s):

Max Chevalier ◽

Christine Julien ◽

Chantal Soulé-Dupuy

Keyword(s):

Information Retrieval ◽

Search Engines ◽

User Profile ◽

User Model ◽

User Models ◽

Search Results ◽

Retrieval Systems ◽

Information Retrieval Systems ◽

Adaptive Information Retrieval ◽

The Web

Searching information can be realized thanks to specific tools called Information Retrieval Systems IRS (also called “search engines”). To provide more accurate results to users, most of such systems offer personalization features. To do this, each system models a user in order to adapt search results that will be displayed. In a multi-application context (e.g., when using several search engines for a unique query), personalization techniques can be considered as limited because the user model (also called profile) is incomplete since it does not exploit actions/queries coming from other search engines. So, sharing user models between several search engines is a challenge in order to provide more efficient personalization techniques. A semantic architecture for user profile interoperability is proposed to reach this goal. This architecture is also important because it can be used in many other contexts to share various resources models, for instance a document model, between applications. It is also ensuring the possibility for every system to keep its own representation of each resource while providing a solution to easily share it.

Download Full-text

IR Research and Innovation in Commercial Online Systems: An Exploratory Survey

Proceedings of the Annual Conference of CAIS / Actes du congrès annuel de l'ACSI ◽

10.29173/cais731 ◽

2013 ◽

Author(s):

Sherry Koshman ◽

Edie Rasmussen

Keyword(s):

Information Retrieval ◽

Large Scale ◽

Interactive Systems ◽

Information Industry ◽

Online Systems ◽

Research And Innovation ◽

Retrieval Systems ◽

Information Retrieval Systems ◽

Operational Systems ◽

Mcgill University

From the 1994 CAIS Conference: The Information Industry in Transition McGill University, Montreal, Quebec. May 25 - 27, 1994."Conventional" information retrieval systems (IRS), originating in the research of the 11950s and 1960s, are based on keyword matching and the application of Boolean operators to produce a set of retrieved documents from a database. In the ensuing years, research in information retrieval has identified a number of innovations (for example, automatic weighting of terms, ranked output, and relevance feedback) which have the potential to significantly enhance the performance of IRS, though commercial vendors have been slow to incorporate these changes into their systems. This was the situation in 1988 which led Radecki, in a special issue of Information Processing & Management, to examine the potential for improvements in conventional Boolean retrieval systems, and explore the reasons why these improvements had not been implemented in operational systems. Over the last five years, this position has begun to change as commercial vendors such as Dialog, Dow Jones, West Publishing, and Mead have implemented new, non-Boolean features in their systems, including natural language input, weighted keyword terms, and document ranking. This paper identifies some of the significant findings of IR research and compares them to the implementation of non-Boolean features in such systems. The preliminary survey of new features in commercial systems suggests the need for new methods of evaluation, including the development of evalutation measures appropriate to large-scale, interactive systems.

Download Full-text

Automatic Keyword Annotation System Using Newspapers

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2014.p0340 ◽

2014 ◽

Vol 18 (3) ◽

pp. 340-346 ◽

Cited By ~ 1

Author(s):

Tomoki Takada ◽

◽

Mizuki Arai ◽

Tomohiro Takagi

Keyword(s):

Information Retrieval ◽

Language Processing ◽

High Speed ◽

Naive Bayes ◽

High Accuracy ◽

Naïve Bayes ◽

Annotation System ◽

Retrieval Systems ◽

Information Retrieval Systems ◽

Index Terms

Nowadays, an increasingly large amount of information exists on the web. Therefore, a method is needed that enables us to find necessary information quickly because this is becoming increasingly difficult for users. To solve this problem, information retrieval systems like Google and recommendation systems like that on Amazon are used. In this paper, we focus on information retrieval systems. These retrieval systems require index terms, which affect the precision of retrieval. Two methods generally decide index terms. One is analyzing a text using natural language processing and deciding index terms using varying amounts of statistics. The other is someone choosing document keywords as index terms. However, the latter method requires too much time and effort and becomes more impractical as information grows. Therefore, we propose the Nikkei annotator system, which is based on the model of the human brain and learns patterns of past keyword annotation and automatically outputs keywords that users prefer. The purposes of the proposed method are automating manual keyword annotation and achieving high speed and high accuracy keyword annotation. Experimental results showed that the proposed method is more accurate than TFIDF and Naive Bayes in P@5 and P@10. Moreover, these results also showed that the proposed method could annotate about 19 times faster than Naive Bayes.

Download Full-text

Web Resources on Medical Tourism

Literacy Skill Development for Library Science Professionals - Advances in Library and Information Science ◽

10.4018/978-1-5225-7125-4.ch008 ◽

2019 ◽

pp. 174-195

Author(s):

S. Naseehath

Keyword(s):

Information Retrieval ◽

Search Engine ◽

Search Engines ◽

Medical Tourism ◽

Engine Performance ◽

Link Analysis ◽

Search Results ◽

Retrieval Systems ◽

Information Retrieval Systems ◽

General Search

Webometric research has fallen into two main categories, namely link analysis and search engine evaluation. Search engines are also used to collect data for link analysis. A set of measurements is proposed for evaluating web search engine performance. Some measurements are adapted from the concepts of recall and precision, which are commonly used in evaluating traditional information retrieval systems. Others are newly developed to evaluate search engine stability, which is unique to web information retrieval systems. Overlapping of search results, annual growth of search results on each search engines, variation of results on search using synonyms are also used to evaluate the relative efficiency of search engines. In this study, the investigator attempts to conduct a webometric study on the topic medical tourism in Kerala using six search engines; these include three general search engines, namely Bing, Google, and Lycos, and three metasearch engines, namely Dogpile, ixquick, and WebCrawler.

Download Full-text

On The Reuse of Past Searches in Information Retrieval

International Journal of Information System Modeling and Design ◽

10.4018/ijismd.2015040103 ◽

2015 ◽

Vol 6 (2) ◽

pp. 72-92 ◽

Cited By ~ 2

Author(s):

Claudio Gutiérrez-Soto ◽

Gilles Hubert

Keyword(s):

Monte Carlo ◽

Information Retrieval ◽

Search Results ◽

Monte Carlo Algorithms ◽

Log Files ◽

Retrieval Systems ◽

Cosine Measure ◽

Information Retrieval Systems ◽

Single User ◽

Baseline Approach

When using information retrieval systems, information related to searches is typically stored in files, which are well known as log files. By contrast, past search results of previously submitted queries are ignored most of the time. Nevertheless, past search results can be profitable for new searches. Some approaches in Information Retrieval exploit the previous searches in a customizable way for a single user. On the contrary, approaches that deal with past searches collectively are less common. This paper deals with such an approach, by using past results of similar past queries submitted by other users, to build the answers for new submitted queries. It proposes two Monte Carlo algorithms to build the result for a new query by selecting relevant documents associated to the most similar past query. Experiments were carried out to evaluate the effectiveness of the proposed algorithms using several dataset variants. These algorithms were also compared with the baseline approach based on the cosine measure, from which they reuse past results. Simulated datasets were designed for the experiments, following the Cranfield paradigm, well established in the Information Retrieval domain. The empirical results show the interest of our approach.

Download Full-text

On The Reuse of Past Searches in Information Retrieval

Business Intelligence ◽

10.4018/978-1-4666-9562-7.ch057 ◽

2016 ◽

pp. 1117-1137 ◽

Cited By ~ 1

Author(s):

Claudio Gutiérrez-Soto ◽

Gilles Hubert

Keyword(s):

Monte Carlo ◽

Information Retrieval ◽

Search Results ◽

Monte Carlo Algorithms ◽

Log Files ◽

Retrieval Systems ◽

Cosine Measure ◽

Information Retrieval Systems ◽

Single User ◽

Baseline Approach

When using information retrieval systems, information related to searches is typically stored in files, which are well known as log files. By contrast, past search results of previously submitted queries are ignored most of the time. Nevertheless, past search results can be profitable for new searches. Some approaches in Information Retrieval exploit the previous searches in a customizable way for a single user. On the contrary, approaches that deal with past searches collectively are less common. This paper deals with such an approach, by using past results of similar past queries submitted by other users, to build the answers for new submitted queries. It proposes two Monte Carlo algorithms to build the result for a new query by selecting relevant documents associated to the most similar past query. Experiments were carried out to evaluate the effectiveness of the proposed algorithms using several dataset variants. These algorithms were also compared with the baseline approach based on the cosine measure, from which they reuse past results. Simulated datasets were designed for the experiments, following the Cranfield paradigm, well established in the Information Retrieval domain. The empirical results show the interest of our approach.

Download Full-text

Domain-specific readability measures to improve information retrieval in the Persian language

The Electronic Library ◽

10.1108/el-01-2017-0007 ◽

2018 ◽

Vol 36 (3) ◽

pp. 430-444

Author(s):

Sholeh Arastoopoor

Keyword(s):

Information Retrieval ◽

Computer Science ◽

Web Search ◽

Content Type ◽

Domain Specific ◽

Search Results ◽

Persian Language ◽

Retrieval Systems ◽

Information Retrieval Systems ◽

Primary Focus

Purpose The degree to which a text is considered readable depends on the capability of the reader. This assumption puts different information retrieval systems at the risk of retrieving unreadable or hard-to-be-read yet relevant documents for their users. This paper aims to examine the potential use of concept-based readability measures along with classic measures for re-ranking search results in information retrieval systems, specifically in the Persian language. Design/methodology/approach Flesch–Dayani as a classic readability measure along with document scope (DS) and document cohesion (DC) as domain-specific measures have been applied for scoring the retrieved documents from Google (181 documents) and the RICeST database (215 documents) in the field of computer science and information technology (IT). The re-ranked result has been compared with the ranking of potential users regarding their readability. Findings The results show that there is a difference among subcategories of the computer science and IT field according to their readability and understandability. This study also shows that it is possible to develop a hybrid score based on DS and DC measures and, among all four applied scores in re-ranking the documents, the re-ranked list of documents based on the DSDC score shows correlation with re-ranking of the participants in both groups. Practical implications The findings of this study would foster a new option in re-ranking search results based on their difficulty for experts and non-experts in different fields. Originality/value The findings and the two-mode re-ranking model proposed in this paper along with its primary focus on domain-specific readability in the Persian language would help Web search engines and online databases in further refining the search results in pursuit of retrieving useful texts for users with differing expertise.

Download Full-text

Managing tail latency in large scale information retrieval systems

ACM SIGIR Forum ◽

10.1145/3451964.3451982 ◽

2020 ◽

Vol 54 (1) ◽

pp. 1-2

Author(s):

Joel M. Mackenzie

Keyword(s):

Information Retrieval ◽

User Experience ◽

Large Scale ◽

Response Times ◽

Smart Devices ◽

Worst Case ◽

Retrieval Systems ◽

Trade Offs ◽

Efficiency And Effectiveness ◽

Information Retrieval Systems

As both the availability of internet access and the prominence of smart devices continue to increase, data is being generated at a rate faster than ever before. This massive increase in data production comes with many challenges, including efficiency concerns for the storage and retrieval of such large-scale data. However, users have grown to expect the sub-second response times that are common in most modern search engines, creating a problem --- how can such large amounts of data continue to be served efficiently enough to satisfy end users? This dissertation investigates several issues regarding tail latency in large-scale information retrieval systems. Tail latency corresponds to the high percentile latency that is observed from a system --- in the case of search, this latency typically corresponds to how long it takes for a query to be processed. In particular, keeping tail latency as low as possible translates to a good experience for all users, as tail latency is directly related to the worst-case latency and hence, the worst possible user experience. The key idea in targeting tail latency is to move from questions such as "what is the median latency of our search engine?" to questions which more accurately capture user experience such as "how many queries take more than 200 ms to return answers?" or "what is the worst case latency that a user may be subject to, and how often might it occur?" While various strategies exist for efficiently processing queries over large textual corpora, prior research has focused almost entirely on improvements to the average processing time or cost of search systems. As a first contribution, we examine some state-of-the-art retrieval algorithms for two popular index organizations, and discuss the trade-offs between them, paying special attention to the notion of tail latency. This research uncovers a number of observations that are subsequently leveraged for improved search efficiency and effectiveness. We then propose and solve a new problem, which involves processing a number of related query variations together, known as multi-queries , to yield higher quality search results. We experiment with a number of algorithmic approaches to efficiently process these multi-queries, and report on the cost, efficiency, and effectiveness trade-offs present with each. Finally, we examine how predictive models can be used to improve the tail latency and end-to-end cost of a commonly used multi-stage retrieval architecture without impacting result effectiveness. By combining ideas from numerous areas of information retrieval, we propose a prediction framework which can be used for training and evaluating several efficiency/effectiveness trade-off parameters, resulting in improved trade-offs between cost, result quality, and tail latency.

Download Full-text