Indexing Textual Information

Author(s):  
Ioannis N. Kouris ◽  
Christos Makris ◽  
Evangelos Theodoridis ◽  
Athanasios Tsakalidis

Information retrieval is the computational discipline that deals with the efficient representation, organization, and access to information objects that represent natural language texts (Baeza-Yates, & Ribeiro-Neto, 1999; Salton & McGill, 1983; Witten, Moûat, & Bell, 1999). A crucial subproblem in the information retrieval area is the design and implementation of efficient data structures and algorithms for indexing and searching information objects that are vaguely described. In this article, we are going to present the latest developments in the indexing area by giving special emphasis to: data structures and algorithmic techniques for string manipulation, space efficient implementations, and compression techniques for efficient storage of information objects. The aforementioned problems appear in a series of applications as digital libraries, molecular sequence databases (DNA sequences, protein databases [Gusûeld, 1997)], implementation of Web search engines, web mining and information filtering.

2006 ◽  
Author(s):  
Χρήστος Τρυφωνόπουλος

Much information of interest to humans is today available on the Web. People can easily gain access to information but at the same time, they have to cope with the problem of information overload. Consequently, they have to rely on specialised tools and systems designed for searching, querying and retrieving information from the Web. Currently, Web search is controlled by a few search engines that are assigned the burden to follow this information explosion by utilising centralised search infrastructures. Additionally, users are striving to stay informed by sifting through enormous amounts of new information, and by relying on tools and techniques that are not able to capture the dynamic nature of the Web. In this setting, peer-to-peer Web search seems an ideal candidate that can offer adaptivity to high dynamics, scalability, resilience to failures and leverage the functionality of the traditional search engine to offer new features and services. In this thesis, we study the problem of peer-to-peer resource sharing in wide-area networks such as the Internet and the Web. In the architecture that we envision, each peer owns resources which it is willing to share: documents, web pages or files that are appropriately annotated and queried using constructs from information retrieval models. There are two kinds of basic functionality that we expect this architecture to offer: information retrieval and information filtering (also known as publish/subscribe or information dissemination). The main focus of our work is on providing models and languages for expressing publications, queries and subscriptions, protocols that regulate peer interactions in this distributed environment and indexing mechanisms that are utilized locally by each one of the peers.Initially, we present three progressively more expressive data models, WP, AWP and AWPS, that are based on information retrieval concepts and their respective query languages. Then, we study the complexity of query satisfiability and entailment for models WP and AWP using techniques from propositional logic and computational complexity. Subsequently, we propose a peer-to-peer architecture designed to support fullfledged information retrieval and filtering functionality in a single unifying framework. In the context of this architecture, we focus on the problem of information filtering using the model AWPS, and present centralised and distributed algorithmsfor efficient, adaptive information filtering in a peer-to-peer environment. We use two levels of indexing to store queries submitted by users. The first level corresponds to the partitioning of the global query index to different peers using a distributed hash table as the underlying routing infrastructure. Each node is responsible for a fraction of the submitted user queries through a mapping of attribute values to peer identifiers. The distributed hash table infrastructure is used to define the mapping scheme and also manages the routing of messages between different nodes. Our set of protocols, collectively called DHTrie, extend the basic functionality of the distributed hash table to offer filtering functionality in a dynamic peer-to-peer environment. Additionally, the use of a self-maintainable routing table allows efficient communication between the peers, offering significantly lower network load and latency. This extra routing table uses only local informationcollected by each peer to speed up the retrieval and filtering process. The second level of our indexing mechanism is managed locally by each peer, and is used for indexing the user queries the peer is responsible for. In this level of the index, each peer is able to store large numbers of user queries and match them against incoming documents. We have proposed data structures and localindexing algorithms that enable us to solve the filtering problem efficiently for large databases of queries. The main idea behind these algorithms is to store sets of words compactly by exploiting their common elements using trie-like data structures. Since these algorithms use heuristics to cluster user queries, we also consider the periodic re-organisation of the query database when the clustering of queries deteriorates. Our experimental results show the scalability and efficiency of the proposed algorithms in a dynamic setting. The distributed protocols manage to provide exact query answering functionality (precision and recall are the same as those of a centralised system) at a low network cost and low latency. Additionally, the local algorithms we have proposed outperform solutions in the current literature. Our trie-based query indexing algorithms proved more than 20% faster than their counterparts, offering sophisticated clustering of user queries and mechanisms for the adaptive reorganisation of the query database when filtering performance drops.


2009 ◽  
pp. 196-204
Author(s):  
Ioannis N. Kouris ◽  
Christos Makris ◽  
Evangelos Theodoridis ◽  
Athanasios Tsakalidis

Information retrieval is the computational discipline that deals with the efficient representation, organization, and access to information objects that represent natural language texts (Baeza-Yates, & Ribeiro-Neto, 1999; Salton & McGill, 1983; Witten, Moûat, & Bell, 1999). A crucial subproblem in the information retrieval area is the design and implementation of efficient data structures and algorithms for indexing and searching information objects that are vaguely described. In this article, we are going to present the latest developments in the indexing area by giving special emphasis to: data structures and algorithmic techniques for string manipulation, space efficient implementations, and compression techniques for efficient storage of information objects.


Author(s):  
Qiaozhu Mei ◽  
Dragomir Radev

This chapter is a basic introduction to text information retrieval. Information Retrieval (IR) refers to the activities of obtaining information resources (usually in the form of textual documents) from a much larger collection, which are relevant to an information need of the user (usually expressed as a query). Practical instances of an IR system include digital libraries and Web search engines. This chapter presents the typical architecture of an IR system, an overview of the methods corresponding to the design and the implementation of each major component of an information retrieval system, a discussion of evaluation methods for an IR system, and finally a summary of recent developments and research trends in the field of information retrieval.


Author(s):  
Iris Xie

The emergence of the Internet has allowed millions of people to use a variety of electronic information retrieval (IR) systems, such as digital libraries, Web search engines, online databases, and Online Public Access Catalogues (OPACs). The nature of IR is interaction. Interactive information retrieval is defined as the communication process between the users and the IR systems. However, the dynamics of interactive IR is not yet fully understood. Moreover, most of the existing IR systems do not support the full range of users’ interactions with IR systems. Instead, they only support one type of information-seeking strategy: how to specify queries by using terms to select relevant information. However, new digital environments require users to apply multiple information-seeking strategies and shift from one information- seeking strategy to another in the information retrieval process.


Author(s):  
Ji-Rong Wen

Web query log is a type of file keeping track of the activities of the users who are utilizing a search engine. Compared to traditional information retrieval setting in which documents are the only information source available, query logs are an additional information source in the Web search setting. Based on query logs, a set of Web mining techniques, such as log-based query clustering, log-based query expansion, collaborative filtering and personalized search, could be employed to improve the performance of Web search.


Author(s):  
Ji-Rong Wen

Web query log is a type of file keeping track of the activities of the users who are utilizing a search engine. Compared to traditional information retrieval setting in which documents are the only information source available, query logs are an additional information source in the Web search setting. Based on query logs, a set of Web mining techniques, such as log-based query clustering, log-based query expansion, collaborative filtering and personalized search, could be employed to improve the performance of Web search.


Author(s):  
Cédric Pruski ◽  
Nicolas Guelfi ◽  
Chantal Reynaud

Finding relevant information on the Web is difficult for most users. Although Web search applications are improving, they must be more “intelligent” to adapt to the search domains targeted by queries, the evolution of these domains, and users’ characteristics. In this paper, the authors present the TARGET framework for Web Information Retrieval. The proposed approach relies on the use of ontologies of a particular nature, called adaptive ontologies, for representing both the search domain and a user’s profile. Unlike existing approaches on ontologies, the authors make adaptive ontologies adapt semi-automatically to the evolution of the modeled domain. The ontologies and their properties are exploited for domain specific Web search purposes. The authors propose graph-based data structures for enriching Web data in semantics, as well as define an automatic query expansion technique to adapt a query to users’ real needs. The enriched query is evaluated on the previously defined graph-based data structures representing a set of Web pages returned by a usual search engine in order to extract the most relevant information according to user needs. The overall TARGET framework is formalized using first-order logic and fully tool supported.


2011 ◽  
Vol 3 (3) ◽  
pp. 41-58 ◽  
Author(s):  
Cédric Pruski ◽  
Nicolas Guelfi ◽  
Chantal Reynaud

Finding relevant information on the Web is difficult for most users. Although Web search applications are improving, they must be more “intelligent” to adapt to the search domains targeted by queries, the evolution of these domains, and users’ characteristics. In this paper, the authors present the TARGET framework for Web Information Retrieval. The proposed approach relies on the use of ontologies of a particular nature, called adaptive ontologies, for representing both the search domain and a user’s profile. Unlike existing approaches on ontologies, the authors make adaptive ontologies adapt semi-automatically to the evolution of the modeled domain. The ontologies and their properties are exploited for domain specific Web search purposes. The authors propose graph-based data structures for enriching Web data in semantics, as well as define an automatic query expansion technique to adapt a query to users’ real needs. The enriched query is evaluated on the previously defined graph-based data structures representing a set of Web pages returned by a usual search engine in order to extract the most relevant information according to user needs. The overall TARGET framework is formalized using first-order logic and fully tool supported.


2021 ◽  
Vol 55 (1) ◽  
pp. 1-2
Author(s):  
Bhaskar Mitra

Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents---or short passages---in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms---such as a person's name or a product model number---not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections---such as the document index of a commercial Web search engine---containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks. We ground our contributions with a detailed survey of the growing body of neural IR literature [Mitra and Craswell, 2018]. Our key contribution towards improving the effectiveness of deep ranking models is developing the Duet principle [Mitra et al., 2017] which emphasizes the importance of incorporating evidence based on both patterns of exact term matches and similarities between learned latent representations of query and document. To efficiently retrieve from large collections, we develop a framework to incorporate query term independence [Mitra et al., 2019] into any arbitrary deep model that enables large-scale precomputation and the use of inverted index for fast retrieval. In the context of stochastic ranking, we further develop optimization strategies for exposure-based objectives [Diaz et al., 2020]. Finally, this dissertation also summarizes our contributions towards benchmarking neural IR models in the presence of large training datasets [Craswell et al., 2019] and explores the application of neural methods to other IR tasks, such as query auto-completion.


Sign in / Sign up

Export Citation Format

Share Document