scholarly journals Peer-to-peer techniques for web information retrieval and filtering

2006 ◽  
Author(s):  
Χρήστος Τρυφωνόπουλος

Much information of interest to humans is today available on the Web. People can easily gain access to information but at the same time, they have to cope with the problem of information overload. Consequently, they have to rely on specialised tools and systems designed for searching, querying and retrieving information from the Web. Currently, Web search is controlled by a few search engines that are assigned the burden to follow this information explosion by utilising centralised search infrastructures. Additionally, users are striving to stay informed by sifting through enormous amounts of new information, and by relying on tools and techniques that are not able to capture the dynamic nature of the Web. In this setting, peer-to-peer Web search seems an ideal candidate that can offer adaptivity to high dynamics, scalability, resilience to failures and leverage the functionality of the traditional search engine to offer new features and services. In this thesis, we study the problem of peer-to-peer resource sharing in wide-area networks such as the Internet and the Web. In the architecture that we envision, each peer owns resources which it is willing to share: documents, web pages or files that are appropriately annotated and queried using constructs from information retrieval models. There are two kinds of basic functionality that we expect this architecture to offer: information retrieval and information filtering (also known as publish/subscribe or information dissemination). The main focus of our work is on providing models and languages for expressing publications, queries and subscriptions, protocols that regulate peer interactions in this distributed environment and indexing mechanisms that are utilized locally by each one of the peers.Initially, we present three progressively more expressive data models, WP, AWP and AWPS, that are based on information retrieval concepts and their respective query languages. Then, we study the complexity of query satisfiability and entailment for models WP and AWP using techniques from propositional logic and computational complexity. Subsequently, we propose a peer-to-peer architecture designed to support fullfledged information retrieval and filtering functionality in a single unifying framework. In the context of this architecture, we focus on the problem of information filtering using the model AWPS, and present centralised and distributed algorithmsfor efficient, adaptive information filtering in a peer-to-peer environment. We use two levels of indexing to store queries submitted by users. The first level corresponds to the partitioning of the global query index to different peers using a distributed hash table as the underlying routing infrastructure. Each node is responsible for a fraction of the submitted user queries through a mapping of attribute values to peer identifiers. The distributed hash table infrastructure is used to define the mapping scheme and also manages the routing of messages between different nodes. Our set of protocols, collectively called DHTrie, extend the basic functionality of the distributed hash table to offer filtering functionality in a dynamic peer-to-peer environment. Additionally, the use of a self-maintainable routing table allows efficient communication between the peers, offering significantly lower network load and latency. This extra routing table uses only local informationcollected by each peer to speed up the retrieval and filtering process. The second level of our indexing mechanism is managed locally by each peer, and is used for indexing the user queries the peer is responsible for. In this level of the index, each peer is able to store large numbers of user queries and match them against incoming documents. We have proposed data structures and localindexing algorithms that enable us to solve the filtering problem efficiently for large databases of queries. The main idea behind these algorithms is to store sets of words compactly by exploiting their common elements using trie-like data structures. Since these algorithms use heuristics to cluster user queries, we also consider the periodic re-organisation of the query database when the clustering of queries deteriorates. Our experimental results show the scalability and efficiency of the proposed algorithms in a dynamic setting. The distributed protocols manage to provide exact query answering functionality (precision and recall are the same as those of a centralised system) at a low network cost and low latency. Additionally, the local algorithms we have proposed outperform solutions in the current literature. Our trie-based query indexing algorithms proved more than 20% faster than their counterparts, offering sophisticated clustering of user queries and mechanisms for the adaptive reorganisation of the query database when filtering performance drops.

Author(s):  
Cédric Pruski ◽  
Nicolas Guelfi ◽  
Chantal Reynaud

Finding relevant information on the Web is difficult for most users. Although Web search applications are improving, they must be more “intelligent” to adapt to the search domains targeted by queries, the evolution of these domains, and users’ characteristics. In this paper, the authors present the TARGET framework for Web Information Retrieval. The proposed approach relies on the use of ontologies of a particular nature, called adaptive ontologies, for representing both the search domain and a user’s profile. Unlike existing approaches on ontologies, the authors make adaptive ontologies adapt semi-automatically to the evolution of the modeled domain. The ontologies and their properties are exploited for domain specific Web search purposes. The authors propose graph-based data structures for enriching Web data in semantics, as well as define an automatic query expansion technique to adapt a query to users’ real needs. The enriched query is evaluated on the previously defined graph-based data structures representing a set of Web pages returned by a usual search engine in order to extract the most relevant information according to user needs. The overall TARGET framework is formalized using first-order logic and fully tool supported.


Author(s):  
Ioannis N. Kouris ◽  
Christos Makris ◽  
Evangelos Theodoridis ◽  
Athanasios Tsakalidis

Information retrieval is the computational discipline that deals with the efficient representation, organization, and access to information objects that represent natural language texts (Baeza-Yates, & Ribeiro-Neto, 1999; Salton & McGill, 1983; Witten, Moûat, & Bell, 1999). A crucial subproblem in the information retrieval area is the design and implementation of efficient data structures and algorithms for indexing and searching information objects that are vaguely described. In this article, we are going to present the latest developments in the indexing area by giving special emphasis to: data structures and algorithmic techniques for string manipulation, space efficient implementations, and compression techniques for efficient storage of information objects. The aforementioned problems appear in a series of applications as digital libraries, molecular sequence databases (DNA sequences, protein databases [Gusûeld, 1997)], implementation of Web search engines, web mining and information filtering.


Author(s):  
HEINER STUCKENSCHMIDT

Web page categorization is an approach for improving precision and efficiency of information retrieval on the web by filtering out irrelevant pages. Current approaches to information filtering based on categorization assume the existence of a single classification hierarchy used for filtering. In this paper, we address the problem of filtering information categorized according to different classification hierarchies. We describe a method for approximating Boolean queries over class names across different class hierarchies.


Author(s):  
Xianghan Zheng ◽  
Vladimir Oleshchuk

Today, Peer-to-Peer SIP based communication systems have attracted much attention from both the academia and industry. The decentralized nature of P2P might provide the distributed peer-to-peer communication system without help of the traditional SIP server. However, the decentralization features come to the cost of the reduced manageability and create new concerns. Until now, the main focus of research was on the availability of the network and systems, while few attempts are put on protecting privacy. In this chapter, we investigate on P2PSIP security issues and introduce two enhancement solutions: central based security and distributed trust security, both of which have their own advantages and disadvantages. After that, we study appropriate combination of these two approaches to get optimized protection. Our design is independent of the DHT (Distributed Hash Table) overlay technology. We take the Chord overlay as the example, and then, analyze the system in several aspects: security & privacy, number-of the hops, message flows, etc.


Author(s):  
R. Subhashini ◽  
V.Jawahar Senthil Kumar

The World Wide Web is a large distributed digital information space. The ability to search and retrieve information from the Web efficiently and effectively is an enabling technology for realizing its full potential. Information Retrieval (IR) plays an important role in search engines. Today’s most advanced engines use the keyword-based (“bag of words”) paradigm, which has inherent disadvantages. Organizing web search results into clusters facilitates the user’s quick browsing of search results. Traditional clustering techniques are inadequate because they do not generate clusters with highly readable names. This paper proposes an approach for web search results in clustering based on a phrase based clustering algorithm. It is an alternative to a single ordered result of search engines. This approach presents a list of clusters to the user. Experimental results verify the method’s feasibility and effectiveness.


Author(s):  
Zoltán Czirkos ◽  
Gábor Hosszú

In this chapter, the authors present a novel peer-to-peer based intrusion detection system called Komondor, more specifically, its internals regarding the utilized peer-to-peer transport layer. The novelty of our intrusion detection system is that it is composed of independent software instances running on different hosts and is organized into a peer-to-peer network. The maintenance of this overlay network does not require any user interaction. The applied P2P overlay network model enables the nodes to communicate evenly over an unstable network. The base of our Komondor NIDS is a P2P network similar to Kademlia. To achieve high reliability and availability, we had to modify the Kademlia overlay network in such a way so that it would be resistent to network failures and support broadcast messages. The main purpose of this chapter is to present our modifications and enhancements on Kademlia.


Author(s):  
Esharenana E. Adomi

The World Wide Web (WWW) has led to the advent of the information age. With increased demand for information from various quarters, the Web has turned out to be a veritable resource. Web surfers in the early days were frustrated by the delay in finding the information they needed. The first major leap for information retrieval came from the deployment of Web search engines such as Lycos, Excite, AltaVista, etc. The rapid growth in the popularity of the Web during the past few years has led to a precipitous pronouncement of death for the online services that preceded the Web in the wired world.


Author(s):  
Ji-Rong Wen

The Web is an open and free environment for people to publish and get information. Everyone on the Web can be either an author, a reader, or both. The language of the Web, HTML (Hypertext Markup Language), is mainly designed for information display, not for semantic representation. Therefore, current Web search engines usually treat Web pages as unstructured documents, and traditional information retrieval (IR) technologies are employed for Web page parsing, indexing, and searching. The unstructured essence of Web pages seriously blocks more accurate search and advanced applications on the Web. For example, many sites contain structured information about various products. Extracting and integrating product information from multiple Web sites could lead to powerful search functions, such as comparison shopping and business intelligence. However, these structured data are embedded in Web pages, and there are no proper traditional methods to extract and integrate them. Another example is the link structure of the Web. If used properly, information hidden in the links could be taken advantage of to effectively improve search performance and make Web search go beyond traditional information retrieval (Page, Brin, Motwani, & Winograd, 1998, Kleinberg, 1998).


Sign in / Sign up

Export Citation Format

Share Document