web clustering Latest Research Papers

Template-generated Web pages contain most of structured data on the Web. Clustering these pages according to their template structure is an important problem in wrapper-based structured data extraction systems. These systems extract structured data using wrappers that must be matched to only particular template pages. Selecting single type of template from all crawled Web pages is a time consuming task. Although there are methods to cluster Web pages according to their structural similarity, however, in most cases they are too computationally expensive to be applicable at Web-Scale. We propose a novel highly scalable approach to structurally cluster Web pages by employing XPath addresses of inbound inner-site links. We demonstrate the effectiveness of our method by clustering more than one million Web pages from many real world Websites in a few minutes and achieving >90% accuracy.

Download Full-text

Personalized Information Recommendation Based on Web Clustering

Lecture Notes in Electrical Engineering - Proceedings of the 9th International Symposium on Linear Drives for Industry Applications, Volume 2 ◽

10.1007/978-3-642-40630-0_66 ◽

2013 ◽

pp. 511-519

Author(s):

Xiaoru Sun

Keyword(s):

Information Recommendation ◽

Web Clustering

Download Full-text

Clustering and Diversifying Web Search Results with Graph-Based Word Sense Induction

Computational Linguistics ◽

10.1162/coli_a_00148 ◽

2013 ◽

Vol 39 (3) ◽

pp. 709-754 ◽

Cited By ~ 62

Author(s):

Antonio Di Marco ◽

Roberto Navigli

Keyword(s):

Information Search ◽

Web Search ◽

Word Sense ◽

Clustering Methods ◽

Search Results ◽

Search Result ◽

Word Sense Induction ◽

Novel Approach ◽

Web Clustering ◽

Word Senses

Web search result clustering aims to facilitate information search on the Web. Rather than the results of a query being presented as a flat list, they are grouped on the basis of their similarity and subsequently shown to the user as a list of clusters. Each cluster is intended to represent a different meaning of the input query, thus taking into account the lexical ambiguity (i.e., polysemy) issue. Existing Web clustering methods typically rely on some shallow notion of textual similarity between search result snippets, however. As a result, text snippets with no word in common tend to be clustered separately even if they share the same meaning, whereas snippets with words in common may be grouped together even if they refer to different meanings of the input query. In this article we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction. Key to our approach is to first acquire the various senses (i.e., meanings) of an ambiguous query and then cluster the search results based on their semantic similarity to the word senses induced. Our experiments, conducted on data sets of ambiguous queries, show that our approach outperforms both Web clustering and search engines.

Download Full-text

The Research of Web Clustering Algorithm Based on Binarized Zipf's Law

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.336-338.2152 ◽

2013 ◽

Vol 336-338 ◽

pp. 2152-2156

Author(s):

Rong Wang ◽

Bo Ping Zhang

Keyword(s):

Clustering Algorithm ◽

Zipf’S Law ◽

Zipf's Law ◽

Time Property ◽

Web Clustering

This paper analyzes Web clustering firstly, then proposes to describes the object using binary property for users clustering in Web clustering. Makes the time property Binaryzation using Zipf's Law, then clusters the object using ROCK Algorithm. Experiments can prove that the work for clustering is well.

Download Full-text