scholarly journals Clustering and Diversifying Web Search Results with Graph-Based Word Sense Induction

2013 ◽  
Vol 39 (3) ◽  
pp. 709-754 ◽  
Author(s):  
Antonio Di Marco ◽  
Roberto Navigli

Web search result clustering aims to facilitate information search on the Web. Rather than the results of a query being presented as a flat list, they are grouped on the basis of their similarity and subsequently shown to the user as a list of clusters. Each cluster is intended to represent a different meaning of the input query, thus taking into account the lexical ambiguity (i.e., polysemy) issue. Existing Web clustering methods typically rely on some shallow notion of textual similarity between search result snippets, however. As a result, text snippets with no word in common tend to be clustered separately even if they share the same meaning, whereas snippets with words in common may be grouped together even if they refer to different meanings of the input query. In this article we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction. Key to our approach is to first acquire the various senses (i.e., meanings) of an ambiguous query and then cluster the search results based on their semantic similarity to the word senses induced. Our experiments, conducted on data sets of ambiguous queries, show that our approach outperforms both Web clustering and search engines.

Author(s):  
Reinald Kim Amplayo ◽  
Seung-won Hwang ◽  
Min Song

Word sense induction (WSI), or the task of automatically discovering multiple senses or meanings of a word, has three main challenges: domain adaptability, novel sense detection, and sense granularity flexibility. While current latent variable models are known to solve the first two challenges, they are not flexible to different word sense granularities, which differ very much among words, from aardvark with one sense, to play with over 50 senses. Current models either require hyperparameter tuning or nonparametric induction of the number of senses, which we find both to be ineffective. Thus, we aim to eliminate these requirements and solve the sense granularity problem by proposing AutoSense, a latent variable model based on two observations: (1) senses are represented as a distribution over topics, and (2) senses generate pairings between the target word and its neighboring word. These observations alleviate the problem by (a) throwing garbage senses and (b) additionally inducing fine-grained word senses. Results show great improvements over the stateof-the-art models on popular WSI datasets. We also show that AutoSense is able to learn the appropriate sense granularity of a word. Finally, we apply AutoSense to the unsupervised author name disambiguation task where the sense granularity problem is more evident and show that AutoSense is evidently better than competing models. We share our data and code here: https://github.com/rktamplayo/AutoSense.


2021 ◽  
Author(s):  
◽  
Daniel Wayne Crabtree

<p>This thesis investigates the refinement of web search results with a special focus on the use of clustering and the role of queries. It presents a collection of new methods for evaluating clustering methods, performing clustering effectively, and for performing query refinement. The thesis identifies different types of query, the situations where refinement is necessary, and the factors affecting search difficulty. It then analyses hard searches and argues that many of them fail because users and search engines have different query models. The thesis identifies best practice for evaluating web search results and search refinement methods. It finds that none of the commonly used evaluation measures for clustering meet all of the properties of good evaluation measures. It then presents new quality and coverage measures that satisfy all the desired properties and that rank clusterings correctly in all web page clustering situations. The thesis argues that current web page clustering methods work well when different interpretations of the query have distinct vocabulary, but still have several limitations and often produce incomprehensible clusters. It then presents a new clustering method that uses the query to guide the construction of semantically meaningful clusters. The new clustering method significantly improves performance. Finally, the thesis explores how searches and queries are composed of different aspects and shows how to use aspects to reduce the distance between the query models of search engines and users. It then presents fully automatic methods that identify query aspects, identify underrepresented aspects, and predict query difficulty. Used in combination, these methods have many applications — the thesis describes methods for two of them. The first method improves the search results for hard queries with underrepresented aspects by automatically expanding the query using semantically orthogonal keywords related to the underrepresented aspects. The second method helps users refine hard ambiguous queries by identifying the different query interpretations using a clustering of a diverse set of refinements. Both methods significantly outperform existing methods.</p>


2021 ◽  
Author(s):  
◽  
Daniel Wayne Crabtree

<p>This thesis investigates the refinement of web search results with a special focus on the use of clustering and the role of queries. It presents a collection of new methods for evaluating clustering methods, performing clustering effectively, and for performing query refinement. The thesis identifies different types of query, the situations where refinement is necessary, and the factors affecting search difficulty. It then analyses hard searches and argues that many of them fail because users and search engines have different query models. The thesis identifies best practice for evaluating web search results and search refinement methods. It finds that none of the commonly used evaluation measures for clustering meet all of the properties of good evaluation measures. It then presents new quality and coverage measures that satisfy all the desired properties and that rank clusterings correctly in all web page clustering situations. The thesis argues that current web page clustering methods work well when different interpretations of the query have distinct vocabulary, but still have several limitations and often produce incomprehensible clusters. It then presents a new clustering method that uses the query to guide the construction of semantically meaningful clusters. The new clustering method significantly improves performance. Finally, the thesis explores how searches and queries are composed of different aspects and shows how to use aspects to reduce the distance between the query models of search engines and users. It then presents fully automatic methods that identify query aspects, identify underrepresented aspects, and predict query difficulty. Used in combination, these methods have many applications — the thesis describes methods for two of them. The first method improves the search results for hard queries with underrepresented aspects by automatically expanding the query using semantically orthogonal keywords related to the underrepresented aspects. The second method helps users refine hard ambiguous queries by identifying the different query interpretations using a clustering of a diverse set of refinements. Both methods significantly outperform existing methods.</p>


2018 ◽  
Vol 7 (3.3) ◽  
pp. 90
Author(s):  
Sumathi Rani Manukonda ◽  
Asst.Prof Kmit ◽  
Narayanguda . ◽  
Hyderabad . ◽  
Nomula Divya ◽  
...  

Clustering the document in data mining is one of the traditional approach in which the same documents that are more relevant are grouped together. Document clustering take part in achieving accuracy that retrieve information for systems that identifies the nearest neighbors of the document. Day to day the massive quantity of data is being generated and it is clustered. According to particular sequence to improve the cluster qualityeven though different clustering methods have been introduced, still many challenges exist for the improvement of document clustering. For web search purposea document in group is efficiently arranged for the result retrieval.The users accordingly search query in an organized way. Hierarchical clustering is attained by document clustering.To the greatest algorithms for groupingdo not concentrate on the semantic approach, hence resulting to the unsatisfactory output clustering. The involuntary approach of organizing documents of web like Google, Yahoo is often considered as a reference. A distinct method to identify the existing group of similar things in the previously organized documents and retrieves effective document classifier for new documents. In this paper the main concentration is on hierarchical clustering and k-means algorithms, hence prove that k-means and its variant are efficient than hierarchical clustering along with this by implementing greedy fast k-means algorithm (GFA) for cluster document in efficient way is considered.  


1999 ◽  
Vol 08 (02) ◽  
pp. 137-156 ◽  
Author(s):  
CHING-CHI HSU ◽  
CHIA-HUI CHANG

This paper describes a Web information search tool called WebYacht. The goal of WebYacht is to solve the problem of imprecise search results in current Web search engines. Due to incomplete information given by users and the diversified information published on the Web, conventional document ranking based on an automatic assessment of document relevance to the query may not be the best approach when little information is given as in most cases. In order to clarify the ambiguity of the short queries given by users, WebYacht adopts cluster-based browsing model as well as relevance feedback to facilitate Web information search. The idea is to have users give two to three times more feedback in the same amount of time that would be required to give feedback for conventional feedback mechanisms. With the assistance of cluster-based representation provided by WebYacht, a lot of browsing labor can be reduced. In this paper, we explain the techniques used in the design of WebYacht and compare the performances of feedback interface designs and to conventional similarity ranking search results.


Author(s):  
Jesse Thomason ◽  
Raymond J. Mooney

A word in natural language can be polysemous, having multiple meanings, as well as synonymous, meaning the same thing as other words. Word sense induction attempts to find the senses of polysemous words. Synonymy detection attempts to find when two words are interchangeable. We combine these tasks, first inducing word senses and then detecting similar senses to form word-sense synonym sets (synsets) in an unsupervised fashion. Given pairs of images and text with noun phrase labels, we perform synset induction to produce collections of underlying concepts described by one or more noun phrases. We find that considering multi-modal features from both visual and textual context yields better induced synsets than using either context alone. Human evaluations show that our unsupervised, multi-modally induced synsets are comparable in quality to annotation-assisted ImageNet synsets, achieving about 84% of ImageNet synsets' approval.


Sign in / Sign up

Export Citation Format

Share Document