I/O-Conscious Data Preparation for Large-Scale Web Search Engines

Due to the large amount of information on the web and the difficulties of relating user’s expressed information needs to document content, large-scale web search engines tend to return thousands of ranked documents. This chapter discusses the use of clustering to help users navigate through the result sets and explore the domain. A newly developed system, HOBSearch, makes use of suffix tree clustering to overcome many of the weaknesses of traditional clustering approaches. Using result snippets rather than full documents, HOBSearch both speeds up clustering substantially and manages to tailor the clustering to the topics indicated in user’s query. An inherent problem with clustering, though, is the choice of cluster labels. Our experiments with HOBSearch show that cluster labels of an acceptable quality can be generated with no upervision or predefined structures and within the constraints given by large-scale web search.

Download Full-text

DEVS modeling of large scale Web Search Engines

Proceedings of the Winter Simulation Conference 2014 ◽

10.1109/wsc.2014.7020144 ◽

2014 ◽

Cited By ~ 9

Author(s):

Alonso Inostrosa-Psijas ◽

Gabriel Wainer ◽

Veronica Gil-Costa ◽

Mauricio Marin

Keyword(s):

Search Engines ◽

Large Scale ◽

Web Search ◽

Web Search Engines

Download Full-text

A Query Construction Service for Large-Scale Web Search Engines

2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology ◽

10.1109/wi-iat.2009.239 ◽

2009 ◽

Cited By ~ 1

Author(s):

Ioannis Papadakis ◽

Michalis Stefanidakis ◽

Sofia Stamou ◽

Ioannis Andreou

Keyword(s):

Search Engines ◽

Large Scale ◽

Web Search ◽

Web Search Engines

Download Full-text

Scalability and efficiency challenges in large-scale web search engines

Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval - SIGIR '14 ◽

10.1145/2600428.2602291 ◽

2014 ◽

Cited By ~ 3

Author(s):

B. Barla Cambazoglu ◽

Ricardo Baeza-Yates

Keyword(s):

Search Engines ◽

Large Scale ◽

Web Search ◽

Web Search Engines

Download Full-text

Semantifying queries over large-scale Web search engines

Journal of Internet Services and Applications ◽

10.1007/s13174-012-0068-9 ◽

2012 ◽

Vol 3 (3) ◽

pp. 255-268 ◽

Cited By ~ 2

Author(s):

Ioannis Papadakis ◽

Michalis Stefanidakis ◽

Sofia Stamou ◽

Ioannis Andreou

Keyword(s):

Search Engines ◽

Large Scale ◽

Web Search ◽

Web Search Engines

Download Full-text

Scalability and Efficiency Challenges in Large-Scale Web Search Engines

Proceedings of the Eighth ACM International Conference on Web Search and Data Mining - WSDM '15 ◽

10.1145/2684822.2697039 ◽

2015 ◽

Cited By ~ 2

Author(s):

B. Barla Cambazoglu ◽

Ricardo Baeza-Yates

Keyword(s):

Search Engines ◽

Large Scale ◽

Web Search ◽

Web Search Engines

Download Full-text

Scalability and Efficiency Challenges in Large-Scale Web Search Engines

Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval - SIGIR '16 ◽

10.1145/2911451.2914808 ◽

2016 ◽

Cited By ~ 6

Author(s):

B. Barla Cambazoglu ◽

Ricardo Baeza-Yates

Keyword(s):

Search Engines ◽

Large Scale ◽

Web Search ◽

Web Search Engines

Download Full-text

The Matter of Chance: Auditing Web Search Results Related to the 2020 U.S. Presidential Primary Elections Across Six Search Engines

Social Science Computer Review ◽

10.1177/08944393211006863 ◽

2021 ◽

pp. 089443932110068

Author(s):

Aleksandra Urman ◽

Mykola Makhortykh ◽

Roberto Ulloa

Keyword(s):

Search Engine ◽

Search Engines ◽

Large Scale ◽

Web Search ◽

Primary Elections ◽

Virtual Agents ◽

Search Results ◽

Presidential Primary ◽

Large Scale Analysis ◽

Algorithmic Information

We examine how six search engines filter and rank information in relation to the queries on the U.S. 2020 presidential primary elections under the default—that is nonpersonalized—conditions. For that, we utilize an algorithmic auditing methodology that uses virtual agents to conduct large-scale analysis of algorithmic information curation in a controlled environment. Specifically, we look at the text search results for “us elections,” “donald trump,” “joe biden,” “bernie sanders” queries on Google, Baidu, Bing, DuckDuckGo, Yahoo, and Yandex, during the 2020 primaries. Our findings indicate substantial differences in the search results between search engines and multiple discrepancies within the results generated for different agents using the same search engine. It highlights that whether users see certain information is decided by chance due to the inherent randomization of search results. We also find that some search engines prioritize different categories of information sources with respect to specific candidates. These observations demonstrate that algorithmic curation of political information can create information inequalities between the search engine users even under nonpersonalized conditions. Such inequalities are particularly troubling considering that search results are highly trusted by the public and can shift the opinions of undecided voters as demonstrated by previous research.

Download Full-text

Parallel programming in modern web search engines

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '06 ◽

10.1145/1122971.1122973 ◽

2006 ◽

Author(s):

Raymie Stata

Keyword(s):

Parallel Programming ◽

Search Engines ◽

Web Search ◽

Web Search Engines

Download Full-text