The Matter of Chance: Auditing Web Search Results Related to the 2020 U.S. Presidential Primary Elections Across Six Search Engines

Web search engines are one of the most popular services to help users find useful information on the Web. Although many studies have been carried out to estimate the size and overlap of the general web search engines, it may not benefit the ordinary web searching users, since they care more about the overlap of the top N (N=10, 20 or 50) search results on concrete queries, but not the overlap of the total index database. In this study, we present experimental results on the comparison of the overlap of the top N (N=10, 20 or 50) search results from AlltheWeb, Google, AltaVista and WiseNut for the 58 most popular queries, as well as for the distance of the overlapped results. These 58 queries are chosen from WordTracker service, which records the most popular queries submitted to some famous metasearch engines, such as MetaCrawler and Dogpile. We divide these 58 queries into three categories for further investigation. Through in-depth study, we observe a number of interesting results: the overlap of the top N results retrieved by different search engines is very small; the search results of the queries in different categories behave in dramatically different ways; Google, on average, has the highest overlap among these four search engines; each search engine tends to adopt a different rank algorithm independently.

Download Full-text

Web Search Results Discovery by Multi-granular Graphs

Quantitative Semantics and Soft Computing Methods for the Web ◽

10.4018/978-1-60960-881-1.ch006 ◽

2011 ◽

pp. 118-136

Author(s):

Gloria Bordogna ◽

Alessandro Campi ◽

Giuseppe Psaila ◽

Stefania Ronchi

Keyword(s):

Search Engine ◽

Search Engines ◽

Web Search ◽

Search Process ◽

Dynamic Clustering ◽

Search Results

In this chapter, the authors propose a novel multi-granular framework for visualization and exploration of the results of a complex search process, performed by a user by submitting several queries to possibly distinct search engines. The primary aim of the approach is to supply users with summaries, with distinct levels of details, of the results for a search process. It applies dynamic clustering to the results in each ordered list retrieved by a search engine evaluating a user’s query. The single retrieved items, the clusters so identified, and the single retrieved lists, are considered as dealing with topics at distinct levels of granularity, from the finest level to the coarsest one, respectively. Implicit topics are revealed by associating labels with the retrieved items, the clusters, and the retrieved lists. Then, some manipulation operators, defined in this chapter, are applied to each pair of retrieved lists, clusters, and single items, to reveal their implicit relationships. These relationships have a semantic nature, since they are labeled to approximately represent the shared documents and the shared sub-topics between each pair of combined elements. Finally, both the topics retrieved by the distinct searches and their relationships are represented through multi-granular graphs, that represent the retrieved topics at three distinct levels of granularity. The exploration of the results can be performed by expanding the graphs nodes to see their contents, and by expanding the edges to see their shared contents and their common sub-topics.

Download Full-text

A Study on Web Searching

Data Warehousing and Mining ◽

10.4018/978-1-59904-951-9.ch115 ◽

2008 ◽

pp. 1926-1937

Author(s):

Shanfeng Chu ◽

Xiaotie Deng ◽

Qizhi Fang ◽

Weimin Zhang

Keyword(s):

Search Engine ◽

Search Engines ◽

Web Search ◽

Experimental Results ◽

Web Searching ◽

Search Results ◽

Total Index ◽

Depth Study ◽

Web Search Engines ◽

The Web

Web search engines are one of the most popular services to help users find useful information on the Web. Although many studies have been carried out to estimate the size and overlap of the general web search engines, it may not benefit the ordinary web searching users, since they care more about the overlap of the top N (N=10, 20 or 50) search results on concrete queries, but not the overlap of the total index database. In this study, we present experimental results on the comparison of the overlap of the top N (N=10, 20 or 50) search results from AlltheWeb, Google, AltaVista and WiseNut for the 58 most popular queries, as well as for the distance of the overlapped results. These 58 queries are chosen from WordTracker service, which records the most popular queries submitted to some famous metasearch engines, such as MetaCrawler and Dogpile. We divide these 58 queries into three categories for further investigation. Through in-depth study, we observe a number of interesting results: the overlap of the top N results retrieved by different search engines is very small; the search results of the queries in different categories behave in dramatically different ways; Google, on average, has the highest overlap among these four search engines; each search engine tends to adopt a different rank algorithm independently.

Download Full-text

Cluster Search Engine Results with Crowd Intelligence

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.284-287.3375 ◽

2013 ◽

Vol 284-287 ◽

pp. 3375-3379

Author(s):

Chun Hsiung Tseng ◽

Fu Cheng Yang ◽

Yu Ping Tseng ◽

Yi Yun Chang

Keyword(s):

Search Engine ◽

Search Engines ◽

Web Search ◽

Web Pages ◽

Huge Amount ◽

Search Results ◽

Web Search Engines ◽

Crowd Intelligence ◽

User Queries

Most Web users today rely heavily on search engines to gather information. To achieve better search results, some algorithms such as PageRank have been developed. However, most Web search engines employ keyword-based search and thus have some natural weaknesses. Among these problems, a well-known one is that it is very difficult for search engines to infer semantics from user queries and returned results. Hence, despite of efforts of ranking search results, users may still have to navigate through a huge amount of Web pages to locate the desired resources. In this research, the researchers developed a clustering-based methodology to improve the performance of search engines. Instead of extracting features used for clustering from the returned documents, the proposed method extracts features from the delicious service, which is actually a tag provider service. By utilizing such information, the resulting system can benefit from crowd intelligence. The obtained information is then used for enhancing the performance of the ordinary k-means algorithm to achieve better clustering results.

Download Full-text

Effective Search Engine Spam Classification

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1100.0882s819 ◽

2019 ◽

Vol 8 (2S8) ◽

pp. 1541-1545

Keyword(s):

Decision Tree ◽

Search Engine ◽

Search Engines ◽

Web Search ◽

Web Pages ◽

Classification Methods ◽

Search Results ◽

Web Search Engine ◽

Engine Industry

Search engine spam is formed by the spam creators for commercial gain. Spammers applied different strategies in web pages to display the first page of web search results. These strategies may avoid displaying good quality web pages in the top of search engine results page. Nowadays there are numerous devised algorithms available to identify search engine spam. Even though search engines are still affected by search engine spam. There is a necessity for search engine industry to filter search engine spam in the best way. The proposed study identifies spam in web search engine. Spammers try to use most popular search keywords, popular links and advertising keywords in web pages. This strategy helps to increase ranking to display the top of search results. The proposed method is used important features to detect spam pages which are classified using decision tree C4.5 classifier. This method produces better performance when compared with existing classification methods.

Download Full-text

What does Google recommend when you want to compare insurance offerings?

Aslib Journal of Information Management ◽

10.1108/ajim-07-2018-0172 ◽

2019 ◽

Vol 71 (3) ◽

pp. 310-324

Author(s):

Dirk Lewandowski ◽

Sebastian Sünkler

Keyword(s):

Data Analysis ◽

Empirical Study ◽

Search Engine ◽

Search Engines ◽

Large Scale ◽

Insurance Companies ◽

Content Type ◽

Search Queries ◽

Large Scale Analysis ◽

Search Engine Result

Purpose The purpose of this paper is to describe a new method to improve the analysis of search engine results by considering the provider level as well as the domain level. This approach is tested by conducting a study using queries on the topic of insurance comparisons. Design/methodology/approach The authors conducted an empirical study that analyses the results of search queries aimed at comparing insurance companies. The authors used a self-developed software system that automatically queries commercial search engines and automatically extracts the content of the returned result pages for further data analysis. The data analysis was carried out using the KNIME Analytics Platform. Findings Google’s top search results are served by only a few providers that frequently appear in these results. The authors show that some providers operate several domains on the same topic and that these domains appear for the same queries in the result lists. Research limitations/implications The authors demonstrate the feasibility of this approach and draw conclusions for further investigations from the empirical study. However, the study is a limited use case based on a limited number of search queries. Originality/value The proposed method allows large-scale analysis of the composition of the top results from commercial search engines. It allows using valid empirical data to determine what users actually see on the search engine result pages.

Download Full-text

Website removal from search engines due to copyright violation

Aslib Journal of Information Management ◽

10.1108/ajim-05-2018-0108 ◽

2019 ◽

Vol 71 (1) ◽

pp. 54-71 ◽

Cited By ~ 7

Author(s):

Artur Strzelecki

Keyword(s):

Search Engine ◽

Search Engines ◽

Design Methodology ◽

Global Analysis ◽

Domain Name ◽

Content Type ◽

Search Results ◽

Internet Users ◽

Copyright Violation

Purpose The purpose of this paper is to clarify how many removal requests are made, how often, and who makes these requests, as well as which websites are reported to search engines so they can be removed from the search results. Design/methodology/approach Undertakes a deep analysis of more than 3.2bn removed pages from Google’s search results requested by reporting organizations from 2011 to 2018 and over 460m removed pages from Bing’s search results requested by reporting organizations from 2015 to 2017. The paper focuses on pages that belong to the .pl country coded top-level domain (ccTLD). Findings Although the number of requests to remove data from search results has been growing year on year, fewer URLs have been reported in recent years. Some of the requests are, however, unjustified and are rejected by teams representing the search engines. In terms of reporting copyright violations, one company in particular stands out (AudioLock.Net), accounting for 28.1 percent of all reports sent to Google (the top ten companies combined were responsible for 61.3 percent of the total number of reports). Research limitations/implications As not every request can be published, the study is based only what is publicly available. Also, the data assigned to Poland is only based on the ccTLD domain name (.pl); other domain extensions for Polish internet users were not considered. Originality/value This is first global analysis of data from transparency reports published by search engine companies as prior research has been based on specific notices.

Download Full-text

IMPLEMENTASI ALGORITMA GOOGLE LATENT SEMANTIC DISTANCE UNTUK EKSTRAKSI RANGKAIAN KATA KUNCI ARTIKEL JURNAL ILMIAH

Computatio : Journal of Computer Science and Information Systems ◽

10.24912/computatio.v2i2.2569 ◽

2018 ◽

Vol 2 (2) ◽

pp. 186

Author(s):

Novario Jaya Perdana

Keyword(s):

Search Engine ◽

Search Engines ◽

Semantic Distance ◽

Relevant Information ◽

High Accuracy ◽

Hard Work ◽

The Internet ◽

Search Results ◽

Search Result

The accuracy of search result using search engine depends on the keywords that are used. Lack of the information provided on the keywords can lead to reduced accuracy of the search result. This means searching information on the internet is a hard work. In this research, a software has been built to create document keywords sequences. The software uses Google Latent Semantic Distance which can extract relevant information from the document. The information is expressed in the form of specific words sequences which could be used as keyword recommendations in search engines. The result shows that the implementation of the method for creating document keyword recommendation achieved high accuracy and could finds the most relevant information in the top search results.

Download Full-text

Techniques for Improving Web Search by Understanding Queries

10.26686/wgtn.16985482 ◽

2021 ◽

Author(s):

◽

Daniel Wayne Crabtree

Keyword(s):

Search Engines ◽

Best Practice ◽

Web Search ◽

Special Focus ◽

Clustering Methods ◽

Web Page ◽

Clustering Method ◽

Evaluation Measures ◽

Search Results ◽

Web Page Clustering

<p>This thesis investigates the refinement of web search results with a special focus on the use of clustering and the role of queries. It presents a collection of new methods for evaluating clustering methods, performing clustering effectively, and for performing query refinement. The thesis identifies different types of query, the situations where refinement is necessary, and the factors affecting search difficulty. It then analyses hard searches and argues that many of them fail because users and search engines have different query models. The thesis identifies best practice for evaluating web search results and search refinement methods. It finds that none of the commonly used evaluation measures for clustering meet all of the properties of good evaluation measures. It then presents new quality and coverage measures that satisfy all the desired properties and that rank clusterings correctly in all web page clustering situations. The thesis argues that current web page clustering methods work well when different interpretations of the query have distinct vocabulary, but still have several limitations and often produce incomprehensible clusters. It then presents a new clustering method that uses the query to guide the construction of semantically meaningful clusters. The new clustering method significantly improves performance. Finally, the thesis explores how searches and queries are composed of different aspects and shows how to use aspects to reduce the distance between the query models of search engines and users. It then presents fully automatic methods that identify query aspects, identify underrepresented aspects, and predict query difficulty. Used in combination, these methods have many applications — the thesis describes methods for two of them. The first method improves the search results for hard queries with underrepresented aspects by automatically expanding the query using semantically orthogonal keywords related to the underrepresented aspects. The second method helps users refine hard ambiguous queries by identifying the different query interpretations using a clustering of a diverse set of refinements. Both methods significantly outperform existing methods.</p>

Download Full-text

Techniques for Improving Web Search by Understanding Queries

10.26686/wgtn.16985482.v1 ◽

2021 ◽

Author(s):

◽

Daniel Wayne Crabtree

Keyword(s):

Search Engines ◽

Best Practice ◽

Web Search ◽

Special Focus ◽

Clustering Methods ◽

Web Page ◽

Clustering Method ◽

Evaluation Measures ◽

Search Results ◽

Web Page Clustering

<p>This thesis investigates the refinement of web search results with a special focus on the use of clustering and the role of queries. It presents a collection of new methods for evaluating clustering methods, performing clustering effectively, and for performing query refinement. The thesis identifies different types of query, the situations where refinement is necessary, and the factors affecting search difficulty. It then analyses hard searches and argues that many of them fail because users and search engines have different query models. The thesis identifies best practice for evaluating web search results and search refinement methods. It finds that none of the commonly used evaluation measures for clustering meet all of the properties of good evaluation measures. It then presents new quality and coverage measures that satisfy all the desired properties and that rank clusterings correctly in all web page clustering situations. The thesis argues that current web page clustering methods work well when different interpretations of the query have distinct vocabulary, but still have several limitations and often produce incomprehensible clusters. It then presents a new clustering method that uses the query to guide the construction of semantically meaningful clusters. The new clustering method significantly improves performance. Finally, the thesis explores how searches and queries are composed of different aspects and shows how to use aspects to reduce the distance between the query models of search engines and users. It then presents fully automatic methods that identify query aspects, identify underrepresented aspects, and predict query difficulty. Used in combination, these methods have many applications — the thesis describes methods for two of them. The first method improves the search results for hard queries with underrepresented aspects by automatically expanding the query using semantically orthogonal keywords related to the underrepresented aspects. The second method helps users refine hard ambiguous queries by identifying the different query interpretations using a clustering of a diverse set of refinements. Both methods significantly outperform existing methods.</p>

Download Full-text