Building a Post-Search Academic Search Engine Based on a Serial of Clustering Methods

Next Generation Search Engine for the Result Clustering Technology

Next Generation Search Engines ◽

10.4018/978-1-4666-0330-1.ch012 ◽

2012 ◽

pp. 274-290

Author(s):

Lin-Chih Chen

Keyword(s):

High Performance ◽

Clustering Algorithm ◽

Tree Structure ◽

Clustering Method ◽

Hierarchical Tree ◽

Search Results ◽

Tree Building ◽

Almost All ◽

Hierarchical Tree Structure ◽

Better Than

Result clustering has recently attracted a lot of attention to provide the users with a succinct overview of relevant search results than traditional search engines. This chapter proposes a mixed clustering method to organize all returned search results into a hierarchical tree structure. The clustering method accomplishes two main tasks, one is label construction and the other is tree building. This chapter uses precision to measure the quality of clustering results. According to the results of experiments, the author preliminarily concluded that the performance of the system is better than many other well-known commercial and academic systems. This chapter makes several contributions. First, it presents a high performance system based on the clustering method. Second, it develops a divisive hierarchical clustering algorithm to organize all returned snippets into hierarchical tree structure. Third, it performs a wide range of experimental analyses to show that almost all commercial systems are significantly better than most current academic systems.

Download Full-text

The Matter of Chance: Auditing Web Search Results Related to the 2020 U.S. Presidential Primary Elections Across Six Search Engines

Social Science Computer Review ◽

10.1177/08944393211006863 ◽

2021 ◽

pp. 089443932110068

Author(s):

Aleksandra Urman ◽

Mykola Makhortykh ◽

Roberto Ulloa

Keyword(s):

Search Engine ◽

Search Engines ◽

Large Scale ◽

Web Search ◽

Primary Elections ◽

Virtual Agents ◽

Search Results ◽

Presidential Primary ◽

Large Scale Analysis ◽

Algorithmic Information

We examine how six search engines filter and rank information in relation to the queries on the U.S. 2020 presidential primary elections under the default—that is nonpersonalized—conditions. For that, we utilize an algorithmic auditing methodology that uses virtual agents to conduct large-scale analysis of algorithmic information curation in a controlled environment. Specifically, we look at the text search results for “us elections,” “donald trump,” “joe biden,” “bernie sanders” queries on Google, Baidu, Bing, DuckDuckGo, Yahoo, and Yandex, during the 2020 primaries. Our findings indicate substantial differences in the search results between search engines and multiple discrepancies within the results generated for different agents using the same search engine. It highlights that whether users see certain information is decided by chance due to the inherent randomization of search results. We also find that some search engines prioritize different categories of information sources with respect to specific candidates. These observations demonstrate that algorithmic curation of political information can create information inequalities between the search engine users even under nonpersonalized conditions. Such inequalities are particularly troubling considering that search results are highly trusted by the public and can shift the opinions of undecided voters as demonstrated by previous research.

Download Full-text

Website removal from search engines due to copyright violation

Aslib Journal of Information Management ◽

10.1108/ajim-05-2018-0108 ◽

2019 ◽

Vol 71 (1) ◽

pp. 54-71 ◽

Cited By ~ 7

Author(s):

Artur Strzelecki

Keyword(s):

Search Engine ◽

Search Engines ◽

Design Methodology ◽

Global Analysis ◽

Domain Name ◽

Content Type ◽

Search Results ◽

Internet Users ◽

Copyright Violation

Purpose The purpose of this paper is to clarify how many removal requests are made, how often, and who makes these requests, as well as which websites are reported to search engines so they can be removed from the search results. Design/methodology/approach Undertakes a deep analysis of more than 3.2bn removed pages from Google’s search results requested by reporting organizations from 2011 to 2018 and over 460m removed pages from Bing’s search results requested by reporting organizations from 2015 to 2017. The paper focuses on pages that belong to the .pl country coded top-level domain (ccTLD). Findings Although the number of requests to remove data from search results has been growing year on year, fewer URLs have been reported in recent years. Some of the requests are, however, unjustified and are rejected by teams representing the search engines. In terms of reporting copyright violations, one company in particular stands out (AudioLock.Net), accounting for 28.1 percent of all reports sent to Google (the top ten companies combined were responsible for 61.3 percent of the total number of reports). Research limitations/implications As not every request can be published, the study is based only what is publicly available. Also, the data assigned to Poland is only based on the ccTLD domain name (.pl); other domain extensions for Polish internet users were not considered. Originality/value This is first global analysis of data from transparency reports published by search engine companies as prior research has been based on specific notices.

Download Full-text

IMPLEMENTASI ALGORITMA GOOGLE LATENT SEMANTIC DISTANCE UNTUK EKSTRAKSI RANGKAIAN KATA KUNCI ARTIKEL JURNAL ILMIAH

Computatio : Journal of Computer Science and Information Systems ◽

10.24912/computatio.v2i2.2569 ◽

2018 ◽

Vol 2 (2) ◽

pp. 186

Author(s):

Novario Jaya Perdana

Keyword(s):

Search Engine ◽

Search Engines ◽

Semantic Distance ◽

Relevant Information ◽

High Accuracy ◽

Hard Work ◽

The Internet ◽

Search Results ◽

Search Result

The accuracy of search result using search engine depends on the keywords that are used. Lack of the information provided on the keywords can lead to reduced accuracy of the search result. This means searching information on the internet is a hard work. In this research, a software has been built to create document keywords sequences. The software uses Google Latent Semantic Distance which can extract relevant information from the document. The information is expressed in the form of specific words sequences which could be used as keyword recommendations in search engines. The result shows that the implementation of the method for creating document keyword recommendation achieved high accuracy and could finds the most relevant information in the top search results.

Download Full-text

Clustering Life Trajectories a New Divisive Hierarchical Clustering Algorithm for Discrete-Valued Discrete Time Series

SSRN Electronic Journal ◽

10.2139/ssrn.1763815 ◽

2011 ◽

Author(s):

Stephan Dlugosz

Keyword(s):

Time Series ◽

Hierarchical Clustering ◽

Discrete Time ◽

Clustering Algorithm ◽

Discrete Time Series ◽

Life Trajectories ◽

Divisive Hierarchical Clustering ◽

Hierarchical Clustering Algorithm

Download Full-text

Techniques for Improving Web Search by Understanding Queries

10.26686/wgtn.16985482 ◽

2021 ◽

Author(s):

◽

Daniel Wayne Crabtree

Keyword(s):

Search Engines ◽

Best Practice ◽

Web Search ◽

Special Focus ◽

Clustering Methods ◽

Web Page ◽

Clustering Method ◽

Evaluation Measures ◽

Search Results ◽

Web Page Clustering

This thesis investigates the refinement of web search results with a special focus on the use of clustering and the role of queries. It presents a collection of new methods for evaluating clustering methods, performing clustering effectively, and for performing query refinement. The thesis identifies different types of query, the situations where refinement is necessary, and the factors affecting search difficulty. It then analyses hard searches and argues that many of them fail because users and search engines have different query models. The thesis identifies best practice for evaluating web search results and search refinement methods. It finds that none of the commonly used evaluation measures for clustering meet all of the properties of good evaluation measures. It then presents new quality and coverage measures that satisfy all the desired properties and that rank clusterings correctly in all web page clustering situations. The thesis argues that current web page clustering methods work well when different interpretations of the query have distinct vocabulary, but still have several limitations and often produce incomprehensible clusters. It then presents a new clustering method that uses the query to guide the construction of semantically meaningful clusters. The new clustering method significantly improves performance. Finally, the thesis explores how searches and queries are composed of different aspects and shows how to use aspects to reduce the distance between the query models of search engines and users. It then presents fully automatic methods that identify query aspects, identify underrepresented aspects, and predict query difficulty. Used in combination, these methods have many applications — the thesis describes methods for two of them. The first method improves the search results for hard queries with underrepresented aspects by automatically expanding the query using semantically orthogonal keywords related to the underrepresented aspects. The second method helps users refine hard ambiguous queries by identifying the different query interpretations using a clustering of a diverse set of refinements. Both methods significantly outperform existing methods.

Download Full-text

Techniques for Improving Web Search by Understanding Queries

10.26686/wgtn.16985482.v1 ◽

2021 ◽

Author(s):

◽

Daniel Wayne Crabtree

Keyword(s):

Search Engines ◽

Best Practice ◽

Web Search ◽

Special Focus ◽

Clustering Methods ◽

Web Page ◽

Clustering Method ◽

Evaluation Measures ◽

Search Results ◽

Web Page Clustering

This thesis investigates the refinement of web search results with a special focus on the use of clustering and the role of queries. It presents a collection of new methods for evaluating clustering methods, performing clustering effectively, and for performing query refinement. The thesis identifies different types of query, the situations where refinement is necessary, and the factors affecting search difficulty. It then analyses hard searches and argues that many of them fail because users and search engines have different query models. The thesis identifies best practice for evaluating web search results and search refinement methods. It finds that none of the commonly used evaluation measures for clustering meet all of the properties of good evaluation measures. It then presents new quality and coverage measures that satisfy all the desired properties and that rank clusterings correctly in all web page clustering situations. The thesis argues that current web page clustering methods work well when different interpretations of the query have distinct vocabulary, but still have several limitations and often produce incomprehensible clusters. It then presents a new clustering method that uses the query to guide the construction of semantically meaningful clusters. The new clustering method significantly improves performance. Finally, the thesis explores how searches and queries are composed of different aspects and shows how to use aspects to reduce the distance between the query models of search engines and users. It then presents fully automatic methods that identify query aspects, identify underrepresented aspects, and predict query difficulty. Used in combination, these methods have many applications — the thesis describes methods for two of them. The first method improves the search results for hard queries with underrepresented aspects by automatically expanding the query using semantically orthogonal keywords related to the underrepresented aspects. The second method helps users refine hard ambiguous queries by identifying the different query interpretations using a clustering of a diverse set of refinements. Both methods significantly outperform existing methods.

Download Full-text

A Roadmap to Integrate Document Clustering in Information Retrieval

Information Retrieval Methods for Multidisciplinary Applications ◽

10.4018/978-1-4666-3898-3.ch003 ◽

2013 ◽

pp. 31-45

Author(s):

R. Subhashini ◽

V.Jawahar Senthil Kumar

Keyword(s):

Information Retrieval ◽

Search Engines ◽

World Wide ◽

Clustering Algorithm ◽

Web Search ◽

Full Potential ◽

Digital Information ◽

Search Results ◽

The World ◽

The Web

The World Wide Web is a large distributed digital information space. The ability to search and retrieve information from the Web efficiently and effectively is an enabling technology for realizing its full potential. Information Retrieval (IR) plays an important role in search engines. Today’s most advanced engines use the keyword-based (“bag of words”) paradigm, which has inherent disadvantages. Organizing web search results into clusters facilitates the user’s quick browsing of search results. Traditional clustering techniques are inadequate because they do not generate clusters with highly readable names. This paper proposes an approach for web search results in clustering based on a phrase based clustering algorithm. It is an alternative to a single ordered result of search engines. This approach presents a list of clusters to the user. Experimental results verify the method’s feasibility and effectiveness.

Download Full-text

Search Engine Performance Comparisons

Distributed Artificial Intelligence, Agent Technology, and Collaborative Applications - Advances in Intelligent Information Technologies ◽

10.4018/978-1-60566-144-5.ch008 ◽

2011 ◽

pp. 148-164

Author(s):

Xiannong Meng ◽

Song Xing

Keyword(s):

Search Engine ◽

Search Engines ◽

Engine Performance ◽

Process Time ◽

Average Process ◽

Search Results ◽

Time Average ◽

Evaluation Techniques ◽

Average User

This chapter reports the results of a project attempting to assess the performance of a few major search engines from various perspectives. The search engines involved in the study include the Microsoft Search Engine (MSE) when it was in its beta test stage, AllTheWeb, and Yahoo. In a few comparisons, other search engines such as Google, Vivisimo are also included. The study collects statistics such as the average user response time, average process time for a query reported by MSE, as well as the number of pages relevant to a query reported by all search engines involved. The project also studies the quality of search results generated by MSE and other search engines using RankPower as the metric. We found MSE performs well in speed and diversity of the query results, while weaker in other statistics, compared to some other leading search engines. The contribution of this chapter is to review the performance evaluation techniques for search engines and use different measures to assess and compare the quality of different search engines, especially MSE.

Download Full-text

Website Quality Evaluation Based on Search Engine Queries using Web Rank Position Algorithm (WRPA)

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v4.i1.pp224-230 ◽

2016 ◽

Vol 4 (1) ◽

pp. 224

Author(s):

Chandran M ◽

Ramani A. V

Keyword(s):

Search Engine ◽

Search Engines ◽

Quality Evaluation ◽

Research Work ◽

Search Query ◽

Website Quality ◽

Search Results ◽

Search Result ◽

Different Types

The research work is about to test the quality of the website and to improve the quality by analyzing the hit counts, impressions, clicks, count through rates and average positions. This is accomplished using WRPA and SEO technique. The quality of the website mainly lies on the keywords which are present in it. The keywords can be of a search query which is typed by the users in the search engines and based on these keywords, the websites are displayed in the search results. This research work concentrates on bringing the particular websites to the first of the search result in the search engine. The website chosen for research is SRKV. The research work is carried out by creating an index array of Meta tags. This array will hold all the Meta tags. All the search keywords for the website from the users are stored in another array. The index array is matched and compared with the search keywords array. From this, hit to count is calculated for the analysis. Now the calculated hit count and the searched keywords will be analyzed to improve the performance of the website. The matched special keywords from the above comparison are included in the Meta tag to improve the performance of the website. Again all the Meta tags and newly specified keywords in the index array are matched with the SEO keywords. If this matches, then the matched keyword will be stored for improving the quality of the website. Metrics such as impressions, clicks, CTR, average positions are also measured along with the hit counts. The research is carried out under different types of browsers and different types of platforms. Queries about the website from different countries are also measured. In conclusion, if the number of the clicks for the website is more than the average number of clicks, then the quality of the website is good. This research helps in improvising the keywords using WRPA and SEO and thereby improves the quality of the website easily.

Download Full-text