Building a Post-Search Academic Search Engine Based on a Serial of Clustering Methods

2013 ◽  
Vol 284-287 ◽  
pp. 3051-3055
Author(s):  
Lin Chih Chen

Academic search engines, such as Google Scholar and Scirus, provide a Web-based interface to effectively find relevant scientific articles to researchers. However, current academic search engines are lacking the ability to cluster the search results into a hierarchical tree structure. In this paper, we develop a post-search academic search engine by using a mixed clustering method. In this method, we first adopt a suffix tree clustering and a two-way hash mechanism to generate all meaningful labels. We then develop a divisive hierarchical clustering algorithm to organize the labels into a hierarchical tree. According to the results of experiments, we conclude that using our mixed clustering method to cluster the search results can give significant performance gains than current academic search engines. In this paper, we make two contributions. First, we present a high performance academic search engine based on our mixed clustering method. Second, we develop a divisive hierarchical clustering algorithm to organize all returned search results into a hierarchical tree structure.

2012 ◽  
pp. 274-290
Author(s):  
Lin-Chih Chen

Result clustering has recently attracted a lot of attention to provide the users with a succinct overview of relevant search results than traditional search engines. This chapter proposes a mixed clustering method to organize all returned search results into a hierarchical tree structure. The clustering method accomplishes two main tasks, one is label construction and the other is tree building. This chapter uses precision to measure the quality of clustering results. According to the results of experiments, the author preliminarily concluded that the performance of the system is better than many other well-known commercial and academic systems. This chapter makes several contributions. First, it presents a high performance system based on the clustering method. Second, it develops a divisive hierarchical clustering algorithm to organize all returned snippets into hierarchical tree structure. Third, it performs a wide range of experimental analyses to show that almost all commercial systems are significantly better than most current academic systems.


2021 ◽  
pp. 089443932110068
Author(s):  
Aleksandra Urman ◽  
Mykola Makhortykh ◽  
Roberto Ulloa

We examine how six search engines filter and rank information in relation to the queries on the U.S. 2020 presidential primary elections under the default—that is nonpersonalized—conditions. For that, we utilize an algorithmic auditing methodology that uses virtual agents to conduct large-scale analysis of algorithmic information curation in a controlled environment. Specifically, we look at the text search results for “us elections,” “donald trump,” “joe biden,” “bernie sanders” queries on Google, Baidu, Bing, DuckDuckGo, Yahoo, and Yandex, during the 2020 primaries. Our findings indicate substantial differences in the search results between search engines and multiple discrepancies within the results generated for different agents using the same search engine. It highlights that whether users see certain information is decided by chance due to the inherent randomization of search results. We also find that some search engines prioritize different categories of information sources with respect to specific candidates. These observations demonstrate that algorithmic curation of political information can create information inequalities between the search engine users even under nonpersonalized conditions. Such inequalities are particularly troubling considering that search results are highly trusted by the public and can shift the opinions of undecided voters as demonstrated by previous research.


2019 ◽  
Vol 71 (1) ◽  
pp. 54-71 ◽  
Author(s):  
Artur Strzelecki

Purpose The purpose of this paper is to clarify how many removal requests are made, how often, and who makes these requests, as well as which websites are reported to search engines so they can be removed from the search results. Design/methodology/approach Undertakes a deep analysis of more than 3.2bn removed pages from Google’s search results requested by reporting organizations from 2011 to 2018 and over 460m removed pages from Bing’s search results requested by reporting organizations from 2015 to 2017. The paper focuses on pages that belong to the .pl country coded top-level domain (ccTLD). Findings Although the number of requests to remove data from search results has been growing year on year, fewer URLs have been reported in recent years. Some of the requests are, however, unjustified and are rejected by teams representing the search engines. In terms of reporting copyright violations, one company in particular stands out (AudioLock.Net), accounting for 28.1 percent of all reports sent to Google (the top ten companies combined were responsible for 61.3 percent of the total number of reports). Research limitations/implications As not every request can be published, the study is based only what is publicly available. Also, the data assigned to Poland is only based on the ccTLD domain name (.pl); other domain extensions for Polish internet users were not considered. Originality/value This is first global analysis of data from transparency reports published by search engine companies as prior research has been based on specific notices.


Author(s):  
Novario Jaya Perdana

The accuracy of search result using search engine depends on the keywords that are used. Lack of the information provided on the keywords can lead to reduced accuracy of the search result. This means searching information on the internet is a hard work. In this research, a software has been built to create document keywords sequences. The software uses Google Latent Semantic Distance which can extract relevant information from the document. The information is expressed in the form of specific words sequences which could be used as keyword recommendations in search engines. The result shows that the implementation of the method for creating document keyword recommendation achieved high accuracy and could finds the most relevant information in the top search results.


2021 ◽  
Author(s):  
◽  
Daniel Wayne Crabtree

<p>This thesis investigates the refinement of web search results with a special focus on the use of clustering and the role of queries. It presents a collection of new methods for evaluating clustering methods, performing clustering effectively, and for performing query refinement. The thesis identifies different types of query, the situations where refinement is necessary, and the factors affecting search difficulty. It then analyses hard searches and argues that many of them fail because users and search engines have different query models. The thesis identifies best practice for evaluating web search results and search refinement methods. It finds that none of the commonly used evaluation measures for clustering meet all of the properties of good evaluation measures. It then presents new quality and coverage measures that satisfy all the desired properties and that rank clusterings correctly in all web page clustering situations. The thesis argues that current web page clustering methods work well when different interpretations of the query have distinct vocabulary, but still have several limitations and often produce incomprehensible clusters. It then presents a new clustering method that uses the query to guide the construction of semantically meaningful clusters. The new clustering method significantly improves performance. Finally, the thesis explores how searches and queries are composed of different aspects and shows how to use aspects to reduce the distance between the query models of search engines and users. It then presents fully automatic methods that identify query aspects, identify underrepresented aspects, and predict query difficulty. Used in combination, these methods have many applications — the thesis describes methods for two of them. The first method improves the search results for hard queries with underrepresented aspects by automatically expanding the query using semantically orthogonal keywords related to the underrepresented aspects. The second method helps users refine hard ambiguous queries by identifying the different query interpretations using a clustering of a diverse set of refinements. Both methods significantly outperform existing methods.</p>


2021 ◽  
Author(s):  
◽  
Daniel Wayne Crabtree

<p>This thesis investigates the refinement of web search results with a special focus on the use of clustering and the role of queries. It presents a collection of new methods for evaluating clustering methods, performing clustering effectively, and for performing query refinement. The thesis identifies different types of query, the situations where refinement is necessary, and the factors affecting search difficulty. It then analyses hard searches and argues that many of them fail because users and search engines have different query models. The thesis identifies best practice for evaluating web search results and search refinement methods. It finds that none of the commonly used evaluation measures for clustering meet all of the properties of good evaluation measures. It then presents new quality and coverage measures that satisfy all the desired properties and that rank clusterings correctly in all web page clustering situations. The thesis argues that current web page clustering methods work well when different interpretations of the query have distinct vocabulary, but still have several limitations and often produce incomprehensible clusters. It then presents a new clustering method that uses the query to guide the construction of semantically meaningful clusters. The new clustering method significantly improves performance. Finally, the thesis explores how searches and queries are composed of different aspects and shows how to use aspects to reduce the distance between the query models of search engines and users. It then presents fully automatic methods that identify query aspects, identify underrepresented aspects, and predict query difficulty. Used in combination, these methods have many applications — the thesis describes methods for two of them. The first method improves the search results for hard queries with underrepresented aspects by automatically expanding the query using semantically orthogonal keywords related to the underrepresented aspects. The second method helps users refine hard ambiguous queries by identifying the different query interpretations using a clustering of a diverse set of refinements. Both methods significantly outperform existing methods.</p>


Author(s):  
R. Subhashini ◽  
V.Jawahar Senthil Kumar

The World Wide Web is a large distributed digital information space. The ability to search and retrieve information from the Web efficiently and effectively is an enabling technology for realizing its full potential. Information Retrieval (IR) plays an important role in search engines. Today’s most advanced engines use the keyword-based (“bag of words”) paradigm, which has inherent disadvantages. Organizing web search results into clusters facilitates the user’s quick browsing of search results. Traditional clustering techniques are inadequate because they do not generate clusters with highly readable names. This paper proposes an approach for web search results in clustering based on a phrase based clustering algorithm. It is an alternative to a single ordered result of search engines. This approach presents a list of clusters to the user. Experimental results verify the method’s feasibility and effectiveness.


Author(s):  
Xiannong Meng ◽  
Song Xing

This chapter reports the results of a project attempting to assess the performance of a few major search engines from various perspectives. The search engines involved in the study include the Microsoft Search Engine (MSE) when it was in its beta test stage, AllTheWeb, and Yahoo. In a few comparisons, other search engines such as Google, Vivisimo are also included. The study collects statistics such as the average user response time, average process time for a query reported by MSE, as well as the number of pages relevant to a query reported by all search engines involved. The project also studies the quality of search results generated by MSE and other search engines using RankPower as the metric. We found MSE performs well in speed and diversity of the query results, while weaker in other statistics, compared to some other leading search engines. The contribution of this chapter is to review the performance evaluation techniques for search engines and use different measures to assess and compare the quality of different search engines, especially MSE.


Author(s):  
Chandran M ◽  
Ramani A. V

<p>The research work is about to test the quality of the website and to improve the quality by analyzing the hit counts, impressions, clicks, count through rates and average positions. This is accomplished using WRPA and SEO technique. The quality of the website mainly lies on the keywords which are present in it. The keywords can be of a search query which is typed by the users in the search engines and based on these keywords, the websites are displayed in the search results. This research work concentrates on bringing the particular websites to the first of the search result in the search engine. The website chosen for research is SRKV. The research work is carried out by creating an index array of Meta tags. This array will hold all the Meta tags. All the search keywords for the website from the users are stored in another array. The index array is matched and compared with the search keywords array. From this, hit to count is calculated for the analysis. Now the calculated hit count and the searched keywords will be analyzed to improve the performance of the website. The matched special keywords from the above comparison are included in the Meta tag to improve the performance of the website. Again all the Meta tags and newly specified keywords in the index array are matched with the SEO keywords. If this matches, then the matched keyword will be stored for improving the quality of the website. Metrics such as impressions, clicks, CTR, average positions are also measured along with the hit counts. The research is carried out under different types of browsers and different types of platforms. Queries about the website from different countries are also measured. In conclusion, if the number of the clicks for the website is more than the average number of clicks, then the quality of the website is good. This research helps in improvising the keywords using WRPA and SEO and thereby improves the quality of the website easily.</p>


Sign in / Sign up

Export Citation Format

Share Document