Web Mining in Thematic Search Engines

Author(s):  
Massimiliano Caramia ◽  
Giovanni Felici

The recent improvements of search engine technologies have made available to Internet users an enormous amount of knowledge that can be accessed in many different ways. The most popular search engines now provide search facilities for databases containing billions of Web pages, where queries are executed instantly. The focus is switching from quantity (maintaining and indexing large databases of Web pages and quickly selecting pages matching some criterion) to quality (identifying pages with a high quality for the user). Such a trend is motivated by the natural evolution of Internet users who are now more selective in their choice of the search tool and may be willing to pay the price of providing extra feedback to the system and to wait more time for their queries to be better matched. In this framework, several have considered the use of data-mining and optimization techniques, which are often referred to as Web mining (for a recent bibliography on this topic, see, e.g., Getoor, Senator, Domingos & Faloutsos, 2003), and Zaïane, Srivastava, Spiliopoulou, & Masand, 2002). Here, we describe a method for improving standard search results in a thematic search engine, where the documents and the pages made available are restricted to a finite number of topics, and the users are considered to belong to a finite number of user profiles. The method uses clustering techniques to identify, in the set of pages resulting from a simple query, subsets that are homogeneous with respect to a vectorization based on context or profile; then we construct a number of small and potentially good subsets of pages, extracting from each cluster the pages with higher scores. Operating on these subsets with a genetic algorithm, we identify the subset with a good overall score and a high internal dissimilarity. This provides the user with a few nonduplicated pages that represent more correctly the structure of the initial set of pages. Because pages are seen by the algorithms as vectors of fixed dimension, the role of the context- or profile-based vectorization is central and specific to the thematic approach of this method.

2021 ◽  
Author(s):  
Prem Sagar Sharma ◽  
Divakar Yadav

<div>Purpose: Due to the exponential growth of internet users and internet traffic, information seekers are highly dependent upon search engines to extract relevant information. Due to the accessibility of a large amount of textual, audio, video etc. contents, the responsibility of search engines has increased.</div><div>Design/methodology/approach: The search engine provides relevant information to internet users concerning to their query; based on content, link structure etc. However, it does not provide the guarantee of the correctness of the information. The performance of a search engine is highly dependent upon the ranking module. The performance of the ranking module is dependent upon the link structure of web pages, which analyse through Web structure mining (WSM) and their content, which analyses through Web content mining (WCM). Web mining plays an important role in computing the rank of web pages.</div><div>Findings: In this article, web mining types, techniques, tools, algorithms and their challenges are presented. Further, it provides a critical comprehensive survey for the researchers by presenting different features of web pages, which are important to check the quality of web pages.</div><div>Originality: In this work, authors presented different approaches/techniques, algorithms and evaluation approaches in previous researches and identified some critical issues in page ranking & web mining, which provide future directions for the researchers, working in the area.</div>


2021 ◽  
Author(s):  
Prem Sagar Sharma ◽  
Divakar Yadav

<div>Purpose: Due to the exponential growth of internet users and internet traffic, information seekers are highly dependent upon search engines to extract relevant information. Due to the accessibility of a large amount of textual, audio, video etc. contents, the responsibility of search engines has increased.</div><div>Design/methodology/approach: The search engine provides relevant information to internet users concerning to their query; based on content, link structure etc. However, it does not provide the guarantee of the correctness of the information. The performance of a search engine is highly dependent upon the ranking module. The performance of the ranking module is dependent upon the link structure of web pages, which analyse through Web structure mining (WSM) and their content, which analyses through Web content mining (WCM). Web mining plays an important role in computing the rank of web pages.</div><div>Findings: In this article, web mining types, techniques, tools, algorithms and their challenges are presented. Further, it provides a critical comprehensive survey for the researchers by presenting different features of web pages, which are important to check the quality of web pages.</div><div>Originality: In this work, authors presented different approaches/techniques, algorithms and evaluation approaches in previous researches and identified some critical issues in page ranking & web mining, which provide future directions for the researchers, working in the area.</div>


Author(s):  
Massimiliano Caramia ◽  
Giovanni Felici

In the present chapter we report on some extensions on the work presented in the first edition of the Encyclopedia of Data Mining. In Caramia and Felici (2005) we have described a method based on clustering and a heuristic search method- based on a genetic algorithm - to extract pages with relevant information for a specific user query in a thematic search engine. Starting from these results we have extended the research work trying to match some issues related to the semantic aspects of the search, focusing on the keywords that are used to establish the similarity among the pages that result from the query. Complete details on this method, here omitted for brevity, can be found in Caramia and Felici (2006). Search engines technologies remain a strong research topic, as new problems and new demands from the market and the users arise. The process of switching from quantity (maintaining and indexing large databases of web pages and quickly select pages matching some criterion) to quality (identifying pages with a high quality for the user), already highlighted in Caramia and Felici (2005), has not been interrupted, but has gained further energy, being motivated by the natural evolution of the internet users, more selective in their choice of the search tool and willing to pay the price of providing extra feedback to the system and wait more time to have their queries better matched. In this framework, several have considered the use of data mining and optimization techniques, that are often referred to as web mining (for a recent bibliography on this topic see, e.g., Getoor, Senator, Domingos, and Faloutsos, 2003 and Zaïane, Srivastava, Spiliopoulou, and Masand, 2002). The work described in this chapter is bases on clustering techniques to identify, in the set of pages resulting from a simple query, subsets that are homogeneous with respect to a vectorization based on context or profile; then, a number of small and potentially good subsets of pages is constructed, extracting from each cluster the pages with higher scores. Operating on these subsets with a genetic algorithm, a subset with a good overall score and a high internal dissimilarity is identified. A related problem is then considered: the selection of a subset of pages that are compliant with the search keywords, but that also are characterized by the fact that they share a large subset of words different from the search keywords. This characteristic represents a sort of semantic connection of these pages that may be of use to spot some particular aspects of the information present in the pages. Such a task is accomplished by the construction of a special graph, whose maximumweight clique and k-densest subgraph should represent the page subsets with the desired properties. In the following we summarize the main background topics and provide a synthetic description of the methods. Interested readers may find additional information in Caramia and Felici (2004), Caramia and Felici (2005), and Caramia and Felici (2006).


2019 ◽  
Vol 71 (1) ◽  
pp. 54-71 ◽  
Author(s):  
Artur Strzelecki

Purpose The purpose of this paper is to clarify how many removal requests are made, how often, and who makes these requests, as well as which websites are reported to search engines so they can be removed from the search results. Design/methodology/approach Undertakes a deep analysis of more than 3.2bn removed pages from Google’s search results requested by reporting organizations from 2011 to 2018 and over 460m removed pages from Bing’s search results requested by reporting organizations from 2015 to 2017. The paper focuses on pages that belong to the .pl country coded top-level domain (ccTLD). Findings Although the number of requests to remove data from search results has been growing year on year, fewer URLs have been reported in recent years. Some of the requests are, however, unjustified and are rejected by teams representing the search engines. In terms of reporting copyright violations, one company in particular stands out (AudioLock.Net), accounting for 28.1 percent of all reports sent to Google (the top ten companies combined were responsible for 61.3 percent of the total number of reports). Research limitations/implications As not every request can be published, the study is based only what is publicly available. Also, the data assigned to Poland is only based on the ccTLD domain name (.pl); other domain extensions for Polish internet users were not considered. Originality/value This is first global analysis of data from transparency reports published by search engine companies as prior research has been based on specific notices.


Author(s):  
Ravi P. Kumar ◽  
Ashutosh K. Singh ◽  
Anand Mohan

In this era of Web computing, Cyber Security is very important as more and more data is moving into the Web. Some data are confidential and important. There are many threats for the data in the Web. Some of the basic threats can be addressed by designing the Web sites properly using Search Engine Optimization techniques. One such threat is the hanging page which gives room for link spamming. This chapter addresses the issues caused by hanging pages in Web computing. This Chapter has four important objectives. They are 1) Compare and review the different types of link structure based ranking algorithms in ranking Web pages. PageRank is used as the base algorithm throughout this Chapter. 2) Study on hanging pages, explore the effects of hanging pages in Web security and compare the existing methods to handle hanging pages. 3) Study on Link spam and explore the effect of hanging pages in link spam contribution and 4) Study on Search Engine Optimization (SEO) / Web Site Optimization (WSO) and explore the effect of hanging pages in Search Engine Optimization (SEO).


Author(s):  
Oğuzhan Menemencioğlu ◽  
İlhami Muharrem Orak

Semantic web works on producing machine readable data and aims to deal with large amount of data. The most important tool to access the data which exist in web is the search engine. Traditional search engines are insufficient in the face of the amount of data that consists in the existing web pages. Semantic search engines are extensions to traditional engines and overcome the difficulties faced by them. This paper summarizes semantic web, concept of traditional and semantic search engines and infrastructure. Also semantic search approaches are detailed. A summary of the literature is provided by touching on the trends. In this respect, type of applications and the areas worked for are considered. Based on the data for two different years, trend on these points are analyzed and impacts of changes are discussed. It shows that evaluation on the semantic web continues and new applications and areas are also emerging. Multimedia retrieval is a newly scope of semantic. Hence, multimedia retrieval approaches are discussed. Text and multimedia retrieval is analyzed within semantic search.


2019 ◽  
Vol 16 (9) ◽  
pp. 3712-3716
Author(s):  
Kailash Kumar ◽  
Abdulaziz Al-Besher

This paper examines the overlapping of the results retrieved between three major search engines namely Google, Yahoo and Bing. A rigorous analysis of overlap among these search engines was conducted on 100 random queries. The overlap of first ten web page results, i.e., hundred results from each search engine and only non-sponsored results from these above major search engines were taken into consideration. Search engines have their own frequency of updates and ranking of results based on their relevance. Moreover, sponsored search advertisers are different for different search engines. Single search engine cannot index all Web pages. In this research paper, the overlapping analysis of the results were carried out between October 1, 2018 to October 31, 2018 among these major search engines namely, Google, Yahoo and Bing. A framework is built in Java to analyze the overlap among these search engines. This framework eliminates the common results and merges them in a unified list. It also uses the ranking algorithm to re-rank the search engine results and displays it back to the user.


Author(s):  
R. D. Gaharwar ◽  
D. B. Shah

Search engine are being used by the most of the World population as their basic information retrieval system for getting useful information from internet. As a service provider who uses internet for digital marketing it becomes mandatory to get high ranks from search engines. Search engines optimization (SEO) techniques are used for this purpose. Black hat SEO techniques are used for quick results but are prohibited by most of the search engines. Hence web space users or website developers should be well aware of SEO techniques and how to use them in optimal way. This paper presents some of the most commonly used black hat SEO techniques and the counter measures done by different search engines to prohibit them.


2021 ◽  
Author(s):  
Srihari Vemuru ◽  
Eric John ◽  
Shrisha Rao

Humans can easily parse and find answers to complex queries such as "What was the capital of the country of the discoverer of the element which has atomic number 1?" by breaking them up into small pieces, querying these appropriately, and assembling a final answer. However, contemporary search engines lack such capability and fail to handle even slightly complex queries. Search engines process queries by identifying keywords and searching against them in knowledge bases or indexed web pages. The results are, therefore, dependent on the keywords and how well the search engine handles them. In our work, we propose a three-step approach called parsing, tree generation, and querying (PTGQ) for effective searching of larger and more expressive queries of potentially unbounded complexity. PTGQ parses a complex query and constructs a query tree where each node represents a simple query. It then processes the complex query by recursively querying a back-end search engine, going over the corresponding query tree in postorder. Using PTGQ makes sure that the search engine always handles a simpler query containing very few keywords. Results demonstrate that PTGQ can handle queries of much higher complexity than standalone search engines.


Author(s):  
Pavel Šimek ◽  
Jiří Vaněk ◽  
Jan Jarolímek

The majority of Internet users use the global network to search for different information using fulltext search engines such as Google, Yahoo!, or Seznam. The web presentation operators are trying, with the help of different optimization techniques, to get to the top places in the results of fulltext search engines. Right there is a great importance of Search Engine Optimization and Search Engine Marketing, because normal users usually try links only on the first few pages of the fulltext search engines results on certain keywords and in catalogs they use primarily hierarchically higher placed links in each category. Key to success is the application of optimization methods which deal with the issue of keywords, structure and quality of content, domain names, individual sites and quantity and reliability of backward links. The process is demanding, long-lasting and without a guaranteed outcome. A website operator without advanced analytical tools do not identify the contribution of individual documents from which the entire web site consists. If the web presentation operators want to have an overview of their documents and web site in global, it is appropriate to quantify these positions in a specific way, depending on specific key words. For this purpose serves the quantification of competitive value of documents, which consequently sets global competitive value of a web site. Quantification of competitive values is performed on a specific full-text search engine. For each full-text search engine can be and often are, different results. According to published reports of ClickZ agency or Market Share is according to the number of searches by English-speaking users most widely used Google search engine, which has a market share of more than 80%. The whole procedure of quantification of competitive values is common, however, the initial step which is the analysis of keywords depends on a choice of the fulltext search engine.


Sign in / Sign up

Export Citation Format

Share Document