Web Mining in Thematic Search Engines

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch226 ◽

2011 ◽

pp. 1201-1205

Author(s):

Massimiliano Caramia ◽

Giovanni Felici

Keyword(s):

Finite Number ◽

Search Engine ◽

Search Engines ◽

Web Mining ◽

Optimization Techniques ◽

Web Pages ◽

Internet Users ◽

Fixed Dimension ◽

Search Tool ◽

Amount Of Knowledge

The recent improvements of search engine technologies have made available to Internet users an enormous amount of knowledge that can be accessed in many different ways. The most popular search engines now provide search facilities for databases containing billions of Web pages, where queries are executed instantly. The focus is switching from quantity (maintaining and indexing large databases of Web pages and quickly selecting pages matching some criterion) to quality (identifying pages with a high quality for the user). Such a trend is motivated by the natural evolution of Internet users who are now more selective in their choice of the search tool and may be willing to pay the price of providing extra feedback to the system and to wait more time for their queries to be better matched. In this framework, several have considered the use of data-mining and optimization techniques, which are often referred to as Web mining (for a recent bibliography on this topic, see, e.g., Getoor, Senator, Domingos & Faloutsos, 2003), and Zaïane, Srivastava, Spiliopoulou, & Masand, 2002). Here, we describe a method for improving standard search results in a thematic search engine, where the documents and the pages made available are restricted to a finite number of topics, and the users are considered to belong to a finite number of user profiles. The method uses clustering techniques to identify, in the set of pages resulting from a simple query, subsets that are homogeneous with respect to a vectorization based on context or profile; then we construct a number of small and potentially good subsets of pages, extracting from each cluster the pages with higher scores. Operating on these subsets with a genetic algorithm, we identify the subset with a good overall score and a high internal dissimilarity. This provides the user with a few nonduplicated pages that represent more correctly the structure of the initial set of pages. Because pages are seen by the algorithms as vectors of fixed dimension, the role of the context- or profile-based vectorization is central and specific to the thematic approach of this method.

Download Full-text

Web Page Ranking using Web Mining Techniques: A comprehensive survey

10.36227/techrxiv.16654330 ◽

2021 ◽

Author(s):

Prem Sagar Sharma ◽

Divakar Yadav

Keyword(s):

Search Engine ◽

Search Engines ◽

Web Mining ◽

Relevant Information ◽

Traffic Information ◽

Web Pages ◽

Link Structure ◽

Page Ranking ◽

Internet Users ◽

Comprehensive Survey

<div>Purpose: Due to the exponential growth of internet users and internet traffic, information seekers are highly dependent upon search engines to extract relevant information. Due to the accessibility of a large amount of textual, audio, video etc. contents, the responsibility of search engines has increased.</div><div>Design/methodology/approach: The search engine provides relevant information to internet users concerning to their query; based on content, link structure etc. However, it does not provide the guarantee of the correctness of the information. The performance of a search engine is highly dependent upon the ranking module. The performance of the ranking module is dependent upon the link structure of web pages, which analyse through Web structure mining (WSM) and their content, which analyses through Web content mining (WCM). Web mining plays an important role in computing the rank of web pages.</div><div>Findings: In this article, web mining types, techniques, tools, algorithms and their challenges are presented. Further, it provides a critical comprehensive survey for the researchers by presenting different features of web pages, which are important to check the quality of web pages.</div><div>Originality: In this work, authors presented different approaches/techniques, algorithms and evaluation approaches in previous researches and identified some critical issues in page ranking & web mining, which provide future directions for the researchers, working in the area.</div>

Download Full-text

Web Page Ranking using Web Mining Techniques: A comprehensive survey

10.36227/techrxiv.16654330.v1 ◽

2021 ◽

Author(s):

Prem Sagar Sharma ◽

Divakar Yadav

Keyword(s):

Search Engine ◽

Search Engines ◽

Web Mining ◽

Relevant Information ◽

Traffic Information ◽

Web Pages ◽

Link Structure ◽

Page Ranking ◽

Internet Users ◽

Comprehensive Survey

Download Full-text

Web Mining in Thematic Search Engines

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch318 ◽

2011 ◽

pp. 2080-2084

Author(s):

Massimiliano Caramia ◽

Giovanni Felici

Keyword(s):

Data Mining ◽

Genetic Algorithm ◽

Search Engines ◽

Web Mining ◽

Research Work ◽

Relevant Information ◽

Additional Information ◽

Internet Users ◽

User Query ◽

Search Tool

In the present chapter we report on some extensions on the work presented in the first edition of the Encyclopedia of Data Mining. In Caramia and Felici (2005) we have described a method based on clustering and a heuristic search method- based on a genetic algorithm - to extract pages with relevant information for a specific user query in a thematic search engine. Starting from these results we have extended the research work trying to match some issues related to the semantic aspects of the search, focusing on the keywords that are used to establish the similarity among the pages that result from the query. Complete details on this method, here omitted for brevity, can be found in Caramia and Felici (2006). Search engines technologies remain a strong research topic, as new problems and new demands from the market and the users arise. The process of switching from quantity (maintaining and indexing large databases of web pages and quickly select pages matching some criterion) to quality (identifying pages with a high quality for the user), already highlighted in Caramia and Felici (2005), has not been interrupted, but has gained further energy, being motivated by the natural evolution of the internet users, more selective in their choice of the search tool and willing to pay the price of providing extra feedback to the system and wait more time to have their queries better matched. In this framework, several have considered the use of data mining and optimization techniques, that are often referred to as web mining (for a recent bibliography on this topic see, e.g., Getoor, Senator, Domingos, and Faloutsos, 2003 and Zaïane, Srivastava, Spiliopoulou, and Masand, 2002). The work described in this chapter is bases on clustering techniques to identify, in the set of pages resulting from a simple query, subsets that are homogeneous with respect to a vectorization based on context or profile; then, a number of small and potentially good subsets of pages is constructed, extracting from each cluster the pages with higher scores. Operating on these subsets with a genetic algorithm, a subset with a good overall score and a high internal dissimilarity is identified. A related problem is then considered: the selection of a subset of pages that are compliant with the search keywords, but that also are characterized by the fact that they share a large subset of words different from the search keywords. This characteristic represents a sort of semantic connection of these pages that may be of use to spot some particular aspects of the information present in the pages. Such a task is accomplished by the construction of a special graph, whose maximumweight clique and k-densest subgraph should represent the page subsets with the desired properties. In the following we summarize the main background topics and provide a synthetic description of the methods. Interested readers may find additional information in Caramia and Felici (2004), Caramia and Felici (2005), and Caramia and Felici (2006).

Download Full-text

Website removal from search engines due to copyright violation

Aslib Journal of Information Management ◽

10.1108/ajim-05-2018-0108 ◽

2019 ◽

Vol 71 (1) ◽

pp. 54-71 ◽

Cited By ~ 7

Author(s):

Artur Strzelecki

Keyword(s):

Search Engine ◽

Search Engines ◽

Design Methodology ◽

Global Analysis ◽

Domain Name ◽

Content Type ◽

Search Results ◽

Internet Users ◽

Purpose The purpose of this paper is to clarify how many removal requests are made, how often, and who makes these requests, as well as which websites are reported to search engines so they can be removed from the search results. Design/methodology/approach Undertakes a deep analysis of more than 3.2bn removed pages from Google’s search results requested by reporting organizations from 2011 to 2018 and over 460m removed pages from Bing’s search results requested by reporting organizations from 2015 to 2017. The paper focuses on pages that belong to the .pl country coded top-level domain (ccTLD). Findings Although the number of requests to remove data from search results has been growing year on year, fewer URLs have been reported in recent years. Some of the requests are, however, unjustified and are rejected by teams representing the search engines. In terms of reporting copyright violations, one company in particular stands out (AudioLock.Net), accounting for 28.1 percent of all reports sent to Google (the top ten companies combined were responsible for 61.3 percent of the total number of reports). Research limitations/implications As not every request can be published, the study is based only what is publicly available. Also, the data assigned to Poland is only based on the ccTLD domain name (.pl); other domain extensions for Polish internet users were not considered. Originality/value This is first global analysis of data from transparency reports published by search engine companies as prior research has been based on specific notices.

Download Full-text

Review of Link Structure Based Ranking Algorithms and Hanging Pages

Handbook of Research on Modern Cryptographic Solutions for Computer and Cyber Security - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-5225-0105-3.ch018 ◽

2016 ◽

pp. 420-459 ◽

Cited By ~ 1

Author(s):

Ravi P. Kumar ◽

Ashutosh K. Singh ◽

Anand Mohan

Keyword(s):

Search Engine ◽

Web Sites ◽

Cyber Security ◽

Web Security ◽

Optimization Techniques ◽

Web Pages ◽

Link Structure ◽

Search Engine Optimization ◽

Ranking Algorithms ◽

The Web

In this era of Web computing, Cyber Security is very important as more and more data is moving into the Web. Some data are confidential and important. There are many threats for the data in the Web. Some of the basic threats can be addressed by designing the Web sites properly using Search Engine Optimization techniques. One such threat is the hanging page which gives room for link spamming. This chapter addresses the issues caused by hanging pages in Web computing. This Chapter has four important objectives. They are 1) Compare and review the different types of link structure based ranking algorithms in ranking Web pages. PageRank is used as the base algorithm throughout this Chapter. 2) Study on hanging pages, explore the effects of hanging pages in Web security and compare the existing methods to handle hanging pages. 3) Study on Link spam and explore the effect of hanging pages in link spam contribution and 4) Study on Search Engine Optimization (SEO) / Web Site Optimization (WSO) and explore the effect of hanging pages in Search Engine Optimization (SEO).

Download Full-text

A Review on Semantic Text and Multimedia Retrieval and Recent Trends

International Journal of Multimedia Data Engineering and Management ◽

10.4018/ijmdem.2015010104 ◽

2015 ◽

Vol 6 (1) ◽

pp. 54-74

Author(s):

Oğuzhan Menemencioğlu ◽

İlhami Muharrem Orak

Keyword(s):

Semantic Web ◽

Search Engine ◽

Search Engines ◽

Semantic Search ◽

Multimedia Retrieval ◽

Web Pages ◽

The Face ◽

Recent Trends ◽

New Applications ◽

Machine Readable

Semantic web works on producing machine readable data and aims to deal with large amount of data. The most important tool to access the data which exist in web is the search engine. Traditional search engines are insufficient in the face of the amount of data that consists in the existing web pages. Semantic search engines are extensions to traditional engines and overcome the difficulties faced by them. This paper summarizes semantic web, concept of traditional and semantic search engines and infrastructure. Also semantic search approaches are detailed. A summary of the literature is provided by touching on the trends. In this respect, type of applications and the areas worked for are considered. Based on the data for two different years, trend on these points are analyzed and impacts of changes are discussed. It shows that evaluation on the semantic web continues and new applications and areas are also emerging. Multimedia retrieval is a newly scope of semantic. Hence, multimedia retrieval approaches are discussed. Text and multimedia retrieval is analyzed within semantic search.

Download Full-text

Critical Analysis of Major Search Engines

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2019.8239 ◽

2019 ◽

Vol 16 (9) ◽

pp. 3712-3716

Author(s):

Kailash Kumar ◽

Abdulaziz Al-Besher

Keyword(s):

Search Engine ◽

Search Engines ◽

Critical Analysis ◽

Research Paper ◽

Web Pages ◽

Ranking Algorithm ◽

Blackhat Search Engine Optimization Techniques (SEO) and Counter Measures

International Journal of Scientific Research in Science and Technology ◽

10.32628/ijsrst1840117 ◽

2018 ◽

pp. 21-32

Author(s):

R. D. Gaharwar ◽

D. B. Shah

Keyword(s):

Search Engine ◽

Search Engines ◽

Retrieval System ◽

Information Retrieval System ◽

Optimization Techniques ◽

Digital Marketing ◽

World Population ◽

Counter Measures ◽

The World ◽

Web Space

Search engine are being used by the most of the World population as their basic information retrieval system for getting useful information from internet. As a service provider who uses internet for digital marketing it becomes mandatory to get high ranks from search engines. Search engines optimization (SEO) techniques are used for this purpose. Black hat SEO techniques are used for quick results but are prohibited by most of the search engines. Hence web space users or website developers should be well aware of SEO techniques and how to use them in optimal way. This paper presents some of the most commonly used black hat SEO techniques and the counter measures done by different search engines to prohibit them.

Download Full-text

Handling Complex Queries Using Query Trees

10.36227/techrxiv.14845212 ◽

2021 ◽

Author(s):

Srihari Vemuru ◽

Eric John ◽

Shrisha Rao

Keyword(s):

Search Engine ◽

Search Engines ◽

Knowledge Bases ◽

Web Pages ◽

Complex Query ◽

Complex Queries ◽

Tree Generation ◽

Query Tree ◽

Final Answer ◽

Simple Query

Humans can easily parse and find answers to complex queries such as "What was the capital of the country of the discoverer of the element which has atomic number 1?" by breaking them up into small pieces, querying these appropriately, and assembling a final answer. However, contemporary search engines lack such capability and fail to handle even slightly complex queries. Search engines process queries by identifying keywords and searching against them in knowledge bases or indexed web pages. The results are, therefore, dependent on the keywords and how well the search engine handles them. In our work, we propose a three-step approach called parsing, tree generation, and querying (PTGQ) for effective searching of larger and more expressive queries of potentially unbounded complexity. PTGQ parses a complex query and constructs a query tree where each node represents a simple query. It then processes the complex query by recursively querying a back-end search engine, going over the corresponding query tree in postorder. Using PTGQ makes sure that the search engine always handles a simpler query containing very few keywords. Results demonstrate that PTGQ can handle queries of much higher complexity than standalone search engines.

Download Full-text

Quantification of competitive value of documents

Acta Universitatis Agriculturae et Silviculturae Mendelianae Brunensis ◽

10.11118/actaun200957050285 ◽

2009 ◽

Vol 57 (5) ◽

pp. 285-290

Author(s):

Pavel Šimek ◽

Jiří Vaněk ◽

Jan Jarolímek

Keyword(s):

Search Engine ◽

Market Share ◽

Full Text ◽

Search Engines ◽

Web Site ◽

Optimization Techniques ◽

Text Search ◽

Full Text Search ◽

Google Search ◽

The Web

The majority of Internet users use the global network to search for different information using fulltext search engines such as Google, Yahoo!, or Seznam. The web presentation operators are trying, with the help of different optimization techniques, to get to the top places in the results of fulltext search engines. Right there is a great importance of Search Engine Optimization and Search Engine Marketing, because normal users usually try links only on the first few pages of the fulltext search engines results on certain keywords and in catalogs they use primarily hierarchically higher placed links in each category. Key to success is the application of optimization methods which deal with the issue of keywords, structure and quality of content, domain names, individual sites and quantity and reliability of backward links. The process is demanding, long-lasting and without a guaranteed outcome. A website operator without advanced analytical tools do not identify the contribution of individual documents from which the entire web site consists. If the web presentation operators want to have an overview of their documents and web site in global, it is appropriate to quantify these positions in a specific way, depending on specific key words. For this purpose serves the quantification of competitive value of documents, which consequently sets global competitive value of a web site. Quantification of competitive values is performed on a specific full-text search engine. For each full-text search engine can be and often are, different results. According to published reports of ClickZ agency or Market Share is according to the number of searches by English-speaking users most widely used Google search engine, which has a market share of more than 80%. The whole procedure of quantification of competitive values is common, however, the initial step which is the analysis of keywords depends on a choice of the fulltext search engine.

Download Full-text