Web Mining in Thematic Search Engines

Author(s):  
Massimiliano Caramia ◽  
Giovanni Felici

In the present chapter we report on some extensions on the work presented in the first edition of the Encyclopedia of Data Mining. In Caramia and Felici (2005) we have described a method based on clustering and a heuristic search method- based on a genetic algorithm - to extract pages with relevant information for a specific user query in a thematic search engine. Starting from these results we have extended the research work trying to match some issues related to the semantic aspects of the search, focusing on the keywords that are used to establish the similarity among the pages that result from the query. Complete details on this method, here omitted for brevity, can be found in Caramia and Felici (2006). Search engines technologies remain a strong research topic, as new problems and new demands from the market and the users arise. The process of switching from quantity (maintaining and indexing large databases of web pages and quickly select pages matching some criterion) to quality (identifying pages with a high quality for the user), already highlighted in Caramia and Felici (2005), has not been interrupted, but has gained further energy, being motivated by the natural evolution of the internet users, more selective in their choice of the search tool and willing to pay the price of providing extra feedback to the system and wait more time to have their queries better matched. In this framework, several have considered the use of data mining and optimization techniques, that are often referred to as web mining (for a recent bibliography on this topic see, e.g., Getoor, Senator, Domingos, and Faloutsos, 2003 and Zaïane, Srivastava, Spiliopoulou, and Masand, 2002). The work described in this chapter is bases on clustering techniques to identify, in the set of pages resulting from a simple query, subsets that are homogeneous with respect to a vectorization based on context or profile; then, a number of small and potentially good subsets of pages is constructed, extracting from each cluster the pages with higher scores. Operating on these subsets with a genetic algorithm, a subset with a good overall score and a high internal dissimilarity is identified. A related problem is then considered: the selection of a subset of pages that are compliant with the search keywords, but that also are characterized by the fact that they share a large subset of words different from the search keywords. This characteristic represents a sort of semantic connection of these pages that may be of use to spot some particular aspects of the information present in the pages. Such a task is accomplished by the construction of a special graph, whose maximumweight clique and k-densest subgraph should represent the page subsets with the desired properties. In the following we summarize the main background topics and provide a synthetic description of the methods. Interested readers may find additional information in Caramia and Felici (2004), Caramia and Felici (2005), and Caramia and Felici (2006).

Author(s):  
Mahesh Kumar Singh ◽  
Om Prakash Rishi ◽  
Anukrati Sharma ◽  
Zaved Akhtar

Internet plays a vital role for doing the business. It provides platform for creating huge number of customers for ease of business. E-business organizations are growing rapidly and doubly in every minute; World Wide Web (WWW) provides huge information for the Internet users. The accesses of user's behavior are recorded in web logs. This information seems to be very helpful in an E-business environment for analysis and decision making. Mining of web data come across many new challenges with enlarged amount of information on data stored in web logs. The search engines play key role for retrieving the relevant information from huge information. Nowadays, the well-known search engines, like Google, MSN, Yahoo, etc. have provided the users with good search results worked on special search strategies. In web search services the web page ranker component plays the main factor of the Google. This paper discusses the new challenges faced by web mining techniques, ranking of web pages using page ranking algorithms and its application in E-business analysis to improve the business operations.


2013 ◽  
Vol 10 (9) ◽  
pp. 1969-1976
Author(s):  
Sathya Bama ◽  
M.S.Irfan Ahmed ◽  
A. Saravanan

The growth of internet is increasing continuously by which the need for improving the quality of services has been increased. Web mining is a research area which applies data mining techniques to address all this need. With billions of pages on the web it is very intricate task for the search engines to provide the relevant information to the users. Web structure mining plays a vital role by ranking the web pages based on user query which is the most essential attempt of the web search engines. PageRank, Weighted PageRank and HITS are the commonly used algorithm in web structure mining for ranking the web page. But all these algorithms treat all links equally when distributing initial rank scores. In this paper, an improved page rank algorithm is introduced. The result shows that the algorithm has better performance over PageRank algorithm.


2021 ◽  
Vol 1 (3) ◽  
pp. 29-34
Author(s):  
Ayad Abdulrahman

Due to the daily expansion of the web, the amount of information has increased significantly. Thus, the need for retrieving relevant information has also increased. In order to explore the internet, users depend on various search engines. Search engines face a significant challenge in returning the most relevant results for a user's query. The search engine's performance is determined by the algorithm used to rank web pages, which prioritizes the pages with the most relevancy to appear at the top of the result page. In this paper, various web page ranking algorithms such as Page Rank, Time Rank, EigenRumor, Distance Rank, SimRank, etc. are analyzed and compared based on some parameters, including the mining technique to which the algorithm belongs (for instance, Web Content Mining, Web Structure Mining, and Web Usage Mining), the methodology used for ranking web pages, time complexity (amount of time to run an algorithm), input parameters (parameters utilized in the ranking process such as InLink, OutLink, Tag name, Keyword, etc.), and the result relevancy to the user query.


2015 ◽  
Vol 2015 ◽  
pp. 1-8
Author(s):  
S. Sadesh ◽  
R. C. Suganthe

Web with tremendous volume of information retrieves result for user related queries. With the rapid growth of web page recommendation, results retrieved based on data mining techniques did not offer higher performance filtering rate because relationships between user profile and queries were not analyzed in an extensive manner. At the same time, existing user profile based prediction in web data mining is not exhaustive in producing personalized result rate. To improve the query result rate on dynamics of user behavior over time, Hamilton Filtered Regime Switching User Query Probability (HFRS-UQP) framework is proposed. HFRS-UQP framework is split into two processes, where filtering and switching are carried out. The data mining based filtering in our research work uses the Hamilton Filtering framework to filter user result based on personalized information on automatic updated profiles through search engine. Maximized result is fetched, that is, filtered out with respect to user behavior profiles. The switching performs accurate filtering updated profiles using regime switching. The updating in profile change (i.e., switches) regime in HFRS-UQP framework identifies the second- and higher-order association of query result on the updated profiles. Experiment is conducted on factors such as personalized information search retrieval rate, filtering efficiency, and precision ratio.


Author(s):  
Massimiliano Caramia ◽  
Giovanni Felici

The recent improvements of search engine technologies have made available to Internet users an enormous amount of knowledge that can be accessed in many different ways. The most popular search engines now provide search facilities for databases containing billions of Web pages, where queries are executed instantly. The focus is switching from quantity (maintaining and indexing large databases of Web pages and quickly selecting pages matching some criterion) to quality (identifying pages with a high quality for the user). Such a trend is motivated by the natural evolution of Internet users who are now more selective in their choice of the search tool and may be willing to pay the price of providing extra feedback to the system and to wait more time for their queries to be better matched. In this framework, several have considered the use of data-mining and optimization techniques, which are often referred to as Web mining (for a recent bibliography on this topic, see, e.g., Getoor, Senator, Domingos & Faloutsos, 2003), and Zaïane, Srivastava, Spiliopoulou, & Masand, 2002). Here, we describe a method for improving standard search results in a thematic search engine, where the documents and the pages made available are restricted to a finite number of topics, and the users are considered to belong to a finite number of user profiles. The method uses clustering techniques to identify, in the set of pages resulting from a simple query, subsets that are homogeneous with respect to a vectorization based on context or profile; then we construct a number of small and potentially good subsets of pages, extracting from each cluster the pages with higher scores. Operating on these subsets with a genetic algorithm, we identify the subset with a good overall score and a high internal dissimilarity. This provides the user with a few nonduplicated pages that represent more correctly the structure of the initial set of pages. Because pages are seen by the algorithms as vectors of fixed dimension, the role of the context- or profile-based vectorization is central and specific to the thematic approach of this method.


2021 ◽  
Author(s):  
Prem Sagar Sharma ◽  
Divakar Yadav

<div>Purpose: Due to the exponential growth of internet users and internet traffic, information seekers are highly dependent upon search engines to extract relevant information. Due to the accessibility of a large amount of textual, audio, video etc. contents, the responsibility of search engines has increased.</div><div>Design/methodology/approach: The search engine provides relevant information to internet users concerning to their query; based on content, link structure etc. However, it does not provide the guarantee of the correctness of the information. The performance of a search engine is highly dependent upon the ranking module. The performance of the ranking module is dependent upon the link structure of web pages, which analyse through Web structure mining (WSM) and their content, which analyses through Web content mining (WCM). Web mining plays an important role in computing the rank of web pages.</div><div>Findings: In this article, web mining types, techniques, tools, algorithms and their challenges are presented. Further, it provides a critical comprehensive survey for the researchers by presenting different features of web pages, which are important to check the quality of web pages.</div><div>Originality: In this work, authors presented different approaches/techniques, algorithms and evaluation approaches in previous researches and identified some critical issues in page ranking & web mining, which provide future directions for the researchers, working in the area.</div>


2021 ◽  
Author(s):  
Prem Sagar Sharma ◽  
Divakar Yadav

<div>Purpose: Due to the exponential growth of internet users and internet traffic, information seekers are highly dependent upon search engines to extract relevant information. Due to the accessibility of a large amount of textual, audio, video etc. contents, the responsibility of search engines has increased.</div><div>Design/methodology/approach: The search engine provides relevant information to internet users concerning to their query; based on content, link structure etc. However, it does not provide the guarantee of the correctness of the information. The performance of a search engine is highly dependent upon the ranking module. The performance of the ranking module is dependent upon the link structure of web pages, which analyse through Web structure mining (WSM) and their content, which analyses through Web content mining (WCM). Web mining plays an important role in computing the rank of web pages.</div><div>Findings: In this article, web mining types, techniques, tools, algorithms and their challenges are presented. Further, it provides a critical comprehensive survey for the researchers by presenting different features of web pages, which are important to check the quality of web pages.</div><div>Originality: In this work, authors presented different approaches/techniques, algorithms and evaluation approaches in previous researches and identified some critical issues in page ranking & web mining, which provide future directions for the researchers, working in the area.</div>


Author(s):  
Sunny Sharma ◽  
Sunita Sunita ◽  
Arjun Kumar ◽  
Vijay Rana

<span lang="EN-US">The emergence of the Web technology generated a massive amount of raw data by enabling Internet users to post their opinions, comments, and reviews on the web. To extract useful information from this raw data can be a very challenging task. Search engines play a critical role in these circumstances. User queries are becoming main issues for the search engines. Therefore a preprocessing operation is essential. In this paper, we present a framework for natural language preprocessing for efficient data retrieval and some of the required processing for effective retrieval such as elongated word handling, stop word removal, stemming, etc. This manuscript starts by building a manually annotated dataset and then takes the reader through the detailed steps of process. Experiments are conducted for special stages of this process to examine the accuracy of the system.</span>


2019 ◽  
Vol 9 (3) ◽  
pp. 23-47
Author(s):  
Sumita Gupta ◽  
Neelam Duhan ◽  
Poonam Bansal

With the rapid growth of digital information and user need, it becomes imperative to retrieve relevant and desired domain or topic specific documents as per the user query quickly. A focused crawler plays a vital role in digital libraries to crawl the web so that researchers can easily explore the domain specific search results list and find the desired content against the query. In this article, a focused crawler is being proposed for online digital library search engines, which considers meta-data of the query in order to retrieve the corresponding document or other relevant but missing information (e.g. paid publication from ACM, IEEE, etc.) against the user query. The different query strategies are made by using the meta-data and submitted to different search engines which aim to find more relevant information which is missing. The result comes out from these search engines are filtered and then used further for crawling the Web.


2021 ◽  
Vol 40 (1) ◽  
pp. 43-52
Author(s):  
Ibrahim A. Fadel ◽  
Hussein Alsanabani ◽  
Cemil Öz ◽  
Tariq Kamal ◽  
Murat İskefiyeli ◽  
...  

Genetic algorithm is one of data mining classification techniques and it has been applied successfully in a wide range of applications. However, the performance of Genetic algorithm fluctuates significantly. This research work combines Genetic algorithm with fuzzy logic to adapt dynamically crossover and mutation parameters of Genetic algorithm. Two different datasets are taken during the experiment. Several experiments have been performed to prove the effectiveness of the proposed algorithm. Results show that the rules generated from a proposed algorithm are significantly better with high fitness and more efficient as compared to a normal Genetic algorithm.


Sign in / Sign up

Export Citation Format

Share Document