scholarly journals Identification of Web Site Reliability through Data Scrapping at Web Crawler’s Navigation

Searching a specified content on the web site is like epistle a single character in bunch of pages. When the user enters their keyword into any search engines, it takes that in to web server mining process for collecting the entire terms related to that entered key phrase. Few pages gives legal and authenticated matter for the user, which they really wanted to access. Whereas many other pages are bringing them some unwanted and malicious codes of pages or virus activity pages to harm user’s activities and the system’s functions. Generally a web page attacks the targeted system by faulty instructions and malevolent programs through some sort of intrusion methodologies are called as phishing. In this attacking method user is set to access unknown or illegal sites by the way of accessing some unidentified websites link imbedding with legal site contents. Once victim’s system performance got compromised then hackers started to do attack. To avoid this kind of molestations, user needs to understand reliability of web page’s contents before started to continue browsing. This research paper is going to present web crawler architecture, design complexities and implementation for scrapping web contents from visited web pages for indentifying their reliability and freshness.

Nowadays the usage of mobile phones is widely spread in our lifestyle; we use cell phones as a camera, a radio, a music player, and even as a web browser. Since most web pages are created for desktop computers, navigating through web pages is highly fatigued. Hence, there is a great interest in computer science to adopt such pages with rich content into small screens of our mobile devices. On the other hand, every web page has got many different parts that do not have the equal importance to the end user. Consequently, the authors propose a mechanism to identify the most useful part of a web page to a user regarding his or her search query while the information loss is avoided. The challenge here comes from the fact that long web contents cannot be easily displayed in both vertical and horizontal ways.


Author(s):  
Paolo Giudici ◽  
Paola Cerchiello

The aim of this contribution is to show how the information, concerning the order in which the pages of a Web site are visited, can be profitably used to predict the visit behaviour at the site. Usually every click corresponds to the visualization of a Web page. Thus, a Web clickstream defines the sequence of the Web pages requested by a user. Such a sequence identifies a user session.


2015 ◽  
Vol 12 (1) ◽  
pp. 91-114 ◽  
Author(s):  
Víctor Prieto ◽  
Manuel Álvarez ◽  
Víctor Carneiro ◽  
Fidel Cacheda

Search engines use crawlers to traverse the Web in order to download web pages and build their indexes. Maintaining these indexes up-to-date is an essential task to ensure the quality of search results. However, changes in web pages are unpredictable. Identifying the moment when a web page changes as soon as possible and with minimal computational cost is a major challenge. In this article we present the Web Change Detection system that, in a best case scenario, is capable to detect, almost in real time, when a web page changes. In a worst case scenario, it will require, on average, 12 minutes to detect a change on a low PageRank web site and about one minute on a web site with high PageRank. Meanwhile, current search engines require more than a day, on average, to detect a modification in a web page (in both cases).


Author(s):  
GAURAV AGARWAL ◽  
SACHI GUPTA ◽  
SAURABH MUKHERJEE

Today, web servers, are the key repositories of the information & internet is the source of getting this information. There is a mammoth data on the Internet. It becomes a difficult job to search out the accordant data. Search Engine plays a vital role in searching the accordant data. A search engine follows these steps: Web crawling by crawler, Indexing by Indexer and Searching by Searcher. Web crawler retrieves information of the web pages by following every link on the site. Which is stored by web search engine then the content of the web page is indexed by the indexer. The main role of indexer is how data can be catch soon as per user requirements. As the client gives a query, Search Engine searches the results corresponding to this query to provide excellent output. Here ambition is to enroot an algorithm for search engine which may response most desirable result as per user requirement. In this a ranking method is used by the search engine to rank the web pages. Various ranking approaches are discussed in literature but in this paper, ranking algorithm is proposed which is based on parent-child relationship. Proposed ranking algorithm is based on priority assignment phase of Heterogeneous Earliest Finish Time (HEFT) Algorithm which is designed for multiprocessor task scheduling. Proposed algorithm works on three on range variable its means the density of keywords, number of successors to the nodes and the age of the web page. Density shows the occurrence of the keyword on the particular web page. Numbers of successors represent the outgoing link to a single web page. Age is the freshness value of the web page. The page which is modified recently is the freshest page and having the smallest age or largest freshness value. Proposed Technique requires that the priorities of each page to be set with the downward rank values & pages are arranged in ascending/ Descending order of their rank values. Experiments show that our algorithm is valuable. After the comparison with Google we find that our Algorithm is performing better. For 70% problems our algorithm is working better than Google.


Author(s):  
Ben Choi

Web mining aims for searching, organizing, and extracting information on the Web and search engines focus on searching. The next stage of Web mining is the organization of Web contents, which will then facilitate the extraction of useful information from the Web. This chapter will focus on organizing Web contents. Since a majority of Web contents are stored in the form of Web pages, this chapter will focus on techniques for automatically organizing Web pages into categories. Various artificial intelligence techniques have been used; however the most successful ones are classification and clustering. This chapter will focus on clustering. Clustering is well suited for Web mining by automatically organizing Web pages into categories each of which contain Web pages having similar contents. However, one problem in clustering is the lack of general methods to automatically determine the number of categories or clusters. For the Web domain, until now there is no such a method suitable for Web page clustering. To address this problem, this chapter describes a method to discover a constant factor that characterizes the Web domain and proposes a new method for automatically determining the number of clusters in Web page datasets. This chapter also proposes a new bi-directional hierarchical clustering algorithm, which arranges individual Web pages into clusters and then arranges the clusters into larger clusters and so on until the average inter-cluster similarity approaches the constant factor. Having the constant factor together with the algorithm, this chapter provides a new clustering system suitable for mining the Web.


2004 ◽  
Vol 4 (1) ◽  
Author(s):  
David Carabantes Alarcón ◽  
Carmen García Carrión ◽  
Juan Vicente Beneit Montesinos

La calidad en Internet tiene un gran valor, y más aún cuando se trata de una página web sobre salud como es un recurso sobre drogodependencias. El presente artículo recoge los estimadores y sistemas más destacados sobre calidad web para el desarrollo de un sistema específico de valoración de la calidad de recursos web sobre drogodependencias. Se ha realizado una prueba de viabilidad mediante el análisis de las principales páginas web sobre este tema (n=60), recogiendo la valoración, desde el punto de vista del usuario, de la calidad de los recursos. Se han detectado aspectos de mejora en cuanto a la exactitud y fiabilidad de la información, autoría, y desarrollo de descripciones y valoraciones de los enlaces externos. AbstractThe quality in Internet has a great value, and still more when is a web page on health like a resource of drug dependence. This paper contains the estimators and systems on quality in the web for the development of a specific system to value the quality of a web site about drug dependence. A test of viability by means of the analysis of the main web pages has been made on this subject, gathering the valuation from the point of view of the user of the quality of the resources. Aspects of improvement as the exactitude and reliability of the information, responsibility, and development of descriptions and valuations of the external links have been detected.


2018 ◽  
Vol 173 ◽  
pp. 03020
Author(s):  
Lu Xing-Hua ◽  
Ye Wen-Quan ◽  
Liu Ming-Yuan

In order to improve the user ' s ability to access websites and web pages, according to the interest preference of the user, the personalized recommendation design is carried out, and the personalized recommendation model for web page visit is established to meet the personalized interest demand of the user to browse the web page. A webpage personalized recommendation algorithm based on association rule mining is proposed. Based on the semantic features of web pages, user browsing behavior is calculated by similarity computation, and web crawler algorithm is constructed to extract the semantic features of web pages. The autocorrelation matching method is used to match the features of web page and user browsing behavior, and the association rules feature quantity of user browsing website behavior is mined. According to the semantic relevance and semantic information of web users to search words, fuzzy registration is taken, Web personalized recommendation is obtained to meet the needs of the users browse the web. The simulation results show that the method is accurate and user satisfaction is higher.


Author(s):  
ALI SELAMAT ◽  
ZHI SAM LEE ◽  
MOHD AIZAINI MAAROF ◽  
SITI MARIYAM SHAMSUDDIN

In this paper, an improved web page classification method (IWPCM) using neural networks to identify the illicit contents of web pages is proposed. The proposed IWPCM approach is based on the improvement of feature selection of the web pages using class based feature vectors (CPBF). The CPBF feature selection approach has been calculated by considering the important term's weight for illicit web documents and reduce the dependency of the less important term's weight for normal web documents. The IWPCM approach has been examined using the modified term-weighting scheme by comparing it with several traditional term-weighting schemes for non-illicit and illicit web contents available from the web. The precision, recall, and F1 measures have been used to evaluate the effectiveness of the proposed IWPCM approach. The experimental results have shown that the proposed improved term-weighting scheme has been able to identify the non-illicit and illicit web contents available from the experimental datasets.


2013 ◽  
Vol 303-306 ◽  
pp. 2311-2316
Author(s):  
Hong Shen Liu ◽  
Peng Fei Wang

The structures and contents of researching search engines are presented and the core technology is the analysis technology of web pages. The characteristic of analyzing web pages in one website is studied, relations between the web pages web crawler gained at two times are able to be obtained and the changed information among them are found easily. A new method of analyzing web pages in one website is introduced and the method analyzes web pages with the changed information of web pages. The result of applying the method shows that the new method is effective in the analysis of web pages.


2017 ◽  
Vol 23 (4) ◽  
pp. 192-197 ◽  
Author(s):  
Lori Northrup ◽  
Ed Cherry ◽  
Della Darby

Frustrated by the time-consuming process of updating subject Web pages, librarians at Samford University Library (SUL) developed a process for streamlining updates using Server-Side Include (SST) commands. They created text files on the library server that corresponded to each of 143 online resources. Include commands within the HTML document for each subject page refer to these text files, which are pulled into the page as it loads on the user's browser. For the user, the process is seamless. For librarians, time spent in updating Web pages is greatly reduced; changes to text files on the server result in simultaneous changes to the edited resources across the library's Web site. For small libraries with limited online resources, this process may provide an elegant solution to an ongoing problem.


Sign in / Sign up

Export Citation Format

Share Document