scholarly journals A scale for crawler effectiveness on the client-side hidden web

2012 ◽  
Vol 9 (2) ◽  
pp. 561-583 ◽  
Author(s):  
Víctor Prieto ◽  
Manuel Álvarez ◽  
Rafael López-García ◽  
Fidel Cacheda

The main goal of this study is to present a scale that classifies crawling systems according to their effectiveness in traversing the ?clientside? Hidden Web. First, we perform a thorough analysis of the different client-side technologies and the main features of the web pages in order to determine the basic steps of the aforementioned scale. Then, we define the scale by grouping basic scenarios in terms of several common features, and we propose some methods to evaluate the effectiveness of the crawlers according to the levels of the scale. Finally, we present a testing web site and we show the results of applying the aforementioned methods to the results obtained by some open-source and commercial crawlers that tried to traverse the pages. Only a few crawlers achieve good results in treating client-side technologies. Regarding standalone crawlers, we highlight the open-source crawlers Heritrix and Nutch and the commercial crawler WebCopierPro, which is able to process very complex scenarios. With regard to the crawlers of the main search engines, only Google processes most of the scenarios we have proposed, while Yahoo! and Bing just deal with the basic ones. There are not many studies that assess the capacity of the crawlers to deal with client-side technologies. Also, these studies consider fewer technologies, fewer crawlers and fewer combinations. Furthermore, to the best of our knowledge, our article provides the first scale for classifying crawlers from the point of view of the most important client-side technologies.

2015 ◽  
Vol 12 (1) ◽  
pp. 91-114 ◽  
Author(s):  
Víctor Prieto ◽  
Manuel Álvarez ◽  
Víctor Carneiro ◽  
Fidel Cacheda

Search engines use crawlers to traverse the Web in order to download web pages and build their indexes. Maintaining these indexes up-to-date is an essential task to ensure the quality of search results. However, changes in web pages are unpredictable. Identifying the moment when a web page changes as soon as possible and with minimal computational cost is a major challenge. In this article we present the Web Change Detection system that, in a best case scenario, is capable to detect, almost in real time, when a web page changes. In a worst case scenario, it will require, on average, 12 minutes to detect a change on a low PageRank web site and about one minute on a web site with high PageRank. Meanwhile, current search engines require more than a day, on average, to detect a modification in a web page (in both cases).


2004 ◽  
Vol 4 (1) ◽  
Author(s):  
David Carabantes Alarcón ◽  
Carmen García Carrión ◽  
Juan Vicente Beneit Montesinos

La calidad en Internet tiene un gran valor, y más aún cuando se trata de una página web sobre salud como es un recurso sobre drogodependencias. El presente artículo recoge los estimadores y sistemas más destacados sobre calidad web para el desarrollo de un sistema específico de valoración de la calidad de recursos web sobre drogodependencias. Se ha realizado una prueba de viabilidad mediante el análisis de las principales páginas web sobre este tema (n=60), recogiendo la valoración, desde el punto de vista del usuario, de la calidad de los recursos. Se han detectado aspectos de mejora en cuanto a la exactitud y fiabilidad de la información, autoría, y desarrollo de descripciones y valoraciones de los enlaces externos. AbstractThe quality in Internet has a great value, and still more when is a web page on health like a resource of drug dependence. This paper contains the estimators and systems on quality in the web for the development of a specific system to value the quality of a web site about drug dependence. A test of viability by means of the analysis of the main web pages has been made on this subject, gathering the valuation from the point of view of the user of the quality of the resources. Aspects of improvement as the exactitude and reliability of the information, responsibility, and development of descriptions and valuations of the external links have been detected.


2013 ◽  
Vol 303-306 ◽  
pp. 2311-2316
Author(s):  
Hong Shen Liu ◽  
Peng Fei Wang

The structures and contents of researching search engines are presented and the core technology is the analysis technology of web pages. The characteristic of analyzing web pages in one website is studied, relations between the web pages web crawler gained at two times are able to be obtained and the changed information among them are found easily. A new method of analyzing web pages in one website is introduced and the method analyzes web pages with the changed information of web pages. The result of applying the method shows that the new method is effective in the analysis of web pages.


Author(s):  
Paolo Giudici ◽  
Paola Cerchiello

The aim of this contribution is to show how the information, concerning the order in which the pages of a Web site are visited, can be profitably used to predict the visit behaviour at the site. Usually every click corresponds to the visualization of a Web page. Thus, a Web clickstream defines the sequence of the Web pages requested by a user. Such a sequence identifies a user session.


2018 ◽  
Vol 7 (3.6) ◽  
pp. 106
Author(s):  
B J. Santhosh Kumar ◽  
Kankanala Pujitha

Application uses URL as contribution for Web Application Vulnerabilities recognition. if the length of URL is too long then it will consume more time to scan the URL (Ain Zubaidah et.al 2014).Existing system can notice the web pages but not overall web application. This application will test for URL of any length using String matching algorithm. To avoid XSS and CSRF and detect attacks that try to sidestep program upheld arrangements by white list and DOM sandboxing techniques (Elias Athanasopoulos et.al.2012). The web application incorporates a rundown of cryptographic hashes of legitimate (trusted) client side contents. In the event that there is a cryptographic hash for the content in the white list. On the off chance that the hash is discovered the content is viewed as trusted or not trusted. This application makes utilization of SHA-1 for making a message process. The web server stores reliable scripts inside div or span HTML components that are attribute as reliable. DOM sandboxing helps in identifying the script or code. Partitioning Program Symbols into Code and Non-code. This helps to identify any hidden code in trusted tag, which bypass web server. Scanning the website for detecting the injection locations and injecting the mischievous XSS assault vectors in such infusion focuses and check for these assaults in the helpless web application( Shashank Gupta et.al 2015).The proposed application improve the false negative rate.  


2007 ◽  
Vol 16 (05) ◽  
pp. 793-828 ◽  
Author(s):  
JUAN D. VELÁSQUEZ ◽  
VASILE PALADE

Understanding the web user browsing behaviour in order to adapt a web site to the needs of a particular user represents a key issue for many commercial companies that do their business over the Internet. This paper presents the implementation of a Knowledge Base (KB) for building web-based computerized recommender systems. The Knowledge Base consists of a Pattern Repository that contains patterns extracted from web logs and web pages, by applying various web mining tools, and a Rule Repository containing rules that describe the use of discovered patterns for building navigation or web site modification recommendations. The paper also focuses on testing the effectiveness of the proposed online and offline recommendations. An ample real-world experiment is carried out on a web site of a bank.


Author(s):  
Aso Mohammed Aladdin ◽  
Chnoor M. Rahman ◽  
Mzhda S. Abdulkarim

In developing web sites there are some rules that developers should depend on in order to create a site suitable to the users’ needs and also to make them as comfort as possible when they surf it. Before creating any website or operating any application, it is important for developers to address the functionality, design, usability and security of the work according to the demands.  Every developer has his/her own way to develop a website, some prefer to use website builders and while others prefer to what they have primarily formed in their mind What they have primarily formed in their mind preferred software and programming languages. Therefore, this paper will compare the web based sites and open source projects in terms of functionality, usability, design and security in order to help academic staffs or business organization for choosing the best way for developing an academic or e-commerce web site.  


Author(s):  
Pavel Šimek ◽  
Jiří Vaněk ◽  
Jan Jarolímek

The majority of Internet users use the global network to search for different information using fulltext search engines such as Google, Yahoo!, or Seznam. The web presentation operators are trying, with the help of different optimization techniques, to get to the top places in the results of fulltext search engines. Right there is a great importance of Search Engine Optimization and Search Engine Marketing, because normal users usually try links only on the first few pages of the fulltext search engines results on certain keywords and in catalogs they use primarily hierarchically higher placed links in each category. Key to success is the application of optimization methods which deal with the issue of keywords, structure and quality of content, domain names, individual sites and quantity and reliability of backward links. The process is demanding, long-lasting and without a guaranteed outcome. A website operator without advanced analytical tools do not identify the contribution of individual documents from which the entire web site consists. If the web presentation operators want to have an overview of their documents and web site in global, it is appropriate to quantify these positions in a specific way, depending on specific key words. For this purpose serves the quantification of competitive value of documents, which consequently sets global competitive value of a web site. Quantification of competitive values is performed on a specific full-text search engine. For each full-text search engine can be and often are, different results. According to published reports of ClickZ agency or Market Share is according to the number of searches by English-speaking users most widely used Google search engine, which has a market share of more than 80%. The whole procedure of quantification of competitive values is common, however, the initial step which is the analysis of keywords depends on a choice of the fulltext search engine.


2019 ◽  
Author(s):  
Lucas van der Deijl ◽  
Antal van den Bosch ◽  
Roel Smeets

Literary history is no longer written in books alone. As literary reception thrives in blogs, Wikipedia entries, Amazon reviews, and Goodreads pro les, the Web has become a key platform for the exchange of information on literature. Al- though conventional printed media in the eld—academic monographs, literary supplements, and magazines—may still claim the highest authority, online me- dia presumably provide the rst (and possibly the only) source for many readers casually interested in literary history. Wikipedia o ers quick and free answers to readers’ questions and the range of topics described in its entries dramatically exceeds the volume any printed encyclopedia could possibly cover. While an important share of this expanding knowledge base about literature is produced bottom-up (user based and crowd-sourced), search engines such as Google have become brokers in this online economy of knowledge, organizing information on the Web for its users. Similar to the printed literary histories, search engines prioritize certain information sources over others when ranking and sorting Web pages; as such, their search algorithms create hierarchies of books, authors, and periods.


Author(s):  
Kimihito Ito ◽  
Yuzuru Tanaka

Web applications, which are computer programs ported to the Web, allow end-users to use various remote services and tools through their Web browsers. There are an enormous number of Web applications on the Web, and they are becoming the basic infrastructure of everyday life. In spite of the remarkable development of Web-based infrastructure, it is still difficult for end-users to compose new integrated tools of both existing Web applications and legacy local applications, such as spreadsheets, chart tools, and database. In this chapter, the authors propose a new framework where end-users can wrap remote Web applications into visual components, called pads, and functionally combine them together through drag-and-drop operations. The authors use, as the basis, a meme media architecture IntelligentPad that was proposed by the second author. In the IntelligentPad architecture, each visual component, called a pad, has slots as data I/O ports. By pasting a pad onto another pad, users can integrate their functionalities. The framework presented in this chapter allows users to visually create a wrapper pad for any Web application by defining HTML nodes within the Web application to work as slots. Examples of such a node include input-forms and text strings on Web pages. Users can directly manipulate both wrapped Web applications and wrapped local legacy tools on their desktop screen to define application linkages among them. Since no programming expertise is required to wrap Web applications or to functionally combine them together, end-users can build new integrated tools of both wrapped Web applications and local legacy applications.


Sign in / Sign up

Export Citation Format

Share Document