scholarly journals Distributed and collaborative Web Change Detection system

2015 ◽  
Vol 12 (1) ◽  
pp. 91-114 ◽  
Author(s):  
Víctor Prieto ◽  
Manuel Álvarez ◽  
Víctor Carneiro ◽  
Fidel Cacheda

Search engines use crawlers to traverse the Web in order to download web pages and build their indexes. Maintaining these indexes up-to-date is an essential task to ensure the quality of search results. However, changes in web pages are unpredictable. Identifying the moment when a web page changes as soon as possible and with minimal computational cost is a major challenge. In this article we present the Web Change Detection system that, in a best case scenario, is capable to detect, almost in real time, when a web page changes. In a worst case scenario, it will require, on average, 12 minutes to detect a change on a low PageRank web site and about one minute on a web site with high PageRank. Meanwhile, current search engines require more than a day, on average, to detect a modification in a web page (in both cases).

Author(s):  
Paolo Giudici ◽  
Paola Cerchiello

The aim of this contribution is to show how the information, concerning the order in which the pages of a Web site are visited, can be profitably used to predict the visit behaviour at the site. Usually every click corresponds to the visualization of a Web page. Thus, a Web clickstream defines the sequence of the Web pages requested by a user. Such a sequence identifies a user session.


2004 ◽  
Vol 4 (1) ◽  
Author(s):  
David Carabantes Alarcón ◽  
Carmen García Carrión ◽  
Juan Vicente Beneit Montesinos

La calidad en Internet tiene un gran valor, y más aún cuando se trata de una página web sobre salud como es un recurso sobre drogodependencias. El presente artículo recoge los estimadores y sistemas más destacados sobre calidad web para el desarrollo de un sistema específico de valoración de la calidad de recursos web sobre drogodependencias. Se ha realizado una prueba de viabilidad mediante el análisis de las principales páginas web sobre este tema (n=60), recogiendo la valoración, desde el punto de vista del usuario, de la calidad de los recursos. Se han detectado aspectos de mejora en cuanto a la exactitud y fiabilidad de la información, autoría, y desarrollo de descripciones y valoraciones de los enlaces externos. AbstractThe quality in Internet has a great value, and still more when is a web page on health like a resource of drug dependence. This paper contains the estimators and systems on quality in the web for the development of a specific system to value the quality of a web site about drug dependence. A test of viability by means of the analysis of the main web pages has been made on this subject, gathering the valuation from the point of view of the user of the quality of the resources. Aspects of improvement as the exactitude and reliability of the information, responsibility, and development of descriptions and valuations of the external links have been detected.


2013 ◽  
Vol 303-306 ◽  
pp. 2311-2316
Author(s):  
Hong Shen Liu ◽  
Peng Fei Wang

The structures and contents of researching search engines are presented and the core technology is the analysis technology of web pages. The characteristic of analyzing web pages in one website is studied, relations between the web pages web crawler gained at two times are able to be obtained and the changed information among them are found easily. A new method of analyzing web pages in one website is introduced and the method analyzes web pages with the changed information of web pages. The result of applying the method shows that the new method is effective in the analysis of web pages.


2018 ◽  
Vol 8 (4) ◽  
pp. 1-13
Author(s):  
Rajnikant Bhagwan Wagh ◽  
Jayantrao Bhaurao Patil

Recommendation systems are growing very rapidly. While surfing, users frequently miss the goal of their search and lost in information overload problem. To overcome this information overload problem, the authors have proposed a novel web page recommendation system to save surfing time of user. The users are analyzed when they surf through a particular web site. Authors have used relationship matrix and frequency matrix for effectively finding the connectivity among the web pages of similar users. These webpages are divided into various clusters using enhanced graph based partitioning concept. Authors classify active users more accurately to found clusters. Threshold values are used in both clustering and classification stages for more appropriate results. Experimental results show that authors get around 61% accuracy, 37% coverage and 46% F1 measure. It helps in improved surfing experience of users.


Personification of the present disclosure can be viewed as providing methods for creating a web site. In this respect, one embodiment includes the following steps: receiving a choice of a design template to be used in creating the website, the design template providing an initial layout for pages of the website and recommended content, rendering for display a representation of the website from the design template in a development environment, wherein the representation provides controls for editing content and layout of the website representation and the rendering produces HTML files that are displayed to a user and enabling the user to edit design features of the website based upon a rendered view of the website in the development environment, a presently displayed representation of a web page having editing tools embedded in the web page. Preferably, models conform to various types of web pages and other features that are typically found or visible on websites. There may be different options for each feature. The innovation is to provide a platform for making it easy to build websites and pages based on stored templates that enable the website and pages to be modified and configured without the user needing to write any software code


2012 ◽  
Vol 9 (2) ◽  
pp. 561-583 ◽  
Author(s):  
Víctor Prieto ◽  
Manuel Álvarez ◽  
Rafael López-García ◽  
Fidel Cacheda

The main goal of this study is to present a scale that classifies crawling systems according to their effectiveness in traversing the ?clientside? Hidden Web. First, we perform a thorough analysis of the different client-side technologies and the main features of the web pages in order to determine the basic steps of the aforementioned scale. Then, we define the scale by grouping basic scenarios in terms of several common features, and we propose some methods to evaluate the effectiveness of the crawlers according to the levels of the scale. Finally, we present a testing web site and we show the results of applying the aforementioned methods to the results obtained by some open-source and commercial crawlers that tried to traverse the pages. Only a few crawlers achieve good results in treating client-side technologies. Regarding standalone crawlers, we highlight the open-source crawlers Heritrix and Nutch and the commercial crawler WebCopierPro, which is able to process very complex scenarios. With regard to the crawlers of the main search engines, only Google processes most of the scenarios we have proposed, while Yahoo! and Bing just deal with the basic ones. There are not many studies that assess the capacity of the crawlers to deal with client-side technologies. Also, these studies consider fewer technologies, fewer crawlers and fewer combinations. Furthermore, to the best of our knowledge, our article provides the first scale for classifying crawlers from the point of view of the most important client-side technologies.


Author(s):  
Vijay Kasi ◽  
Radhika Jain

In the context of the Internet, a search engine can be defined as a software program designed to help one access information, documents, and other content on the World Wide Web. The adoption and growth of the Internet in the last decade has been unprecedented. The World Wide Web has always been applauded for its simplicity and ease of use. This is evident looking at the extent of the knowledge one requires to build a Web page. The flexible nature of the Internet has enabled the rapid growth and adoption of it, making it hard to search for relevant information on the Web. The number of Web pages has been increasing at an astronomical pace, from around 2 million registered domains in 1995 to 233 million registered domains in 2004 (Consortium, 2004). The Internet, considered a distributed database of information, has the CRUD (create, retrieve, update, and delete) rule applied to it. While the Internet has been effective at creating, updating, and deleting content, it has considerably lacked in enabling the retrieval of relevant information. After all, there is no point in having a Web page that has little or no visibility on the Web. Since the 1990s when the first search program was released, we have come a long way in terms of searching for information. Although we are currently witnessing a tremendous growth in search engine technology, the growth of the Internet has overtaken it, leading to a state in which the existing search engine technology is falling short. When we apply the metrics of relevance, rigor, efficiency, and effectiveness to the search domain, it becomes very clear that we have progressed on the rigor and efficiency metrics by utilizing abundant computing power to produce faster searches with a lot of information. Rigor and efficiency are evident in the large number of indexed pages by the leading search engines (Barroso, Dean, & Holzle, 2003). However, more research needs to be done to address the relevance and effectiveness metrics. Users typically type in two to three keywords when searching, only to end up with a search result having thousands of Web pages! This has made it increasingly hard to effectively find any useful, relevant information. Search engines face a number of challenges today requiring them to perform rigorous searches with relevant results efficiently so that they are effective. These challenges include the following (“Search Engines,” 2004). 1. The Web is growing at a much faster rate than any present search engine technology can index. 2. Web pages are updated frequently, forcing search engines to revisit them periodically. 3. Dynamically generated Web sites may be slow or difficult to index, or may result in excessive results from a single Web site. 4. Many dynamically generated Web sites are not able to be indexed by search engines. 5. The commercial interests of a search engine can interfere with the order of relevant results the search engine shows. 6. Content that is behind a firewall or that is password protected is not accessible to search engines (such as those found in several digital libraries).1 7. Some Web sites have started using tricks such as spamdexing and cloaking to manipulate search engines to display them as the top results for a set of keywords. This can make the search results polluted, with more relevant links being pushed down in the result list. This is a result of the popularity of Web searches and the business potential search engines can generate today. 8. Search engines index all the content of the Web without any bounds on the sensitivity of information. This has raised a few security and privacy flags. With the above background and challenges in mind, we lay out the article as follows. In the next section, we begin with a discussion of search engine evolution. To facilitate the examination and discussion of the search engine development’s progress, we break down this discussion into the three generations of search engines. Figure 1 depicts this evolution pictorially and highlights the need for better search engine technologies. Next, we present a brief discussion on the contemporary state of search engine technology and various types of content searches available today. With this background, the next section documents various concerns about existing search engines setting the stage for better search engine technology. These concerns include information overload, relevance, representation, and categorization. Finally, we briefly address the research efforts under way to alleviate these concerns and then present our conclusion.


2003 ◽  
Vol 13 (3) ◽  
pp. 545-548
Author(s):  
Eileen C. Herring ◽  
Richard A. Criley

The Hawaiian Native Plant Propagation Web Site (http://pdcs.ctahr.hawaii.edu:591/hawnprop) is a collection of organized propagation information for selected Hawaiian indigenous and endemic plants. It was designed to provide easy access to this information for university extension personnel, researchers, students, conservationists, and nursery and landscape professionals. Journals and newsletters published in Hawaii provided the most relevant data for this Web site. The first prototype was a database-driven Web site that provided sophisticated search capability and dynamically generated Web pages for each plant record. Subsequent testing of the prototype identified a number of usability problems. These problems were corrected by redesigning the Web site using a hybrid databasestatic Web page approach. The database software search features are retained, but each database record is linked to a static Web page containing the propagation information for a specific plant.


Author(s):  
Carmen Domínguez-Falcón ◽  
Domingo Verano-Tacoronte ◽  
Marta Suárez-Fuentes

Purpose The strong regulation of the Spanish pharmaceutical sector encourages pharmacies to modify their business model, giving the customer a more relevant role by integrating 2.0 tools. However, the study of the implementation of these tools is still quite limited, especially in terms of a customer-oriented web page design. This paper aims to analyze the online presence of Spanish community pharmacies by studying the profile of their web pages to classify them by their degree of customer orientation. Design/methodology/approach In total, 710 community pharmacies were analyzed, of which 160 had Web pages. Using items drawn from the literature, content analysis was performed to evaluate the presence of these items on the web pages. Then, after analyzing the scores on the items, a cluster analysis was conducted to classify the pharmacies according to the degree of development of their online customer orientation strategy. Findings The number of pharmacies with a web page is quite low. The development of these websites is limited, and they have a more informational than relational role. The statistical analysis allows to classify the pharmacies in four groups according to their level of development Practical implications Pharmacists should make incremental use of their websites to facilitate real two-way communication with customers and other stakeholders to maintain a relationship with them by having incorporated the Web 2.0 and social media (SM) platforms. Originality/value This study analyses, from a marketing perspective, the degree of Web 2.0 adoption and the characteristics of the websites, in terms of aiding communication and interaction with customers in the Spanish pharmaceutical sector.


Information ◽  
2018 ◽  
Vol 9 (9) ◽  
pp. 228 ◽  
Author(s):  
Zuping Zhang ◽  
Jing Zhao ◽  
Xiping Yan

Web page clustering is an important technology for sorting network resources. By extraction and clustering based on the similarity of the Web page, a large amount of information on a Web page can be organized effectively. In this paper, after describing the extraction of Web feature words, calculation methods for the weighting of feature words are studied deeply. Taking Web pages as objects and Web feature words as attributes, a formal context is constructed for using formal concept analysis. An algorithm for constructing a concept lattice based on cross data links was proposed and was successfully applied. This method can be used to cluster the Web pages using the concept lattice hierarchy. Experimental results indicate that the proposed algorithm is better than previous competitors with regard to time consumption and the clustering effect.


Sign in / Sign up

Export Citation Format

Share Document