Efficient watcher based web crawler design

2015 ◽  
Vol 67 (6) ◽  
pp. 663-686 ◽  
Author(s):  
Saed ALQARALEH ◽  
Omar RAMADAN ◽  
Muhammed SALAMAH

Purpose – The purpose of this paper is to design a watcher-based crawler (WBC) that has the ability of crawling static and dynamic web sites, and can download only the updated and newly added web pages. Design/methodology/approach – In the proposed WBC crawler, a watcher file, which can be uploaded to the web sites servers, prepares a report that contains the addresses of the updated and the newly added web pages. In addition, the WBC is split into five units, where each unit is responsible for performing a specific crawling process. Findings – Several experiments have been conducted and it has been observed that the proposed WBC increases the number of uniquely visited static and dynamic web sites as compared with the existing crawling techniques. In addition, the proposed watcher file not only allows the crawlers to visit the updated and newly web pages, but also solves the crawlers overlapping and communication problems. Originality/value – The proposed WBC performs all crawling processes in the sense that it detects all updated and newly added pages automatically without any human explicit intervention or downloading the entire web sites.

2014 ◽  
Vol 31 (4) ◽  
pp. 10-13 ◽  
Author(s):  
Sharon Q. Yang

Purpose – This study aims to ascertain the trends and changes of how academic libraries market and deliver information literacy (IL) on the web. Design/methodology/approach – The author compares the findings from two separate studies that scanned the Web sites for IL-related activities in 2009 and 2012, respectively. Findings – Academic libraries intensified their efforts to promote and deliver IL on the web between 2009 and 2012. There was a significant increase in IL-related activities on the web in the three-year period. Practical implications – The findings describe the status quo and changes in IL-related activities on the libraries’ Web sites. This information may help librarians to know what they have been doing and if there is space for improvement. Originality/value – This is the only study that spans three years in measuring the progress librarians made in marketing and delivering IL on the Web.


2009 ◽  
Vol 19 (4) ◽  
pp. 408-424 ◽  
Author(s):  
Anssi Tarkiainen ◽  
Hanna‐Kaisa Ellonen ◽  
Olli Kuivalainen

PurposeThe purpose of this paper is to increase understanding of the effects of web site extension on the parent‐magazine brand in the context of experiential goods, and to identify factors that are related to success.Design/methodology/approachThe paper focuses on the relationship between consumers' experiences on magazine web sites and their loyalty towards the print magazine.FindingsThere are different ways in which the web site can complement the print version. The first mechanism is related to engaging in more frequent communication with the magazine's readers, and the second is related to consumer‐initiated interaction between other readers. In both cases something is offered that cannot be obtained from the print magazine, but is assumed to complement it.Originality/valueThe paper increases understanding of brand extensions with regard to experiential goods, but more research is needed on the factors that are related to extension success.


2014 ◽  
Vol 26 (4) ◽  
pp. 244-258
Author(s):  
Nicholas C. Williamson ◽  
Joy Bhadury

Purpose – The purpose of this empirical research is to identify the distinguishing operating characteristics of wineries that use what is alleged to be the most profitable channel of distribution for marketing wine in the USA: the wine club. Design/methodology/approach – The research design entails the contrasting of the Web site-reflected operating features of wineries that support wine clubs with wineries that do not. Findings – Support was found for the great majority of operating features identified in the literature as likely characterizing the operations of wineries with wine clubs. A notable exception concerns the lack of confirmation of hypotheses concerning “Wine 2.0” variables. Research limitations/implications – In the apparent pursuit of higher profits, owners and managers of wineries with wine clubs more frequently adopt operating features that expose them to objective competitive comparisons than do owners and managers with other wineries. The former are also more prone to advertise on their Web sites a variety of offers that collectively constitute a more valuable quid pro quo in their relationships with consumer buyers than appears to be the case with other wineries. Strategically, results demonstrate that a winery’s adoption of a wine club is not a part of an evolutionary process of wineries in general. Originality/value – There has been no other published empirical research that concerned the identification of distinguishing operating features of wineries that use what has been argued to be the most profitable channel for marketing wine at retail in the USA: the wine club channel. Winery owners and managers will find particular value in the results and implications of the research.


2015 ◽  
Vol 33 (6) ◽  
pp. 1163-1173 ◽  
Author(s):  
Kobra Taram ◽  
Abbas Doulani

Purpose – The purpose of this paper is to explore webometric analysis of keywords and expressions of the biochemistry field of study via LexiURL Searcher. Design/methodology/approach – Interfaces for assisting users with information access have received considerable attention. Along with the extraction of data on Web sites for webometric purposes (e.g. link analysis, ranking of Web sites, etc.), LexiURL Searcher presents some information on the arrangement of links among different Web sites. Such capability enables users to identify one or more Web sites around their intended subject and, accordingly, explore all Web sites linked with their identified Web site(s). LexiURL Searcher has preceded webometric analysis by considering the main expressions and keywords derived from the MeSH database. Findings – The worldwide survey indicated that links from countries such as England, Japan, Germany, Australia and Canada were among the Web sites that are most used in biochemistry. Alternatively, other countries such as Singapore, Thailand and Poland had the most advantageous links to the outside world, whereas South Africa, New Zealand and The Netherlands had the least link effect. Biochemistry, being a specialized domain, would benefit greatly from site linking and would provide users the most assistance in information processing. Originality/value – Most webometric studies remain on the level of link analysis and Web site statuses; however, this paper gives information on the common thread Web sites based on a standard thesaurus.


2015 ◽  
Vol 49 (2) ◽  
pp. 205-223
Author(s):  
B T Sampath Kumar ◽  
D Vinay Kumar ◽  
K.R. Prithviraj

Purpose – The purpose of this paper is to know the rate of loss of online citations used as references in scholarly journals. It also indented to recover the vanished online citations using Wayback Machine and also to calculate the half-life period of online citations. Design/methodology/approach – The study selected three journals published by Emerald publication. All 389 articles published in these three scholarly journals were selected. A total of 15,211 citations were extracted of which 13,281 were print citations and only 1,930 were online citations. The online citations so extracted were then tested to determine whether they were active or missing on the Web. W3C Link Checker was used to check the existence of online citations. The online citations which got HTTP error message while testing for its accessibility were then entered in to the search box of the Wayback Machine to recover vanished online citations. Findings – Study found that only 12.69 percent (1,930 out of 15,211) citations were online citations and the percentage of online citations varied from a low of 9.41 in the year 2011 to high of 17.52 in the year 2009. Another notable finding of the research was that 30.98 percent of online citations were not accessible (vanished) and remaining 69.02 percent of online citations were still accessible (active). The HTTP 404 error message – “page not found” was the overwhelming message encountered and represented 62.98 percent of all HTTP error message. It was found that the Wayback Machine had archived only 48.33 percent of the vanished web pages, leaving 51.67 percent still unavailable. The half-life of online citations was increased from 5.40 years to 11.73 years after recovering the vanished online citations. Originality/value – This is a systematic and in-depth study on recovery of vanished online citations cited in journals articles spanning a period of five years. The findings of the study will be helpful to researchers, authors, publishers, and editorial staff to recover vanishing online citations using Wayback Machine.


2014 ◽  
Vol 66 (1) ◽  
pp. 96-116 ◽  
Author(s):  
Enrique Orduña-Malea ◽  
Jose Luis Ortega ◽  
Isidro F. Aguillo

Purpose – The purpose of this paper is to detect whether both file type (a set of rich and web files) and language (English, Spanish, German, French and Italian) influence the web visibility of European universities. Design/methodology/approach – A webometrics analysis of the top 200 European universities (as ranked in the Ranking web of World Universities) was carried out by a manual query for each official URL identified by using the Google search engine (April 2012). A correlation analysis between visibility and file format page count is offered according to language. Finally, a prediction of visibility is shown by using the SMOreg function. Findings – The results indicate that Spanish and English are the languages that correlate most highly with web visibility. This correlation becomes greater – though moderate – when considering only PDF files. Research limitations/implications – The results are limited due to the low correlation between overall page count and visibility. The lack of an accurate search engine that would assist in link counting procedures makes this process difficult. Originality/value – An observed increase in correlation – although moderate – while analysing PDF files (in English and Spanish) is considered to be meaningful. This may indirectly confirm that specific file formats and languages generate different web visibility behaviour on European university web sites.


2005 ◽  
Vol 18 (2) ◽  
pp. 95-97 ◽  
Author(s):  
John Maxymuk

PurposeTo show that despite libraries' tendencies to focus all their efforts – even in the online environment – on developing tools, resources, and finding aids for their patrons, some have also used the web to develop resources for staff needs.Design/methodology/approachSurveys a number of library web sites and highlights online resources that have been developed to assist library staff in areas of training, organization, and professional development.FindingsRanging from online instruction for new staff, listings of library policies and passwords, and resources for staff development, many libraries have begun to use their web sites to provide valuable information for staff too.Originality/valueThe examples presented in this column can provide guidance for any library beginning to use their web site to provide information resources for their staff. Several types of information are presented showing both the range of information of use to staff and a variety of methods to convey that information.


Author(s):  
Harshala Bhoir ◽  
K. Jayamalini

Now a days Internet is widely used by users to find required information. Searching on web for useful information has become more difficult. Web crawler helps to extract the relevant and irrelevant links from the web. Web crawler downloads web pages through the program. This paper implements web crawler with Scrapy and Beautiful Soup python web crawler framework to crawls news on news web sites.Scrapy is a web crawling framework that allow programmer to create spider that define how a certain site or a group of sites will be scraped. It has built-in support for extracting data from HTML sources using XPath expression and CSS expression. BeautifulSoup is a framework that extract data from web pages. Beautiful Soup provides a few simple methods for navigating, searching and modifying a parse tree. BeautifulSoup automatically convert incoming document to Unicode and outgoing document to UTF-8.Proposed system use BeautifulSoup and scrapy framework to crawls news web sites. This paper also compares scrapy and beautiful Soup4 web crawler frameworks.


2006 ◽  
Vol 19 (2) ◽  
pp. 84-86
Author(s):  
Jennifer Paustenbaugh

PurposeThe purpose of the paper is to provide a tribute to the life and work of library fund‐raiser Gwen Leighty.Design/methodology/approachThe paper uses personal knowledge and references to Academic Libraries Advancement and Development Network (ALADN) and LIBDEV web sites.FindingsThe paper finds that fundraising is connecting with people and the journey that each development officer must make while raising funds for their library.Originality/valueThe paper presents a brief history of ALADN and the valuable contribution one person made to the cause of library fund‐raising.


2015 ◽  
Vol 31 (1) ◽  
pp. 2-6
Author(s):  
Robert Fox

Purpose – In order to continue to respond to patron needs in a relevant way, it is necessary to continuously reevaluate the central message that the library website is intended to convey. It ' s necessary to question assumptions, listen to user needs, and shift our paradigm to make the library web presence as effective as possible. Design/methodology/approach – This is a regular viewpoint column. A basic literature review was done prior to the column being written. Findings – The library Web site remains, in many respects, the “first face” of the library for patrons. To remain relevant, traditional methodologies used in library science may need to be set aside or catered to the needs of the patron. Originality/value – Various methods regarding design philosophy are explored which may be of use to information professionals responsible for the design and content of the library Web sites.


Sign in / Sign up

Export Citation Format

Share Document