scholarly journals A Novel Approach on Focused Crawling With Anchor Text

2018 ◽  
Vol 7 (1) ◽  
pp. 7-15
Author(s):  
S. Subatra Devi

A novel approach with focused crawling for various anchor texts is discussed in this paper. Most of the search engines search the web with the anchor text to retrieve the relevant pages and answer the queries given by the users. The crawler usually searches the web pages and filters the unnecessary pages which can be done through focused crawling. A focused crawler generates its boundary to crawl the relevant pages based on the link and ignores the irrelevant pages on the web. In this paper, an effective focused crawling method is implemented to improve the quality of the search. Here, three learning phases are considered namely, content-based, link-based and sibling-based learning are undergone to improve the navigation of the search. In this approach, the crawler crawls through the relevant pages efficiently and more relevant pages are retrieved in an effective way. It is proved experimentally that more number of relevant pages are retrieved for different anchor texts with three learning phases using focused crawling.

Author(s):  
Satinder Kaur ◽  
Sunil Gupta

Inform plays a very important role in life and nowadays, the world largely depends on the World Wide Web to obtain any information. Web comprises of a lot of websites of every discipline, whereas websites consists of web pages which are interlinked with each other with the help of hyperlinks. The success of a website largely depends on the design aspects of the web pages. Researchers have done a lot of work to appraise the web pages quantitatively. Keeping in mind the importance of the design aspects of a web page, this paper aims at the design of an automated evaluation tool which evaluate the aspects for any web page. The tool takes the HTML code of the web page as input, and then it extracts and checks the HTML tags for the uniformity. The tool comprises of normalized modules which quantify the measures of design aspects. For realization, the tool has been applied on four web pages of distinct sites and design aspects have been reported for comparison. The tool will have various advantages for web developers who can predict the design quality of web pages and enhance it before and after implementation of website without user interaction.


2015 ◽  
Vol 12 (1) ◽  
pp. 91-114 ◽  
Author(s):  
Víctor Prieto ◽  
Manuel Álvarez ◽  
Víctor Carneiro ◽  
Fidel Cacheda

Search engines use crawlers to traverse the Web in order to download web pages and build their indexes. Maintaining these indexes up-to-date is an essential task to ensure the quality of search results. However, changes in web pages are unpredictable. Identifying the moment when a web page changes as soon as possible and with minimal computational cost is a major challenge. In this article we present the Web Change Detection system that, in a best case scenario, is capable to detect, almost in real time, when a web page changes. In a worst case scenario, it will require, on average, 12 minutes to detect a change on a low PageRank web site and about one minute on a web site with high PageRank. Meanwhile, current search engines require more than a day, on average, to detect a modification in a web page (in both cases).


2020 ◽  
Vol 17 (2) ◽  
pp. 1260-1265
Author(s):  
Mohd Sharul Hafiz Razak ◽  
Nor Azman Ismail ◽  
Alif Fikri Mohktar ◽  
Su Elya Namira ◽  
Nurina Izzati Ramzi

This paper aims to investigate 18 web domains of computer science and information technology academic websites of Malaysia universities.We collected more than two million web pages. A webometric analysis was used to explore the number of web pages, inbound links, the web impact factor (WIF) and link relationships. The results show Fakulti Teknologi dan Sains Maklumat (FTSM), Universiti Kebangsaan Malaysia (UKM) has the highest number of webpages while Fakulti Teknologi Kreatif dan Warisan (FTKW), Universiti Malaysia Kelantan (UMK) has the largest WIF score. Pearson’s rank correlation coefficient was used to detect the relationship between institutions subdomain age and WIF. Correlations point out that there is scant relationship between subdomain age and WIF score across all 18 Malaysia selected schools [r =−.076, n = 18, p < .0005]. This is due to WIF are highly dependent on the quality of the content to attract backlinks and Google crawler algorithm that changes from time to time for the number of web pages. Subdomain age is independent to the year of establishment of the schools. These findings can be used as a guide to the implementation of university web content strategy.


2019 ◽  
Author(s):  
Lucas van der Deijl ◽  
Antal van den Bosch ◽  
Roel Smeets

Literary history is no longer written in books alone. As literary reception thrives in blogs, Wikipedia entries, Amazon reviews, and Goodreads pro les, the Web has become a key platform for the exchange of information on literature. Al- though conventional printed media in the eld—academic monographs, literary supplements, and magazines—may still claim the highest authority, online me- dia presumably provide the rst (and possibly the only) source for many readers casually interested in literary history. Wikipedia o ers quick and free answers to readers’ questions and the range of topics described in its entries dramatically exceeds the volume any printed encyclopedia could possibly cover. While an important share of this expanding knowledge base about literature is produced bottom-up (user based and crowd-sourced), search engines such as Google have become brokers in this online economy of knowledge, organizing information on the Web for its users. Similar to the printed literary histories, search engines prioritize certain information sources over others when ranking and sorting Web pages; as such, their search algorithms create hierarchies of books, authors, and periods.


2001 ◽  
Vol 20 (4) ◽  
pp. 11-18 ◽  
Author(s):  
Cleborne D. Maddux

The Internet and the World Wide Web are growing at unprecedented rates. More and more teachers are authoring school or classroom web pages. Such pages have particular potential for use in rural areas by special educators, children with special needs, and the parents of children with special needs. The quality of many of these pages leaves much to be desired. All web pages, especially those authored by special educators should be accessible for people with disabilities. Many other problems complicate use of the web for all users, whether or not they have disabilities. By taking some simple steps, beginning webmasters can avoid these problems. This article discusses practical solutions to common accessibility problems and other problems seen commonly on the web.


2004 ◽  
Vol 4 (1) ◽  
Author(s):  
David Carabantes Alarcón ◽  
Carmen García Carrión ◽  
Juan Vicente Beneit Montesinos

La calidad en Internet tiene un gran valor, y más aún cuando se trata de una página web sobre salud como es un recurso sobre drogodependencias. El presente artículo recoge los estimadores y sistemas más destacados sobre calidad web para el desarrollo de un sistema específico de valoración de la calidad de recursos web sobre drogodependencias. Se ha realizado una prueba de viabilidad mediante el análisis de las principales páginas web sobre este tema (n=60), recogiendo la valoración, desde el punto de vista del usuario, de la calidad de los recursos. Se han detectado aspectos de mejora en cuanto a la exactitud y fiabilidad de la información, autoría, y desarrollo de descripciones y valoraciones de los enlaces externos. AbstractThe quality in Internet has a great value, and still more when is a web page on health like a resource of drug dependence. This paper contains the estimators and systems on quality in the web for the development of a specific system to value the quality of a web site about drug dependence. A test of viability by means of the analysis of the main web pages has been made on this subject, gathering the valuation from the point of view of the user of the quality of the resources. Aspects of improvement as the exactitude and reliability of the information, responsibility, and development of descriptions and valuations of the external links have been detected.


2013 ◽  
Vol 303-306 ◽  
pp. 2311-2316
Author(s):  
Hong Shen Liu ◽  
Peng Fei Wang

The structures and contents of researching search engines are presented and the core technology is the analysis technology of web pages. The characteristic of analyzing web pages in one website is studied, relations between the web pages web crawler gained at two times are able to be obtained and the changed information among them are found easily. A new method of analyzing web pages in one website is introduced and the method analyzes web pages with the changed information of web pages. The result of applying the method shows that the new method is effective in the analysis of web pages.


2021 ◽  
Author(s):  
Xiangyi Chen

Text, link and usage information are the most commonly used sources in the ranking algorithm of a web search engine. In this thesis, we argue that the quality of the web pages such as the performance of the page delivery (e.g. reliability and response time) should also play an important role in ranking, especially for users with a slow Internet connection or mobile users. Based on this principle, if two pages have the same level of relevancy to a query, the one with a higher delivery quality (e.g. faster response) should be ranked higher. We define several important attributes for the Quality of Service (QoS) and explain how we rank the web pages based on these algorithms. In addition, while combining those QoS attributes, we have tested and compared different aggregation algorithms. The experiment results show that our proposed algorithms can promote the pages with a higher delivery quality to higher positions in the result list, which is beneficial to users to improve their overall experiences of using the search engine and QoS based re-ranking algorithm always gets the best performance.


A web crawler is also called spider. For the intention of web indexing it automatically searches on the WWW. As the W3 is increasing day by day, globally the number of web pages grown massively. To make the search sociable for users, searching engine are mandatory. So to discover the particular data from the WWW search engines are operated. It would be almost challenging for mankind devoid of search engines to find anything from the web unless and until he identifies a particular URL address. A central depository of HTML documents in indexed form is sustained by every search Engine. Every time an operator gives the inquiry, searching is done at the database of indexed web pages. The size of a database of every search engine depends on the existing page on the internet. So to increase the proficiency of search engines, it is permitted to store only the most relevant and significant pages in the database.


Sign in / Sign up

Export Citation Format

Share Document