focused crawlers
Recently Published Documents


TOTAL DOCUMENTS

21
(FIVE YEARS 6)

H-INDEX

4
(FIVE YEARS 1)

2021 ◽  
Vol 15 (3) ◽  
pp. 205-215
Author(s):  
Gurjot Singh Mahi ◽  
Amandeep Verma

  Web crawlers are as old as the Internet and are most commonly used by search engines to visit websites and index them into repositories. They are not limited to search engines but are also widely utilized to build corpora in different domains and languages. This study developed a focused set of web crawlers for three Punjabi news websites. The web crawlers were developed to extract quality text articles and add them to a local repository to be used in further research. The crawlers were implemented using the Python programming language and were utilized to construct a corpus of more than 134,000 news articles in nine different news genres. The crawler code and extracted corpora were made publicly available to the scientific community for research purposes.


2021 ◽  
Vol 21 (2) ◽  
pp. 105-120
Author(s):  
K. S. Sakunthala Prabha ◽  
C. Mahesh ◽  
S. P. Raja

Abstract Topic precise crawler is a special purpose web crawler, which downloads appropriate web pages analogous to a particular topic by measuring cosine similarity or semantic similarity score. The cosine based similarity measure displays inaccurate relevance score, if topic term does not directly occur in the web page. The semantic-based similarity measure provides the precise relevance score, even if the synonyms of the given topic occur in the web page. The unavailability of the topic in the ontology produces inaccurate relevance score by the semantic focused crawlers. This paper overcomes these glitches with a hybrid string-matching algorithm by combining the semantic similarity-based measure with the probabilistic similarity-based measure. The experimental results revealed that this algorithm increased the efficiency of the focused web crawlers and achieved better Harvest Rate (HR), Precision (P) and Irrelevance Ratio (IR) than the existing web focused crawlers achieve.


Author(s):  
Jingfa Liu ◽  
Wei Zhang ◽  
Zhihe Yang ◽  
Ziang Liu

The traditional crawlers have difficulty in implementing semantic analysis. Therefore, the focused crawler technologies with topic preference characteristics have received many attentions in the recent years. To increase the precision of focused crawlers and prevent “topic drifting”, this paper adopts the comprehensive relevancy evaluation (CRE) of hyperlinks based on the combination of web content and link structure. In addition, the improved version of the energy landscape paving (ELP) algorithm that is a class of metropolis-sampling-based global optimization method is proposed to avoid the focused crawler falling into local optima. By incorporating the CRE strategy into the improved ELP, a novel focused crawler strategy denoted by IELP is proposed. The experimental results on rainstorm disasters domain show that the precision of the proposed focused crawler is obviously promoted compared to other focused crawlers in literature, illustrating the ability of the IELP to retrieve topic-related web pages.


Author(s):  
Wei Wang ◽  
Lihua Yu

Focused crawlers, as fundamental components of vertical search engines, focus on crawling the web pages related to a specific topic. Existing focused crawlers commonly suffer from the problems of low efficiency of crawling pages and subject migration. In this paper, we propose a learning-based focused crawler using a URL knowledge base. To improve the accuracy of similarity, the similarity of the topic is measured with the parent page content, anchor information, and URL content. The URL content is also learned and updated iteratively and continuously. Within the crawler, we implement a crawling mechanism based on a combination of content analysis and simple link analysis crawler strategy, which decreases computational complexity and avoids the locality problem of crawling. Experimental results show that our proposed algorithm achieves a better precision than traditional methods including the shark-search and best-first search algorithms, and avoids the local optimum problem of crawling.


2019 ◽  
Vol 19 (2) ◽  
pp. 146-158 ◽  
Author(s):  
S. R. Mani Sekhar ◽  
G. M. Siddesh ◽  
Sunilkumar S. Manvi ◽  
K. G. Srinivasa

Abstract In the fast growing of digital technologies, crawlers and search engines face unpredictable challenges. Focused web-crawlers are essential for mining the boundless data available on the internet. Web-Crawlers face indeterminate latency problem due to differences in their response time. The proposed work attempts to optimize the designing and implementation of Focused Web-Crawlers using Master-Slave architecture for Bioinformatics web sources. Focused Crawlers ideally should crawl only relevant pages, but the relevance of the page can only be estimated after crawling the genomics pages. A solution for predicting the page relevance, which is based on Natural Language Processing, is proposed in the paper. The frequency of the keywords on the top ranked sentences of the page determines the relevance of the pages within genomics sources. The proposed solution uses a TextRank algorithm to rank the sentences, as well as ensuring the correct classification of Bioinformatics web page. Finally, the model is validated by being compared with a breadth first search web-crawler. The comparison shows significant reduction in run time for the same harvest rate.


Author(s):  
Venugopal Boppana ◽  
Sandhya P

<p><span lang="EN-IN">The large and wide range of information has become a tough time for crawlers and search engines to extract related information. This paper discusses about focused crawlers also called as topic specific crawler and variations of focused crawlers leading to distributed architecture, i.e., context aware notification architecture. To get the relevant pages from a huge amount of information available in the internet we use the focused crawler. This can bring out the relevant pages for the given topic with less number of searches in a short time. Here the input to the focused crawler is a topic specified using exemplary documents, but not using the keywords. Focused crawlers avoid the searching of all the web documents instead it searches over the links that are relevant to the crawler boundary. The Focused crawling mechanism helps us to save CPU time to large extent to keep the crawl up-to-date.</span></p>


Author(s):  
Ashwani Kumar ◽  
Anuj Kumar ◽  
Rahul Mishra

A concentered crawler crosses the World Wide Web, choosing out applicable pages to a predefined topic and forgetting those out of concern. Collecting domain specific documents employing focused crawlers has been considered one of most crucial schemes to detect applicable data. While browsing the Internet, it is unmanageable to act with extraneous pages and to anticipate which associates lead to quality pages. However most focused crawler use local explore algorithmic program to crisscross the web space, but they could easily entrapped within bounded a sub graph of the web that surrounds the starting URLs also there is problem related to applicable pages that are miss when no associates from the starting URLs. There is some applicable pages are miss. To address this problem we design a focused crawler where calculating the absolute frequency of the topic keyword also calculate the equivalent word and sub equivalent word of the keyword. The weight table is constructed agreeing to the user query. To check the resemblance of web pages with respect to topic keywords and priority of extracted associate is computed.


2016 ◽  
Vol 2016 ◽  
pp. 1-10 ◽  
Author(s):  
Houqing Lu ◽  
Donghui Zhan ◽  
Lei Zhou ◽  
Dengchao He

A focused crawler is topic-specific and aims selectively to collect web pages that are relevant to a given topic from the Internet. However, the performance of the current focused crawling can easily suffer the impact of the environments of web pages and multiple topic web pages. In the crawling process, a highly relevant region may be ignored owing to the low overall relevance of that page, and anchor text or link-context may misguide crawlers. In order to solve these problems, this paper proposes a new focused crawler. First, we build a web page classifier based on improved term weighting approach (ITFIDF), in order to gain highly relevant web pages. In addition, this paper introduces an evaluation approach of the link, link priority evaluation (LPE), which combines web page content block partition algorithm and the strategy of joint feature evaluation (JFE), to better judge the relevance between URLs on the web page and the given topic. The experimental results demonstrate that the classifier using ITFIDF outperforms TFIDF, and our focused crawler is superior to other focused crawlers based on breadth-first, best-first, anchor text only, link-context only, and content block partition in terms of harvest rate and target recall. In conclusion, our methods are significant and effective for focused crawler.


2015 ◽  
Vol 117 (8) ◽  
pp. 13-20
Author(s):  
Saturi Rajesh ◽  
D.Raju D.Raju ◽  
P.Ajay Kumar ◽  
P.Srikanth P.Srikanth
Keyword(s):  

2015 ◽  
Vol 19 (3) ◽  
pp. 449-474 ◽  
Author(s):  
Karane Vieira ◽  
Luciano Barbosa ◽  
Altigran Soares da Silva ◽  
Juliana Freire ◽  
Edleno Moura
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document