focused crawlers Latest Research Papers

Web crawlers are as old as the Internet and are most commonly used by search engines to visit websites and index them into repositories. They are not limited to search engines but are also widely utilized to build corpora in different domains and languages. This study developed a focused set of web crawlers for three Punjabi news websites. The web crawlers were developed to extract quality text articles and add them to a local repository to be used in further research. The crawlers were implemented using the Python programming language and were utilized to construct a corpus of more than 134,000 news articles in nine different news genres. The crawler code and extracted corpora were made publicly available to the scientific community for research purposes.

Download Full-text

An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm

Cybernetics and Information Technologies ◽

10.2478/cait-2021-0022 ◽

2021 ◽

Vol 21 (2) ◽

pp. 105-120

Author(s):

K. S. Sakunthala Prabha ◽

C. Mahesh ◽

S. P. Raja

Keyword(s):

Semantic Similarity ◽

Similarity Measure ◽

String Matching ◽

Similarity Score ◽

Web Page ◽

Web Crawler ◽

Matching Algorithm ◽

Relevance Score ◽

Focused Crawlers ◽

The Web

Abstract Topic precise crawler is a special purpose web crawler, which downloads appropriate web pages analogous to a particular topic by measuring cosine similarity or semantic similarity score. The cosine based similarity measure displays inaccurate relevance score, if topic term does not directly occur in the web page. The semantic-based similarity measure provides the precise relevance score, even if the synonyms of the given topic occur in the web page. The unavailability of the topic in the ontology produces inaccurate relevance score by the semantic focused crawlers. This paper overcomes these glitches with a hybrid string-matching algorithm by combining the semantic similarity-based measure with the probabilistic similarity-based measure. The experimental results revealed that this algorithm increased the efficiency of the focused web crawlers and achieved better Harvest Rate (HR), Precision (P) and Irrelevance Ratio (IR) than the existing web focused crawlers achieve.

Download Full-text

Focused Crawler Strategy Based on Improved Energy Landscape Paving Algorithm

Fuzzy Systems and Data Mining VI - Frontiers in Artificial Intelligence and Applications ◽

10.3233/faia200731 ◽

2020 ◽

Author(s):

Jingfa Liu ◽

Wei Zhang ◽

Zhihe Yang ◽

Ziang Liu

Keyword(s):

Global Optimization ◽

Semantic Analysis ◽

Energy Landscape ◽

Optimization Method ◽

Web Pages ◽

Web Content ◽

Local Optima ◽

Link Structure ◽

Focused Crawlers ◽

Energy Landscape Paving

The traditional crawlers have difficulty in implementing semantic analysis. Therefore, the focused crawler technologies with topic preference characteristics have received many attentions in the recent years. To increase the precision of focused crawlers and prevent “topic drifting”, this paper adopts the comprehensive relevancy evaluation (CRE) of hyperlinks based on the combination of web content and link structure. In addition, the improved version of the energy landscape paving (ELP) algorithm that is a class of metropolis-sampling-based global optimization method is proposed to avoid the focused crawler falling into local optima. By incorporating the CRE strategy into the improved ELP, a novel focused crawler strategy denoted by IELP is proposed. The experimental results on rainstorm disasters domain show that the precision of the proposed focused crawler is obviously promoted compared to other focused crawlers in literature, illustrating the ability of the IELP to retrieve topic-related web pages.

Download Full-text

UCrawler: A learning-based web crawler using a URL knowledge base

Journal of Computational Methods in Sciences and Engineering ◽

10.3233/jcm-204658 ◽

2020 ◽

pp. 1-14

Author(s):

Wei Wang ◽

Lihua Yu

Keyword(s):

Knowledge Base ◽

Web Pages ◽

Local Optimum ◽

Specific Topic ◽

Web Crawler ◽

Vertical Search ◽

Best First Search ◽

Low Efficiency ◽

Focused Crawlers ◽

Simple Link

Focused crawlers, as fundamental components of vertical search engines, focus on crawling the web pages related to a specific topic. Existing focused crawlers commonly suffer from the problems of low efficiency of crawling pages and subject migration. In this paper, we propose a learning-based focused crawler using a URL knowledge base. To improve the accuracy of similarity, the similarity of the topic is measured with the parent page content, anchor information, and URL content. The URL content is also learned and updated iteratively and continuously. Within the crawler, we implement a crawling mechanism based on a combination of content analysis and simple link analysis crawler strategy, which decreases computational complexity and avoids the locality problem of crawling. Experimental results show that our proposed algorithm achieves a better precision than traditional methods including the shark-search and best-first search algorithms, and avoids the local optimum problem of crawling.

Download Full-text

Optimized Focused Web Crawler with Natural Language Processing Based Relevance Measure in Bioinformatics Web Sources

Cybernetics and Information Technologies ◽

10.2478/cait-2019-0021 ◽

2019 ◽

Vol 19 (2) ◽

pp. 146-158 ◽

Cited By ~ 1

Author(s):

S. R. Mani Sekhar ◽

G. M. Siddesh ◽

Sunilkumar S. Manvi ◽

K. G. Srinivasa

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Web Page ◽

Breadth First Search ◽

Web Crawler ◽

Web Crawlers ◽

Relevance Measure ◽

Focused Crawlers

Abstract In the fast growing of digital technologies, crawlers and search engines face unpredictable challenges. Focused web-crawlers are essential for mining the boundless data available on the internet. Web-Crawlers face indeterminate latency problem due to differences in their response time. The proposed work attempts to optimize the designing and implementation of Focused Web-Crawlers using Master-Slave architecture for Bioinformatics web sources. Focused Crawlers ideally should crawl only relevant pages, but the relevance of the page can only be estimated after crawling the genomics pages. A solution for predicting the page relevance, which is based on Natural Language Processing, is proposed in the paper. The frequency of the keywords on the top ranked sentences of the page determines the relevance of the pages within genomics sources. The proposed solution uses a TextRank algorithm to rank the sentences, as well as ensuring the correct classification of Bioinformatics web page. Finally, the model is validated by being compared with a breadth first search web-crawler. The comparison shows significant reduction in run time for the same harvest rate.

Download Full-text

Focused crawling from the basic approach to context aware notification architecture

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v13.i2.pp492-498 ◽

2019 ◽

Vol 13 (2) ◽

pp. 492

Author(s):

Venugopal Boppana ◽

Sandhya P

Keyword(s):

Context Aware ◽

Huge Amount ◽

Focused Crawling ◽

Web Documents ◽

Cpu Time ◽

Related Information ◽

Wide Range ◽

Short Time ◽

Focused Crawlers ◽

The Given

<p><span lang="EN-IN">The large and wide range of information has become a tough time for crawlers and search engines to extract related information. This paper discusses about focused crawlers also called as topic specific crawler and variations of focused crawlers leading to distributed architecture, i.e., context aware notification architecture. To get the relevant pages from a huge amount of information available in the internet we use the focused crawler. This can bring out the relevant pages for the given topic with less number of searches in a short time. Here the input to the focused crawler is a topic specified using exemplary documents, but not using the keywords. Focused crawlers avoid the searching of all the web documents instead it searches over the links that are relevant to the crawler boundary. The Focused crawling mechanism helps us to save CPU time to large extent to keep the crawl up-to-date.</span></p>

Download Full-text

Effective Concentrated Web Crawling Approach Path for Google

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i11.459 ◽

2017 ◽

Vol 7 (11) ◽

pp. 1

Author(s):

Ashwani Kumar ◽

Anuj Kumar ◽

Rahul Mishra

Keyword(s):

World Wide ◽

Web Pages ◽

Absolute Frequency ◽

Domain Specific ◽

User Query ◽

The World ◽

Web Space ◽

Focused Crawlers ◽

Weight Table ◽

The Web

A concentered crawler crosses the World Wide Web, choosing out applicable pages to a predefined topic and forgetting those out of concern. Collecting domain specific documents employing focused crawlers has been considered one of most crucial schemes to detect applicable data. While browsing the Internet, it is unmanageable to act with extraneous pages and to anticipate which associates lead to quality pages. However most focused crawler use local explore algorithmic program to crisscross the web space, but they could easily entrapped within bounded a sub graph of the web that surrounds the starting URLs also there is problem related to applicable pages that are miss when no associates from the starting URLs. There is some applicable pages are miss. To address this problem we design a focused crawler where calculating the absolute frequency of the topic keyword also calculate the equivalent word and sub equivalent word of the keyword. The weight table is constructed agreeing to the user query. To check the resemblance of web pages with respect to topic keywords and priority of extracted associate is computed.

Download Full-text

An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation

Mathematical Problems in Engineering ◽

10.1155/2016/6406901 ◽

2016 ◽

Vol 2016 ◽

pp. 1-10 ◽

Cited By ~ 8

Author(s):

Houqing Lu ◽

Donghui Zhan ◽

Lei Zhou ◽

Dengchao He

Keyword(s):

Web Pages ◽

Web Page ◽

Term Weighting ◽

Block Partition ◽

Feature Evaluation ◽

Anchor Text ◽

Evaluation Approach ◽

Focused Crawlers ◽

The Impact ◽

The Given

A focused crawler is topic-specific and aims selectively to collect web pages that are relevant to a given topic from the Internet. However, the performance of the current focused crawling can easily suffer the impact of the environments of web pages and multiple topic web pages. In the crawling process, a highly relevant region may be ignored owing to the low overall relevance of that page, and anchor text or link-context may misguide crawlers. In order to solve these problems, this paper proposes a new focused crawler. First, we build a web page classifier based on improved term weighting approach (ITFIDF), in order to gain highly relevant web pages. In addition, this paper introduces an evaluation approach of the link, link priority evaluation (LPE), which combines web page content block partition algorithm and the strategy of joint feature evaluation (JFE), to better judge the relevance between URLs on the web page and the given topic. The experimental results demonstrate that the classifier using ITFIDF outperforms TFIDF, and our focused crawler is superior to other focused crawlers based on breadth-first, best-first, anchor text only, link-context only, and content block partition in terms of harvest rate and target recall. In conclusion, our methods are significant and effective for focused crawler.

Download Full-text

A Frame Work for Topical Collections Make with Focused and Accelerated Focused Crawlers

International Journal of Computer Applications ◽

10.5120/20573-2974 ◽

2015 ◽

Vol 117 (8) ◽

pp. 13-20

Author(s):

Saturi Rajesh ◽

D.Raju D.Raju ◽

P.Ajay Kumar ◽

P.Srikanth P.Srikanth

Keyword(s):

Frame Work ◽

Focused Crawlers

Download Full-text

Finding seeds to bootstrap focused crawlers

World Wide Web ◽

10.1007/s11280-015-0331-7 ◽

2015 ◽

Vol 19 (3) ◽

pp. 449-474 ◽

Cited By ~ 13

Author(s):

Karane Vieira ◽

Luciano Barbosa ◽

Altigran Soares da Silva ◽

Juliana Freire ◽

Edleno Moura

Keyword(s):

Focused Crawlers

Download Full-text

focused crawlers
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Development of Focused Crawlers for Building Large Punjabi News Corpus

An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm

Focused Crawler Strategy Based on Improved Energy Landscape Paving Algorithm

UCrawler: A learning-based web crawler using a URL knowledge base

Optimized Focused Web Crawler with Natural Language Processing Based Relevance Measure in Bioinformatics Web Sources

Focused crawling from the basic approach to context aware notification architecture

Effective Concentrated Web Crawling Approach Path for Google

An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation

A Frame Work for Topical Collections Make with Focused and Accelerated Focused Crawlers

Finding seeds to bootstrap focused crawlers

Export Citation Format

focused crawlersRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Development of Focused Crawlers for Building Large Punjabi News Corpus

An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm

Focused Crawler Strategy Based on Improved Energy Landscape Paving Algorithm

UCrawler: A learning-based web crawler using a URL knowledge base

Optimized Focused Web Crawler with Natural Language Processing Based Relevance Measure in Bioinformatics Web Sources

Focused crawling from the basic approach to context aware notification architecture

Effective Concentrated Web Crawling Approach Path for Google

An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation

A Frame Work for Topical Collections Make with Focused and Accelerated Focused Crawlers

Finding seeds to bootstrap focused crawlers

focused crawlers
Recently Published Documents