UCrawler: A learning-based web crawler using a URL knowledge base
Focused crawlers, as fundamental components of vertical search engines, focus on crawling the web pages related to a specific topic. Existing focused crawlers commonly suffer from the problems of low efficiency of crawling pages and subject migration. In this paper, we propose a learning-based focused crawler using a URL knowledge base. To improve the accuracy of similarity, the similarity of the topic is measured with the parent page content, anchor information, and URL content. The URL content is also learned and updated iteratively and continuously. Within the crawler, we implement a crawling mechanism based on a combination of content analysis and simple link analysis crawler strategy, which decreases computational complexity and avoids the locality problem of crawling. Experimental results show that our proposed algorithm achieves a better precision than traditional methods including the shark-search and best-first search algorithms, and avoids the local optimum problem of crawling.