scholarly journals Design of a Parallel and Scalable Crawler for the Hidden Web

2022 ◽  
Vol 12 (1) ◽  
pp. 0-0

The WWW contains huge amount of information from different areas. This information may be present virtually in the form of web pages, media, articles (research journals / magazine), blogs etc. A major portion of the information is present in web databases that can be retrieved by raising queries at the interface offered by the specific database and is thus called the Hidden Web. An important issue is to efficiently retrieve and provide access to this enormous amount of information through crawling. In this paper, we present the architecture of a parallel crawler for the Hidden Web that avoids download overlaps by following a domain-specific approach. The experimental results further show that the proposed parallel Hidden web crawler (PSHWC), not only effectively but also efficiently extracts and download the contents in the Hidden web databases

2015 ◽  
Vol 6 (4) ◽  
pp. 39-56
Author(s):  
Nan Jing ◽  
Mengdi Li ◽  
Su Zhang

Professional social network gives companies a platform to post hiring information and locate professional talents. However, the professional network has a great number of users who generate huge amount of information every day, which makes it difficult for the hiring company to distinguish reliability of users' information and evaluate their professional abilities. In this context, this article bases on LinkedIn Mobile as the online professional social network and proposes a research approach to effectively identify unreliable information and evaluate users' abilities. First, the authors look for relevant social network profiles for a cross-site check. Second, on a single professional social networking they site, the authors check the similarity between the user's background and his connections' backgrounds, to detect any possible unreliable information. Third, they propose an algorithm to rank the trustfulness of users' recommendations based on a PageRank algorithm that was traditionally to evaluate the importance of web pages.


Author(s):  
Rosy Madaan ◽  
Ashutosh Dixit ◽  
A. K. Sharma ◽  
Komal Kumar Bhatia

2014 ◽  
Vol 4 (2) ◽  
pp. 1-18
Author(s):  
Sonali Gupta ◽  
Komal Kumar Bhatia

A huge number of Hidden Web databases exists over the WWW forming a massive source of high quality information. Retrieval of this information for enriching the repository of the search engine is the prime target of a Hidden web crawler. Besides this, the crawler should perform this task at an affordable cost and resource utilization. This paper proposes a Random ranking mechanism whereby the queries to be raised by the hidden web crawler have been ranked. By ranking the queries according to the proposed mechanism, the Hidden Web crawler is able to make an optimal choice among the candidate queries and efficiently retrieve the Hidden web databases. The Hidden Web crawler proposed here also possesses an extensible and scalable framework to improve the efficiency of crawling. The proposed approach has also been compared with other methods of Hidden Web crawling existing in the literature.


The Dark Web ◽  
2018 ◽  
pp. 319-333
Author(s):  
Sudhakar Ranjan ◽  
Komal Kumar Bhatia

Now days with the advent of internet technologies and ecommerce the need for smart search engine for human life is rising. The traditional search engines are not intelligent as well as smart and thus lead to the rise in searching costs. In this paper, architecture of a vertical search engine based on the domain specific hidden web crawler is proposed. To make a least cost vertical search engine improvement in the following techniques like: searching, indexing, ranking, transaction and query interface are suggested. The domain term analyzer filters the useless information to the maximum extent and finally provides the users with high precision information. Through the experimental result it is shown that the system works on accelerating the access, computation, storage, communication time, increased efficiency and work professionally.


The Dark Web ◽  
2018 ◽  
pp. 65-83
Author(s):  
Sonali Gupta ◽  
Komal Kumar Bhatia

A huge number of Hidden Web databases exists over the WWW forming a massive source of high quality information. Retrieval of this information for enriching the repository of the search engine is the prime target of a Hidden web crawler. Besides this, the crawler should perform this task at an affordable cost and resource utilization. This paper proposes a Random ranking mechanism whereby the queries to be raised by the hidden web crawler have been ranked. By ranking the queries according to the proposed mechanism, the Hidden Web crawler is able to make an optimal choice among the candidate queries and efficiently retrieve the Hidden web databases. The Hidden Web crawler proposed here also possesses an extensible and scalable framework to improve the efficiency of crawling. The proposed approach has also been compared with other methods of Hidden Web crawling existing in the literature.


2017 ◽  
Vol 7 (2) ◽  
pp. 19-33
Author(s):  
Sudhakar Ranjan ◽  
Komal Kumar Bhatia

Now days with the advent of internet technologies and ecommerce the need for smart search engine for human life is rising. The traditional search engines are not intelligent as well as smart and thus lead to the rise in searching costs. In this paper, architecture of a vertical search engine based on the domain specific hidden web crawler is proposed. To make a least cost vertical search engine improvement in the following techniques like: searching, indexing, ranking, transaction and query interface are suggested. The domain term analyzer filters the useless information to the maximum extent and finally provides the users with high precision information. Through the experimental result it is shown that the system works on accelerating the access, computation, storage, communication time, increased efficiency and work professionally.


Author(s):  
Sawroop Kaur ◽  
Aman Singh ◽  
G. Geetha ◽  
Xiaochun Cheng

AbstractDue to the massive size of the hidden web, searching, retrieving and mining rich and high-quality data can be a daunting task. Moreover, with the presence of forms, data cannot be accessed easily. Forms are dynamic, heterogeneous and spread over trillions of web pages. Significant efforts have addressed the problem of tapping into the hidden web to integrate and mine rich data. Effective techniques, as well as application in special cases, are required to be explored to achieve an effective harvest rate. One such special area is atmospheric science, where hidden web crawling is least implemented, and crawler is required to crawl through the huge web to narrow down the search to specific data. In this study, an intelligent hidden web crawler for harvesting data in urban domains (IHWC) is implemented to address the relative problems such as classification of domains, prevention of exhaustive searching, and prioritizing the URLs. The crawler also performs well in curating pollution-related data. The crawler targets the relevant web pages and discards the irrelevant by implementing rejection rules. To achieve more accurate results for a focused crawl, ICHW crawls the websites on priority for a given topic. The crawler has fulfilled the dual objective of developing an effective hidden web crawler that can focus on diverse domains and to check its integration in searching pollution data in smart cities. One of the objectives of smart cities is to reduce pollution. Resultant crawled data can be used for finding the reason for pollution. The crawler can help the user to search the level of pollution in a specific area. The harvest rate of the crawler is compared with pioneer existing work. With an increase in the size of a dataset, the presented crawler can add significant value to emission accuracy. Our results are demonstrating the accuracy and harvest rate of the proposed framework, and it efficiently collect hidden web interfaces from large-scale sites and achieve higher rates than other crawlers.


2014 ◽  
Vol 543-547 ◽  
pp. 2941-2944 ◽  
Author(s):  
Guo Chao Liang ◽  
Cai Feng Cao

The LED optical design focused web crawler is proposed based on Shark-Search and topical dictionary. Then the crawling strategy is implemented by extending the web crawler Heritrix. The experimental results show that the design scheme (Topic-First) and its web crawler LED-Crawler can effectively capture the LED optical design relevant web pages. And compared to general search engines, LED-Crawler can improve the accuracy.


Sign in / Sign up

Export Citation Format

Share Document