Design of a Parallel and Scalable Crawler for the  Hidden Web

Professional social network gives companies a platform to post hiring information and locate professional talents. However, the professional network has a great number of users who generate huge amount of information every day, which makes it difficult for the hiring company to distinguish reliability of users' information and evaluate their professional abilities. In this context, this article bases on LinkedIn Mobile as the online professional social network and proposes a research approach to effectively identify unreliable information and evaluate users' abilities. First, the authors look for relevant social network profiles for a cross-site check. Second, on a single professional social networking they site, the authors check the similarity between the user's background and his connections' backgrounds, to detect any possible unreliable information. Third, they propose an algorithm to rank the trustfulness of users' recommendations based on a PageRank algorithm that was traditionally to evaluate the importance of web pages.

Download Full-text

AKSHR: A novel framework for a Domain-specific Hidden Web Crawler

2010 First International Conference On Parallel, Distributed and Grid Computing (PDGC 2010) ◽

10.1109/pdgc.2010.5679916 ◽

2010 ◽

Cited By ~ 4

Author(s):

Komal Kumar Bhatia ◽

A.K. Sharma ◽

Rosy Madaan

Keyword(s):

Web Crawler ◽

Domain Specific ◽

Hidden Web

Download Full-text

A Framework for Incremental Domain-Specific Hidden Web Crawler

Communications in Computer and Information Science - Contemporary Computing ◽

10.1007/978-3-642-14834-7_39 ◽

2010 ◽

pp. 412-422 ◽

Cited By ~ 1

Author(s):

Rosy Madaan ◽

Ashutosh Dixit ◽

A. K. Sharma ◽

Komal Kumar Bhatia

Keyword(s):

Web Crawler ◽

Domain Specific ◽

Hidden Web

Download Full-text

Optimal Query Generation for Hidden Web Extraction through Response Analysis

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2014040101 ◽

2014 ◽

Vol 4 (2) ◽

pp. 1-18

Author(s):

Sonali Gupta ◽

Komal Kumar Bhatia

Keyword(s):

Optimal Choice ◽

Response Analysis ◽

Web Crawler ◽

Web Databases ◽

Web Extraction ◽

Hidden Web ◽

Query Generation ◽

High Quality Information ◽

Ranking Mechanism ◽

Prime Target

A huge number of Hidden Web databases exists over the WWW forming a massive source of high quality information. Retrieval of this information for enriching the repository of the search engine is the prime target of a Hidden web crawler. Besides this, the crawler should perform this task at an affordable cost and resource utilization. This paper proposes a Random ranking mechanism whereby the queries to be raised by the hidden web crawler have been ranked. By ranking the queries according to the proposed mechanism, the Hidden Web crawler is able to make an optimal choice among the candidate queries and efficiently retrieve the Hidden web databases. The Hidden Web crawler proposed here also possesses an extensible and scalable framework to improve the efficiency of crawling. The proposed approach has also been compared with other methods of Hidden Web crawling existing in the literature.

Download Full-text

Design of a Least Cost (LC) Vertical Search Engine Based on Domain Specific Hidden Web Crawler

The Dark Web ◽

10.4018/978-1-5225-3163-0.ch014 ◽

2018 ◽

pp. 319-333

Author(s):

Sudhakar Ranjan ◽

Komal Kumar Bhatia

Keyword(s):

Search Engine ◽

Human Life ◽

Experimental Result ◽

Web Crawler ◽

Domain Specific ◽

Hidden Web ◽

Vertical Search ◽

Communication Time ◽

Vertical Search Engine ◽

Domain Term

Now days with the advent of internet technologies and ecommerce the need for smart search engine for human life is rising. The traditional search engines are not intelligent as well as smart and thus lead to the rise in searching costs. In this paper, architecture of a vertical search engine based on the domain specific hidden web crawler is proposed. To make a least cost vertical search engine improvement in the following techniques like: searching, indexing, ranking, transaction and query interface are suggested. The domain term analyzer filters the useless information to the maximum extent and finally provides the users with high precision information. Through the experimental result it is shown that the system works on accelerating the access, computation, storage, communication time, increased efficiency and work professionally.

Download Full-text

Optimal Query Generation for Hidden Web Extraction Through Response Analysis

The Dark Web ◽

10.4018/978-1-5225-3163-0.ch005 ◽

2018 ◽

pp. 65-83

Author(s):

Sonali Gupta ◽

Komal Kumar Bhatia

Keyword(s):

Optimal Choice ◽

Response Analysis ◽

Web Crawler ◽

Web Databases ◽

Web Extraction ◽

Hidden Web ◽

Query Generation ◽

High Quality Information ◽

Ranking Mechanism ◽

Prime Target

A huge number of Hidden Web databases exists over the WWW forming a massive source of high quality information. Retrieval of this information for enriching the repository of the search engine is the prime target of a Hidden web crawler. Besides this, the crawler should perform this task at an affordable cost and resource utilization. This paper proposes a Random ranking mechanism whereby the queries to be raised by the hidden web crawler have been ranked. By ranking the queries according to the proposed mechanism, the Hidden Web crawler is able to make an optimal choice among the candidate queries and efficiently retrieve the Hidden web databases. The Hidden Web crawler proposed here also possesses an extensible and scalable framework to improve the efficiency of crawling. The proposed approach has also been compared with other methods of Hidden Web crawling existing in the literature.

Download Full-text

Design of a Least Cost (LC) Vertical Search Engine based on Domain Specific Hidden Web Crawler

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2017040102 ◽

2017 ◽

Vol 7 (2) ◽

pp. 19-33

Author(s):

Sudhakar Ranjan ◽

Komal Kumar Bhatia

Keyword(s):

Search Engine ◽

Human Life ◽

Experimental Result ◽

Web Crawler ◽

Domain Specific ◽

Hidden Web ◽

Vertical Search ◽

Communication Time ◽

Vertical Search Engine ◽

Domain Term

Now days with the advent of internet technologies and ecommerce the need for smart search engine for human life is rising. The traditional search engines are not intelligent as well as smart and thus lead to the rise in searching costs. In this paper, architecture of a vertical search engine based on the domain specific hidden web crawler is proposed. To make a least cost vertical search engine improvement in the following techniques like: searching, indexing, ranking, transaction and query interface are suggested. The domain term analyzer filters the useless information to the maximum extent and finally provides the users with high precision information. Through the experimental result it is shown that the system works on accelerating the access, computation, storage, communication time, increased efficiency and work professionally.

Download Full-text

A QIIIEP based domain specific hidden web crawler

Proceedings of the International Conference & Workshop on Emerging Trends in Technology - ICWET '11 ◽

10.1145/1980022.1980073 ◽

2011 ◽

Cited By ~ 1

Author(s):

D. K. Sharma ◽

A. K. Sharma

Keyword(s):

Web Crawler ◽

Domain Specific ◽

Hidden Web

Download Full-text

IHWC: intelligent hidden web crawler for harvesting data in urban domains

Complex & Intelligent Systems ◽

10.1007/s40747-021-00471-1 ◽

2021 ◽

Author(s):

Sawroop Kaur ◽

Aman Singh ◽

G. Geetha ◽

Xiaochun Cheng

Keyword(s):

Large Scale ◽

Smart Cities ◽

Quality Data ◽

Web Pages ◽

Web Crawler ◽

Harvest Rate ◽

Web Interfaces ◽

Hidden Web ◽

Special Cases ◽

Daunting Task

AbstractDue to the massive size of the hidden web, searching, retrieving and mining rich and high-quality data can be a daunting task. Moreover, with the presence of forms, data cannot be accessed easily. Forms are dynamic, heterogeneous and spread over trillions of web pages. Significant efforts have addressed the problem of tapping into the hidden web to integrate and mine rich data. Effective techniques, as well as application in special cases, are required to be explored to achieve an effective harvest rate. One such special area is atmospheric science, where hidden web crawling is least implemented, and crawler is required to crawl through the huge web to narrow down the search to specific data. In this study, an intelligent hidden web crawler for harvesting data in urban domains (IHWC) is implemented to address the relative problems such as classification of domains, prevention of exhaustive searching, and prioritizing the URLs. The crawler also performs well in curating pollution-related data. The crawler targets the relevant web pages and discards the irrelevant by implementing rejection rules. To achieve more accurate results for a focused crawl, ICHW crawls the websites on priority for a given topic. The crawler has fulfilled the dual objective of developing an effective hidden web crawler that can focus on diverse domains and to check its integration in searching pollution data in smart cities. One of the objectives of smart cities is to reduce pollution. Resultant crawled data can be used for finding the reason for pollution. The crawler can help the user to search the level of pollution in a specific area. The harvest rate of the crawler is compared with pioneer existing work. With an increase in the size of a dataset, the presented crawler can add significant value to emission accuracy. Our results are demonstrating the accuracy and harvest rate of the proposed framework, and it efficiently collect hidden web interfaces from large-scale sites and achieve higher rates than other crawlers.

Download Full-text

Research and Implementation of LED Optical Design Focused Web Crawler

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.543-547.2941 ◽

2014 ◽

Vol 543-547 ◽

pp. 2941-2944 ◽

Cited By ~ 1

Author(s):

Guo Chao Liang ◽

Cai Feng Cao

Keyword(s):

Search Engines ◽

Optical Design ◽

Experimental Results ◽

Web Pages ◽

Web Crawler ◽

Design Scheme ◽

General Search ◽

The Web

The LED optical design focused web crawler is proposed based on Shark-Search and topical dictionary. Then the crawling strategy is implemented by extending the web crawler Heritrix. The experimental results show that the design scheme (Topic-First) and its web crawler LED-Crawler can effectively capture the LED optical design relevant web pages. And compared to general search engines, LED-Crawler can improve the accuracy.

Download Full-text