Deep Web Information Retrieval Process

Author(s):  
Dilip Kumar Sharma ◽  
A. K. Sharma

Web crawlers specialize in downloading web content and analyzing and indexing from surface web, consisting of interlinked HTML pages. Web crawlers have limitations if the data is behind the query interface. Response depends on the querying party’s context in order to engage in dialogue and negotiate for the information. In this article, the authors discuss deep web searching techniques. A survey of technical literature on deep web searching contributes to the development of a general framework. Existing frameworks and mechanisms of present web crawlers are taxonomically classified into four steps and analyzed to find limitations in searching the deep web.

The Dark Web ◽  
2018 ◽  
pp. 114-137
Author(s):  
Dilip Kumar Sharma ◽  
A. K. Sharma

Web crawlers specialize in downloading web content and analyzing and indexing from surface web, consisting of interlinked HTML pages. Web crawlers have limitations if the data is behind the query interface. Response depends on the querying party's context in order to engage in dialogue and negotiate for the information. In this paper, the authors discuss deep web searching techniques. A survey of technical literature on deep web searching contributes to the development of a general framework. Existing frameworks and mechanisms of present web crawlers are taxonomically classified into four steps and analyzed to find limitations in searching the deep web.


Author(s):  
Dilip Kumar Sharma ◽  
A. K. Sharma

Web crawlers specialize in downloading web content and analyzing and indexing from surface web, consisting of interlinked HTML pages. Web crawlers have limitations if the data is behind the query interface. Response depends on the querying party’s context in order to engage in dialogue and negotiate for the information. In this paper, the authors discuss deep web searching techniques. A survey of technical literature on deep web searching contributes to the development of a general framework. Existing frameworks and mechanisms of present web crawlers are taxonomically classified into four steps and analyzed to find limitations in searching the deep web.


Author(s):  
Dilip Kumar Sharma ◽  
A. K. Sharma

A traditional crawler picks up a URL, retrieves the corresponding page and extracts various links, adding them to the queue. A deep Web crawler, after adding links to the queue, checks for forms. If forms are present, it processes them and retrieves the required information. Various techniques have been proposed for crawling deep Web information, but much remains undiscovered. In this paper, the authors analyze and compare important deep Web information crawling techniques to find their relative limitations and advantages. To minimize limitations of existing deep Web crawlers, a novel architecture is proposed based on QIIIEP specifications (Sharma & Sharma, 2009). The proposed architecture is cost effective and has features of privatized search and general search for deep Web data hidden behind html forms.


The Dark Web ◽  
2018 ◽  
pp. 334-358
Author(s):  
Dilip Kumar Sharma ◽  
A. K. Sharma

A traditional crawler picks up a URL, retrieves the corresponding page and extracts various links, adding them to the queue. A deep Web crawler, after adding links to the queue, checks for forms. If forms are present, it processes them and retrieves the required information. Various techniques have been proposed for crawling deep Web information, but much remains undiscovered. In this paper, the authors analyze and compare important deep Web information crawling techniques to find their relative limitations and advantages. To minimize limitations of existing deep Web crawlers, a novel architecture is proposed based on QIIIEP specifications (Sharma & Sharma, 2009). The proposed architecture is cost effective and has features of privatized search and general search for deep Web data hidden behind html forms.


2021 ◽  
pp. 50-71
Author(s):  
Shakeel Ahmed ◽  
Shubham Sharma ◽  
Saneh Lata Yadav

Information retrieval is finding material of unstructured nature within large collections stored on computers. Surface web consists of indexed content accessible by traditional browsers whereas deep or hidden web content cannot be found with traditional search engines and requires a password or network permissions. In deep web, dark web is also growing as new tools make it easier to navigate hidden content and accessible with special software like Tor. According to a study by Nature, Google indexes no more than 16% of the surface web and misses all of the deep web. Any given search turns up just 0.03% of information that exists online. So, the key part of the hidden web remains inaccessible to the users. This chapter deals with positing some questions about this research. Detailed definitions, analogies are explained, and the chapter discusses related work and puts forward all the advantages and limitations of the existing work proposed by researchers. The chapter identifies the need for a system that will process the surface and hidden web data and return integrated results to the users.


2018 ◽  
Vol 52 (2) ◽  
pp. 266-277 ◽  
Author(s):  
Hyo-Jung Oh ◽  
Dong-Hyun Won ◽  
Chonghyuck Kim ◽  
Sung-Hee Park ◽  
Yong Kim

Purpose The purpose of this paper is to describe the development of an algorithm for realizing web crawlers that automatically collect dynamically generated webpages from the deep web. Design/methodology/approach This study proposes and develops an algorithm to collect web information as if the web crawler gathers static webpages by managing script commands as links. The proposed web crawler actually experiments with the algorithm by collecting deep webpages. Findings Among the findings of this study is that if the actual crawling process provides search results as script pages, the outcome only collects the first page. However, the proposed algorithm can collect deep webpages in this case. Research limitations/implications To use a script as a link, a human must first analyze the web document. This study uses the web browser object provided by Microsoft Visual Studio as a script launcher, so it cannot collect deep webpages if the web browser object cannot launch the script, or if the web document contains script errors. Practical implications The research results show deep webs are estimated to have 450 to 550 times more information than surface webpages, and it is difficult to collect web documents. However, this algorithm helps to enable deep web collection through script runs. Originality/value This study presents a new method to be utilized with script links instead of adopting previous keywords. The proposed algorithm is available as an ordinary URL. From the conducted experiment, analysis of scripts on individual websites is needed to employ them as links.


2021 ◽  
Vol 2021 ◽  
pp. 1-21
Author(s):  
Randa Basheer ◽  
Bassel Alkhatib

From proactive detection of cyberattacks to the identification of key actors, analyzing contents of the Dark Web plays a significant role in deterring cybercrimes and understanding criminal minds. Researching in the Dark Web proved to be an essential step in fighting cybercrime, whether with a standalone investigation of the Dark Web solely or an integrated one that includes contents from the Surface Web and the Deep Web. In this review, we probe recent studies in the field of analyzing Dark Web content for Cyber Threat Intelligence (CTI), introducing a comprehensive analysis of their techniques, methods, tools, approaches, and results, and discussing their possible limitations. In this review, we demonstrate the significance of studying the contents of different platforms on the Dark Web, leading new researchers through state-of-the-art methodologies. Furthermore, we discuss the technical challenges, ethical considerations, and future directions in the domain.


Author(s):  
Hadrian Peter ◽  
Charles Greenidge

Traditionally a great deal of research has been devoted to data extraction on the web (Crescenzi, et al, 2001; Embley, et al, 2005; Laender, et al, 2002; Hammer, et al, 1997; Ribeiro-Neto, et al, 1999; Huck, et al, 1998; Wang & Lochovsky, 2002, 2003) from areas where data is easily indexed and extracted by a Search Engine, the so-called Surface Web. There are, however, other sites that are greater and potentially more vital, that contain information which cannot be readily indexed by standard search engines. These sites which have been designed to require some level of direct human participation (for example, to issue queries rather than simply follow hyperlinks) cannot be handled using the simple link traversal techniques used by many web crawlers (Rappaport, 2000; Cho & Garcia-Molina, 2000; Cho et al, 1998; Edwards et al, 2001). This area of the web, which has been operationally off-limits for crawlers using standard indexing procedures, is termed the Deep Web (Zillman, 2005; Bergman, 2000). Much work still needs to be done as Deep Web sites represent an area that is only recently being explored to identify where potential uses can be developed.


Sign in / Sign up

Export Citation Format

Share Document