Deep Web Information Retrieval Process

Dilip Kumar Sharma; A. K. Sharma

doi:10.4018/jitwe.2010010101

Deep Web Information Retrieval Process

The Dark Web ◽

10.4018/978-1-5225-3163-0.ch007 ◽

2018 ◽

pp. 114-137

Author(s):

Dilip Kumar Sharma ◽

A. K. Sharma

Keyword(s):

Information Retrieval ◽

General Framework ◽

Deep Web ◽

Web Searching ◽

Web Content ◽

Web Information Retrieval ◽

Technical Literature ◽

Web Information ◽

Web Crawlers ◽

Surface Web

Web crawlers specialize in downloading web content and analyzing and indexing from surface web, consisting of interlinked HTML pages. Web crawlers have limitations if the data is behind the query interface. Response depends on the querying party's context in order to engage in dialogue and negotiate for the information. In this paper, the authors discuss deep web searching techniques. A survey of technical literature on deep web searching contributes to the development of a general framework. Existing frameworks and mechanisms of present web crawlers are taxonomically classified into four steps and analyzed to find limitations in searching the deep web.

Download Full-text

Deep Web Information Retrieval Process

Models for Capitalizing on Web Engineering Advancements ◽

10.4018/978-1-4666-0023-2.ch005 ◽

2012 ◽

pp. 75-96

Author(s):

Dilip Kumar Sharma ◽

A. K. Sharma

Keyword(s):

Information Retrieval ◽

General Framework ◽

Deep Web ◽

Web Searching ◽

Web Content ◽

Web Information Retrieval ◽

Technical Literature ◽

Web Information ◽

Web Crawlers ◽

Surface Web

Web crawlers specialize in downloading web content and analyzing and indexing from surface web, consisting of interlinked HTML pages. Web crawlers have limitations if the data is behind the query interface. Response depends on the querying party’s context in order to engage in dialogue and negotiate for the information. In this paper, the authors discuss deep web searching techniques. A survey of technical literature on deep web searching contributes to the development of a general framework. Existing frameworks and mechanisms of present web crawlers are taxonomically classified into four steps and analyzed to find limitations in searching the deep web.

Download Full-text

A Novel Architecture for Deep Web Crawler

International Journal of Information Technology and Web Engineering ◽

10.4018/jitwe.2011010103 ◽

2011 ◽

Vol 6 (1) ◽

pp. 25-48 ◽

Cited By ~ 7

Author(s):

Dilip Kumar Sharma ◽

A. K. Sharma

Keyword(s):

Cost Effective ◽

Deep Web ◽

Web Data ◽

Web Crawler ◽

Web Information ◽

General Search ◽

Web Crawlers

A traditional crawler picks up a URL, retrieves the corresponding page and extracts various links, adding them to the queue. A deep Web crawler, after adding links to the queue, checks for forms. If forms are present, it processes them and retrieves the required information. Various techniques have been proposed for crawling deep Web information, but much remains undiscovered. In this paper, the authors analyze and compare important deep Web information crawling techniques to find their relative limitations and advantages. To minimize limitations of existing deep Web crawlers, a novel architecture is proposed based on QIIIEP specifications (Sharma & Sharma, 2009). The proposed architecture is cost effective and has features of privatized search and general search for deep Web data hidden behind html forms.

Download Full-text

A Novel Architecture for Deep Web Crawler

The Dark Web ◽

10.4018/978-1-5225-3163-0.ch015 ◽

2018 ◽

pp. 334-358

Author(s):

Dilip Kumar Sharma ◽

A. K. Sharma

Keyword(s):

Cost Effective ◽

Deep Web ◽

Web Data ◽

Web Crawler ◽

Web Information ◽

General Search ◽

Web Crawlers

A traditional crawler picks up a URL, retrieves the corresponding page and extracts various links, adding them to the queue. A deep Web crawler, after adding links to the queue, checks for forms. If forms are present, it processes them and retrieves the required information. Various techniques have been proposed for crawling deep Web information, but much remains undiscovered. In this paper, the authors analyze and compare important deep Web information crawling techniques to find their relative limitations and advantages. To minimize limitations of existing deep Web crawlers, a novel architecture is proposed based on QIIIEP specifications (Sharma & Sharma, 2009). The proposed architecture is cost effective and has features of privatized search and general search for deep Web data hidden behind html forms.

Download Full-text

Information Retrieval in the Hidden Web

10.4018/978-1-7998-8061-5.ch003 ◽

2021 ◽

pp. 50-71

Author(s):

Shakeel Ahmed ◽

Shubham Sharma ◽

Saneh Lata Yadav

Keyword(s):

Information Retrieval ◽

Search Engines ◽

Deep Web ◽

Web Content ◽

Web Data ◽

Hidden Web ◽

Special Software ◽

Surface Web ◽

Dark Web

Information retrieval is finding material of unstructured nature within large collections stored on computers. Surface web consists of indexed content accessible by traditional browsers whereas deep or hidden web content cannot be found with traditional search engines and requires a password or network permissions. In deep web, dark web is also growing as new tools make it easier to navigate hidden content and accessible with special software like Tor. According to a study by Nature, Google indexes no more than 16% of the surface web and misses all of the deep web. Any given search turns up just 0.03% of information that exists online. So, the key part of the hidden web remains inaccessible to the users. This chapter deals with positing some questions about this research. Detailed definitions, analogies are explained, and the chapter discusses related work and puts forward all the advantages and limitations of the existing work proposed by researchers. The chapter identifies the need for a system that will process the surface and hidden web data and return integrated results to the users.

Download Full-text

Construction of deep web information retrieval system in e-commerce field

2011 International Conference on E-Business and E-Government (ICEE) ◽

10.1109/icebeg.2011.5886812 ◽

2011 ◽

Cited By ~ 2

Author(s):

Gang Liu ◽

Kai Liu

Keyword(s):

Information Retrieval ◽

Retrieval System ◽

Information Retrieval System ◽

Deep Web ◽

Web Information Retrieval ◽

Web Information

Download Full-text

Design and implementation of crawling algorithm to collect deep web information for web archiving

Data Technologies and Applications ◽

10.1108/dta-07-2017-0053 ◽

2018 ◽

Vol 52 (2) ◽

pp. 266-277 ◽

Cited By ~ 2

Author(s):

Hyo-Jung Oh ◽

Dong-Hyun Won ◽

Chonghyuck Kim ◽

Sung-Hee Park ◽

Yong Kim

Keyword(s):

Deep Web ◽

Web Crawler ◽

Web Archiving ◽

Web Browser ◽

Web Documents ◽

Content Type ◽

Web Document ◽

Web Information ◽

Web Crawlers ◽

The Web

Purpose The purpose of this paper is to describe the development of an algorithm for realizing web crawlers that automatically collect dynamically generated webpages from the deep web. Design/methodology/approach This study proposes and develops an algorithm to collect web information as if the web crawler gathers static webpages by managing script commands as links. The proposed web crawler actually experiments with the algorithm by collecting deep webpages. Findings Among the findings of this study is that if the actual crawling process provides search results as script pages, the outcome only collects the first page. However, the proposed algorithm can collect deep webpages in this case. Research limitations/implications To use a script as a link, a human must first analyze the web document. This study uses the web browser object provided by Microsoft Visual Studio as a script launcher, so it cannot collect deep webpages if the web browser object cannot launch the script, or if the web document contains script errors. Practical implications The research results show deep webs are estimated to have 450 to 550 times more information than surface webpages, and it is difficult to collect web documents. However, this algorithm helps to enable deep web collection through script runs. Originality/value This study presents a new method to be utilized with script links instead of adopting previous keywords. The proposed algorithm is available as an ordinary URL. From the conducted experiment, analysis of scripts on individual websites is needed to employ them as links.

Download Full-text

Efficient Approach for Knowledge Management Using Deep Web Information Retrieval System

IOSR Journal of Computer Engineering ◽

10.9790/0661-12293100 ◽

2013 ◽

Vol 12 (2) ◽

pp. 93-100

Author(s):

Soniya Agrawal ◽

Keyword(s):

Knowledge Management ◽

Information Retrieval ◽

Retrieval System ◽

Information Retrieval System ◽

Deep Web ◽

Web Information Retrieval ◽

Efficient Approach ◽

Web Information

Download Full-text

Threats from the Dark: A Review over Dark Web Investigation Research for Cyber Threat Intelligence

Journal of Computer Networks and Communications ◽

10.1155/2021/1302999 ◽

2021 ◽

Vol 2021 ◽

pp. 1-21

Author(s):

Randa Basheer ◽

Bassel Alkhatib

Keyword(s):

State Of The Art ◽

Deep Web ◽

Web Content ◽

Ethical Considerations ◽

Future Directions ◽

Threat Intelligence ◽

Cyber Threat ◽

Cyber Threat Intelligence ◽

Surface Web ◽

Dark Web

From proactive detection of cyberattacks to the identification of key actors, analyzing contents of the Dark Web plays a significant role in deterring cybercrimes and understanding criminal minds. Researching in the Dark Web proved to be an essential step in fighting cybercrime, whether with a standalone investigation of the Dark Web solely or an integrated one that includes contents from the Surface Web and the Deep Web. In this review, we probe recent studies in the field of analyzing Dark Web content for Cyber Threat Intelligence (CTI), introducing a comprehensive analysis of their techniques, methods, tools, approaches, and results, and discussing their possible limitations. In this review, we demonstrate the significance of studying the contents of different platforms on the Dark Web, leading new researchers through state-of-the-art methodologies. Furthermore, we discuss the technical challenges, ethical considerations, and future directions in the domain.

Download Full-text

Data Extraction from Deep Web Sites

Encyclopedia of Internet Technologies and Applications ◽

10.4018/978-1-59140-993-9.ch021 ◽

2008 ◽

pp. 142-149

Author(s):

Hadrian Peter ◽

Charles Greenidge

Keyword(s):

Search Engine ◽

Web Sites ◽

Search Engines ◽

Data Extraction ◽

Deep Web ◽

Web Crawlers ◽

Surface Web ◽

Simple Link ◽

The Web

Traditionally a great deal of research has been devoted to data extraction on the web (Crescenzi, et al, 2001; Embley, et al, 2005; Laender, et al, 2002; Hammer, et al, 1997; Ribeiro-Neto, et al, 1999; Huck, et al, 1998; Wang & Lochovsky, 2002, 2003) from areas where data is easily indexed and extracted by a Search Engine, the so-called Surface Web. There are, however, other sites that are greater and potentially more vital, that contain information which cannot be readily indexed by standard search engines. These sites which have been designed to require some level of direct human participation (for example, to issue queries rather than simply follow hyperlinks) cannot be handled using the simple link traversal techniques used by many web crawlers (Rappaport, 2000; Cho & Garcia-Molina, 2000; Cho et al, 1998; Edwards et al, 2001). This area of the web, which has been operationally off-limits for crawlers using standard indexing procedures, is termed the Deep Web (Zillman, 2005; Bergman, 2000). Much work still needs to be done as Deep Web sites represent an area that is only recently being explored to identify where potential uses can be developed.

Download Full-text