Efficient Methodology for Deep Web Data Extraction

Deep Web substance are gotten to by inquiries submitted to Web information bases and the returned information records are enwrapped in progressively created Web pages (they will be called profound Web pages in this paper). Removing organized information from profound Web pages is a difficult issue because of the fundamental mind boggling structures of such pages. As of not long ago, an enormous number of strategies have been proposed to address this issue, however every one of them have characteristic impediments since they are Web-page-programming-language subordinate. As the mainstream two-dimensional media, the substance on Web pages are constantly shown routinely for clients to peruse. This inspires us to look for an alternate path for profound Web information extraction to beat the constraints of past works by using some fascinating normal visual highlights on the profound Web pages. In this paper, a novel vision-based methodology that is Visual Based Deep Web Data Extraction (VBDWDE) Algorithm is proposed. This methodology basically uses the visual highlights on the profound Web pages to execute profound Web information extraction, including information record extraction and information thing extraction. We additionally propose another assessment measure amendment to catch the measure of human exertion expected to create wonderful extraction. Our investigations on a huge arrangement of Web information bases show that the proposed vision-based methodology is exceptionally viable for profound Web information extraction.

Download Full-text

A deep web data extraction model for web mining: a review

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v23.i1.pp519-528 ◽

2021 ◽

Vol 23 (1) ◽

pp. 519

Author(s):

Ily Amalina Ahmad Sabri ◽

Mustafa Man

Keyword(s):

Web Mining ◽

Data Extraction ◽

Deep Web ◽

Quality Data ◽

Web Pages ◽

Web Data ◽

Comprehensive Overview ◽

Web Data Extraction ◽

Proposed Model ◽

Extraction Model

The World Wide Web has become a large pool of information. Extracting structured data from a published web pages has drawn attention in the last decade. The process of web data extraction (WDE) has many challenges, dueto variety of web data and the unstructured data from hypertext mark up language (HTML) files. The aim of this paper is to provide a comprehensive overview of current web data extraction techniques, in termsof extracted quality data. This paper focuses on study for data extraction using wrapper approaches and compares each other to identify the best approach to extract data from online sites. To observe the efficiency of the proposed model, we compare the performance of data extraction by single web page extraction with different models such as document object model (DOM), wrapper using hybrid dom and json (WHDJ), wrapper extraction of image using DOM and JSON (WEIDJ) and WEIDJ (no-rules). Finally, the experimentations proved that WEIDJ can extract data fastest and low time consuming compared to other proposed method.<br /><div> </div>

Download Full-text

Web Information Extraction via Web Views

Web Information Systems ◽

10.4018/978-1-59140-208-4.ch007 ◽

2004 ◽

pp. 227-267

Author(s):

Wee Keong Ng ◽

Zehua Liu ◽

Zhao Li ◽

Ee Peng Lim

Keyword(s):

Information Extraction ◽

Data Model ◽

Information Source ◽

Extraction Process ◽

Web Pages ◽

Efficient Manner ◽

Web Information Extraction ◽

Web Information ◽

Definition Of ◽

The Web

With the explosion of information on the Web, traditional ways of browsing and keyword searching of information over web pages no longer satisfy the demanding needs of web surfers. Web information extraction has emerged as an important research area that aims to automatically extract information from target web pages and convert them into a structured format for further processing. The main issues involved in the extraction process include: (1) the definition of a suitable extraction language; (2) the definition of a data model representing the web information source; (3) the generation of the data model, given a target source; and (4) the extraction and presentation of information according to a given data model. In this chapter, we discuss the challenges of these issues and the approaches that current research activities have taken to revolve these issues. We propose several classification schemes to classify existing approaches of information extraction from different perspectives. Among the existing works, we focus on the Wiccap system — a software system that enables ordinary end-users to obtain information of interest in a simple and efficient manner by constructing personalized web views of information sources.

Download Full-text

A Deep Web Data Extraction and Application System Based on Cloud Technology

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.756-759.2583 ◽

2013 ◽

Vol 756-759 ◽

pp. 2583-2587 ◽

Cited By ~ 1

Author(s):

Zi Yang Han ◽

Feng Ying Wang ◽

Ping Sun ◽

Zheng Yu Li

Keyword(s):

Parallel Computing ◽

Data Extraction ◽

Scheduling Algorithm ◽

Computing System ◽

Extraction Process ◽

Deep Web ◽

System Structure ◽

Web Data ◽

Cloud Technology ◽

Web Data Extraction

There are so many Deep Webs in Internet, which contains a large amount of valuable data, This paper proposes a Deep Web data extraction and service system based on the principle of cloud technology. We adopt a kind of multi-node parallel computing system structure and design a task scheduling algorithm in the data extraction process, in above foundation, balance the task load of among nodes to accomplish data extraction rapidly; The experimental results show that cloud parallel computing and dispersed network resources are used to extract data in Deep Web system is valid and improves the data extraction efficiency of Deep Web and service quality.

Download Full-text

Deep web data extraction based on visual information processing

Journal of Ambient Intelligence and Humanized Computing ◽

10.1007/s12652-017-0587-0 ◽

2017 ◽

Cited By ~ 5

Author(s):

Jin Liu ◽

Li Lin ◽

Zehuan Cai ◽

Jin Wang ◽

Hye-jin Kim

Keyword(s):

Information Processing ◽

Visual Information ◽

Data Extraction ◽

Visual Information Processing ◽

Deep Web ◽

Web Data ◽

Web Data Extraction

Download Full-text

State-of-the-art web data extraction systems for online business intelligence

Informacijos mokslai ◽

10.15388/im.2013.0.1595 ◽

2013 ◽

Vol 64 ◽

pp. 145-155

Author(s):

Tomas Grigalis ◽

Antanas Čenys

Keyword(s):

Business Intelligence ◽

State Of The Art ◽

Data Extraction ◽

Web Pages ◽

Web Data ◽

Web Data Extraction ◽

Technological Advances ◽

Online Business ◽

A Company ◽

Support Decision Making

The success of a company hinges on identifying and responding to competitive pressures. The main objective of online business intelligence is to collect valuable information from many Web sources to support decision making and thus gain competitive advantage. However, the online business intelligence presents non-trivial challenges to Web data extraction systems that must deal with technologically sophisticated modern Web pages where traditional manual programming approaches often fail. In this paper, we review commercially available state-of-the-art Web data extraction systems and their technological advances in the context of online business intelligence.Keywords: online business intelligence, Web data extraction, Web scrapingŠiuolaikinės iš tinklalapių duomenis renkančios ir verslo analitikai tinkamos sistemos (anglų k.)Tomas Grigalis, Antanas Čenys Santrauka Šiuolaikinės verslo organizacijos sėkmė priklauso nuo sugebėjimo atitinkamai reaguoti į nuolat besikeičiančią konkurencinę aplinką. Internete veikiančios verslo analitinės sistemos pagrindinis tikslas yra rinkti vertingą informaciją iš daugybės skirtingų internetinių šaltinių ir tokiu būdu padėti verslo organizacijai priimti tinkamus sprendimus ir įgyti konkurencinį pranašumą. Tačiau informacijos rinkimas iš internetinių šaltinių yra sudėtinga problema, kai informaciją renkančios sistemos turi gerai veikti su itin technologiškai sudėtingais tinklalapiais. Šiame straipsnyje verslo analitikos kontekste apžvelgiamos pažangiausios internetinių duomenų rinkimo sistemos. Taip pat pristatomi konkretūs scenarijai, kai duomenų rinkimo sistemos gali padėti verslo analitikai. Straipsnio pabaigoje autoriai aptaria pastarųjų metų technologinius pasiekimus, kurie turi potencialą tapti visiškai automatinėmis internetinių duomenų rinkimo sistemomis ir dar labiau patobulinti verslo analitiką bei gerokai sumažinti jos išlaidas.

Download Full-text

Web Harvesting

Advances in Data Mining and Database Management - Web Usage Mining Techniques and Applications Across Industries ◽

10.4018/978-1-5225-0613-3.ch014 ◽

2017 ◽

pp. 351-378 ◽

Cited By ~ 1

Author(s):

B. Umamageswari ◽

R. Kalpana

Keyword(s):

Web Mining ◽

Data Extraction ◽

Web Pages ◽

Extraction Techniques ◽

Web Data ◽

Web Data Extraction ◽

Region Extraction ◽

Universal Applicability ◽

Web Harvesting

Web mining is done on huge amounts of data extracted from WWW. Many researchers have developed several state-of-the-art approaches for web data extraction. So far in the literature, the focus is mainly on the techniques used for data region extraction. Applications which are fed with the extracted data, require fetching data spread across multiple web pages which should be crawled automatically. For this to happen, we need to extract not only data regions, but also the navigation links. Data extraction techniques are designed for specific HTML tags; which questions their universal applicability for carrying out information extraction from differently formatted web pages. This chapter focuses on various web data extraction techniques available for different kinds of data rich pages, classification of web data extraction techniques and comparison of those techniques across many useful dimensions.

Download Full-text

Models and Approaches for Web Information Extraction and Web Page Understanding

Advances in E-Business Research - The Evolution of the Internet in the Business Sector ◽

10.4018/978-1-4666-7262-8.ch002 ◽

2015 ◽

pp. 25-50 ◽

Cited By ~ 1

Author(s):

Ruslan R. Fayzrakhmanov

Keyword(s):

Information Extraction ◽

Logical Structure ◽

Web Pages ◽

Web Page ◽

Web Information Extraction ◽

Web Information ◽

Consistent Model ◽

Object Oriented Approach ◽

Navigation Menu ◽

Oriented Approach

This chapter discusses the main challenges addressed within the fields of Web information extraction and Web page understanding and considers different utilized Web page representations. A configurable Java-based framework for implementing effective methods for Web Page Processing (WPP) called WPPS is presented as the result of this analysis. WPPS leverages a Unified Ontological Model (UOM) of Web pages that describes their different aspects, such as layout, visual features, interface, DOM tree, and the logical structure in the form of one consistent model. The UOM is a formalization of certain layers of a Web page conceptualization defined in the chapter. A WPPS API provided for the development of WPP methods makes it possible to combine the declarative approach, represented by the set of inference rules and SPARQL queries, with the object-oriented approach. The framework is illustrated with one example scenario related to the identification of a Web page navigation menu.

Download Full-text