Full-Page Wrapper Generation for Unsupervised Deep Web Data Extraction

Mapping Intimacies ◽

10.36227/techrxiv.16649947 ◽

2021 ◽

Author(s):

Chia-Hui Chang

Keyword(s):

Data Extraction ◽

Deep Web ◽

Training Data ◽

Web Data ◽

Wrapper Induction ◽

Web Data Extraction ◽

Finite State ◽

Training Examples ◽

Sophisticated Analysis ◽

Wrapper Generation

<div>Web data extraction is a key component in many business intelligence tasks, such as data transformation, exchange, and analysis. Many approaches have been proposed, with either labeled training examples (supervised) or annotation-free training pages (unsupervised). However, most research focuses on extraction effectiveness. Not much attention has been paid to extraction efficiency. In fact, most unsupervised web data extraction ignores wrapper generation because they could work alone without any supervision. </div><div>In this paper, we argue that wrapper generation for unsupervised web data extraction is as important as supervised wrapper induction because the generated wrappers could work more efficiently without sophisticated analysis during testing. We consider two approaches for wrapper generation: schema-guided finite-state machine (FSM) approaches and data-driven machine learning (ML) approaches. We exploit unique mandatory templates to improve the FSM-based wrapper, and proposed two convolutional neural network (CNN)-based models for sequence-labeling. The experimental results show that the FSM wrapper performs well even with small training data, while the CNN-based models require more training pages to achieve the same effectiveness but are more efficient with GPU support. Furthermore, FSM wrappers can work as a filter to reduce the number of training pages and advance the learning curve for wrapper generation.</div>

Get full-text (via PubEx)

Full-Page Wrapper Generation for Unsupervised Deep Web Data Extraction

10.36227/techrxiv.16649947.v1 ◽

2021 ◽

Author(s):

Chia-Hui Chang

Keyword(s):

Data Extraction ◽

Deep Web ◽

Training Data ◽

Web Data ◽

Wrapper Induction ◽

Web Data Extraction ◽

Finite State ◽

Training Examples ◽

Sophisticated Analysis ◽

Wrapper Generation

Get full-text (via PubEx)

A Deep Web Data Extraction and Application System Based on Cloud Technology

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.756-759.2583 ◽

2013 ◽

Vol 756-759 ◽

pp. 2583-2587 ◽

Cited By ~ 1

Author(s):

Zi Yang Han ◽

Feng Ying Wang ◽

Ping Sun ◽

Zheng Yu Li

Keyword(s):

Parallel Computing ◽

Data Extraction ◽

Scheduling Algorithm ◽

Computing System ◽

Extraction Process ◽

Deep Web ◽

System Structure ◽

Web Data ◽

Cloud Technology ◽

Web Data Extraction

There are so many Deep Webs in Internet, which contains a large amount of valuable data, This paper proposes a Deep Web data extraction and service system based on the principle of cloud technology. We adopt a kind of multi-node parallel computing system structure and design a task scheduling algorithm in the data extraction process, in above foundation, balance the task load of among nodes to accomplish data extraction rapidly; The experimental results show that cloud parallel computing and dispersed network resources are used to extract data in Deep Web system is valid and improves the data extraction efficiency of Deep Web and service quality.

Get full-text (via PubEx)

Deep web data extraction based on visual information processing

Journal of Ambient Intelligence and Humanized Computing ◽

10.1007/s12652-017-0587-0 ◽

2017 ◽

Cited By ~ 5

Author(s):

Jin Liu ◽

Li Lin ◽

Zehuan Cai ◽

Jin Wang ◽

Hye-jin Kim

Keyword(s):

Information Processing ◽

Visual Information ◽

Data Extraction ◽

Visual Information Processing ◽

Deep Web ◽

Web Data ◽

Web Data Extraction

Get full-text (via PubEx)

DWDE-IR: An Efficient Deep Web Data Extraction for Information Retrieval on Web Mining

Journal of Emerging Technologies in Web Intelligence ◽

10.4304/jetwi.6.1.133-141 ◽

2014 ◽

Vol 6 (1) ◽

Cited By ~ 1

Author(s):

Aysha Banu ◽

M. Chitra

Keyword(s):

Information Retrieval ◽

Web Mining ◽

Data Extraction ◽

Deep Web ◽

Web Data ◽

Web Data Extraction

Get full-text (via PubEx)

Deep web data extraction

2010 IEEE International Conference on Systems, Man and Cybernetics ◽

10.1109/icsmc.2010.5642466 ◽

2010 ◽

Cited By ~ 12

Author(s):

Jer Lang Hong

Keyword(s):

Data Extraction ◽

Deep Web ◽

Web Data ◽

Web Data Extraction

Get full-text (via PubEx)

Accuracy Crawler: An Accurate Crawler for Deep Web Data Extraction

2018 International Conference on Control, Power, Communication and Computing Technologies (ICCPCCT) ◽

10.1109/iccpcct.2018.8574286 ◽

2018 ◽

Author(s):

Prafful Mishra ◽

Anshul Khurana

Keyword(s):

Data Extraction ◽

Deep Web ◽

Web Data ◽

Web Data Extraction

Get full-text (via PubEx)

Client-side deep Web data extraction

IEEE International Conference on E-Commerce Technology for Dynamic E-Business ◽

10.1109/cec-east.2004.30 ◽

2005 ◽

Cited By ~ 2

Author(s):

M. Alvarez ◽

A. Pan ◽

J. Raposo ◽

A. Vina

Keyword(s):

Data Extraction ◽

Deep Web ◽

Web Data ◽

Web Data Extraction ◽

Client Side

Get full-text (via PubEx)

Review of Deep Web Data Extraction

2019 IEEE Symposium Series on Computational Intelligence (SSCI) ◽

10.1109/ssci44817.2019.9002877 ◽

2019 ◽

Author(s):

Shenglin Li ◽

Chen Chen ◽

Kaiwen Luo ◽

Bo Song

Keyword(s):

Data Extraction ◽

Deep Web ◽

Web Data ◽

Web Data Extraction

Get full-text (via PubEx)

Efficiency Improvement Approach of Deep Web Data Extraction

2019 14th International Conference on Computer Engineering and Systems (ICCES) ◽

10.1109/icces48960.2019.9068134 ◽

2019 ◽

Author(s):

Mona Nasr ◽

Hanan Fahmy ◽

Mohamed Thabet

Keyword(s):

Data Extraction ◽

Deep Web ◽

Efficiency Improvement ◽

Web Data ◽

Web Data Extraction

Get full-text (via PubEx)

A framework enhancement method of deep web data extraction

Materials Today Proceedings ◽

10.1016/j.matpr.2021.01.132 ◽

2021 ◽

Author(s):

Salar Faisal Noori ◽

B. Bazeer Ahamed

Keyword(s):

Data Extraction ◽

Deep Web ◽

Web Data ◽

Web Data Extraction ◽

Enhancement Method

Get full-text (via PubEx)