web data extraction Latest Research Papers

Web data extraction has seen significant development in the last decade since its inception in the early nineties. It has evolved from a simple manual way of extracting data from web page and documents to automated extraction to an intelligent extraction using machine learning algorithms, tools and techniques. Data extraction is one of the key components of end-to-end life cycle in web data extraction process that includes navigation, extraction, data enrichment and visualization. This paper presents the journey of web data extraction over the last many years highlighting evolution of tools, techniques, frameworks and algorithms for building intelligent web data extraction systems. The paper also throws light into challenges, opportunities for future research and emerging trends over the years in web data extraction with specific focus on machine learning techniques. Both traditional and machine learning approaches to manual and automated web data extraction are experimented and results published with few use cases demonstrating the challenges in web data extraction in the event of changes in the website layout. This paper introduces novel ideas such as self-healing capability in web data extraction and proactive error detection in the event of changes in website layout as an area of future research. This unique perspective will help readers to get deeper insights in to the present and future of web data extraction.

Download Full-text

Full-Page Wrapper Generation for Unsupervised Deep Web Data Extraction

10.36227/techrxiv.16649947 ◽

2021 ◽

Author(s):

Chia-Hui Chang

Keyword(s):

Data Extraction ◽

Deep Web ◽

Training Data ◽

Web Data ◽

Wrapper Induction ◽

Web Data Extraction ◽

Finite State ◽

Training Examples ◽

Sophisticated Analysis ◽

Wrapper Generation

<div>Web data extraction is a key component in many business intelligence tasks, such as data transformation, exchange, and analysis. Many approaches have been proposed, with either labeled training examples (supervised) or annotation-free training pages (unsupervised). However, most research focuses on extraction effectiveness. Not much attention has been paid to extraction efficiency. In fact, most unsupervised web data extraction ignores wrapper generation because they could work alone without any supervision. </div><div>In this paper, we argue that wrapper generation for unsupervised web data extraction is as important as supervised wrapper induction because the generated wrappers could work more efficiently without sophisticated analysis during testing. We consider two approaches for wrapper generation: schema-guided finite-state machine (FSM) approaches and data-driven machine learning (ML) approaches. We exploit unique mandatory templates to improve the FSM-based wrapper, and proposed two convolutional neural network (CNN)-based models for sequence-labeling. The experimental results show that the FSM wrapper performs well even with small training data, while the CNN-based models require more training pages to achieve the same effectiveness but are more efficient with GPU support. Furthermore, FSM wrappers can work as a filter to reduce the number of training pages and advance the learning curve for wrapper generation.</div>

Download Full-text

Full-Page Wrapper Generation for Unsupervised Deep Web Data Extraction

10.36227/techrxiv.16649947.v1 ◽

2021 ◽

Author(s):

Chia-Hui Chang

Keyword(s):

Data Extraction ◽

Deep Web ◽

Training Data ◽

Web Data ◽

Wrapper Induction ◽

Web Data Extraction ◽

Finite State ◽

Training Examples ◽

Sophisticated Analysis ◽

Wrapper Generation

<div>Web data extraction is a key component in many business intelligence tasks, such as data transformation, exchange, and analysis. Many approaches have been proposed, with either labeled training examples (supervised) or annotation-free training pages (unsupervised). However, most research focuses on extraction effectiveness. Not much attention has been paid to extraction efficiency. In fact, most unsupervised web data extraction ignores wrapper generation because they could work alone without any supervision. </div><div>In this paper, we argue that wrapper generation for unsupervised web data extraction is as important as supervised wrapper induction because the generated wrappers could work more efficiently without sophisticated analysis during testing. We consider two approaches for wrapper generation: schema-guided finite-state machine (FSM) approaches and data-driven machine learning (ML) approaches. We exploit unique mandatory templates to improve the FSM-based wrapper, and proposed two convolutional neural network (CNN)-based models for sequence-labeling. The experimental results show that the FSM wrapper performs well even with small training data, while the CNN-based models require more training pages to achieve the same effectiveness but are more efficient with GPU support. Furthermore, FSM wrappers can work as a filter to reduce the number of training pages and advance the learning curve for wrapper generation.</div>

Download Full-text

The smallest extraction problem

Proceedings of the VLDB Endowment ◽

10.14778/3476249.3476293 ◽

2021 ◽

Vol 14 (11) ◽

pp. 2445-2458

Author(s):

Valerio Cetorelli ◽

Paolo Atzeni ◽

Valter Crescenzi ◽

Franco Milicchio

Keyword(s):

Unsupervised Learning ◽

Optimization Problem ◽

Learning Algorithm ◽

State Of The Art ◽

Data Extraction ◽

Source Code ◽

Web Data ◽

Web Data Extraction ◽

New Family ◽

Context Free

We introduce landmark grammars , a new family of context-free grammars aimed at describing the HTML source code of pages published by large and templated websites and therefore at effectively tackling Web data extraction problems. Indeed, they address the inherent ambiguity of HTML, one of the main challenges of Web data extraction, which, despite over twenty years of research, has been largely neglected by the approaches presented in literature. We then formalize the Smallest Extraction Problem (SEP), an optimization problem for finding the grammar of a family that best describes a set of pages and contextually extract their data. Finally, we present an unsupervised learning algorithm to induce a landmark grammar from a set of pages sharing a common HTML template, and we present an automatic Web data extraction system. The experiments on consolidated benchmarks show that the approach can substantially contribute to improve the state-of-the-art.

Download Full-text

A deep web data extraction model for web mining: a review

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v23.i1.pp519-528 ◽

2021 ◽

Vol 23 (1) ◽

pp. 519

Author(s):

Ily Amalina Ahmad Sabri ◽

Mustafa Man

Keyword(s):

Web Mining ◽

Data Extraction ◽

Deep Web ◽

Quality Data ◽

Web Pages ◽

Web Data ◽

Comprehensive Overview ◽

Web Data Extraction ◽

Proposed Model ◽

Extraction Model

The World Wide Web has become a large pool of information. Extracting structured data from a published web pages has drawn attention in the last decade. The process of web data extraction (WDE) has many challenges, dueto variety of web data and the unstructured data from hypertext mark up language (HTML) files. The aim of this paper is to provide a comprehensive overview of current web data extraction techniques, in termsof extracted quality data. This paper focuses on study for data extraction using wrapper approaches and compares each other to identify the best approach to extract data from online sites. To observe the efficiency of the proposed model, we compare the performance of data extraction by single web page extraction with different models such as document object model (DOM), wrapper using hybrid dom and json (WHDJ), wrapper extraction of image using DOM and JSON (WEIDJ) and WEIDJ (no-rules). Finally, the experimentations proved that WEIDJ can extract data fastest and low time consuming compared to other proposed method.<br /><div> </div>

Download Full-text

Efficient Methodology for Deep Web Data Extraction

Turkish Journal of Computer and Mathematics Education (TURCOMAT) ◽

10.17762/turcomat.v12i1s.1769 ◽

2021 ◽

Vol 12 (1S) ◽

pp. 286-293

Author(s):

Shilpa Deshmukh, Et. al.

Keyword(s):

Information Extraction ◽

Data Extraction ◽

Deep Web ◽

Web Pages ◽

Web Data ◽

Web Information Extraction ◽

Web Data Extraction ◽

Web Information ◽

Assessment Measure ◽

Enormous Number

Deep Web substance are gotten to by inquiries submitted to Web information bases and the returned information records are enwrapped in progressively created Web pages (they will be called profound Web pages in this paper). Removing organized information from profound Web pages is a difficult issue because of the fundamental mind boggling structures of such pages. As of not long ago, an enormous number of strategies have been proposed to address this issue, however every one of them have characteristic impediments since they are Web-page-programming-language subordinate. As the mainstream two-dimensional media, the substance on Web pages are constantly shown routinely for clients to peruse. This inspires us to look for an alternate path for profound Web information extraction to beat the constraints of past works by using some fascinating normal visual highlights on the profound Web pages. In this paper, a novel vision-based methodology that is Visual Based Deep Web Data Extraction (VBDWDE) Algorithm is proposed. This methodology basically uses the visual highlights on the profound Web pages to execute profound Web information extraction, including information record extraction and information thing extraction. We additionally propose another assessment measure amendment to catch the measure of human exertion expected to create wonderful extraction. Our investigations on a huge arrangement of Web information bases show that the proposed vision-based methodology is exceptionally viable for profound Web information extraction.

Download Full-text