Robust Web Data Extraction Based on Weighted Path-layer Similarity

Journal of Computer Information Systems ◽

10.1080/08874417.2020.1861571 ◽

2021 ◽

pp. 1-11

Author(s):

Peng Gao ◽

Hao Han

Keyword(s):

Data Extraction ◽

Web Data ◽

Web Data Extraction

Download Full-text

Web Data Extraction using Hybrid Program Synthesis: A Combination of Top-down and Bottom-up Inference

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data ◽

10.1145/3318464.3380608 ◽

2020 ◽

Author(s):

Mohammad Raza ◽

Sumit Gulwani

Keyword(s):

Data Extraction ◽

Program Synthesis ◽

Web Data ◽

Top Down ◽

Bottom Up ◽

Web Data Extraction ◽

Download Full-text

The smallest extraction problem

Proceedings of the VLDB Endowment ◽

10.14778/3476249.3476293 ◽

2021 ◽

Vol 14 (11) ◽

pp. 2445-2458

Author(s):

Valerio Cetorelli ◽

Paolo Atzeni ◽

Valter Crescenzi ◽

Franco Milicchio

Keyword(s):

Unsupervised Learning ◽

Optimization Problem ◽

Learning Algorithm ◽

State Of The Art ◽

Data Extraction ◽

Source Code ◽

Web Data ◽

Web Data Extraction ◽

We introduce landmark grammars , a new family of context-free grammars aimed at describing the HTML source code of pages published by large and templated websites and therefore at effectively tackling Web data extraction problems. Indeed, they address the inherent ambiguity of HTML, one of the main challenges of Web data extraction, which, despite over twenty years of research, has been largely neglected by the approaches presented in literature. We then formalize the Smallest Extraction Problem (SEP), an optimization problem for finding the grammar of a family that best describes a set of pages and contextually extract their data. Finally, we present an unsupervised learning algorithm to induce a landmark grammar from a set of pages sharing a common HTML template, and we present an automatic Web data extraction system. The experiments on consolidated benchmarks show that the approach can substantially contribute to improve the state-of-the-art.

Download Full-text

Web Data Extraction System

Encyclopedia of Database Systems ◽

10.1007/978-1-4614-8265-9_1154 ◽

2018 ◽

pp. 4611-4618

Author(s):

Robert Baumgartner ◽

Wolfgang Gatterbauer ◽

Georg Gottlob

Keyword(s):

Data Extraction ◽

Extraction System ◽

Web Data ◽

Web Data Extraction

Download Full-text

Web Data Extraction Based on Ensemble Learning

International Journal of Database Theory and Application ◽

10.14257/ijdta.2015.8.3.27 ◽

2015 ◽

Vol 8 (3) ◽

pp. 311-322 ◽

Author(s):

Yongquan Dong ◽

Qiang Chu ◽

Ping Ling

Keyword(s):

Ensemble Learning ◽

Data Extraction ◽

Web Data ◽

Web Data Extraction

Download Full-text

Web Data Extraction Based on Structure Feature

Communications in Computer and Information Science - Applied Informatics and Communication ◽

10.1007/978-3-642-23235-0_75 ◽

2011 ◽

pp. 591-599 ◽

Author(s):

Ma Anxiang ◽

Gao Kening ◽

Zhang Xiaohong ◽

Zhang Bin

Keyword(s):

Data Extraction ◽

Web Data ◽

Web Data Extraction ◽

Structure Feature

Download Full-text

Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction

Web Information Systems Engineering – WISE 2007 - Lecture Notes in Computer Science ◽

10.1007/978-3-540-76993-4_18 ◽

2007 ◽

pp. 212-224 ◽

Author(s):

Manuel Álvarez ◽

Alberto Pan ◽

Juan Raposo ◽

Fernando Bellas ◽

Fidel Cacheda

Keyword(s):

Edit Distance ◽

Data Extraction ◽

Web Data ◽

Web Data Extraction

Download Full-text

Web Data Extraction and Integration System for Search Engine Results

Web Recommendations Systems ◽

10.1007/978-981-15-2513-1_2 ◽

2020 ◽

pp. 11-25

Author(s):

K. R. Venugopal ◽

K. C. Srikantaiah

Keyword(s):

Search Engine ◽

Data Extraction ◽

Web Data ◽

Integration System ◽

Web Data Extraction

Download Full-text

Robust Web Data Extraction: A Novel Approach Based on Minimum Cost Script Edit Model

Web Information Systems and Mining - Lecture Notes in Computer Science ◽

10.1007/978-3-642-33469-6_62 ◽

2012 ◽

pp. 497-509 ◽

Author(s):

Donglan Liu ◽

Xinjun Wang ◽

Zhongmin Yan ◽

Qiuyan Li

Keyword(s):

Data Extraction ◽

Minimum Cost ◽

Web Data ◽

Web Data Extraction ◽

Download Full-text

Similarity Based Web Data Extraction and Integration System for Web Content Mining

Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering - Advances in Communication, Network, and Computing ◽

10.1007/978-3-642-35615-5_41 ◽

2012 ◽

pp. 269-274

Author(s):

Srikantaiah K.C. ◽

Suraj M. ◽

Venugopal K.R. ◽

Iyengar S.S. ◽

L. M. Patnaik

Keyword(s):

Data Extraction ◽

Web Content ◽

Web Data ◽

Integration System ◽

Web Content Mining ◽

Web Data Extraction ◽

Download Full-text

Web Data Extraction System

10.1007/springerreference_64082 ◽

2011 ◽

Keyword(s):

Data Extraction ◽

Extraction System ◽

Web Data ◽

Web Data Extraction

Download Full-text