A novel approach for Web data extraction based on XML encoding

Author(s):  
Tiezheng Nie ◽  
Derong Shen ◽  
Ge Yu ◽  
Zhong Shi
2011 ◽  
Vol 55-57 ◽  
pp. 1003-1008
Author(s):  
Yong Quan Dong ◽  
Xiang Jun Zhao ◽  
Gong Jie Zhang

A novel approach is proposed to automatically extract data records from detail pages using hierarchical clustering techniques. The approach uses the information of the listing pages to identify the content blocks in detail pages, which narrows the scopes of Web data extraction. Meanwhile, it also makes full use of the structure and content features to cluster content feature vectors. Finally, it aligns data elements of multiple details pages to extract the data records. Experiment results on test beds of real web pages show that the approach can achieve high extraction accuracy and outperforms the existing techniques substantially.


2021 ◽  
Vol 14 (11) ◽  
pp. 2445-2458
Author(s):  
Valerio Cetorelli ◽  
Paolo Atzeni ◽  
Valter Crescenzi ◽  
Franco Milicchio

We introduce landmark grammars , a new family of context-free grammars aimed at describing the HTML source code of pages published by large and templated websites and therefore at effectively tackling Web data extraction problems. Indeed, they address the inherent ambiguity of HTML, one of the main challenges of Web data extraction, which, despite over twenty years of research, has been largely neglected by the approaches presented in literature. We then formalize the Smallest Extraction Problem (SEP), an optimization problem for finding the grammar of a family that best describes a set of pages and contextually extract their data. Finally, we present an unsupervised learning algorithm to induce a landmark grammar from a set of pages sharing a common HTML template, and we present an automatic Web data extraction system. The experiments on consolidated benchmarks show that the approach can substantially contribute to improve the state-of-the-art.


2018 ◽  
pp. 4611-4618
Author(s):  
Robert Baumgartner ◽  
Wolfgang Gatterbauer ◽  
Georg Gottlob

2015 ◽  
Vol 8 (3) ◽  
pp. 311-322 ◽  
Author(s):  
Yongquan Dong ◽  
Qiang Chu ◽  
Ping Ling

Sign in / Sign up

Export Citation Format

Share Document