Similarity Based Web Data Extraction and Integration System for Web Content Mining

Author(s):  
Srikantaiah K.C. ◽  
Suraj M. ◽  
Venugopal K.R. ◽  
Iyengar S.S. ◽  
L. M. Patnaik
2014 ◽  
Vol 13 (01) ◽  
pp. 1450005 ◽  
Author(s):  
Basavaraj S. Anami ◽  
Ramesh S. Wadawadagi ◽  
Veerappa B. Pagi

With incessantly growing amount of information published over Web pages, the World Wide Web (WWW) has become prolific in the field of data mining research. The heterogeneous and semi-structured nature of Web data has made the process of automated discovery a challenging issue. Web Content Mining (WCM) essentially uses data mining techniques to effectively discover knowledge from Web page contents. The intent of this study is to provide a comparative analysis of Machine Learning (ML) techniques available in the literature for WCM. For analysis, the article focuses on issues such as representation techniques, learning methods, datasets used and performance of each method as a criterion. The survey observes that some of the traditional ML algorithms have been efficiently used to work on Web data. Finally, the paper concludes citing some promising issues for further research in this domain.


2021 ◽  
Vol 14 (11) ◽  
pp. 2445-2458
Author(s):  
Valerio Cetorelli ◽  
Paolo Atzeni ◽  
Valter Crescenzi ◽  
Franco Milicchio

We introduce landmark grammars , a new family of context-free grammars aimed at describing the HTML source code of pages published by large and templated websites and therefore at effectively tackling Web data extraction problems. Indeed, they address the inherent ambiguity of HTML, one of the main challenges of Web data extraction, which, despite over twenty years of research, has been largely neglected by the approaches presented in literature. We then formalize the Smallest Extraction Problem (SEP), an optimization problem for finding the grammar of a family that best describes a set of pages and contextually extract their data. Finally, we present an unsupervised learning algorithm to induce a landmark grammar from a set of pages sharing a common HTML template, and we present an automatic Web data extraction system. The experiments on consolidated benchmarks show that the approach can substantially contribute to improve the state-of-the-art.


2018 ◽  
pp. 4611-4618
Author(s):  
Robert Baumgartner ◽  
Wolfgang Gatterbauer ◽  
Georg Gottlob

2015 ◽  
Vol 8 (3) ◽  
pp. 311-322 ◽  
Author(s):  
Yongquan Dong ◽  
Qiang Chu ◽  
Ping Ling

Sign in / Sign up

Export Citation Format

Share Document