Site-Wide Wrapper Induction for Life Science Deep Web Databases

Author(s):  
Saqib Mir ◽  
Steffen Staab ◽  
Isabel Rojas
Author(s):  
Ling Song ◽  
Jun Ma ◽  
Po Yan ◽  
Li Lian ◽  
Dongmei Zhang
Keyword(s):  
Deep Web ◽  

Author(s):  
Zina Ben Miled ◽  
Nianhua Li ◽  
Yang Liu ◽  
Yue He ◽  
Eric Lynch ◽  
...  
Keyword(s):  

2011 ◽  
Vol 8 (3) ◽  
pp. 779-799 ◽  
Author(s):  
Ying Wang ◽  
Huilai Li ◽  
Wanli Zuo ◽  
Fengling He ◽  
Xin Wang ◽  
...  

Ontology plays an important role in locating Domain-Specific Deep Web contents, therefore, this paper presents a novel framework WFF for efficiently locating Domain-Specific Deep Web databases based on focused crawling and ontology by constructing Web Page Classifier(WPC), Form Structure Classifier(FSC) and Form Content Classifier(FCC) in a hierarchical fashion. Firstly, WPC discovers potentially interesting pages based on ontology-assisted focused crawler. Then, FSC analyzes the interesting pages and determines whether these pages subsume searchable forms based on structural characteristics. Lastly, FCC identifies searchable forms that belong to a given domain in the semantic level, and stores these URLs of Domain- Specific searchable forms to a database. Through a detailed experimental evaluation, WFF framework not only simplifies discovering process, but also effectively determines Domain-Specific databases.


2012 ◽  
Vol 40 (1) ◽  
pp. 159-184 ◽  
Author(s):  
Yanni Li ◽  
Yuping Wang ◽  
Jintao Du

2021 ◽  
Author(s):  
Chia-Hui Chang

<div>Web data extraction is a key component in many business intelligence tasks, such as data transformation, exchange, and analysis. Many approaches have been proposed, with either labeled training examples (supervised) or annotation-free training pages (unsupervised). However, most research focuses on extraction effectiveness. Not much attention has been paid to extraction efficiency. In fact, most unsupervised web data extraction ignores wrapper generation because they could work alone without any supervision. </div><div>In this paper, we argue that wrapper generation for unsupervised web data extraction is as important as supervised wrapper induction because the generated wrappers could work more efficiently without sophisticated analysis during testing. We consider two approaches for wrapper generation: schema-guided finite-state machine (FSM) approaches and data-driven machine learning (ML) approaches. We exploit unique mandatory templates to improve the FSM-based wrapper, and proposed two convolutional neural network (CNN)-based models for sequence-labeling. The experimental results show that the FSM wrapper performs well even with small training data, while the CNN-based models require more training pages to achieve the same effectiveness but are more efficient with GPU support. Furthermore, FSM wrappers can work as a filter to reduce the number of training pages and advance the learning curve for wrapper generation.</div>


2021 ◽  
Author(s):  
Chia-Hui Chang

<div>Web data extraction is a key component in many business intelligence tasks, such as data transformation, exchange, and analysis. Many approaches have been proposed, with either labeled training examples (supervised) or annotation-free training pages (unsupervised). However, most research focuses on extraction effectiveness. Not much attention has been paid to extraction efficiency. In fact, most unsupervised web data extraction ignores wrapper generation because they could work alone without any supervision. </div><div>In this paper, we argue that wrapper generation for unsupervised web data extraction is as important as supervised wrapper induction because the generated wrappers could work more efficiently without sophisticated analysis during testing. We consider two approaches for wrapper generation: schema-guided finite-state machine (FSM) approaches and data-driven machine learning (ML) approaches. We exploit unique mandatory templates to improve the FSM-based wrapper, and proposed two convolutional neural network (CNN)-based models for sequence-labeling. The experimental results show that the FSM wrapper performs well even with small training data, while the CNN-based models require more training pages to achieve the same effectiveness but are more efficient with GPU support. Furthermore, FSM wrappers can work as a filter to reduce the number of training pages and advance the learning curve for wrapper generation.</div>


Sign in / Sign up

Export Citation Format

Share Document