Automatic Data Extraction from Data-Rich Web Pages

Web site schema detection and data extraction from the Deep Web have been studied a lot. Although, few researches have focused on the more challenging jobs: wrapper verification or extractor generation. A wrapper verifier would check whether a new page from a site complies with the detected schema, and so the extractor will use the wrapper to get instances of the schema types. If the wrapper failed to work with the new page, a new wrapper/schema would be re-generated by calling an unsupervised wrapper induction system. In this paper, a new data extractor called GenDE is proposed. It verifies the site schema and extracts data from the Web pages using Conditional Random Fields (CRFs). The problem is solved by breaking down an observation sequence (a Web page) into simpler subsequences that will be labeled using CRF. Moreover, the system solves the problem of automatic data extraction from modern JavaScript sites in which data/schema are attached (on the client side) in a JSON format. The experiments show an encouraging result as it outperforms the CSP-based extractor algorithm (95% and 96% of recall and precision, respectively). Moreover, it gives a high performance result when tested on the SWDE benchmark dataset (84.91%).

Download Full-text

Automatic Data Extraction from Template-Generated Web Pages

Journal of Software ◽

10.3724/sp.j.1001.2008.00209 ◽

2008 ◽

Vol 19 (2) ◽

pp. 209-223 ◽

Cited By ~ 9

Author(s):

Shao-Hua YANG

Keyword(s):

Data Extraction ◽

Web Pages ◽

Automatic Data

Download Full-text

A FRAME WORK FOR WEB INFORMATION EXTRACTION AND ANALYSIS

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v7i2.3459 ◽

2013 ◽

Vol 7 (2) ◽

pp. 574-579 ◽

Cited By ~ 3

Author(s):

Dr Sunitha Abburu ◽

G. Suresh Babu

Keyword(s):

Information Extraction ◽

Data Extraction ◽

Research Work ◽

Web Pages ◽

Web Documents ◽

E Learning ◽

Structured Information ◽

Frame Work ◽

Effective Decision ◽

The Web

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.Â But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies Â data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.Â Â It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.Â The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.

Download Full-text

Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering ◽

10.1016/s0169-023x(99)00027-0 ◽

1999 ◽

Vol 31 (3) ◽

pp. 227-251 ◽

Cited By ~ 174

Author(s):

D.W. Embley ◽

D.M. Campbell ◽

Y.S. Jiang ◽

S.W. Liddle ◽

D.W. Lonsdale ◽

...

Keyword(s):

Conceptual Model ◽

Data Extraction ◽

Web Pages ◽

Model Based

Download Full-text

Research on Framework Load Correlations Based on Automatic Data Extraction Algorithm

Advances in Intelligent Systems and Computing - Advances in Intelligent Systems and Interactive Applications ◽

10.1007/978-3-030-34387-3_74 ◽

2019 ◽

pp. 604-613

Author(s):

Meiwen Hu ◽

Binjie Wang ◽

Shouguang Sun

Keyword(s):

Data Extraction ◽

Automatic Data ◽

Extraction Algorithm

Download Full-text

Automatic data extraction: A prerequisite for productivity measurement

2008 IEEE International Engineering Management Conference ◽

10.1109/iemce.2008.4617971 ◽

2008 ◽

Author(s):

D. Zaum ◽

M. Olbrich ◽

E. Barke

Keyword(s):

Data Extraction ◽

Productivity Measurement ◽

Automatic Data

Download Full-text

Automatic data extraction from 24 hour blood pressure measurement reports of a large multicenter clinical trial

Computer Methods and Programs in Biomedicine ◽

10.1016/j.cmpb.2021.106588 ◽

2021 ◽

pp. 106588

Author(s):

Janis M Nolde ◽

Ajmal Mian ◽

Luca Schlaich ◽

Justine Chan ◽

Leslie Marisol Lugo-Gavidia ◽

...

Keyword(s):

Blood Pressure ◽

Clinical Trial ◽

Pressure Measurement ◽

Data Extraction ◽

Blood Pressure Measurement ◽

Multicenter Clinical Trial ◽

Automatic Data

Download Full-text

Exploiting Enriched Knowledge of Web Network Structures

Enhancing Qualitative and Mixed Methods Research with Technology - Advances in Knowledge Acquisition, Transfer, and Management ◽

10.4018/978-1-4666-6493-7.ch011 ◽

2015 ◽

pp. 255-286

Author(s):

Shalin Hai-Jew

Keyword(s):

Social Media ◽

Data Extraction ◽

Web Pages ◽

Network Structures ◽

Social Media Platform ◽

Special Software ◽

Testing Tool ◽

Media Platform ◽

Data Visualizations ◽

The Web

Understanding Web network structures may offer insights on various organizations and individuals. These structures are often latent and invisible without special software tools; the interrelationships between various websites may not be apparent with a surface perusal of the publicly accessible Web pages. Three publicly available tools may be “chained” (combined in sequence) in a data extraction sequence to enable visualization of various aspects of http network structures in an enriched way (with more detailed insights about the composition of such networks, given their heterogeneous and multimodal contents). Maltego Tungsten™, a penetration-testing tool, enables the mapping of Web networks, which are enriched with a variety of information: the technological understructure and tools used to build the network, some linked individuals (digital profiles), some linked documents, linked images, related emails, some related geographical data, and even the in-degree of the various nodes. NCapture with NVivo enables the extraction of public social media platform data and some basic analysis of these captures. The Network Overview, Discovery, and Exploration for Excel (NodeXL) tool enables the extraction of social media platform data and various evocative data visualizations and analyses. With the size of the Web growing exponentially and new domains (like .ventures, .guru, .education, .company, and others), the ability to map widely will offer a broad competitive advantage to those who would exploit this approach to enhance knowledge.

Download Full-text