Structured Data Extraction from the Web Based on Partial Tree Alignment

<p>Web data extraction is the process of extracting user required information from web page. The information consists of semi-structured data not in structured format. The extraction data involves the web documents in html format. Nowadays, most people uses web data extractors because the extraction involve large information which makes the process of manual information extraction takes time and complicated. We present in this paper WEIDJ approach to extract images from the web, whose goal is to harvest images as object from template-based html pages. The WEIDJ (Web Extraction Image using DOM (Document Object Model) and JSON (JavaScript Object Notation)) applies DOM theory in order to build the structure and JSON as environment of programming. The extraction process leverages both the input of web address and the structure of extraction. Then, WEIDJ splits DOM tree into small subtrees and applies searching algorithm by visual blocks for each web page to find images. Our approach focus on three level of extraction; single web page, multiple web page and the whole web page. Extensive experiments on several biodiversity web pages has been done to show the comparison time performance between image extraction using DOM, JSON and WEIDJ for single web page. The experimental results advocate via our model, WEIDJ image extraction can be done fast and effectively.</p>

Download Full-text

Web Data Extraction Based on Visual Information and Partial Tree Alignment

2014 11th Web Information System and Application Conference ◽

10.1109/wisa.2014.12 ◽

2014 ◽

Cited By ~ 3

Author(s):

Siwu Fan ◽

Xinjun Wang ◽

Yongquan Dong

Keyword(s):

Visual Information ◽

Data Extraction ◽

Web Data ◽

Tree Alignment ◽

Web Data Extraction ◽

Partial Tree

Download Full-text

Data extraction from the web based on pre-defined schema

Journal of Computer Science and Technology ◽

10.1007/bf02943278 ◽

2002 ◽

Vol 17 (4) ◽

pp. 377-388 ◽

Cited By ~ 11

Author(s):

Xiaofeng Meng ◽

Hongjun Lu ◽

Haiyan Gang ◽

Mingzhe Gu

Keyword(s):

Data Extraction ◽

Web Based ◽

The Web

Download Full-text

Rule-Based Parsing for Web Data Extraction

Data Warehousing and Mining ◽

10.4018/978-1-59904-951-9.ch034 ◽

2008 ◽

pp. 469-484

Author(s):

David Camacho ◽

Ricardo Aler ◽

Juan Cuadrado

Keyword(s):

Data Extraction ◽

Research Field ◽

Multi Agent System ◽

Agent System ◽

Web Based ◽

Web Information ◽

Multi Agent ◽

Stored Information ◽

The Web ◽

Web Systems

How to build intelligent robust applications that work with the information stored in the Web is a difficult problem for several reasons which arise from the essential nature of the Web: the information is highly distributed, it is dynamic (both in content and format), it is not usually correctly structured, and the web sources will be unreachable at some times. To build robust and adaptable web systems, it is necessary to provide a standard representation for the information (i.e., using languages such as XML and ontologies to represent the semantics of the stored knowledge). However, this is actually a research field and usually most web sources do not provide their information in a structured way. This chapter analyzes a new approach that allows us to build robust and adaptable web systems by using a multi-agent approach. Several problems, including how to retrieve, extract, and manage the stored information from web sources, are analyzed from an agent perspective. Two difficult problems will be addressed in this chapter: designing a general architecture to deal with the problem of managing web information sources; and how these agents could work semiautomatically, adapting their behaviors to the dynamic conditions of the electronic sources. To achieve the first goal, a generic web-based multi-agent system (MAS) will be proposed, and will be applied in a specific problem to retrieve and manage information from electronic newspapers. To partially solve the problem of retrieving and extracting web information, a semiautomatic web parser will be designed and deployed like a reusable software component. This parser uses two sets of rules to adapt the behavior of the web agent to possible changes in the web sources. The first one is used to define the knowledge to be extracted from the HTML pages; the second one represents the final structure to store the retrieved knowledge. Using this parser, a specific web-based multi-agent system will be implemented.

Download Full-text

Rule-Based Parsing for Web Data Extraction

Intelligent Agents for Data Mining and Information Retrieval ◽

10.4018/978-1-59140-194-0.ch005 ◽

2004 ◽

pp. 65-87 ◽

Cited By ~ 1

Author(s):

David Camacho ◽

Ricardo Aler ◽

Juan Cuadrado

Keyword(s):

Data Extraction ◽

Research Field ◽

Multi Agent System ◽

Agent System ◽

Web Based ◽

Web Information ◽

Multi Agent ◽

Stored Information ◽

The Web ◽

Web Systems

How to build intelligent robust applications that work with the information stored in the Web is a difficult problem for several reasons which arise from the essential nature of the Web: the information is highly distributed, it is dynamic (both in content and format), it is not usually correctly structured, and the web sources will be unreachable at some times. To build robust and adaptable web systems, it is necessary to provide a standard representation for the information (i.e., using languages such as XML and ontologies to represent the semantics of the stored knowledge). However, this is actually a research field and usually most web sources do not provide their information in a structured way. This chapter analyzes a new approach that allows us to build robust and adaptable web systems by using a multi-agent approach. Several problems, including how to retrieve, extract, and manage the stored information from web sources, are analyzed from an agent perspective. Two difficult problems will be addressed in this chapter: designing a general architecture to deal with the problem of managing web information sources; and how these agents could work semiautomatically, adapting their behaviors to the dynamic conditions of the electronic sources. To achieve the first goal, a generic web-based multi-agent system (MAS) will be proposed, and will be applied in a specific problem to retrieve and manage information from electronic newspapers. To partially solve the problem of retrieving and extracting web information, a semiautomatic web parser will be designed and deployed like a reusable software component. This parser uses two sets of rules to adapt the behavior of the web agent to possible changes in the web sources. The first one is used to define the knowledge to be extracted from the HTML pages; the second one represents the final structure to store the retrieved knowledge. Using this parser, a specific web-based multi-agent system will be implemented.

Download Full-text

Web Data Extraction Based on Tag Path Clustering

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.756-759.1590 ◽

2013 ◽

Vol 756-759 ◽

pp. 1590-1594

Author(s):

Gui Li ◽

Cheng Chen ◽

Zheng Yu Li ◽

Zi Yang Han ◽

Ping Sun

Keyword(s):

Data Extraction ◽

Structured Data ◽

Web Pages ◽

Web Data Extraction ◽

Web Document ◽

Automatic Methods ◽

Simple Extraction ◽

Web Document Clustering ◽

Fully Automatic ◽

The Web

Fully automatic methods that extract structured data from the Web have been studied extensively. The existing methods suffice for simple extraction, but they often fail to handle more complicated Web pages. This paper introduces a method based on tag path clustering to extract structured data. The method gets complete tag path collection by parsing the DOM tree of the Web document. Clustering of tag paths is performed based on introduced similarity measure and the data area can be targeted, then taking advantage of features of tag position, we can separate and filter record, finally complete data extraction. Experiments show this method achieves higher accuracy than previous methods.

Download Full-text

Using XPaths of inbound links to cluster template-generated web pages

Computer Science and Information Systems ◽

10.2298/csis130416020g ◽

2014 ◽

Vol 11 (1) ◽

pp. 111-131

Author(s):

Tomas Grigalis ◽

Antanas Cenys

Keyword(s):

Real World ◽

Data Extraction ◽

Structural Similarity ◽

Structured Data ◽

Single Type ◽

Web Pages ◽

Template Structure ◽

Computationally Expensive ◽

Web Clustering ◽

The Web

Template-generated Web pages contain most of structured data on the Web. Clustering these pages according to their template structure is an important problem in wrapper-based structured data extraction systems. These systems extract structured data using wrappers that must be matched to only particular template pages. Selecting single type of template from all crawled Web pages is a time consuming task. Although there are methods to cluster Web pages according to their structural similarity, however, in most cases they are too computationally expensive to be applicable at Web-Scale. We propose a novel highly scalable approach to structurally cluster Web pages by employing XPath addresses of inbound inner-site links. We demonstrate the effectiveness of our method by clustering more than one million Web pages from many real world Websites in a few minutes and achieving >90% accuracy.

Download Full-text

Access to Census Interaction Data

Technologies for Migration and Commuting Analysis ◽

10.4018/978-1-61520-755-8.ch002 ◽

2010 ◽

pp. 31-50

Author(s):

Adam Dennett ◽

John Stillwell ◽

Oliver Duke-Williams

Keyword(s):

Data Extraction ◽

Successful Implementation ◽

Interaction Data ◽

User Perspective ◽

Web Based ◽

History Of ◽

Extraction Processes ◽

Electronic Access ◽

Gain Access ◽

The Web

This chapter is concerned with how users gain access to census interaction data. The authors outline a brief history of electronic access to interaction data sources and identify a number of issues and problems which led to the development of the Web-based Interface to Census Interaction Data (WICID). After presenting a number of practical and technical prerequisites for WICID, the authors explain in detail the architecture underpinning the system and the importance of the metadata framework for both initial successful implementation and ongoing maintenance and flexibility. Much of this chapter is devoted to explaining the basic query building and data extraction processes from a user perspective and further guidance relating to some of WICID’s less basic but no less useful features, is provided.

Download Full-text