Using Logic Programming and XML Technologies for Data Extraction from Web Pages

The Web is designed as a major information provider for the human consumer. However, information published on the Web is difficult to understand and reuse by a machine. In this chapter, we show how well established intelligent techniques based on logic programming and inductive learning combined with more recent XML technologies might help to improve the efficiency of the task of data extraction from Web pages. Our work can be seen as a necessary step of the more general problem of Web data management and integration.

Download Full-text

A FRAME WORK FOR WEB INFORMATION EXTRACTION AND ANALYSIS

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v7i2.3459 ◽

2013 ◽

Vol 7 (2) ◽

pp. 574-579 ◽

Cited By ~ 3

Author(s):

Dr Sunitha Abburu ◽

G. Suresh Babu

Keyword(s):

Information Extraction ◽

Data Extraction ◽

Research Work ◽

Web Pages ◽

Web Documents ◽

E Learning ◽

Structured Information ◽

Frame Work ◽

Effective Decision ◽

The Web

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.Â But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies Â data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.Â Â It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.Â The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.

Download Full-text

Exploiting Enriched Knowledge of Web Network Structures

Enhancing Qualitative and Mixed Methods Research with Technology - Advances in Knowledge Acquisition, Transfer, and Management ◽

10.4018/978-1-4666-6493-7.ch011 ◽

2015 ◽

pp. 255-286

Author(s):

Shalin Hai-Jew

Keyword(s):

Social Media ◽

Data Extraction ◽

Web Pages ◽

Network Structures ◽

Social Media Platform ◽

Special Software ◽

Testing Tool ◽

Media Platform ◽

Data Visualizations ◽

The Web

Understanding Web network structures may offer insights on various organizations and individuals. These structures are often latent and invisible without special software tools; the interrelationships between various websites may not be apparent with a surface perusal of the publicly accessible Web pages. Three publicly available tools may be “chained” (combined in sequence) in a data extraction sequence to enable visualization of various aspects of http network structures in an enriched way (with more detailed insights about the composition of such networks, given their heterogeneous and multimodal contents). Maltego Tungsten™, a penetration-testing tool, enables the mapping of Web networks, which are enriched with a variety of information: the technological understructure and tools used to build the network, some linked individuals (digital profiles), some linked documents, linked images, related emails, some related geographical data, and even the in-degree of the various nodes. NCapture with NVivo enables the extraction of public social media platform data and some basic analysis of these captures. The Network Overview, Discovery, and Exploration for Excel (NodeXL) tool enables the extraction of social media platform data and various evocative data visualizations and analyses. With the size of the Web growing exponentially and new domains (like .ventures, .guru, .education, .company, and others), the ability to map widely will offer a broad competitive advantage to those who would exploit this approach to enhance knowledge.

Download Full-text

Improving Performance of DOM in Semi-structured Data Extraction using WEIDJ Model

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v9.i3.pp752-763 ◽

2018 ◽

Vol 9 (3) ◽

pp. 752 ◽

Cited By ~ 2

Author(s):

Ily Amalina Ahmad Sabri ◽

Mustafa Man

Keyword(s):

Data Extraction ◽

Extraction Process ◽

Structured Data ◽

Web Pages ◽

Web Page ◽

Web Data ◽

Web Documents ◽

Web Extraction ◽

Comparison Time ◽

The Web

<p>Web data extraction is the process of extracting user required information from web page. The information consists of semi-structured data not in structured format. The extraction data involves the web documents in html format. Nowadays, most people uses web data extractors because the extraction involve large information which makes the process of manual information extraction takes time and complicated. We present in this paper WEIDJ approach to extract images from the web, whose goal is to harvest images as object from template-based html pages. The WEIDJ (Web Extraction Image using DOM (Document Object Model) and JSON (JavaScript Object Notation)) applies DOM theory in order to build the structure and JSON as environment of programming. The extraction process leverages both the input of web address and the structure of extraction. Then, WEIDJ splits DOM tree into small subtrees and applies searching algorithm by visual blocks for each web page to find images. Our approach focus on three level of extraction; single web page, multiple web page and the whole web page. Extensive experiments on several biodiversity web pages has been done to show the comparison time performance between image extraction using DOM, JSON and WEIDJ for single web page. The experimental results advocate via our model, WEIDJ image extraction can be done fast and effectively.</p>

Download Full-text

Tech Services on the Web: Data Management and Curation Resources from IASSIST; http://www.iassistdata.org/resources/category/data-management-and-curation

Technical Services Quarterly ◽

10.1080/07317131.2013.819758 ◽

2013 ◽

Vol 30 (4) ◽

pp. 437-438

Author(s):

Mary Mallery

Keyword(s):

Data Management ◽

Web Data ◽

Web Data Management ◽

The Web

Download Full-text

Intensional data on the web

10.31219/osf.io/mvs8q ◽

2017 ◽

Author(s):

Antoine Amarilli ◽

Silviu Maniu ◽

Pierre Senellart

Keyword(s):

Social Networks ◽

Data Management ◽

Data Access ◽

Influence Maximization ◽

Management Scenarios ◽

Focused Crawling ◽

Crowdsourced Data ◽

The World ◽

Web Data Management ◽

The Web

We call data intensional when it is not directly available, but must be accessed through a costlyinterface. Intensional data naturally arises in a number of Web data management scenarios, suchas Web crawling or ontology-based data access. Such scenarios require us to model an uncertainview of the world, for which, given a query, we must answer the question “What is the best thingto do next?” Once data has been retrieved, the knowledge of the world is revised, and the wholeprocess is repeated, until enough knowledge about the world has been obtained for the particularapplication considered. In this article, we give an overview of the steps underlying all intensionaldata management scenarios, and illustrate them on three concrete applications: focused crawling,online influence maximization in social networks, and mining crowdsourced data.

Download Full-text

Web Data Extraction Based on Tag Path Clustering

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.756-759.1590 ◽

2013 ◽

Vol 756-759 ◽

pp. 1590-1594

Author(s):

Gui Li ◽

Cheng Chen ◽

Zheng Yu Li ◽

Zi Yang Han ◽

Ping Sun

Keyword(s):

Data Extraction ◽

Structured Data ◽

Web Pages ◽

Web Data Extraction ◽

Web Document ◽

Automatic Methods ◽

Simple Extraction ◽

Web Document Clustering ◽

Fully Automatic ◽

The Web

Fully automatic methods that extract structured data from the Web have been studied extensively. The existing methods suffice for simple extraction, but they often fail to handle more complicated Web pages. This paper introduces a method based on tag path clustering to extract structured data. The method gets complete tag path collection by parsing the DOM tree of the Web document. Clustering of tag paths is performed based on introduced similarity measure and the data area can be targeted, then taking advantage of features of tag position, we can separate and filter record, finally complete data extraction. Experiments show this method achieves higher accuracy than previous methods.

Download Full-text

Big Data Analysis of Web Data Extraction

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i4.37.24095 ◽

2018 ◽

Vol 7 (4.37) ◽

pp. 168

Author(s):

Nadia Ibrahim ◽

Alaa Hassan ◽

Marwah Nihad

Keyword(s):

Data Mining ◽

Data Analysis ◽

High Performance ◽

Data Extraction ◽

Large Data ◽

Heterogeneous Data ◽

Web Pages ◽

Target Domain ◽

Data Mining Algorithms ◽

The Web

In this study, the large data extraction techniques; include detection of patterns and secret relationships between factors numbering and bring in the required information. Rapid analysis of massive data can lead to innovation and concepts of the theoretical value. Compared with results from mining between traditional data sets and the vast amount of large heterogeneous data interdependent it has the ability expand the knowledge and ideas about the target domain. We studied in this research data mining on the Internet. The various networks that are used to extract data onto different locations complex may appear sometimes and has been used to extract information on the web technology to extract and data analysis (Marwah et al., 2016). In this research, we extracted the information on large quantities of the web pages and examined the pages of the site using Java code, and we added the extracted information on a special database for the web page. We used the data network function to get accurate results of evaluating and categorizing the data pages found, which identifies the trusted web or risky web pages, and imported the data onto a CSV extension. Consequently, examine and categorize these data using WEKA to obtain accurate results. We concluded from the results that the applied data mining algorithms are better than other techniques in classification and extraction of data and high performance.

Download Full-text

Wrapper Maintenance: A Machine Learning Approach

Journal of Artificial Intelligence Research ◽

10.1613/jair.1145 ◽

2003 ◽

Vol 18 ◽

pp. 149-181 ◽

Cited By ~ 74

Author(s):

K. Lerman ◽

S. N. Minton ◽

C. A. Knoblock

Keyword(s):

Structural Information ◽

Data Extraction ◽

Research Problem ◽

Web Pages ◽

Online Information ◽

Verification Algorithm ◽

Machine Learning Approach ◽

Correct Data ◽

The Web ◽

Important Research Problem

The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wrappers from extracting data correctly. We present an efficient algorithm that learns structural information about data from positive examples alone. We describe how this information can be used for two wrapper maintenance applications: wrapper verification and reinduction. The wrapper verification system detects when a wrapper is not extracting correct data, usually because the Web source has changed its format. The reinduction algorithm automatically recovers from changes in the Web source by identifying data on Web pages so that a new wrapper may be generated for this source. To validate our approach, we monitored 27 wrappers over a period of a year. The verification algorithm correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes, resulting in precision of 0.73 and recall of 0.95. We validated the reinduction algorithm on ten Web sources. We were able to successfully reinduce the wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data extraction task.

Download Full-text

Using XPaths of inbound links to cluster template-generated web pages

Computer Science and Information Systems ◽

10.2298/csis130416020g ◽

2014 ◽

Vol 11 (1) ◽

pp. 111-131

Author(s):

Tomas Grigalis ◽

Antanas Cenys

Keyword(s):

Real World ◽

Data Extraction ◽

Structural Similarity ◽

Structured Data ◽

Single Type ◽

Web Pages ◽

Template Structure ◽

Computationally Expensive ◽

Web Clustering ◽

The Web

Template-generated Web pages contain most of structured data on the Web. Clustering these pages according to their template structure is an important problem in wrapper-based structured data extraction systems. These systems extract structured data using wrappers that must be matched to only particular template pages. Selecting single type of template from all crawled Web pages is a time consuming task. Although there are methods to cluster Web pages according to their structural similarity, however, in most cases they are too computationally expensive to be applicable at Web-Scale. We propose a novel highly scalable approach to structurally cluster Web pages by employing XPath addresses of inbound inner-site links. We demonstrate the effectiveness of our method by clustering more than one million Web pages from many real world Websites in a few minutes and achieving >90% accuracy.

Download Full-text

Web Data Extraction and Integration in Domain

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.756-759.1585 ◽

2013 ◽

Vol 756-759 ◽

pp. 1585-1589 ◽

Cited By ~ 1

Author(s):

Gui Li ◽

Zi Yang Han ◽

Zhao Xin Chen ◽

Zheng Yu Li ◽

Ping Sun

Keyword(s):

Data Model ◽

Data Extraction ◽

Value Added ◽

Web Pages ◽

Web Data ◽

Integration Algorithm ◽

Web Data Extraction ◽

Data Schema ◽

The Web

The purpose of WEB data extraction and integration is to provide the domain oriented value-added services. Based on the requirements of domain, and the features of web pages data. this paper proposes a WEB data schema and a domain data model. It also puts forward the web table positioning and web table records extracting based on WEB data schema and an integration algorithm based on the main data model. The experiment results are given to show effectiveness of the proposed algorithm and model.

Download Full-text