Structured Data Extraction: Wrapper Generation

Web Data Mining ◽

10.1007/978-3-540-37882-2_9 ◽

2007 ◽

pp. 323-380

Keyword(s):

Data Extraction ◽

Structured Data ◽

Wrapper Generation

Download Full-text

Multiple Types of Semi-structured Data Extraction Using Wrapper for Extraction of Image Using DOM (WEID)

Regional Conference on Science, Technology and Social Sciences (RCSTSS 2016) ◽

10.1007/978-981-13-0074-5_6 ◽

2018 ◽

pp. 67-76 ◽

Cited By ~ 1

Author(s):

Ily Amalina Sabri Ahmad ◽

Mustafa Man

Keyword(s):

Data Extraction ◽

Structured Data

Download Full-text

Wrapper Generation for Automatic Data Extraction from Large Web Sites

Databases in Networked Information Systems - Lecture Notes in Computer Science ◽

10.1007/978-3-540-31970-2_3 ◽

2005 ◽

pp. 34-53 ◽

Cited By ~ 1

Author(s):

Nitin Jindal

Keyword(s):

Web Sites ◽

Data Extraction ◽

Automatic Data ◽

Wrapper Generation

Download Full-text

Entropy-based automated wrapper generation for weblog data extraction

World Wide Web ◽

10.1007/s11280-013-0269-6 ◽

2013 ◽

Vol 17 (4) ◽

pp. 827-846

Author(s):

George Gkotsis ◽

Karen Stepanyan ◽

Alexandra I. Cristea ◽

Mike Joy

Keyword(s):

Data Extraction ◽

Wrapper Generation

Download Full-text

Improving Performance of DOM in Semi-structured Data Extraction using WEIDJ Model

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v9.i3.pp752-763 ◽

2018 ◽

Vol 9 (3) ◽

pp. 752 ◽

Cited By ~ 2

Author(s):

Ily Amalina Ahmad Sabri ◽

Mustafa Man

Keyword(s):

Data Extraction ◽

Extraction Process ◽

Structured Data ◽

Web Pages ◽

Web Page ◽

Web Data ◽

Web Documents ◽

Web Extraction ◽

Comparison Time ◽

The Web

<p>Web data extraction is the process of extracting user required information from web page. The information consists of semi-structured data not in structured format. The extraction data involves the web documents in html format. Nowadays, most people uses web data extractors because the extraction involve large information which makes the process of manual information extraction takes time and complicated. We present in this paper WEIDJ approach to extract images from the web, whose goal is to harvest images as object from template-based html pages. The WEIDJ (Web Extraction Image using DOM (Document Object Model) and JSON (JavaScript Object Notation)) applies DOM theory in order to build the structure and JSON as environment of programming. The extraction process leverages both the input of web address and the structure of extraction. Then, WEIDJ splits DOM tree into small subtrees and applies searching algorithm by visual blocks for each web page to find images. Our approach focus on three level of extraction; single web page, multiple web page and the whole web page. Extensive experiments on several biodiversity web pages has been done to show the comparison time performance between image extraction using DOM, JSON and WEIDJ for single web page. The experimental results advocate via our model, WEIDJ image extraction can be done fast and effectively.</p>

Download Full-text

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

Big Data - Lecture Notes in Computer Science ◽

10.1007/978-3-642-39467-6_26 ◽

2013 ◽

pp. 292-302 ◽

Cited By ~ 3

Author(s):

George Gkotsis ◽

Karen Stepanyan ◽

Alexandra I. Cristea ◽

Mike Joy

Keyword(s):

Data Extraction ◽

Wrapper Generation

Download Full-text

Vertical Classification of Web Pages for Structured Data Extraction

Information Retrieval Technology - Lecture Notes in Computer Science ◽

10.1007/978-3-642-35341-3_44 ◽

2012 ◽

pp. 486-495 ◽

Cited By ~ 1

Author(s):

Long Li ◽

Dandan Song ◽

Lejian Liao

Keyword(s):

Data Extraction ◽

Structured Data ◽

Web Pages

Download Full-text

Intelligent Bilingual Data Extraction and Rebuilding Using Data Mining for Big Data

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.8699 ◽

2020 ◽

Vol 17 (1) ◽

pp. 513-518

Author(s):

Shashi Pal Singh ◽

Ajai Kumar ◽

Rachna Awasthi ◽

Neetu Yadav ◽

Shikha Jain

Keyword(s):

Data Mining ◽

Big Data ◽

Parallel Processing ◽

Data Extraction ◽

Structured Data ◽

Unstructured Data ◽

File Format ◽

Structure Form ◽

Text Extraction ◽

File Formats

In today’s World there exists various source of data in various formats (file formats), different structure, different types and etc. which is a hug collection of unstructured over the internet or social media. This gives rise to categorization of data as unstructured, semi structured and structured data. Data that exist in irregular manner without any particular schema are referred as unstructured data which is very difficult to process as it consists of irregularities and ambiguities. So, we are focused on Intelligent Processing Unit which converts unstructured big data into intelligent meaningful information. Intelligent text extraction is a technique that automatically identifies and extracts text from file format. The system consists of different stages which include the pre-processing, keyphase extraction techniques and transformation for the text extraction and retrieve structured data from unstructured data. The system consists multiple method/approach give better result. We are currently working in various file formats and converting the file format into DOCX which will come in the form of the un-structure Form, and then we will obtain that file in the structure form with the help of intelligent Pre-processing. The pre-process stages that triggers the unstructured data/corpus into structured data converting into meaning full. The Initial stage is the system remove the stop word, unwanted symbols noisy data and line spacing. The second stage is Data Extraction from various sources of file or types of files into proper format plain text. The then in third stage we transform the data or information from one format to another for the user to understand the data. The final step is rebuilding the file in its original format maintaining tag of the files. The large size files are divided into sub small size file to executed the parallel processing algorithms for fast processing of larger files and data. Parallel processing is a very important concept for text extraction and with its help; the big file breaks in a small file and improves the result. Extraction of data is done in Bilingual language, and represent the most relevant information contained in the document. Key-phase extraction is an important problem of data mining, Knowledge retrieval and natural speech processing. Keyword Extraction technique has been used to abstract keywords that exclusively recognize a document. Rebuilding is an important part of this project and we will use the entire concept in that file format and in the last, we need the same format which we have done in that file. This concept is being widely used but not much work of the work has been done in the area of developing many functionalities under one tool, so this makes us feel the requirement of such a tool which can easily and efficiently convert unstructured files into structured one.

Download Full-text