Structured Data Extraction: Wrapper Generation

2011 ◽  
pp. 363-423 ◽  
Author(s):  
Bing Liu
2013 ◽  
Vol 17 (4) ◽  
pp. 827-846
Author(s):  
George Gkotsis ◽  
Karen Stepanyan ◽  
Alexandra I. Cristea ◽  
Mike Joy

Author(s):  
Ily Amalina Ahmad Sabri ◽  
Mustafa Man

<p>Web data extraction is the process of extracting user required information from web page. The information consists of semi-structured data not in structured format. The extraction data involves the web documents in html format. Nowadays, most people uses web data extractors because the extraction involve large information which makes the process of manual information extraction takes time and complicated. We present in this paper WEIDJ approach to extract images from the web, whose goal is to harvest images as object from template-based html pages. The WEIDJ (Web Extraction Image using DOM (Document Object Model) and JSON (JavaScript Object Notation)) applies DOM theory in order to build the structure and JSON as environment of programming. The extraction process leverages both the input of web address and the structure of extraction. Then, WEIDJ splits DOM tree into small subtrees and applies searching algorithm by visual blocks for each web page to find images. Our approach focus on three level of extraction; single web page, multiple web page and the whole web page. Extensive experiments on several biodiversity web pages has been done to show the comparison time performance between image extraction using DOM, JSON and WEIDJ for single web page. The experimental results advocate via our model, WEIDJ image extraction can be done fast and effectively.</p>


Author(s):  
George Gkotsis ◽  
Karen Stepanyan ◽  
Alexandra I. Cristea ◽  
Mike Joy

2020 ◽  
Vol 17 (1) ◽  
pp. 513-518
Author(s):  
Shashi Pal Singh ◽  
Ajai Kumar ◽  
Rachna Awasthi ◽  
Neetu Yadav ◽  
Shikha Jain

In today’s World there exists various source of data in various formats (file formats), different structure, different types and etc. which is a hug collection of unstructured over the internet or social media. This gives rise to categorization of data as unstructured, semi structured and structured data. Data that exist in irregular manner without any particular schema are referred as unstructured data which is very difficult to process as it consists of irregularities and ambiguities. So, we are focused on Intelligent Processing Unit which converts unstructured big data into intelligent meaningful information. Intelligent text extraction is a technique that automatically identifies and extracts text from file format. The system consists of different stages which include the pre-processing, keyphase extraction techniques and transformation for the text extraction and retrieve structured data from unstructured data. The system consists multiple method/approach give better result. We are currently working in various file formats and converting the file format into DOCX which will come in the form of the un-structure Form, and then we will obtain that file in the structure form with the help of intelligent Pre-processing. The pre-process stages that triggers the unstructured data/corpus into structured data converting into meaning full. The Initial stage is the system remove the stop word, unwanted symbols noisy data and line spacing. The second stage is Data Extraction from various sources of file or types of files into proper format plain text. The then in third stage we transform the data or information from one format to another for the user to understand the data. The final step is rebuilding the file in its original format maintaining tag of the files. The large size files are divided into sub small size file to executed the parallel processing algorithms for fast processing of larger files and data. Parallel processing is a very important concept for text extraction and with its help; the big file breaks in a small file and improves the result. Extraction of data is done in Bilingual language, and represent the most relevant information contained in the document. Key-phase extraction is an important problem of data mining, Knowledge retrieval and natural speech processing. Keyword Extraction technique has been used to abstract keywords that exclusively recognize a document. Rebuilding is an important part of this project and we will use the entire concept in that file format and in the last, we need the same format which we have done in that file. This concept is being widely used but not much work of the work has been done in the area of developing many functionalities under one tool, so this makes us feel the requirement of such a tool which can easily and efficiently convert unstructured files into structured one.


2013 ◽  
Vol 8 (6) ◽  
pp. 93-96
Author(s):  
Vimala. S

Sign in / Sign up

Export Citation Format

Share Document