Analysis and Improvement of Data Extraction Technology on the Web

Author(s):  
Bi Li
2010 ◽  
Vol 20-23 ◽  
pp. 178-183
Author(s):  
Jun Hua Gu ◽  
Jie Song ◽  
Na Zhang ◽  
Yan Liu Liu

With the increasingly high-speed of the internet as well as the increase in the amount of data it contains, users are finding it more and more difficult to gain useful information from the web. How to extract accurate information from the Web efficiently has become an urgent problem. Web information extraction technology has emerged to solve this kind of problem. The method of Web information auto-extraction based on XML is designed through standardizing the HTML document using data translation algorism, forming an extracting rule base by learning the XPath expression of samples, and using extraction rule base to realize auto-extraction of pages of same kind. The results show that this approach should lead to a higher recall ratio and precision ratio, and the result should have a self-description, making it convenient for founding data extraction system of each domain.


2013 ◽  
Vol 7 (2) ◽  
pp. 574-579 ◽  
Author(s):  
Dr Sunitha Abburu ◽  
G. Suresh Babu

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.  But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies  data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.   It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.  The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.


Author(s):  
Shalin Hai-Jew

Understanding Web network structures may offer insights on various organizations and individuals. These structures are often latent and invisible without special software tools; the interrelationships between various websites may not be apparent with a surface perusal of the publicly accessible Web pages. Three publicly available tools may be “chained” (combined in sequence) in a data extraction sequence to enable visualization of various aspects of http network structures in an enriched way (with more detailed insights about the composition of such networks, given their heterogeneous and multimodal contents). Maltego Tungsten™, a penetration-testing tool, enables the mapping of Web networks, which are enriched with a variety of information: the technological understructure and tools used to build the network, some linked individuals (digital profiles), some linked documents, linked images, related emails, some related geographical data, and even the in-degree of the various nodes. NCapture with NVivo enables the extraction of public social media platform data and some basic analysis of these captures. The Network Overview, Discovery, and Exploration for Excel (NodeXL) tool enables the extraction of social media platform data and various evocative data visualizations and analyses. With the size of the Web growing exponentially and new domains (like .ventures, .guru, .education, .company, and others), the ability to map widely will offer a broad competitive advantage to those who would exploit this approach to enhance knowledge.


2014 ◽  
Vol 989-994 ◽  
pp. 4322-4325
Author(s):  
Mu Qing Zhan ◽  
Rong Hua Lu

In the means of getting information from the Internet, the Web information extraction technology which can get more precise and more granular information is different from Search Engine, this article presents the technical route of Web information exaction of ceramic products’ information on the basis of analyzing the developing status of Web information extraction technology at home and abroad, and makes the extraction rules, and develops a set of extraction system, and acquires the relevant ceramic products’ information.


Author(s):  
Arshid Yousefi Avarvand ◽  
Mehrdad Halaji ◽  
Donya Zare ◽  
Meysam Hasannejad-Bibalan ◽  
Hadi Sedigh Ebrahim-Saraie

Background: Streptococcus pneumoniae is an important pathogen of children, mostly in developing countries. We aimed to investigate the prevalence of invasive S. pneumoniae among Iranian children using a systematic review and meta-analysis. Methods: A systematic search was carried out to identify papers published by Iranian authors in the Web of Science, PubMed, Scopus, and Google Scholar electronic databases from January of 2010 to December of 2017. Then, seven publications that met our inclusion criteria were selected for data extraction and analysis. Results: Totally, one study was multicenter, and six were single-center based studies. Meanwhile, all of the included studied performed among hospitalized patients. Seven studies reported the prevalence of invasive S. pneumoniae isolated from children, of these the pooled prevalence of S. pneumoniae was 2.5% (95% CI: 0.7%-9.1%). Conclusion: The overall prevalence of invasive S. pneumoniae infections among Iranian children is low (2.5%). However, further clinical studies are required to elucidate the burden of infections among Iranian children, especially in eastern regions.


Author(s):  
Ily Amalina Ahmad Sabri ◽  
Mustafa Man

<p>Web data extraction is the process of extracting user required information from web page. The information consists of semi-structured data not in structured format. The extraction data involves the web documents in html format. Nowadays, most people uses web data extractors because the extraction involve large information which makes the process of manual information extraction takes time and complicated. We present in this paper WEIDJ approach to extract images from the web, whose goal is to harvest images as object from template-based html pages. The WEIDJ (Web Extraction Image using DOM (Document Object Model) and JSON (JavaScript Object Notation)) applies DOM theory in order to build the structure and JSON as environment of programming. The extraction process leverages both the input of web address and the structure of extraction. Then, WEIDJ splits DOM tree into small subtrees and applies searching algorithm by visual blocks for each web page to find images. Our approach focus on three level of extraction; single web page, multiple web page and the whole web page. Extensive experiments on several biodiversity web pages has been done to show the comparison time performance between image extraction using DOM, JSON and WEIDJ for single web page. The experimental results advocate via our model, WEIDJ image extraction can be done fast and effectively.</p>


Author(s):  
Fatimah Sidi ◽  
Iskandar Ishak ◽  
Marzanah A. Jabar

The number of unstructured documents written in Malay language is enormously available on the web and intranets. However, unstructured documents cannot be queried in simple ways, hence the knowledge contained in such documents can neither be used by automatic systems nor could be understood easily and clearly by humans. This paper proposes a new approach to transform extracted knowledge in Malay unstructured document using ontology by identifying, organizing, and structuring the documents into an interrogative structured form. A Malay knowledge base, the MalayIK corpus is developed and used to test the MalayIK-Ontology against Ontos, an existing data extraction engine. The experimental results from MalayIK-Ontology have shown a significant improvement of knowledge extraction over Ontos implementation. This shows that clear knowledge organization and structuring concept is able to increase understanding, which leads to potential increase in sharable and reusable of concepts among the community.


Sign in / Sign up

Export Citation Format

Share Document