Search Engine-Based Web Information Extraction

2011 ◽  
pp. 2048-2081
Author(s):  
Gijs Geleijnse ◽  
Jan Korst

In this chapter we discuss approaches to find, extract, and structure information from natural language texts on the Web. Such structured information can be expressed and shared using the standard Semantic Web languages and hence be machine interpreted. In this chapter we focus on two tasks in Web information extraction. The first part focuses on mining facts from the Web, while in the second part, we present an approach to collect community-based meta-data. A search engine is used to retrieve potentially relevant texts. From these texts, instances and relations are extracted. The proposed approaches are illustrated using various case-studies, showing that we can reliably extract information from the Web using simple techniques.

Author(s):  
Gijs Geleijnse

In this chapter we discuss approaches to find, extract, and structure information from natural language texts on the Web. Such structured information can be expressed and shared using the standard Semantic Web languages and hence be machine interpreted. In this chapter we focus on two tasks in Web information extraction. The first part focuses on mining facts from the Web, while in the second part, we present an approach to collect community-based meta-data. A search engine is used to retrieve potentially relevant texts. From these texts, instances and relations are extracted. The proposed approaches are illustrated using various case-studies, showing that we can reliably extract information from the Web using simple techniques.


2014 ◽  
Vol 989-994 ◽  
pp. 4322-4325
Author(s):  
Mu Qing Zhan ◽  
Rong Hua Lu

In the means of getting information from the Internet, the Web information extraction technology which can get more precise and more granular information is different from Search Engine, this article presents the technical route of Web information exaction of ceramic products’ information on the basis of analyzing the developing status of Web information extraction technology at home and abroad, and makes the extraction rules, and develops a set of extraction system, and acquires the relevant ceramic products’ information.


2021 ◽  
pp. 99-110
Author(s):  
Mohammad Ali Tofigh ◽  
◽  
◽  
Zhendong Mu

With the development of society, people pay more and more attention to the safety of food, and relevant laws and policies are gradually introduced and being improved. The research and development of agricultural product quality and safety system has become a research hot spot, and how to obtain the Web information of the system effectively and quickly is the focus of the research, so it is essential to carry out the intelligent extraction of Web information for agricultural product quality and safety system. The purpose of this paper is to solve the problem of how to efficiently extract the Web information of the agricultural product quality and safety system. By studying the Web information extraction methods of various systems, the paper makes a detailed analysis and research on how to realize the efficient and intelligent extraction of the Web information of the agricultural product quality and safety system. This paper analyzes in detail all kinds of template information extraction algorithms used at present, and systematically discusses a set of schemes that can automatically extract the Web information of agricultural product quality and safety system according to the template. The research results show that the proposed scheme is a dynamically extensible information extraction system, which can independently implement dynamic configuration templates according to different requirements without changing the code. Compared with the general way, the Web information extraction speed of agricultural product quality safety system is increased by 25%, the accuracy is increased by 12%, and the recall rate is increased by 30%.


2004 ◽  
pp. 227-267
Author(s):  
Wee Keong Ng ◽  
Zehua Liu ◽  
Zhao Li ◽  
Ee Peng Lim

With the explosion of information on the Web, traditional ways of browsing and keyword searching of information over web pages no longer satisfy the demanding needs of web surfers. Web information extraction has emerged as an important research area that aims to automatically extract information from target web pages and convert them into a structured format for further processing. The main issues involved in the extraction process include: (1) the definition of a suitable extraction language; (2) the definition of a data model representing the web information source; (3) the generation of the data model, given a target source; and (4) the extraction and presentation of information according to a given data model. In this chapter, we discuss the challenges of these issues and the approaches that current research activities have taken to revolve these issues. We propose several classification schemes to classify existing approaches of information extraction from different perspectives. Among the existing works, we focus on the Wiccap system — a software system that enables ordinary end-users to obtain information of interest in a simple and efficient manner by constructing personalized web views of information sources.


2014 ◽  
Vol 614 ◽  
pp. 503-506
Author(s):  
Qi Shen ◽  
Qing Ming Song ◽  
Bo Chen

With the development of web technology, the use of dynamic web pages and the personalization of page contents become more and more popular. Currently, the information of page is protean and the structures of different pages are vastly different, the traditional thinking of web information extraction technology has been difficult to adapt to the situation. In this paper, proposes a web information extraction method based on extended XPath policy through the analysis of structural features of web pages on tourist theme. This algorithm avoids the defects of traditional web information extraction technology; it is simple, practical, high cleaning efficiency, accuracy, and saving the overhead of the system.


2008 ◽  
pp. 211-238
Author(s):  
Wee Keong Ng ◽  
Zehua Liu ◽  
Zhao Li ◽  
Ee Peng Lim

With the explosion of information on the Web, traditional ways of browsing and keyword searching of information over web pages no longer satisfy the demanding needs of web surfers. Web information extraction has emerged as an important research area that aims to automatically extract information from target web pages and convert them into a structured format for further processing. The main issues involved in the extraction process include: (1) the definition of a suitable extraction language; (2) the definition of a data model representing the web information source; (3) the generation of the data model, given a target source; and (4) the extraction and presentation of information according to a given data model. In this chapter, we discuss the challenges of these issues and the approaches that current research activities have taken to revolve these issues. We propose several classification schemes to classify existing approaches of information extraction from different perspectives. Among the existing works, we focus on the Wiccap system — a software system that enables ordinary end-users to obtain information of interest in a simple and efficient manner by constructing personalized web views of information sources.


Sign in / Sign up

Export Citation Format

Share Document