Information Extraction from Heterogenous Web Sites Using Additional Search of Related Contents Based on a User’s Instantiated Example

2004 ◽

Vol 13 (03) ◽

pp. 721-738 ◽

Cited By ~ 1

Author(s):

XIAOYING GAO ◽

MENGJIE ZHANG

Keyword(s):

Information Extraction ◽

Web Sites ◽

Learning Algorithm ◽

Knowledge Bases ◽

Web Pages ◽

Specific Knowledge ◽

Domain Specific ◽

Domain Specific Knowledge ◽

Knowledge Unit ◽

Training Examples

This paper describes a learning/adaptive approach to automatically building knowledge bases for information extraction from text based web pages. A frame based representation is introduced to represent domain knowledge as knowledge unit frames. A frame learning algorithm is developed to automatically learn knowledge unit frames from training examples. Some training examples can be obtained by automatically parsing a number of tabular web pages in the same domain, which greatly reduced the amount of time consuming manual work. This approach was investigated on ten web sites of real estate advertisements and car advertisements and nearly all the information was successfully extracted with very few false alarms. These results suggest that both the knowledge unit frame representation and the frame learning algorithm work well, domain specific knowledge bases can be learned from training examples, and the domain specific knowledge base can be used for information extraction from flexible text-based semi-structured Web pages on multiple Web sites. The investigation of the knowledge representation on five other domains suggests that this approach can be easily applied to other domains by simply changing the training examples.

Download Full-text

Adapting information extraction knowledge for unseen Web sites

2002 IEEE International Conference on Data Mining, 2002. Proceedings. ◽

10.1109/icdm.2002.1183995 ◽

2003 ◽

Cited By ~ 3

Author(s):

Tak-Lam Wong ◽

Wai Lam

Keyword(s):

Information Extraction ◽

Web Sites

Download Full-text

Unsupervised Extraction of Common Product Attributes From E-Commerce Websites by Considering Client Suggestion

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.j9307.0981119 ◽

2019 ◽

Vol 8 (11) ◽

pp. 1199-1203

Keyword(s):

Electronic Commerce ◽

Information Extraction ◽

Web Sites ◽

Extraction Methods ◽

Agricultural Product ◽

Accounting System ◽

Product Attributes ◽

Customer Reviews ◽

Stage Of Development ◽

Product Description

Develop an unsupervised learning framework for extracting popular product attributes from product description pages originated from different E-commerce Web sites. Unlike existing information extraction methods that do not consider the popularity of product attributes, in this proposed framework is able to not only detect popular product features from a collection of customer reviews but also map these popular features to the related product attributes. Building an intelligent E-commerce systems typically involves a component that can automatically extract product attribute information from a variety of product description pages in different E-commerce Web sites. Web information extraction methods such as wrappers are able to automatically extract product attributes from the Web content One novelty in this framework is that it can bridge the vocabulary gap between the text in product description pages and the text in customer reviews. Technically,in this framework developed a discriminative graphical model based on hidden Conditional Random Fields. As an unsupervised model, this framework can be easily applied to a variety of new domains and Web sites without the need of labelling training samples. E-commerce is proposed for enhancing the capability. Covered by electronic commerce surroundings, facing therefore voluminous new recent business model, it's obligatory to conduct the analysis to the electronic commerce pattern analysis method and like this is often useful in North American nation uncover the new electronic commerce pattern as provide the approach for electronic commerce pattern modernization to be conjointly helpful within the enterprise outline the particular electronic commerce strategy and therefore the implementation step. Initiated from this encouragement, during this paper proposes the innovative construct of the E-commerce recent agricultural product selling supported the massive web knowledge platform later the rapid development of rebuilding and opening up, China's agriculture has entered a new historical stage of development. Evaluate the growth mechanism of agricultural production enterprises from the angle of resource dynamic provide. In the e-commerce environment, the enterprise data and economic information are relatively concentrated, so the economical accounting system can instantly grasp the current activities of the economical data, and quickly generate economical information.

Download Full-text

Mining web sites using adaptive information extraction

Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - EACL '03 ◽

10.3115/1067737.1067752 ◽

2003 ◽

Cited By ~ 7

Author(s):

Alexiei Dingli ◽

Fabio Ciravegna ◽

David Guthrie ◽

Yorick Wilks

Keyword(s):

Information Extraction ◽

Web Sites

Download Full-text

Automatic Information Extraction from E-Commerce Web Sites

2010 International Conference on E-Business and E-Government ◽

10.1109/icee.2010.355 ◽

2010 ◽

Author(s):

Taofen Qiu ◽

Tianqi Yang

Keyword(s):

Information Extraction ◽

Web Sites ◽

Automatic Information ◽

Automatic Information Extraction

Download Full-text

Learning knowledge bases for information extraction from multiple text based Web sites

IEEE/WIC International Conference on Intelligent Agent Technology, 2003. IAT 2003. ◽

10.1109/iat.2003.1241057 ◽

2004 ◽

Author(s):

Xiaoying Gao ◽

Mengjie Zhang

Keyword(s):

Information Extraction ◽

Web Sites ◽

Knowledge Bases

Download Full-text

An Adaptive Web Information Extraction Approach Based on STU-DOM Tree

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.397-400.1972 ◽

2013 ◽

Vol 397-400 ◽

pp. 1972-1978

Author(s):

Song Pu Wu ◽

Qing Wang

Keyword(s):

Information Extraction ◽

Web Sites ◽

Web Information Extraction ◽

Maintenance Costs ◽

Web Information ◽

Dom Tree ◽

Html Parser

An adaptive web information extraction approach is presented in this paper. Most of the traditional web information extraction approaches depend on the templates of web sites. If the templates are changed, the information extraction rules should be redesigned. To reduce the maintenance costs and improve the adaptability of information extractors, an adaptive web information extraction approach is proposed based on the STU-DOM tree. The webpage is parsed into DOM Trees based on HTML Parser. Then DOM trees are filtered into STU-DOM trees to confirm blocks which contain keywords of a certain topic. The proposed approach is applied to webpages and the results show that the approach not only extracts information efficiently, but also is irrelevant to site structures.

Download Full-text

AUTOMATIC PATTERN CONSTRUCTION FOR WEB INFORMATION EXTRACTION

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s0218488504002928 ◽

2004 ◽

Vol 12 (04) ◽

pp. 447-470 ◽

Cited By ~ 2

Author(s):

XIAOYING GAO ◽

MENGJIE ZHANG ◽

PETER ANDREAE

Keyword(s):

Information Extraction ◽

Success Rate ◽

Web Sites ◽

Web Pages ◽

Web Information Extraction ◽

Web Information ◽

Significant Strength ◽

Extraction Pattern ◽

Data Elements ◽

Domain Independent

This paper describes a domain independent approach for automatically constructing information extraction patterns for semi-structured web pages. Given a randomly chosen page from a web site of similarly structured pages, the system identifies a region of the page that has a regular "tabular" structure, and then infers an extraction pattern that will match the "rows" of the region and identify the data elements. The approach was tested on three corpora containing a series of tabular web sites from different domains and achieved a success rate of at least 80%. A significant strength of the system is that it can infer extraction patterns from a single training page and does not require any manual labeling of the training page.

Download Full-text