Web Content Extraction by Integrating Textual and Visual Importance of Web Pages

In this chapter, we study the problem of extracting company acquisition relation from huge amounts of webpages, and propose a novel algorithm for a company acquisition relation extraction. Our algorithm considers the tense feature of Web content and classification technology of semantic strength when extracting company acquisition relation from webpages. It first determines the tense of each sentence in a webpage, where a CRF model is employed. Then, the tense of sentences is applied to sentences classification so as to evaluate the semantic strength of the candidate sentences in describing company acquisition relation. After that, we rank the candidate acquisition relations and return the top-k company acquisition relation. We run experiments on 6144 pages crawled through Google, and measure the performance of our algorithm under different metrics. The experimental results show that our algorithm is effective in determining the tense of sentences as well as the company acquisition relation.

Download Full-text

Web content extraction based on maximum continuous sum of text density

2016 International Conference on Asian Language Processing (IALP) ◽

10.1109/ialp.2016.7875988 ◽

2016 ◽

Cited By ~ 1

Author(s):

Kai Sun ◽

Miao Li ◽

Jinhua Du ◽

Lei Chen ◽

Zhengxin Yang ◽

...

Keyword(s):

Web Content ◽

Content Extraction ◽

Text Density

Download Full-text

Enriching the trustworthiness of health-related web pages

Health Informatics Journal ◽

10.1177/1460458211405006 ◽

2011 ◽

Vol 17 (2) ◽

pp. 116-126 ◽

Cited By ~ 4

Author(s):

Arnaud Gaudinat ◽

Sarah Cruchet ◽

Celia Boyer ◽

Pravir Chrawdhry

Keyword(s):

Resource Description Framework ◽

Quality Criteria ◽

Web Pages ◽

Web Content ◽

Quality Model ◽

Dublin Core ◽

Health Related ◽

Description Framework ◽

Resource Description ◽

Element Set

We present an experimental mechanism for enriching web content with quality metadata. This mechanism is based on a simple and well-known initiative in the field of the health-related web, the HONcode. The Resource Description Framework (RDF) format and the Dublin Core Metadata Element Set were used to formalize these metadata. The model of trust proposed is based on a quality model for health-related web pages that has been tested in practice over a period of thirteen years. Our model has been explored in the context of a project to develop a research tool that automatically detects the occurrence of quality criteria in health-related web pages.

Download Full-text

Opinion Content Extraction from Web Pages Using Embedded Semantic Term Tree Kernels

Proceedings of International Conference on Computational Intelligence and Data Engineering - Lecture Notes on Data Engineering and Communications Technologies ◽

10.1007/978-981-10-6319-0_29 ◽

2017 ◽

pp. 345-358

Author(s):

Veerappa B. Pagi ◽

Ramesh S. Wadawadagi

Keyword(s):

Web Pages ◽

Content Extraction

Download Full-text

Extracting Top-k Company Acquisition Relations From the Web

International Journal on Semantic Web and Information Systems ◽

10.4018/ijswis.2017100102 ◽

2017 ◽

Vol 13 (4) ◽

pp. 27-41 ◽

Cited By ~ 1

Author(s):

Jie Zhao ◽

Jianfei Wang ◽

Jia Yang ◽

Peiquan Jin

Keyword(s):

Rapid Development ◽

Relation Extraction ◽

Experimental Results ◽

Competitive Intelligence ◽

Web Pages ◽

Web Content ◽

Web Page ◽

Competitive Strategies ◽

The Web ◽

Novel Algorithm

Company acquisition relation reflects a company's development intent and competitive strategies, which is an important type of enterprise competitive intelligence. In the traditional environment, the acquisition of competitive intelligence mainly relies on newspapers, internal reports, and so on, but the rapid development of the Web introduces a new way to extract company acquisition relation. In this paper, the authors study the problem of extracting company acquisition relation from huge amounts of Web pages, and propose a novel algorithm for company acquisition relation extraction. The authors' algorithm considers the tense feature of Web content and classification technology of semantic strength when extracting company acquisition relation from Web pages. It first determines the tense of each sentence in a Web page, which is then applied in sentences classification so as to evaluate the semantic strength of the candidate sentences in describing company acquisition relation. After that, the authors rank the candidate acquisition relations and return the top-k company acquisition relation. They run experiments on 6144 pages crawled through Google, and measure the performance of their algorithm under different metrics. The experimental results show that the algorithm is effective in determining the tense of sentences as well as the company acquisition relation.

Download Full-text

Automatic Web Content Extraction for Generating Tag Clouds from Thai Web Sites

2011 IEEE 8th International Conference on e-Business Engineering ◽

10.1109/icebe.2011.34 ◽

2011 ◽

Cited By ~ 3

Author(s):

Wigrai Thanadechteemapat ◽

Chun Che Fung

Keyword(s):

Web Sites ◽

Web Content ◽

Content Extraction ◽