scholarly journals Web Content Extraction by Integrating Textual and Visual Importance of Web Pages

2014 ◽  
Vol 91 (3) ◽  
pp. 20-24
Author(s):  
K. Nethra ◽  
J. Anitha
Author(s):  
Stanislas Morbieu ◽  
Guillaume Bruneval ◽  
Mohamed Lacarne ◽  
Mohamed Kone ◽  
Francois-Xavier Bois
Keyword(s):  

2017 ◽  
Vol 11 (2) ◽  
pp. 39-48 ◽  
Author(s):  
Qingtang Liu ◽  
Mingbo Shao ◽  
Linjing Wu ◽  
Gang Zhao ◽  
Guilin Fan ◽  
...  
Keyword(s):  

Author(s):  
Jie Zhao ◽  
Jianfei Wang ◽  
Jia Yang ◽  
Peiquan Jin

In this chapter, we study the problem of extracting company acquisition relation from huge amounts of webpages, and propose a novel algorithm for a company acquisition relation extraction. Our algorithm considers the tense feature of Web content and classification technology of semantic strength when extracting company acquisition relation from webpages. It first determines the tense of each sentence in a webpage, where a CRF model is employed. Then, the tense of sentences is applied to sentences classification so as to evaluate the semantic strength of the candidate sentences in describing company acquisition relation. After that, we rank the candidate acquisition relations and return the top-k company acquisition relation. We run experiments on 6144 pages crawled through Google, and measure the performance of our algorithm under different metrics. The experimental results show that our algorithm is effective in determining the tense of sentences as well as the company acquisition relation.


2011 ◽  
Vol 17 (2) ◽  
pp. 116-126 ◽  
Author(s):  
Arnaud Gaudinat ◽  
Sarah Cruchet ◽  
Celia Boyer ◽  
Pravir Chrawdhry

We present an experimental mechanism for enriching web content with quality metadata. This mechanism is based on a simple and well-known initiative in the field of the health-related web, the HONcode. The Resource Description Framework (RDF) format and the Dublin Core Metadata Element Set were used to formalize these metadata. The model of trust proposed is based on a quality model for health-related web pages that has been tested in practice over a period of thirteen years. Our model has been explored in the context of a project to develop a research tool that automatically detects the occurrence of quality criteria in health-related web pages.


Author(s):  
Jie Zhao ◽  
Jianfei Wang ◽  
Jia Yang ◽  
Peiquan Jin

Company acquisition relation reflects a company's development intent and competitive strategies, which is an important type of enterprise competitive intelligence. In the traditional environment, the acquisition of competitive intelligence mainly relies on newspapers, internal reports, and so on, but the rapid development of the Web introduces a new way to extract company acquisition relation. In this paper, the authors study the problem of extracting company acquisition relation from huge amounts of Web pages, and propose a novel algorithm for company acquisition relation extraction. The authors' algorithm considers the tense feature of Web content and classification technology of semantic strength when extracting company acquisition relation from Web pages. It first determines the tense of each sentence in a Web page, which is then applied in sentences classification so as to evaluate the semantic strength of the candidate sentences in describing company acquisition relation. After that, the authors rank the candidate acquisition relations and return the top-k company acquisition relation. They run experiments on 6144 pages crawled through Google, and measure the performance of their algorithm under different metrics. The experimental results show that the algorithm is effective in determining the tense of sentences as well as the company acquisition relation.


Sign in / Sign up

Export Citation Format

Share Document