Main Content Extraction from Web Pages

Author(s):  
Stanislas Morbieu ◽  
Guillaume Bruneval ◽  
Mohamed Lacarne ◽  
Mohamed Kone ◽  
Francois-Xavier Bois
Keyword(s):  
2017 ◽  
Vol 11 (2) ◽  
pp. 39-48 ◽  
Author(s):  
Qingtang Liu ◽  
Mingbo Shao ◽  
Linjing Wu ◽  
Gang Zhao ◽  
Guilin Fan ◽  
...  
Keyword(s):  

2018 ◽  
Vol 3 (1) ◽  
pp. 34 ◽  
Author(s):  
Sanjay K. Dwivedi ◽  
Chandrakala Arya
Keyword(s):  

2021 ◽  
Vol 13 (6) ◽  
pp. 1-13
Author(s):  
Guangxuan Chen ◽  
Guangxiao Chen ◽  
Lei Zhang ◽  
Qiang Liu

In order to solve the problems of repeated acquisition, data redundancy and low efficiency in the process of website forensics, this paper proposes an incremental acquisition method orientecd to dynamic websites. This method realized the incremental collection on dynamically updated websites through acquiring and parsing web pages, URL deduplication, web page denoising, web page content extraction and hashing. Experiments show that the algorithm has relative high acquisition precision and recall rate, and can be combined with other data to perform effective digital forensics on dynamically updated real-time websites.


Sign in / Sign up

Export Citation Format

Share Document