A Web Text Extraction Method Based on Regular Expressions and Text Density

Author(s):  
Fayun Li

2016 ◽  
Vol 28 (7) ◽  
pp. 1944-1944
Author(s):  
Alberto Bartoli ◽  
Andrea De Lorenzo ◽  
Eric Medvet ◽  
Fabiano Tarlao


2013 ◽  
Vol 774-776 ◽  
pp. 1802-1806
Author(s):  
Zhi Ming Zhang ◽  
Shuai Shuai Huang ◽  
Ping Li

With the rapid development of Internet, and surge in the amount of information on the Internet, how to accurately and quickly get the information of the users really need, such as the title, links, and pictures, is the hotspot. This paper proposed a fast web information extraction method based on html parser, this paper validated the effect of the proposed method by extracting commodities information of e-commerce website, the results show that the accuracy of the information extraction by our method is higher than the extraction method based on regular expressions, and the extraction time is greatly shortened.



Author(s):  
Yan Song ◽  
Anan Liu ◽  
Lin Pang ◽  
Shouxun Lin ◽  
Yongdong Zhang ◽  
...  


2014 ◽  
Vol 989-994 ◽  
pp. 3768-3772
Author(s):  
Xuan Qi Chen ◽  
Biao He ◽  
Guo Cheng Wang ◽  
Yao Xin Li

This paper presents a new method to achieve effective text extraction using mathematical morphology. Firstly, the document is segmented and divided into several parts based on the layout. And then, every part is dilated to big connected regions, whose biggest skeleton will be extracted and serve as a structure element (SE). Finally, a proposed region-concatenated operation with the SE will be employed, whose result can be the input of subsequent OCR system. Experimentally, the proposed method is robust to noise, the text orientation, font style and size, language and layout.



2005 ◽  
Vol 36 (9) ◽  
pp. 87-96 ◽  
Author(s):  
Osamu Hori ◽  
Takeshi Mita


Sign in / Sign up

Export Citation Format

Share Document