Semi-supervised Graph-based Genre Classification for Web Pages

Author(s):  
Noushin Rezapour Asheghi ◽  
Katja Markert ◽  
Serge Sharoff
2019 ◽  
Vol 8 (2S11) ◽  
pp. 2011-2016

With the boom in the number of internet pages, it is very hard to discover desired records effortlessly and fast out of heaps of web pages retrieved with the aid of a search engine. there may be a increasing requirement for automatic type strategies with more class accuracy. There are a few conditions these days in which it's far vital to have an green and reliable classification of a web-web page from the information contained within the URL (Uniform aid Locator) handiest, with out the want to go to the web page itself. We want to understand if the URL can be used by us while not having to look and visit the page due to numerous motives. Getting the web page content material and sorting them to discover the genre of the net web page is very time ingesting and calls for the consumer to recognize the shape of the web page which needs to be categorised. To avoid this time-eating technique we proposed an exchange method so one can help us get the genre of the entered URL based of the entered URL and the metadata i.e., description, keywords used in the website along side the title of the web site. This approach does not most effective rely upon URL however also content from the internet application. The proposed gadget can be evaluated using numerous available datasets.


2016 ◽  
Vol 15 (01) ◽  
pp. 1650008 ◽  
Author(s):  
Chaker Jebari

This paper proposes an adaptive centroid-based classifier (ACC) for multi-label classification of web pages. Using a set of multi-genre training dataset, ACC constructs a centroid for each genre. To deal with the rapid evolution of web genres, ACC implements an adaptive classification method where web pages are classified one by one. For each web page, ACC calculated its similarity with all genre centroids. Based on this similarity, ACC either adjusts the genre centroid by including the new web page or discards it. A web page is a complex object that contains different sections belonging to different genres. To handle this complexity, ACC implements a multi-label classification where a web page can be assigned to multiple genres at the same time. To improve the performance of genre classification, we propose to aggregate the classifications produced using character n-grams extracted from URL, title, headings and anchors. Experiments conducted using a known multi-label dataset show that ACC outperforms many other multi-label classifiers and has the lowest computational complexity.


Sign in / Sign up

Export Citation Format

Share Document