web page classification
Recently Published Documents


TOTAL DOCUMENTS

217
(FIVE YEARS 32)

H-INDEX

17
(FIVE YEARS 2)

Webology ◽  
2021 ◽  
Vol 18 (2) ◽  
pp. 225-242
Author(s):  
Chait hra ◽  
Dr.G.M. Lingaraju ◽  
Dr.S. Jagannatha

Nowadays, the Internet contain s a wide variety of online documents, making finding useful information about a given subject impossible, as well as retrieving irrelevant pages. Web document and page recognition software is useful in a variety of fields, including news, medicine, and fitness, research, and information technology. To enhance search capability, a large number of web page classification methods have been proposed, especially for news web pages. Furthermore existing classification approaches seek to distinguish news web pages while still reducing the high dimensionality of features derived from these pages. Due to the lack of automated classification methods, this paper focuses on the classification of news web pages based on their scarcity and importance. This work will establish different models for the identification and classification of the web pages. The data sets used in this paper were collected from popular news websites. In the research work we have used BBC dataset that has five predefined categories. Initially the input source can be preprocessed and the errors can be eliminated. Then the features can be extracted depend upon the web page reviews using Term frequency-inverse document frequency vectorization. In the work 2225 documents are represented with the 15286 features, which represents the tf-idf score for different unigrams and bigrams. This type of the representation is not only used for classification task also helpful to analyze the dataset. Feature selection is done by using the chi-squared test which will be in the task of finding the terms that are most correlated with each of the categories. Then the pointed features can be selected using chi-squared test. Finally depend upon the classifier the web page can be classified. The results showed that list has obtained the highest percentage, which reflect its effectiveness on the classification of web pages.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Xiang Song ◽  
Yi Zhu ◽  
Xuemei Zeng ◽  
Xingshu Chen

Web page classification is critical for information retrieval. Most web page classification methods have the following two faults: (1) need to analyze based on the overall web page and (2) do not pay enough attention to the existence of noise information inside the web page, which will thus decrease the efficiency and classification performance, especially when classifying the contaminated web page. To solve these problems, this paper proposes a denoising disposal algorithm. We choose the top-down method for hierarchical classification to improve the prediction efficiency. The experimental results demonstrate that our method is about 7 times faster than the full-page method and achieves good classification results in most categories. The precision of 7 parent categories is all above 88% and is 24% higher than the other meta tag-based method on average.


Author(s):  
Amit Gupta ◽  
Rajesh Bhatia

Web Page Classification is decisive for information retrieval and management task and plays an imperative role for natural language processing (NLP) problems in web engineering. Traditional machine learning algorithms excerpt covet features from web pages whereas deep leaning algorithms crave features as the network goes deeper. Pre-trained models such as BERT attains remarkable achievement for text classification and continue to show state-ofthe-art results. Knowledge Graphs can provide rich structured factual information for better language modelling and representation. In this study, we proposed an ensemble Knowledge Based Deep Inception (KBDI) approachfor web page classification by learning bidirectional contextual representation using pre-trained BERT incorporating Knowledge Graph embeddings and fine-tune the target task by applying Deep Inception network utilizing parallel multi-scale semantics. Proposed ensemble evaluates the efficacy of fusing domain specific knowledge embeddings with the pre-trained BERT model. Experimental interpretation exhibit that the proposed BERT fused KBDI model outperforms benchmark baselines and achieve better performance in contrast to other conventional approaches evaluated on web page classification datasets.


2021 ◽  
Vol 9 (4) ◽  
pp. 963-973
Author(s):  
Suleyman Suleymanzade ◽  
Fargana Abdullayeva

The quality of the web page classification process has a huge impact on information retrieval systems. In this paper, we proposed to combine the results of text and image data classifiers to get an accurate representation of the web pages. To get and analyse the data we created the complicated classifier system with data miner, text classifier, and aggregator. The process of image and text data classification has been achieved by the deep learning models. In order to represent the common view onto the web pages, we proposed three aggregation techniques that combine the data from the classifiers.


2021 ◽  
Author(s):  
Deeksha Deeksha ◽  
Rajesh Bhatia ◽  
Shikhar Bhardwaj ◽  
Manish Kumar ◽  
Kashish Bhatia ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document