web page classification Latest Research Papers

Nowadays, the Internet contain s a wide variety of online documents, making finding useful information about a given subject impossible, as well as retrieving irrelevant pages. Web document and page recognition software is useful in a variety of fields, including news, medicine, and fitness, research, and information technology. To enhance search capability, a large number of web page classification methods have been proposed, especially for news web pages. Furthermore existing classification approaches seek to distinguish news web pages while still reducing the high dimensionality of features derived from these pages. Due to the lack of automated classification methods, this paper focuses on the classification of news web pages based on their scarcity and importance. This work will establish different models for the identification and classification of the web pages. The data sets used in this paper were collected from popular news websites. In the research work we have used BBC dataset that has five predefined categories. Initially the input source can be preprocessed and the errors can be eliminated. Then the features can be extracted depend upon the web page reviews using Term frequency-inverse document frequency vectorization. In the work 2225 documents are represented with the 15286 features, which represents the tf-idf score for different unigrams and bigrams. This type of the representation is not only used for classification task also helpful to analyze the dataset. Feature selection is done by using the chi-squared test which will be in the task of finding the terms that are most correlated with each of the categories. Then the pointed features can be selected using chi-squared test. Finally depend upon the classifier the web page can be classified. The results showed that list has obtained the highest percentage, which reflect its effectiveness on the classification of web pages.

Download Full-text

Hierarchical Contaminated Web Page Classification Based on Meta Tag Denoising Disposal

Security and Communication Networks ◽

10.1155/2021/2470897 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Xiang Song ◽

Yi Zhu ◽

Xuemei Zeng ◽

Xingshu Chen

Keyword(s):

Hierarchical Classification ◽

Classification Performance ◽

Classification Methods ◽

Web Page ◽

Web Page Classification ◽

Full Page ◽

Good Classification ◽

Prediction Efficiency ◽

The Web ◽

Page Classification

Web page classification is critical for information retrieval. Most web page classification methods have the following two faults: (1) need to analyze based on the overall web page and (2) do not pay enough attention to the existence of noise information inside the web page, which will thus decrease the efficiency and classification performance, especially when classifying the contaminated web page. To solve these problems, this paper proposes a denoising disposal algorithm. We choose the top-down method for hierarchical classification to improve the prediction efficiency. The experimental results demonstrate that our method is about 7 times faster than the full-page method and achieves good classification results in most categories. The precision of 7 parent categories is all above 88% and is 24% higher than the other meta tag-based method on average.

Download Full-text

Knowledge Based Deep Inception Model for Web Page Classification

Journal of Web Engineering ◽

10.13052/jwe1540-9589.2075 ◽

2021 ◽

Author(s):

Amit Gupta ◽

Rajesh Bhatia

Keyword(s):

Language Processing ◽

Machine Learning Algorithms ◽

Web Page ◽

Web Engineering ◽

Web Page Classification ◽

Domain Specific ◽

Multi Scale ◽

Knowledge Based ◽

Domain Specific Knowledge ◽

Page Classification

Web Page Classification is decisive for information retrieval and management task and plays an imperative role for natural language processing (NLP) problems in web engineering. Traditional machine learning algorithms excerpt covet features from web pages whereas deep leaning algorithms crave features as the network goes deeper. Pre-trained models such as BERT attains remarkable achievement for text classification and continue to show state-ofthe-art results. Knowledge Graphs can provide rich structured factual information for better language modelling and representation. In this study, we proposed an ensemble Knowledge Based Deep Inception (KBDI) approachfor web page classification by learning bidirectional contextual representation using pre-trained BERT incorporating Knowledge Graph embeddings and fine-tune the target task by applying Deep Inception network utilizing parallel multi-scale semantics. Proposed ensemble evaluates the efficacy of fusing domain specific knowledge embeddings with the pre-trained BERT model. Experimental interpretation exhibit that the proposed BERT fused KBDI model outperforms benchmark baselines and achieve better performance in contrast to other conventional approaches evaluated on web page classification datasets.

Download Full-text

Data pre-processing of website browsing record: An initial step for web page classification

10.1109/icsecs52883.2021.00129 ◽

2021 ◽

Author(s):

Siti Hawa Apandi ◽

Jamaludin Sallim ◽

Rozlina Mohamed

Keyword(s):

Initial Step ◽

Web Page ◽

Web Page Classification ◽

Page Classification

Download Full-text

Web Page Classification Using Convolutional Neural Network (CNN) Towards Eliminating Internet Addiction

10.1109/icsecs52883.2021.00034 ◽

2021 ◽

Author(s):

Siti Hawa Apandi ◽

Jamaludin Sallim ◽

Rozlina Mohamed ◽

Araby Madbouly

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Internet Addiction ◽

Web Page ◽

Web Page Classification ◽

Page Classification

Download Full-text

Full Content-based Web Page Classification Methods by using Deep Neural Networks

Statistics Optimization & Information Computing ◽

10.19139/soic-2310-5070-1056 ◽

2021 ◽

Vol 9 (4) ◽

pp. 963-973

Author(s):

Suleyman Suleymanzade ◽

Fargana Abdullayeva

Keyword(s):

Image Data ◽

Web Pages ◽

Web Page ◽

Common View ◽

Web Page Classification ◽

Aggregation Techniques ◽

Information Retrieval Systems ◽

Huge Impact ◽

The Web ◽

Page Classification

The quality of the web page classification process has a huge impact on information retrieval systems. In this paper, we proposed to combine the results of text and image data classifiers to get an accurate representation of the web pages. To get and analyse the data we created the complicated classifier system with data miner, text classifier, and aggregator. The process of image and text data classification has been achieved by the deep learning models. In order to represent the common view onto the web pages, we proposed three aggregation techniques that combine the data from the classifiers.

Download Full-text