scholarly journals Automatic Web Page Classification System with Improved Accuracy

Webology ◽  
2021 ◽  
Vol 18 (2) ◽  
pp. 225-242
Author(s):  
Chait hra ◽  
Dr.G.M. Lingaraju ◽  
Dr.S. Jagannatha

Nowadays, the Internet contain s a wide variety of online documents, making finding useful information about a given subject impossible, as well as retrieving irrelevant pages. Web document and page recognition software is useful in a variety of fields, including news, medicine, and fitness, research, and information technology. To enhance search capability, a large number of web page classification methods have been proposed, especially for news web pages. Furthermore existing classification approaches seek to distinguish news web pages while still reducing the high dimensionality of features derived from these pages. Due to the lack of automated classification methods, this paper focuses on the classification of news web pages based on their scarcity and importance. This work will establish different models for the identification and classification of the web pages. The data sets used in this paper were collected from popular news websites. In the research work we have used BBC dataset that has five predefined categories. Initially the input source can be preprocessed and the errors can be eliminated. Then the features can be extracted depend upon the web page reviews using Term frequency-inverse document frequency vectorization. In the work 2225 documents are represented with the 15286 features, which represents the tf-idf score for different unigrams and bigrams. This type of the representation is not only used for classification task also helpful to analyze the dataset. Feature selection is done by using the chi-squared test which will be in the task of finding the terms that are most correlated with each of the categories. Then the pointed features can be selected using chi-squared test. Finally depend upon the classifier the web page can be classified. The results showed that list has obtained the highest percentage, which reflect its effectiveness on the classification of web pages.

2019 ◽  
Vol 16 (2) ◽  
pp. 384-388 ◽  
Author(s):  
K. S. Ramanujam ◽  
K. David

Web page classification refers to one of the significant research are in the web mining domain. Enormous quantity of data existing in the web demands the essential development of various effective and robust techniques to undergo web mining task that involves the process to categorizing the web page based on the data labels. It also includes various other tasks such as web crawling, analysis of web links and contextual advertising process. Existing machine learning and data mining techniques are being efficiently used for various web mining processes which include classification of web pages. Using of multiple classifier techniques are most promising research area while considering machine learning that works on the base of merging various classifiers with difference in base classifier and/or dataset distribution. With this several classification models are constructed that is highly robust in nature. This review paper, comparison has been done between FA, PSO, ACO, GA and IWT, to evaluate best fit algorithm for classifying web pages.


2021 ◽  
Vol 9 (4) ◽  
pp. 963-973
Author(s):  
Suleyman Suleymanzade ◽  
Fargana Abdullayeva

The quality of the web page classification process has a huge impact on information retrieval systems. In this paper, we proposed to combine the results of text and image data classifiers to get an accurate representation of the web pages. To get and analyse the data we created the complicated classifier system with data miner, text classifier, and aggregator. The process of image and text data classification has been achieved by the deep learning models. In order to represent the common view onto the web pages, we proposed three aggregation techniques that combine the data from the classifiers.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Xiang Song ◽  
Yi Zhu ◽  
Xuemei Zeng ◽  
Xingshu Chen

Web page classification is critical for information retrieval. Most web page classification methods have the following two faults: (1) need to analyze based on the overall web page and (2) do not pay enough attention to the existence of noise information inside the web page, which will thus decrease the efficiency and classification performance, especially when classifying the contaminated web page. To solve these problems, this paper proposes a denoising disposal algorithm. We choose the top-down method for hierarchical classification to improve the prediction efficiency. The experimental results demonstrate that our method is about 7 times faster than the full-page method and achieves good classification results in most categories. The precision of 7 parent categories is all above 88% and is 24% higher than the other meta tag-based method on average.


2014 ◽  
Vol 2014 ◽  
pp. 1-16 ◽  
Author(s):  
Esra Saraç ◽  
Selma Ayşe Özel

The increased popularity of the web has caused the inclusion of huge amount of information to the web, and as a result of this explosive information growth, automated web page classification systems are needed to improve search engines’ performance. Web pages have a large number of features such as HTML/XML tags, URLs, hyperlinks, and text contents that should be considered during an automated classification process. The aim of this study is to reduce the number of features to be used to improve runtime and accuracy of the classification of web pages. In this study, we used an ant colony optimization (ACO) algorithm to select the best features, and then we applied the well-known C4.5, naive Bayes, andknearest neighbor classifiers to assign class labels to web pages. We used the WebKB and Conference datasets in our experiments, and we showed that using the ACO for feature selection improves both accuracy and runtime performance of classification. We also showed that the proposed ACO based algorithm can select better features with respect to the well-known information gain and chi square feature selection methods.


2019 ◽  
Vol 8 (2S11) ◽  
pp. 2011-2016

With the boom in the number of internet pages, it is very hard to discover desired records effortlessly and fast out of heaps of web pages retrieved with the aid of a search engine. there may be a increasing requirement for automatic type strategies with more class accuracy. There are a few conditions these days in which it's far vital to have an green and reliable classification of a web-web page from the information contained within the URL (Uniform aid Locator) handiest, with out the want to go to the web page itself. We want to understand if the URL can be used by us while not having to look and visit the page due to numerous motives. Getting the web page content material and sorting them to discover the genre of the net web page is very time ingesting and calls for the consumer to recognize the shape of the web page which needs to be categorised. To avoid this time-eating technique we proposed an exchange method so one can help us get the genre of the entered URL based of the entered URL and the metadata i.e., description, keywords used in the website along side the title of the web site. This approach does not most effective rely upon URL however also content from the internet application. The proposed gadget can be evaluated using numerous available datasets.


Author(s):  
Soner Kiziloluk ◽  
Ahmet Bedri Ozer

In recent years, data on the Internet has grown exponentially, attaining enormous dimensions. This situation makes it difficult to obtain useful information from such data. Web mining is the process of using data mining techniques such as association rules, classification, clustering, and statistics to discover and extract information from Web documents. Optimization algorithms play an important role in such techniques. In this work, the parliamentary optimization algorithm (POA), which is one of the latest social-based metaheuristic algorithms, has been adopted for Web page classification. Two different data sets (Course and Student) were selected for experimental evaluation, and HTML tags were used as features. The data sets were tested using different classification algorithms implemented in WEKA, and the results were compared with those of the POA. The POA was found to yield promising results compared to the other algorithms. This study is the first to propose the POA for effective Web page classification.


Author(s):  
H. A. Ali ◽  
Ali I. El Desouky ◽  
Ahmed I. Saleh

Web page classification is considered one of the most challenging research areas. Where the Web has a huge volume of unstructured and distributed documents that are related to a variety of domains; so, considering one base for the classification tasks will be extremely difficult. In addition, the Web is full of noise that will certainly harm the classifier performance especially if it is found in the classifier training data. Generally, it will be more valued to build domain-oriented classifiers (vertical classifi- ers) to classify pages related to a specific domain and compensate those classifiers with novel learning techniques to achieve better performance. The contribution of this paper is three edged; firstly, a novel learning technique called .Continuous Learning. is introduced. Secondly, the paper presents a new trend for Web page classification by presenting the domain-oriented classifiers (vertical classifiers). A new way of applying Bayes and K-Nearest Neighbor algorithms is introduced in order to build Domain Oriented Naïve Bayes (DONB) and Domain Oriented K-Nearest Neighbor (DOKNN) classifiers. The third contribution is combining both disciplines by introducing a novel classification strategy. Such strategy adds the continuous learning ability to Bayes theorem to build a Continuous learning domain oriented Naïve Bayes (CLNB) classifier. Where the overfitting problem has a great impact on most Web page classification techniques, continuous learning can be considered as a proposed solution. It allows the classifier to adapt itself continuously for achieving better performance. The proposed classifiers are tested; experimental results have shown that CLNB demonstrates significant performance improvement over both DONB and DOKNN where its accuracy goes beyond 94.1% after testing 1000 pages.


The World revolves around the web technology at present. Every year, the Web information are exponentially growing and this information are huge and complex. The web users are difficult to classify and extract useful information from the web, because the Webinformation are noisy, redundant and irrelevant and also misclassified.Many researchers don’t have strongknowledge about the process of web page classification, techniques and methods previously used. The objective of this survey is to convey an outline of the modern techniques of Web page classification. In this survey, the recent papers in this area are selected and explored.Thus this study will help the researchers to obtain the required knowledge about the current trends in web page classification


2018 ◽  
Vol 7 (3.27) ◽  
pp. 227
Author(s):  
A M. James Raj1 ◽  
F Sagayaraj Francis

In this information age many research work are carried out in web page classification to acquire the relevant and appropriate information. To be more specific, for enhancing the web page classification to obtain the optimized feature sets are chosen by utilizing the evolutionary algorithms.  Normally, these algorithms are designed by the heuristic principles stimulated by natural evolution. After analyzing the significance of the various evolutionary algorithms deployed by several researchers in this domain so far, this work also intended to apply them to acquire the best solutions (enhanced features). In general, applying the evolutionary algorithms the fittest genes are generated and determined by the fitness function. Once the fittest genes are decided picking up the fittest individual genomes from a population for taking them to the next generations is the challenging task. In this article a novel approach is proposed to choose the best solutions.  


Sign in / Sign up

Export Citation Format

Share Document