A Novel Approach for Extraction of Relevant Web Pages from WWW Using Data Mining

Author(s):  
Jyoti Pandey ◽  
Ashok Sharma ◽  
Amit Goel
Author(s):  
Jayanti Mehra ◽  
Ramjeevan Singh Thakur

Weblog analysis takes raw data from access logs and performs study on this data for extracting statistical information. This info incorporates a variety of data for the website activity such as average no. of hits, total no. of user visits, failed and successful cached hits, average time of view, average path length over a website; analytical information such as page was not found errors and server errors; server information, which includes exit and entry pages, single access pages, and top visited pages; requester information like which type of search engines is used, keywords and top referring sites, and so on. In general, the website administrator uses this kind of knowledge to make the system act better, helping in the manipulation process of site, then also forgiving marketing decisions support. Most of the advanced web mining systems practice this kind of information to take out more difficult or complex interpretations using data mining procedures like association rules, clustering, and classification.


Entropy ◽  
2013 ◽  
Vol 16 (1) ◽  
pp. 389-404 ◽  
Author(s):  
Jikai Chen ◽  
Haoyu Li ◽  
Yanjun Wang ◽  
Ronghua Xie ◽  
Xingbin Liu

Author(s):  
Ralla Suresh ◽  
Saritha Vemuri ◽  
Swetha V

The information extracted from Web pages can be used for effective query expansion. The aspect needed to improve accuracy of web search engines is the inclusion of metadata, not only to analyze Web content, but also to interpret. With the Web of today being unstructured and semantically heterogeneous, keyword-based queries are likely to miss important results. . Using data mining methods, our system derives dependency rules and applies them to concept-based queries. This paper presents a novel approach for query expansion that applies dependence rules mined from a large Web World, combining several existing techniques for data extraction and mining, to integrate the system into COMPACT, our prototype implementation of a concept-based search engine.


2020 ◽  
pp. 1598-1611
Author(s):  
Monther Aldwairi ◽  
Musaab Hasan ◽  
Zayed Balbahaith

Drive-by download refers to attacks that automatically download malwares to user's computer without his knowledge or consent. This type of attack is accomplished by exploiting web browsers and plugins vulnerabilities. The damage may include data leakage leading to financial loss. Traditional antivirus and intrusion detection systems are not efficient against such attacks. Researchers proposed plenty of detection approaches mostly passive blacklisting. However, a few proposed dynamic classification techniques, which suffer from clear shortcomings. In this paper, we propose a novel approach to detect drive-by download infected web pages based on extracted features from their source code. We test 23 different machine learning classifiers using data set of 5435 webpages and based on the detection accuracy we selected the top five to build our detection model. The approach is expected to serve as a base for implementing and developing anti drive-by download programs. We develop a graphical user interface program to allow the end user to examine the URL before visiting the website. The Bagged Trees classifier exhibited the highest accuracy of 90.1% and reported 96.24% true positive and 26.07% false positive rate.


Sign in / Sign up

Export Citation Format

Share Document