Efficient Web Mining for Traversal Path Patterns

Web Mining ◽  
2011 ◽  
pp. 322-338 ◽  
Author(s):  
Zhixiang Chen ◽  
Richard H. Fowler ◽  
Ada Wai-Chee Fu ◽  
Chunyue Wang

A maximal forward reference of a Web user is a longest consecutive sequence of Web pages visited by the user in a session without revisiting some previously visited page in the sequence. Efficient mining of frequent traversal path patterns, that is, large reference sequences of maximal forward references, from very large Web logs is a fundamental problem in Web mining. This chapter aims at designing algorithms for this problem with the best possible efficiency. First, two optimal linear time algorithms are designed for finding maximal forward references from Web logs. Second, two algorithms for mining frequent traversal path patterns are devised with the help of a fast construction of shallow generalized suffix trees over a very large alphabet. These two algorithms have respectively provable linear and sublinear time complexity, and their performances are analyzed in comparison with the a priori-like algorithms and the Ukkonen algorithm. It is shown that these two new algorithms are substantially more efficient than the a priori-like algorithms and the Ukkonen algorithm.

2007 ◽  
Vol 16 (05) ◽  
pp. 793-828 ◽  
Author(s):  
JUAN D. VELÁSQUEZ ◽  
VASILE PALADE

Understanding the web user browsing behaviour in order to adapt a web site to the needs of a particular user represents a key issue for many commercial companies that do their business over the Internet. This paper presents the implementation of a Knowledge Base (KB) for building web-based computerized recommender systems. The Knowledge Base consists of a Pattern Repository that contains patterns extracted from web logs and web pages, by applying various web mining tools, and a Rule Repository containing rules that describe the use of discovered patterns for building navigation or web site modification recommendations. The paper also focuses on testing the effectiveness of the proposed online and offline recommendations. An ample real-world experiment is carried out on a web site of a bank.


Algorithms ◽  
2021 ◽  
Vol 14 (6) ◽  
pp. 161
Author(s):  
Dominik Köppl

We present linear-time algorithms computing the reversed Lempel–Ziv factorization [Kolpakov and Kucherov, TCS’09] within the space bounds of two different suffix tree representations. We can adapt these algorithms to compute the longest previous non-overlapping reverse factor table [Crochemore et al., JDA’12] within the same space but pay a multiplicative logarithmic time penalty.


2021 ◽  
Vol 3 (1) ◽  
pp. 95-122
Author(s):  
Kilho Shin ◽  
Taichi Ishikawa ◽  
Yu-Lu Liu ◽  
David Lawrence Shepard

The subpath kernel is a class of positive definite kernels defined over trees, which has the following advantages for the purposes of classification, regression and clustering: it can be incorporated into a variety of powerful kernel machines including SVM; It is invariant whether input trees are ordered or unordered; It can be computed by significantly fast linear-time algorithms; And, finally, its excellent learning performance has been proven through intensive experiments in the literature. In this paper, we leverage recent advances in tree kernels to solve real problems. As an example, we apply our method to the problem of detecting fake e-commerce sites. Although the problem is similar to phishing site detection, the fact that mimicking existing authentic sites is harmful for fake e-commerce sites marks a clear difference between these two problems. We focus on fake e-commerce site detection for three reasons: e-commerce fraud is a real problem that companies and law enforcement have been cooperating to solve; Inefficiency hampers existing approaches because datasets tend to be large, while subpath kernel learning overcomes these performance challenges; And we offer increased resiliency against attempts to subvert existing detection methods through incorporating robust features that adversaries cannot change: the DOM-trees of web-sites. Our real-world results are remarkable: our method has exhibited accuracy as high as 0.998 when training SVM with 1000 instances and evaluating accuracy for almost 7000 independent instances. Its generalization efficiency is also excellent: with only 100 training instances, the accuracy score reached 0.996.


Author(s):  
Yuya Higashikawa ◽  
Naoki Katoh ◽  
Junichi Teruyama ◽  
Koji Watase

2020 ◽  
Vol 2020 ◽  
pp. 1-18
Author(s):  
Sonia Setia ◽  
Verma Jyoti ◽  
Neelam Duhan

The continuous growth of the World Wide Web has led to the problem of long access delays. To reduce this delay, prefetching techniques have been used to predict the users’ browsing behavior to fetch the web pages before the user explicitly demands that web page. To make near accurate predictions for users’ search behavior is a complex task faced by researchers for many years. For this, various web mining techniques have been used. However, it is observed that either of the methods has its own set of drawbacks. In this paper, a novel approach has been proposed to make a hybrid prediction model that integrates usage mining and content mining techniques to tackle the individual challenges of both these approaches. The proposed method uses N-gram parsing along with the click count of the queries to capture more contextual information as an effort to improve the prediction of web pages. Evaluation of the proposed hybrid approach has been done by using AOL search logs, which shows a 26% increase in precision of prediction and a 10% increase in hit ratio on average as compared to other mining techniques.


Algorithmica ◽  
2013 ◽  
Vol 71 (2) ◽  
pp. 471-495 ◽  
Author(s):  
Maw-Shang Chang ◽  
Ming-Tat Ko ◽  
Hsueh-I Lu

1996 ◽  
Vol 06 (01) ◽  
pp. 127-136 ◽  
Author(s):  
QIAN-PING GU ◽  
SHIETUNG PENG

In this paper, we give two linear time algorithms for node-to-node fault tolerant routing problem in n-dimensional hypercubes Hn and star graphs Gn. The first algorithm, given at most n−1 arbitrary fault nodes and two non-fault nodes s and t in Hn, finds a fault-free path s→t of length at most [Formula: see text] in O(n) time, where d(s, t) is the distance between s and t. Our second algorithm, given at most n−2 fault nodes and two non-fault nodes s and t in Gn, finds a fault-free path s→t of length at most d(Gn)+3 in O(n) time, where [Formula: see text] is the diameter of Gn. When the time efficiency of finding the routing path is more important than the length of the path, the algorithms in this paper are better than the previous ones.


Sign in / Sign up

Export Citation Format

Share Document