scholarly journals Learning DOM Trees of Web Pages by Subpath Kernel and Detecting Fake e-Commerce Sites

2021 ◽  
Vol 3 (1) ◽  
pp. 95-122
Author(s):  
Kilho Shin ◽  
Taichi Ishikawa ◽  
Yu-Lu Liu ◽  
David Lawrence Shepard

The subpath kernel is a class of positive definite kernels defined over trees, which has the following advantages for the purposes of classification, regression and clustering: it can be incorporated into a variety of powerful kernel machines including SVM; It is invariant whether input trees are ordered or unordered; It can be computed by significantly fast linear-time algorithms; And, finally, its excellent learning performance has been proven through intensive experiments in the literature. In this paper, we leverage recent advances in tree kernels to solve real problems. As an example, we apply our method to the problem of detecting fake e-commerce sites. Although the problem is similar to phishing site detection, the fact that mimicking existing authentic sites is harmful for fake e-commerce sites marks a clear difference between these two problems. We focus on fake e-commerce site detection for three reasons: e-commerce fraud is a real problem that companies and law enforcement have been cooperating to solve; Inefficiency hampers existing approaches because datasets tend to be large, while subpath kernel learning overcomes these performance challenges; And we offer increased resiliency against attempts to subvert existing detection methods through incorporating robust features that adversaries cannot change: the DOM-trees of web-sites. Our real-world results are remarkable: our method has exhibited accuracy as high as 0.998 when training SVM with 1000 instances and evaluating accuracy for almost 7000 independent instances. Its generalization efficiency is also excellent: with only 100 training instances, the accuracy score reached 0.996.

Web Mining ◽  
2011 ◽  
pp. 322-338 ◽  
Author(s):  
Zhixiang Chen ◽  
Richard H. Fowler ◽  
Ada Wai-Chee Fu ◽  
Chunyue Wang

A maximal forward reference of a Web user is a longest consecutive sequence of Web pages visited by the user in a session without revisiting some previously visited page in the sequence. Efficient mining of frequent traversal path patterns, that is, large reference sequences of maximal forward references, from very large Web logs is a fundamental problem in Web mining. This chapter aims at designing algorithms for this problem with the best possible efficiency. First, two optimal linear time algorithms are designed for finding maximal forward references from Web logs. Second, two algorithms for mining frequent traversal path patterns are devised with the help of a fast construction of shallow generalized suffix trees over a very large alphabet. These two algorithms have respectively provable linear and sublinear time complexity, and their performances are analyzed in comparison with the a priori-like algorithms and the Ukkonen algorithm. It is shown that these two new algorithms are substantially more efficient than the a priori-like algorithms and the Ukkonen algorithm.


Author(s):  
Yuya Higashikawa ◽  
Naoki Katoh ◽  
Junichi Teruyama ◽  
Koji Watase

2002 ◽  
Vol 7 (1) ◽  
pp. 9-25 ◽  
Author(s):  
Moses Boudourides ◽  
Gerasimos Antypas

In this paper we are presenting a simple simulation of the Internet World-Wide Web, where one observes the appearance of web pages belonging to different web sites, covering a number of different thematic topics and possessing links to other web pages. The goal of our simulation is to reproduce the form of the observed World-Wide Web and of its growth, using a small number of simple assumptions. In our simulation, existing web pages may generate new ones as follows: First, each web page is equipped with a topic concerning its contents. Second, links between web pages are established according to common topics. Next, new web pages may be randomly generated and subsequently they might be equipped with a topic and be assigned to web sites. By repeated iterations of these rules, our simulation appears to exhibit the observed structure of the World-Wide Web and, in particular, a power law type of growth. In order to visualise the network of web pages, we have followed N. Gilbert's (1997) methodology of scientometric simulation, assuming that web pages can be represented by points in the plane. Furthermore, the simulated graph is found to possess the property of small worlds, as it is the case with a large number of other complex networks.


Algorithmica ◽  
2013 ◽  
Vol 71 (2) ◽  
pp. 471-495 ◽  
Author(s):  
Maw-Shang Chang ◽  
Ming-Tat Ko ◽  
Hsueh-I Lu

1996 ◽  
Vol 06 (01) ◽  
pp. 127-136 ◽  
Author(s):  
QIAN-PING GU ◽  
SHIETUNG PENG

In this paper, we give two linear time algorithms for node-to-node fault tolerant routing problem in n-dimensional hypercubes Hn and star graphs Gn. The first algorithm, given at most n−1 arbitrary fault nodes and two non-fault nodes s and t in Hn, finds a fault-free path s→t of length at most [Formula: see text] in O(n) time, where d(s, t) is the distance between s and t. Our second algorithm, given at most n−2 fault nodes and two non-fault nodes s and t in Gn, finds a fault-free path s→t of length at most d(Gn)+3 in O(n) time, where [Formula: see text] is the diameter of Gn. When the time efficiency of finding the routing path is more important than the length of the path, the algorithms in this paper are better than the previous ones.


2015 ◽  
Vol 1 (3) ◽  
pp. 351
Author(s):  
Hoger Mahmud Hussen ◽  
Mazen Ismaeel Ghareb ◽  
Zana Azeez Kaka Rash

Recently the Kurdistan Region of Iraq has experienced an explosion in exposure to new technologies in different sectors especially in media and telecommunication. Internet is one of those technologies that have opened a way for information proliferation amongst a previously censored region. Developing web sites to deliver news and other information is a relatively new phenomenon in Kurdistan; this means that the design and development of web pages may lack the quality standard required. In this paper the quality of webpage interface design and usability in the field of news journalism in the KRI is examined against a set of web interface design and usability criterion. For the purpose of data collection 9 available popular news websites are chosen and 900 questionnaires are sent to 100 random users. The result is analyzed and we have found that the majority of users are satisfied with the interface design and usability of the news WebPages, however the result points out some weakness that can be improved. The outcome of this research can be used to enhance website design and usability in the field of journalism in the KRI.


Sign in / Sign up

Export Citation Format

Share Document