Learning DOM Trees of Web Pages by Subpath Kernel and Detecting Fake e-Commerce Sites

Kilho Shin; Taichi Ishikawa; Yu-Lu Liu; David Lawrence Shepard

doi:10.3390/make3010006

Learning DOM Trees of Web Pages by Subpath Kernel and Detecting Fake e-Commerce Sites

Machine Learning and Knowledge Extraction ◽

10.3390/make3010006 ◽

2021 ◽

Vol 3 (1) ◽

pp. 95-122

Author(s):

Kilho Shin ◽

Taichi Ishikawa ◽

Yu-Lu Liu ◽

David Lawrence Shepard

Keyword(s):

Web Sites ◽

Linear Time ◽

Learning Performance ◽

Detection Methods ◽

Web Pages ◽

Real Problem ◽

Accuracy Score ◽

Positive Definite Kernels ◽

Performance Challenges ◽

Linear Time Algorithms

The subpath kernel is a class of positive definite kernels defined over trees, which has the following advantages for the purposes of classification, regression and clustering: it can be incorporated into a variety of powerful kernel machines including SVM; It is invariant whether input trees are ordered or unordered; It can be computed by significantly fast linear-time algorithms; And, finally, its excellent learning performance has been proven through intensive experiments in the literature. In this paper, we leverage recent advances in tree kernels to solve real problems. As an example, we apply our method to the problem of detecting fake e-commerce sites. Although the problem is similar to phishing site detection, the fact that mimicking existing authentic sites is harmful for fake e-commerce sites marks a clear difference between these two problems. We focus on fake e-commerce site detection for three reasons: e-commerce fraud is a real problem that companies and law enforcement have been cooperating to solve; Inefficiency hampers existing approaches because datasets tend to be large, while subpath kernel learning overcomes these performance challenges; And we offer increased resiliency against attempts to subvert existing detection methods through incorporating robust features that adversaries cannot change: the DOM-trees of web-sites. Our real-world results are remarkable: our method has exhibited accuracy as high as 0.998 when training SVM with 1000 instances and evaluating accuracy for almost 7000 independent instances. Its generalization efficiency is also excellent: with only 100 training instances, the accuracy score reached 0.996.

Efficient Web Mining for Traversal Path Patterns

Web Mining ◽

10.4018/978-1-59140-414-9.ch015 ◽

2011 ◽

pp. 322-338 ◽

Cited By ~ 1

Author(s):

Zhixiang Chen ◽

Richard H. Fowler ◽

Ada Wai-Chee Fu ◽

Chunyue Wang

Keyword(s):

Web Mining ◽

Linear Time ◽

Fundamental Problem ◽

A Priori ◽

Web Pages ◽

Suffix Trees ◽

Web Logs ◽

Large Alphabet ◽

Optimal Linear ◽

Linear Time Algorithms

A maximal forward reference of a Web user is a longest consecutive sequence of Web pages visited by the user in a session without revisiting some previously visited page in the sequence. Efficient mining of frequent traversal path patterns, that is, large reference sequences of maximal forward references, from very large Web logs is a fundamental problem in Web mining. This chapter aims at designing algorithms for this problem with the best possible efficiency. First, two optimal linear time algorithms are designed for finding maximal forward references from Web logs. Second, two algorithms for mining frequent traversal path patterns are devised with the help of a fast construction of shallow generalized suffix trees over a very large alphabet. These two algorithms have respectively provable linear and sublinear time complexity, and their performances are analyzed in comparison with the a priori-like algorithms and the Ukkonen algorithm. It is shown that these two new algorithms are substantially more efficient than the a priori-like algorithms and the Ukkonen algorithm.

Almost Linear Time Algorithms for Minsum k-Sink Problems on Dynamic Flow Path Networks

Theoretical Computer Science ◽

10.1016/j.tcs.2021.05.003 ◽

2021 ◽

Author(s):

Yuya Higashikawa ◽

Naoki Katoh ◽

Junichi Teruyama ◽

Koji Watase

Keyword(s):

Linear Time ◽

Flow Path ◽

Dynamic Flow ◽

Linear Time Algorithms

Linear time algorithms for linear programming

Computers & Mathematics with Applications ◽

10.1016/s0898-1221(99)00069-3 ◽

1999 ◽

Vol 37 (4-5) ◽

pp. 199-208

Author(s):

E.A. Galperin

Keyword(s):

Linear Programming ◽

Linear Time ◽

Linear Time Algorithms

Invisible: The Online Presence of Medical Library Web Pages on Hospital Web Sites

Journal of Hospital Librarianship ◽

10.1080/15323269.2012.637859 ◽

2012 ◽

Vol 12 (1) ◽

pp. 14-24 ◽

Cited By ~ 6

Author(s):

Christine Marton

Keyword(s):

Web Sites ◽

Web Pages ◽

Online Presence ◽

Medical Library

Almost-linear-time algorithms for Markov chains and new spectral primitives for directed graphs

Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing - STOC 2017 ◽

10.1145/3055399.3055463 ◽

2017 ◽

Cited By ~ 14

Author(s):

Michael B. Cohen ◽

Jonathan Kelner ◽

John Peebles ◽

Richard Peng ◽

Anup B. Rao ◽

...

Keyword(s):

Markov Chains ◽

Linear Time ◽

Directed Graphs ◽

Linear Time Algorithms

A Simulation of the Structure of the World-Wide Web

Sociological Research Online ◽

10.5153/sro.684 ◽

2002 ◽

Vol 7 (1) ◽

pp. 9-25 ◽

Cited By ~ 2

Author(s):

Moses Boudourides ◽

Gerasimos Antypas

Keyword(s):

World Wide Web ◽

Power Law ◽

Web Sites ◽

World Wide ◽

The Internet ◽

Web Pages ◽

Small Worlds ◽

Web Page ◽

Simple Simulation ◽

The World

In this paper we are presenting a simple simulation of the Internet World-Wide Web, where one observes the appearance of web pages belonging to different web sites, covering a number of different thematic topics and possessing links to other web pages. The goal of our simulation is to reproduce the form of the observed World-Wide Web and of its growth, using a small number of simple assumptions. In our simulation, existing web pages may generate new ones as follows: First, each web page is equipped with a topic concerning its contents. Second, links between web pages are established according to common topics. Next, new web pages may be randomly generated and subsequently they might be equipped with a topic and be assigned to web sites. By repeated iterations of these rules, our simulation appears to exhibit the observed structure of the World-Wide Web and, in particular, a power law type of growth. In order to visualise the network of web pages, we have followed N. Gilbert's (1997) methodology of scientometric simulation, assuming that web pages can be represented by points in the plane. Furthermore, the simulated graph is found to possess the property of small worlds, as it is the case with a large number of other complex networks.

Linear-Time Algorithms for Tree Root Problems

Algorithmica ◽

10.1007/s00453-013-9815-y ◽

2013 ◽

Vol 71 (2) ◽

pp. 471-495 ◽

Cited By ~ 1

Author(s):

Maw-Shang Chang ◽

Ming-Tat Ko ◽

Hsueh-I Lu

Keyword(s):

Linear Time ◽

Linear Time Algorithms

FAULT TOLERANT ROUTING IN HYPERCUBES AND STAR GRAPHS

Parallel Processing Letters ◽

10.1142/s0129626496000133 ◽

1996 ◽

Vol 06 (01) ◽

pp. 127-136 ◽

Cited By ~ 5

Author(s):

QIAN-PING GU ◽

SHIETUNG PENG

Keyword(s):

Free Path ◽

Fault Tolerant ◽

Linear Time ◽

Time Efficiency ◽

Routing Problem ◽

Star Graphs ◽

Linear Time Algorithms ◽

Better Than

In this paper, we give two linear time algorithms for node-to-node fault tolerant routing problem in n-dimensional hypercubes Hn and star graphs Gn. The first algorithm, given at most n−1 arbitrary fault nodes and two non-fault nodes s and t in Hn, finds a fault-free path s→t of length at most [Formula: see text] in O(n) time, where d(s, t) is the distance between s and t. Our second algorithm, given at most n−2 fault nodes and two non-fault nodes s and t in Gn, finds a fault-free path s→t of length at most d(Gn)+3 in O(n) time, where [Formula: see text] is the diameter of Gn. When the time efficiency of finding the routing path is more important than the length of the path, the algorithms in this paper are better than the previous ones.

Fun with Sub-linear Time Algorithms

Lecture Notes in Computer Science - Fun with Algorithms ◽

10.1007/978-3-540-72914-3_3 ◽

2007 ◽

pp. 15-15

Author(s):

Luca Trevisan

Keyword(s):

Linear Time ◽

Linear Time Algorithms

An Investigation into News Webpage interface Design in Kurdistan Region of Iraq

Journal of University of Human Development ◽

10.21928/juhd.v1n3y2015.pp351-356 ◽

2015 ◽

Vol 1 (3) ◽

pp. 351

Author(s):

Hoger Mahmud Hussen ◽

Mazen Ismaeel Ghareb ◽

Zana Azeez Kaka Rash

Keyword(s):

Web Sites ◽

Interface Design ◽

New Technologies ◽

Quality Standard ◽

Website Design ◽

Web Pages ◽

Kurdistan Region ◽

News Websites ◽

Other Information ◽

Web Interface Design

Recently the Kurdistan Region of Iraq has experienced an explosion in exposure to new technologies in different sectors especially in media and telecommunication. Internet is one of those technologies that have opened a way for information proliferation amongst a previously censored region. Developing web sites to deliver news and other information is a relatively new phenomenon in Kurdistan; this means that the design and development of web pages may lack the quality standard required. In this paper the quality of webpage interface design and usability in the field of news journalism in the KRI is examined against a set of web interface design and usability criterion. For the purpose of data collection 9 available popular news websites are chosen and 900 questionnaires are sent to 100 random users. The result is analyzed and we have found that the majority of users are satisfied with the interface design and usability of the news WebPages, however the result points out some weakness that can be improved. The outcome of this research can be used to enhance website design and usability in the field of journalism in the KRI.