On the Improvement of the Isolation Forest Algorithm for Outlier Detection with Streaming Data

Michael Heigl; Kumar Ashutosh Anand; Andreas Urmann; Dalibor Fiala; Martin Schramm; Robert Hable

doi:10.3390/electronics10131534

On the Improvement of the Isolation Forest Algorithm for Outlier Detection with Streaming Data

Electronics ◽

10.3390/electronics10131534 ◽

2021 ◽

Vol 10 (13) ◽

pp. 1534

Author(s):

Michael Heigl ◽

Kumar Ashutosh Anand ◽

Andreas Urmann ◽

Dalibor Fiala ◽

Martin Schramm ◽

...

Keyword(s):

Outlier Detection ◽

Real World ◽

High Speed ◽

State Of The Art ◽

High Volume ◽

Streaming Data ◽

Steady Increase ◽

Efficient Detection ◽

Real World Datasets ◽

Isolation Forest

In recent years, detecting anomalies in real-world computer networks has become a more and more challenging task due to the steady increase of high-volume, high-speed and high-dimensional streaming data, for which ground truth information is not available. Efficient detection schemes applied on networked embedded devices need to be fast and memory-constrained, and must be capable of dealing with concept drifts when they occur. Different approaches for unsupervised online outlier detection have been designed to deal with these circumstances in order to reliably detect malicious activity. In this paper, we introduce a novel framework called PCB-iForest, which generalized, is able to incorporate any ensemble-based online OD method to function on streaming data. Carefully engineered requirements are compared to the most popular state-of-the-art online methods with an in-depth focus on variants based on the widely accepted isolation forest algorithm, thereby highlighting the lack of a flexible and efficient solution which is satisfied by PCB-iForest. Therefore, we integrate two variants into PCB-iForest—an isolation forest improvement called extended isolation forest and a classic isolation forest variant equipped with the functionality to score features according to their contributions to a sample’s anomalousness. Extensive experiments were performed on 23 different multi-disciplinary and security-related real-world datasets in order to comprehensively evaluate the performance of our implementation compared with off-the-shelf methods. The discussion of results, including AUC, F1 score and averaged execution time metric, shows that PCB-iForest clearly outperformed the state-of-the-art competitors in 61% of cases and even achieved more promising results in terms of the tradeoff between classification and computational costs.

Download Full-text

Selective oversampling approach for strongly imbalanced data

PeerJ Computer Science ◽

10.7717/peerj-cs.604 ◽

2021 ◽

Vol 7 ◽

pp. e604

Author(s):

Peter Gnip ◽

Liberios Vokorokos ◽

Peter Drotár

Keyword(s):

Outlier Detection ◽

Real World ◽

State Of The Art ◽

Imbalanced Data ◽

Prediction Performance ◽

Classifier Performance ◽

Real World Applications ◽

Real World Datasets ◽

Synthetic Datasets ◽

Representative Samples

Challenges posed by imbalanced data are encountered in many real-world applications. One of the possible approaches to improve the classifier performance on imbalanced data is oversampling. In this paper, we propose the new selective oversampling approach (SOA) that first isolates the most representative samples from minority classes by using an outlier detection technique and then utilizes these samples for synthetic oversampling. We show that the proposed approach improves the performance of two state-of-the-art oversampling methods, namely, the synthetic minority oversampling technique and adaptive synthetic sampling. The prediction performance is evaluated on four synthetic datasets and four real-world datasets, and the proposed SOA methods always achieved the same or better performance than other considered existing oversampling methods.

Download Full-text

Unsupervised Feature Selection for Outlier Detection on Streaming Data to Enhance Network Security

Applied Sciences ◽

10.3390/app112412073 ◽

2021 ◽

Vol 11 (24) ◽

pp. 12073

Author(s):

Michael Heigl ◽

Enrico Weigelt ◽

Dalibor Fiala ◽

Martin Schramm

Keyword(s):

Feature Selection ◽

Outlier Detection ◽

Data Streams ◽

State Of The Art ◽

Streaming Data ◽

Detection Methods ◽

Unsupervised Feature Selection ◽

Detection Algorithms ◽

Efficient Detection ◽

Selection For

Over the past couple of years, machine learning methods—especially the outlier detection ones—have anchored in the cybersecurity field to detect network-based anomalies rooted in novel attack patterns. However, the ubiquity of massive continuously generated data streams poses an enormous challenge to efficient detection schemes and demands fast, memory-constrained online algorithms that are capable to deal with concept drifts. Feature selection plays an important role when it comes to improve outlier detection in terms of identifying noisy data that contain irrelevant or redundant features. State-of-the-art work either focuses on unsupervised feature selection for data streams or (offline) outlier detection. Substantial requirements to combine both fields are derived and compared with existing approaches. The comprehensive review reveals a research gap in unsupervised feature selection for the improvement of outlier detection methods in data streams. Thus, a novel algorithm for Unsupervised Feature Selection for Streaming Outlier Detection, denoted as UFSSOD, will be proposed, which is able to perform unsupervised feature selection for the purpose of outlier detection on streaming data. Furthermore, it is able to determine the amount of top-performing features by clustering their score values. A generic concept that shows two application scenarios of UFSSOD in conjunction with off-the-shell online outlier detection algorithms has been derived. Extensive experiments have shown that a promising feature selection mechanism for streaming data is not applicable in the field of outlier detection. Moreover, UFSSOD, as an online capable algorithm, yields comparable results to a state-of-the-art offline method trimmed for outlier detection.

Download Full-text

OFCOD: On the Fly Clustering Based Outlier Detection Framework

Data ◽

10.3390/data6010001 ◽

2020 ◽

Vol 6 (1) ◽

pp. 1

Author(s):

Ahmed Elmogy ◽

Hamada Rizk ◽

Amany M. Sarhan

Keyword(s):

Data Mining ◽

Image Processing ◽

Intrusion Detection ◽

Real Time ◽

Outlier Detection ◽

Real World ◽

Medical Data ◽

Experimental Results ◽

Real Time Applications ◽

Real World Datasets

In data mining, outlier detection is a major challenge as it has an important role in many applications such as medical data, image processing, fraud detection, intrusion detection, and so forth. An extensive variety of clustering based approaches have been developed to detect outliers. However they are by nature time consuming which restrict their utilization with real-time applications. Furthermore, outlier detection requests are handled one at a time, which means that each request is initiated individually with a particular set of parameters. In this paper, the first clustering based outlier detection framework, (On the Fly Clustering Based Outlier Detection (OFCOD)) is presented. OFCOD enables analysts to effectively find out outliers on time with request even within huge datasets. The proposed framework has been tested and evaluated using two real world datasets with different features and applications; one with 699 records, and another with five millions records. The experimental results show that the performance of the proposed framework outperforms other existing approaches while considering several evaluation metrics.

Download Full-text

Embedding-Based Complex Feature Value Coupling Learning for Detecting Outliers in Non-IID Categorical Data

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015541 ◽

2019 ◽

Vol 33 ◽

pp. 5541-5548 ◽

Cited By ~ 2

Author(s):

Hongzuo Xu ◽

Yongjun Wang ◽

Zhiyue Wu ◽

Yijie Wang

Keyword(s):

Outlier Detection ◽

Categorical Data ◽

State Of The Art ◽

High Order ◽

Detection Methods ◽

Order Complex ◽

Value Network ◽

Learning Framework ◽

A Value ◽

Real World Datasets

Non-IID categorical data is ubiquitous and common in realworld applications. Learning various kinds of couplings has been proved to be a reliable measure when detecting outliers in such non-IID data. However, it is a critical yet challenging problem to model, represent, and utilise high-order complex value couplings. Existing outlier detection methods normally only focus on pairwise primary value couplings and fail to uncover real relations that hide in complex couplings, resulting in suboptimal and unstable performance. This paper introduces a novel unsupervised embedding-based complex value coupling learning framework EMAC and its instance SCAN to address these issues. SCAN first models primary value couplings. Then, coupling bias is defined to capture complex value couplings with different granularities and highlight the essence of outliers. An embedding method is performed on the value network constructed via biased value couplings, which further learns high-order complex value couplings and embeds these couplings into a value representation matrix. Bidirectional selective value coupling learning is proposed to show how to estimate value and object outlierness through value couplings. Substantial experiments show that SCAN (i) significantly outperforms five state-of-the-art outlier detection methods on thirteen real-world datasets; and (ii) has much better resilience to noise than its competitors.

Download Full-text

Adaptive Double-Exploration Tradeoff for Outlier Detection

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6164 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6837-6844

Author(s):

Xiaojin Zhang ◽

Honglei Zhuang ◽

Shengyu Zhang ◽

Yuan Zhou

Keyword(s):

Confidence Interval ◽

Outlier Detection ◽

Real World ◽

Efficient Algorithm ◽

Experimental Results ◽

Sample Complexity ◽

Bandit Problem ◽

Real World Datasets ◽

Synthetic Datasets ◽

The Individual

We study a variant of the thresholding bandit problem (TBP) in the context of outlier detection, where the objective is to identify the outliers whose rewards are above a threshold. Distinct from the traditional TBP, the threshold is defined as a function of the rewards of all the arms, which is motivated by the criterion for identifying outliers. The learner needs to explore the rewards of the arms as well as the threshold. We refer to this problem as "double exploration for outlier detection". We construct an adaptively updated confidence interval for the threshold, based on the estimated value of the threshold in the previous rounds. Furthermore, by automatically trading off exploring the individual arms and exploring the outlier threshold, we provide an efficient algorithm in terms of the sample complexity. Experimental results on both synthetic datasets and real-world datasets demonstrate the efficiency of our algorithm.

Download Full-text

Efficient Heterogeneous Collaborative Filtering without Negative Sampling for Recommendation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i01.5329 ◽

2020 ◽

Vol 34 (01) ◽

pp. 19-26 ◽

Cited By ~ 5

Author(s):

Chong Chen ◽

Min Zhang ◽

Yongfeng Zhang ◽

Weizhi Ma ◽

Yiqun Liu ◽

...

Keyword(s):

Collaborative Filtering ◽

Real World ◽

Large Scale ◽

State Of The Art ◽

Heterogeneous Data ◽

Model Parameters ◽

Online Systems ◽

Practical Applications ◽

Real World Datasets ◽

Primary Type

Recent studies on recommendation have largely focused on exploring state-of-the-art neural networks to improve the expressiveness of models, while typically apply the Negative Sampling (NS) strategy for efficient learning. Despite effectiveness, two important issues have not been well-considered in existing methods: 1) NS suffers from dramatic fluctuation, making sampling-based methods difficult to achieve the optimal ranking performance in practical applications; 2) although heterogeneous feedback (e.g., view, click, and purchase) is widespread in many online systems, most existing methods leverage only one primary type of user feedback such as purchase. In this work, we propose a novel non-sampling transfer learning solution, named Efficient Heterogeneous Collaborative Filtering (EHCF) for Top-N recommendation. It can not only model fine-grained user-item relations, but also efficiently learn model parameters from the whole heterogeneous data (including all unlabeled data) with a rather low time complexity. Extensive experiments on three real-world datasets show that EHCF significantly outperforms state-of-the-art recommendation methods in both traditional (single-behavior) and heterogeneous scenarios. Moreover, EHCF shows significant improvements in training efficiency, making it more applicable to real-world large-scale systems. Our implementation has been released 1 to facilitate further developments on efficient whole-data based neural methods.

Download Full-text

An Efficient Distance and Density Based Outlier Detection Approach

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.155-156.342 ◽

2012 ◽

Vol 155-156 ◽

pp. 342-347 ◽

Cited By ~ 1

Author(s):

Xun Biao Zhong ◽

Xiao Xia Huang

Keyword(s):

Outlier Detection ◽

Real World ◽

High Performance ◽

Detection Problem ◽

Empirical Results ◽

Detection Approach ◽

Real World Datasets ◽

Good Detection

In order to solve the density based outlier detection problem with low accuracy and high computation, a variance of distance and density (VDD) measure is proposed in this paper. And the k-means clustering and score based VDD (KSVDD) approach proposed can efficiently detect outliers with high performance. For illustration, two real-world datasets are utilized to show the feasibility of the approach. Empirical results show that KSVDD has a good detection precision.

Download Full-text

Efficient Detection of Occlusion prior to Robust Face Recognition

The Scientific World JOURNAL ◽

10.1155/2014/519158 ◽

2014 ◽

Vol 2014 ◽

pp. 1-10 ◽

Cited By ~ 18

Author(s):

Rui Min ◽

Abdenour Hadid ◽

Jean-Luc Dugelay

Keyword(s):

Face Recognition ◽

Facial Expression ◽

Video Surveillance ◽

Real World ◽

State Of The Art ◽

Efficient Detection ◽

Art Works ◽

Enormous Amount ◽

Recognition Systems ◽

Robust Face Recognition

While there has been an enormous amount of research on face recognition under pose/illumination/expression changes and image degradations, problems caused by occlusions attracted relatively less attention. Facial occlusions, due, for example, to sunglasses, hat/cap, scarf, and beard, can significantly deteriorate performances of face recognition systems in uncontrolled environments such as video surveillance. The goal of this paper is to explore face recognition in the presence of partial occlusions, with emphasis on real-world scenarios (e.g., sunglasses and scarf). In this paper, we propose an efficient approach which consists of first analysing the presence of potential occlusion on a face and then conducting face recognition on the nonoccluded facial regions based on selective local Gabor binary patterns. Experiments demonstrate that the proposed method outperforms the state-of-the-art works including KLD-LGBPHS, S-LNMF, OA-LBP, and RSC. Furthermore, performances of the proposed approach are evaluated under illumination and extreme facial expression changes provide also significant results.

Download Full-text

Discrete Trust-aware Matrix Factorization for Fast Recommendation

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/191 ◽

2019 ◽

Author(s):

Guibing Guo ◽

Enneng Yang ◽

Li Shen ◽

Xiaochun Yang ◽

Xiaodong He

Keyword(s):

Social Influence ◽

Collaborative Filtering ◽

Recommender Systems ◽

Social Relations ◽

Real World ◽

Matrix Factorization ◽

State Of The Art ◽

Proposed Model ◽

Hamming Space ◽

Real World Datasets

Trust-aware recommender systems have received much attention recently for their abilities to capture the influence among connected users. However, they suffer from the efficiency issue due to large amount of data and time-consuming real-valued operations. Although existing discrete collaborative filtering may alleviate this issue to some extent, it is unable to accommodate social influence. In this paper we propose a discrete trust-aware matrix factorization (DTMF) model to take dual advantages of both social relations and discrete technique for fast recommendation. Specifically, we map the latent representation of users and items into a joint hamming space by recovering the rating and trust interactions between users and items. We adopt a sophisticated discrete coordinate descent (DCD) approach to optimize our proposed model. In addition, experiments on two real-world datasets demonstrate the superiority of our approach against other state-of-the-art approaches in terms of ranking accuracy and efficiency.

Download Full-text

EA Reader: Enhance Attentive Reader for Cloze-Style Question Answering via Multi-Space Context Fusion

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016375 ◽

2019 ◽

Vol 33 ◽

pp. 6375-6382

Author(s):

Chengzhen Fu ◽

Yan Zhang

Keyword(s):

Real World ◽

Question Answering ◽

State Of The Art ◽

Unified Model ◽

Inference Process ◽

Context Vector ◽

Attentive Reader ◽

Semantic Spaces ◽

Real World Datasets

Query-document semantic interactions are essential for the success of many cloze-style question answering models. Recently, researchers have proposed several attention-based methods to predict the answer by focusing on appropriate subparts of the context document. In this paper, we design a novel module to produce the query-aware context vector, named Multi-Space based Context Fusion (MSCF), with the following considerations: (1) interactions are applied across multiple latent semantic spaces; (2) attention is measured at bit level, not at token level. Moreover, we extend MSCF to the multi-hop architecture. This unified model is called Enhanced Attentive Reader (EA Reader). During the iterative inference process, the reader is equipped with a novel memory update rule and maintains the understanding of documents through read, update and write operations. We conduct extensive experiments on four real-world datasets. Our results demonstrate that EA Reader outperforms state-of-the-art models.

Download Full-text