Anomalies Detection Using Isolation in Concept-Drifting Data Streams

Maurras Ulbricht Togbe; Yousra Chabchoub; Aliou Boly; Mariam Barry; Raja Chiky; Maroua Bahri

doi:10.3390/computers10010013

Anomalies Detection Using Isolation in Concept-Drifting Data Streams

Computers ◽

10.3390/computers10010013 ◽

2021 ◽

Vol 10 (1) ◽

pp. 13

Author(s):

Maurras Ulbricht Togbe ◽

Yousra Chabchoub ◽

Aliou Boly ◽

Mariam Barry ◽

Raja Chiky ◽

...

Keyword(s):

Anomaly Detection ◽

Half Space ◽

Data Streams ◽

Detection Efficiency ◽

Concept Drift ◽

Streaming Data ◽

Detection Methods ◽

Data Sets ◽

Stream Data ◽

Isolation Forest

Detecting anomalies in streaming data is an important issue for many application domains, such as cybersecurity, natural disasters, or bank frauds. Different approaches have been designed in order to detect anomalies: statistics-based, isolation-based, clustering-based, etc. In this paper, we present a structured survey of the existing anomaly detection methods for data streams with a deep view on Isolation Forest (iForest). We first provide an implementation of Isolation Forest Anomalies detection in Stream Data (IForestASD), a variant of iForest for data streams. This implementation is built on top of scikit-multiflow (River), which is an open source machine learning framework for data streams containing a single anomaly detection algorithm in data streams, called Streaming half-space trees. We performed experiments on different real and well known data sets in order to compare the performance of our implementation of IForestASD and half-space trees. Moreover, we extended the IForestASD algorithm to handle drifting data by proposing three algorithms that involve two main well known drift detection methods: ADWIN and KSWIN. ADWIN is an adaptive sliding window algorithm for detecting change in a data stream. KSWIN is a more recent method and it refers to the Kolmogorov–Smirnov Windowing method for concept drift detection. More precisely, we extended KSWIN to be able to deal with n-dimensional data streams. We validated and compared all of the proposed methods on both real and synthetic data sets. In particular, we evaluated the F1-score, the execution time, and the memory consumption. The experiments show that our extensions have lower resource consumption than the original version of IForestASD with a similar or better detection efficiency.

Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams

Entropy ◽

10.3390/e23070859 ◽

2021 ◽

Vol 23 (7) ◽

pp. 859

Author(s):

Abdulaziz O. AlQabbany ◽

Aqil M. Azmi

Keyword(s):

Big Data ◽

Random Forest ◽

Real Time ◽

Data Streams ◽

Learning Algorithm ◽

Concept Drift ◽

The United States ◽

Careful Consideration ◽

Data Sets ◽

Stream Data

We are living in the age of big data, a majority of which is stream data. The real-time processing of this data requires careful consideration from different perspectives. Concept drift is a change in the data’s underlying distribution, a significant issue, especially when learning from data streams. It requires learners to be adaptive to dynamic changes. Random forest is an ensemble approach that is widely used in classical non-streaming settings of machine learning applications. At the same time, the Adaptive Random Forest (ARF) is a stream learning algorithm that showed promising results in terms of its accuracy and ability to deal with various types of drift. The incoming instances’ continuity allows for their binomial distribution to be approximated to a Poisson(1) distribution. In this study, we propose a mechanism to increase such streaming algorithms’ efficiency by focusing on resampling. Our measure, resampling effectiveness (ρ), fuses the two most essential aspects in online learning; accuracy and execution time. We use six different synthetic data sets, each having a different type of drift, to empirically select the parameter λ of the Poisson distribution that yields the best value for ρ. By comparing the standard ARF with its tuned variations, we show that ARF performance can be enhanced by tackling this important aspect. Finally, we present three case studies from different contexts to test our proposed enhancement method and demonstrate its effectiveness in processing large data sets: (a) Amazon customer reviews (written in English), (b) hotel reviews (in Arabic), and (c) real-time aspect-based sentiment analysis of COVID-19-related tweets in the United States during April 2020. Results indicate that our proposed method of enhancement exhibited considerable improvement in most of the situations.

Anomaly Pattern Detection in Streaming Data Based on the Transformation to Multiple Binary-Valued Data Streams

Journal of Artificial Intelligence and Soft Computing Research ◽

10.2478/jaiscr-2022-0002 ◽

2021 ◽

Vol 12 (1) ◽

pp. 19-27

Author(s):

Taegong Kim ◽

Cheong Hee Park

Keyword(s):

Outlier Detection ◽

Data Streams ◽

Data Stream ◽

Detection Method ◽

Binary Classification ◽

Streaming Data ◽

Pattern Detection ◽

Detection Methods ◽

Anomaly Pattern ◽

Isolation Forest

Abstract Anomaly pattern detection in a data stream aims to detect a time point where outliers begin to occur abnormally. Recently, a method for anomaly pattern detection has been proposed based on binary classification for outliers and statistical tests in the data stream of binary labels of normal or an outlier. It showed that an anomaly pattern can be detected accurately even when outlier detection performance is relatively low. However, since the anomaly pattern detection method is based on the binary classification for outliers, most well-known outlier detection methods, with the output of real-valued outlier scores, can not be used directly. In this paper, we propose an anomaly pattern detection method in a data stream using the transformation to multiple binary-valued data streams from real-valued outlier scores. By using three outlier detection methods, Isolation Forest(IF), Autoencoder-based outlier detection, and Local outlier factor(LOF), the proposed anomaly pattern detection method is tested using artificial and real data sets. The experimental results show that anomaly pattern detection using Isolation Forest gives the best performance.

Multivariate Anomaly Detection for Earth Observations: A Comparison of Algorithms and Feature Extraction Techniques

10.5194/esd-2016-51 ◽

2016 ◽

Cited By ~ 1

Author(s):

Milan Flach ◽

Fabian Gans ◽

Alexander Brenning ◽

Joachim Denzler ◽

Markus Reichstein ◽

...

Keyword(s):

Feature Extraction ◽

Anomaly Detection ◽

Data Streams ◽

Multivariate Data ◽

Detection Methods ◽

Earth System ◽

Earth System Science ◽

System Science ◽

Detection Algorithms ◽

Earth Observations

Abstract. Today, many processes at the Earth's surface are constantly monitored by multiple data streams. These observations have become central to advance our understanding of e.g. vegetation dynamics in response to climate or land use change. Another set of important applications is monitoring effects of climatic extreme events, other disturbances such as fires, or abrupt land transitions. One important methodological question is how to reliably detect anomalies in an automated and generic way within multivariate data streams, which typically vary seasonally and are interconnected across variables. Although many algorithms have been proposed for detecting anomalies in multivariate data, only few have been investigated in the context of Earth system science applications. In this study, we systematically combine and compare feature extraction and anomaly detection algorithms for detecting anomalous events. Our aim is to identify suitable workflows for automatically detecting anomalous patterns in multivariate Earth system data streams. We rely on artificial data that mimic typical properties and anomalies in multivariate spatiotemporal Earth observations. This artificial experiment is needed as there is no 'gold standard' for the identification of anomalies in real Earth observations. Our results show that a well chosen feature extraction step (e.g. subtracting seasonal cycles, or dimensionality reduction) is more important than the choice of a particular anomaly detection algorithm. Nevertheless, we identify 3 detection algorithms (k-nearest neighbours mean distance, kernel density estimation, a recurrence approach) and their combinations (ensembles) that outperform other multivariate approaches as well as univariate extreme event detection methods. Our results therefore provide an effective workflow to automatically detect anomalies in Earth system science data.

A Dynamic Subspace Anomaly Detection Method Using Generic Algorithm for Streaming Network Data

Handbook of Research on Emerging Developments in Data Privacy - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-4666-7381-6.ch018 ◽

2015 ◽

pp. 403-425

Author(s):

Ji Zhang

Keyword(s):

Anomaly Detection ◽

Data Streams ◽

Training Data ◽

Detection Methods ◽

Network Data ◽

Data Generation ◽

Research Attention ◽

Network Connection ◽

Dimensional Network ◽

Anomaly Classification

A great deal of research attention has been paid to data mining on data streams in recent years. In this chapter, the authors carry out a case study of anomaly detection in large and high-dimensional network connection data streams using Stream Projected Outlier deTector (SPOT) that is proposed in Zhang et al. (2009) to detect anomalies from data streams using subspace analysis. SPOT is deployed on 1999 KDD CUP anomaly detection application. Innovative approaches for training data generation, anomaly classification, false positive reduction, and adoptive detection subspace generation are proposed in this chapter as well. Experimental results demonstrate that SPOT is effective and efficient in detecting anomalies from network data streams and outperforms existing anomaly detection methods.

FuseAD: Unsupervised Anomaly Detection in Streaming Sensors Data by Fusing Statistical and Deep Learning Models

Sensors ◽

10.3390/s19112451 ◽

2019 ◽

Vol 19 (11) ◽

pp. 2451 ◽

Cited By ~ 13

Author(s):

Mohsin Munir ◽

Shoaib Ahmed Siddiqui ◽

Muhammad Ali Chattha ◽

Andreas Dengel ◽

Sheraz Ahmed

Keyword(s):

Deep Learning ◽

Anomaly Detection ◽

Internal State ◽

Arima Model ◽

Streaming Data ◽

Detection Technique ◽

Detection Methods ◽

Smart Devices ◽

Detection Techniques ◽

Unsupervised Anomaly Detection

The need for robust unsupervised anomaly detection in streaming data is increasing rapidly in the current era of smart devices, where enormous data are gathered from numerous sensors. These sensors record the internal state of a machine, the external environment, and the interaction of machines with other machines and humans. It is of prime importance to leverage this information in order to minimize downtime of machines, or even avoid downtime completely by constant monitoring. Since each device generates a different type of streaming data, it is normally the case that a specific kind of anomaly detection technique performs better than the others depending on the data type. For some types of data and use-cases, statistical anomaly detection techniques work better, whereas for others, deep learning-based techniques are preferred. In this paper, we present a novel anomaly detection technique, FuseAD, which takes advantage of both statistical and deep-learning-based approaches by fusing them together in a residual fashion. The obtained results show an increase in area under the curve (AUC) as compared to state-of-the-art anomaly detection methods when FuseAD is tested on a publicly available dataset (Yahoo Webscope benchmark). The obtained results advocate that this fusion-based technique can obtain the best of both worlds by combining their strengths and complementing their weaknesses. We also perform an ablation study to quantify the contribution of the individual components in FuseAD, i.e., the statistical ARIMA model as well as the deep-learning-based convolutional neural network (CNN) model.

Research on the Fastest Detection Method for Weak Trends under Noise Interference

Entropy ◽

10.3390/e23081093 ◽

2021 ◽

Vol 23 (8) ◽

pp. 1093

Author(s):

Guang Li ◽

Jing Liang ◽

Caitong Yue

Keyword(s):

Anomaly Detection ◽

Data Streams ◽

Concept Drift ◽

Detection Algorithm ◽

Detection Accuracy ◽

Industrial Data ◽

Oil Drilling ◽

Noise Interference ◽

Detection Score ◽

Weak Trend

Trend anomaly detection is the practice of comparing and analyzing current and historical data trends to detect real-time abnormalities in online industrial data-streams. It has the advantages of tracking a concept drift automatically and predicting trend changes in the shortest time, making it important both for algorithmic research and industry. However, industrial data streams contain considerable noise that interferes with detecting weak anomalies. In this paper, the fastest detection algorithm “sliding nesting” is adopted. It is based on calculating the data weight in each window by applying variable weights, while maintaining the method of trend-effective integration accumulation. The new algorithm changes the traditional calculation method of the trend anomaly detection score, which calculates the score in a short window. This algorithm, SNWFD–DS, can detect weak trend abnormalities in the presence of noise interference. Compared with other methods, it has significant advantages. An on-site oil drilling data test shows that this method can significantly reduce delays compared with other methods and can improve the detection accuracy of weak trend anomalies under noise interference.

A Review of Classification and Novel Class Detection Technique of Data Streams

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v3i2c.2891 ◽

2012 ◽

Vol 3 (2) ◽

pp. 314-316

Author(s):

Manish Rai ◽

Rekha Pandit

Keyword(s):

Machine Learning ◽

Data Streams ◽

Concept Drift ◽

Data Classification ◽

Classification Model ◽

Infinite Length ◽

Stream Data ◽

Machine Learning Technique ◽

Feature Evaluation ◽

Learning Technique

Stream data classification suffered from a problem of infinite length, concept evaluation, feature evaluation and data drift. Data stream labeling is more challenging than label static data because of several unique properties of data streams. Data streams are suppose to have infinite length, which makes it difficult to store and use all the historical data for training. Earlier multi-pass machine learning technique is not directly applied to data streams. Data streams discover concept-drift, which occurs when the discontinue concept of the data changes over time. In order to address concept drift, a classification model must endlessly adapt itself to the most recent concept. Various authors reduce these problem using machine learning approach and feature optimization technique. In this paper we present various method for reducing such problem occurred in stream data classification. Here we also discuss a machine learning technique for feature evaluation process for generation of novel class.

A Subspace-Based Analysis Method for Anomaly Detection in Large and High-Dimensional Network Connection Data Streams

Privacy, Intrusion Detection and Response ◽

10.4018/978-1-60960-836-1.ch008 ◽

2011 ◽

pp. 193-219

Author(s):

Ji Zhang

Keyword(s):

Anomaly Detection ◽

Data Streams ◽

Training Data ◽

Detection Methods ◽

High Dimensional ◽

Data Generation ◽

Research Attention ◽

Network Connection ◽

Dimensional Network ◽

Anomaly Classification

A great deal of research attention has been paid to data mining on data streams in recent years. In this chapter, the authors carry out a case study of anomaly detection in large and high-dimensional network connection data streams using Stream Projected Outlier deTector (SPOT) that is proposed in (Zhang et al. 2009) to detect anomalies from data streams using subspace analysis. SPOT is deployed on the 1999 KDD CUP anomaly detection application. Innovative approaches for training data generation, anomaly classification, and false positive reduction are proposed in this chapter as well. Experimental results demonstrate that SPOT is effective and efficient in detecting anomalies from network data streams and outperforms existing anomaly detection methods.

Novel Class Detection with Concept Drift in Data Stream - AhtNODE

International Journal of Distributed Systems and Technologies ◽

10.4018/ijdst.2020010102 ◽

2020 ◽

Vol 11 (1) ◽

pp. 15-26

Author(s):

Jay Gandhi ◽

Vaibhav Gandhi

Keyword(s):

Data Stream ◽

Concept Drift ◽

Ensemble Classifier ◽

Streaming Data ◽

Classification Model ◽

Infinite Length ◽

The Novel ◽

Stream Data ◽

Hoeffding Tree ◽

Discovery Method

Data stream mining has become an interesting analysis topic and it is a growing interest in data discovery method. There are several applications supporting stream data processing like device network, electronic network, etc. Our approach AhtNODE (Adaptive Hoeffding Tree based NOvel class DEtection) detects novel class in the presence of concept drift in streaming data. It addresses there are three challenges of streaming data: infinite length, concept drift, and concept evolution. This approach automatically detects the novel class whenever it arrives in the data stream. It is a multi-class approach that distinguishes novel class from existing classes. The authors tend to apply the Adaptive Hoeffding Tree as a classification model that is also used to handle the concept drift situation. Previous approaches used the ensemble model to handle concept drift. In AHT, classification is done in the single pass. The experiment result proves the effectiveness of AhtNODE compared to existing ensemble classifier in terms of classification accuracy, speed and use of memory.

AN EFFICIENT FUZZY BASED ANOMALY DETECTION USING COLLECTIVE CLUSTERING ALGORITHAM

Kongunadu Research Journal ◽

10.26524/krj135 ◽

2016 ◽

Vol 3 (1) ◽

pp. 81-83

Author(s):

Gomathi K ◽

Umagandhi R

Keyword(s):

Computational Complexity ◽

Anomaly Detection ◽

Detection Technique ◽

Significant Problem ◽

Detection Methods ◽

Data Sets ◽

Key Factors ◽

Research Areas ◽

The Many ◽

Kernel Mapping

Anomaly detection is a significant problem that has been researched within various research areas and application domains. Many anomaly detection methods have been particularly examined for certain application domains, as others are more standard. This present study describes an anomaly detection technique for unsupervised data sets accurately reduce the data from a kernel Eigen space performing a batch re-computation. For each anomaly behavior activities is to identify the key factors, which are used by the methods to differentiate between normal and abnormal actions. This present study provides a best and brief understanding of the techniques belonging to each anomaly and kernel mapping category. Further, for each grouping, to identify the improvements and drawbacks of the techniques in that category. It also provides a discussion on the computational complexity of the techniques since it is an important issue in real application domains hope that this survey will provide a good understanding of the many directions in which research has been done on this topic