Understanding the Effect of Bias in Deep Anomaly Detection

Anomaly detection presents a unique challenge in machine learning, due to the scarcity of labeled anomaly data. Recent work attempts to mitigate such problems by augmenting training of deep anomaly detection models with additional labeled anomaly samples. However, the labeled data often does not align with the target distribution and introduces harmful bias to the trained model. In this paper, we aim to understand the effect of a biased anomaly set on anomaly detection. Concretely, we view anomaly detection as a supervised learning task where the objective is to optimize the recall at a given false positive rate. We formally study the relative scoring bias of an anomaly detector, defined as the difference in performance with respect to a baseline anomaly detector. We establish the first finite sample rates for estimating the relative scoring bias for deep anomaly detection, and empirically validate our theoretical results on both synthetic and real-world datasets. We also provide an extensive empirical study on how a biased training anomaly set affects the anomaly score function and therefore the detection performance on different anomaly classes. Our study demonstrates scenarios in which the biased anomaly set can be useful or problematic, and provides a solid benchmark for future research.

Download Full-text

EADetection: An efficient and accurate sequential behavior anomaly detection approach over data streams

International Journal of Distributed Sensor Networks ◽

10.1177/1550147718803303 ◽

2018 ◽

Vol 14 (10) ◽

pp. 155014771880330 ◽

Cited By ~ 1

Author(s):

Li Cheng ◽

Yijie Wang ◽

Yong Zhou ◽

Xingkong Ma

Keyword(s):

Anomaly Detection ◽

Real Time ◽

Data Streams ◽

Detection Efficiency ◽

Detection System ◽

False Positive Rate ◽

Time Interval ◽

Sequential Behavior ◽

Detection Approach ◽

Anomaly Score

Due to the increasing arriving rate and complex relationship of behavior data streams, how to detect sequential behavior anomaly in an efficient and accurate manner has become an emerging challenge. However, most of the existing literature simply calculates the anomaly score for segmented sequence, and there is limited work going deep to investigate data stream segment and structural relationship. Moreover, existing studies cannot meet efficiency requirements because of large number of projected subsequences. In this article, we propose EADetection, an efficient and accurate sequential behavior anomaly detection approach over data streams. EADetection adopts time interval and fuzzy logic–based correlation to segment event stream adaptively based on rolling window. Through dynamic projection space–based fast pruning, large number of repeated patterns are reduced to improve detection efficiency. Meanwhile, EADetection calculates the anomaly score by top-k pattern–based abnormal scoring based on directed loop graph–based storage strategy, which ensures the accuracy of detection. Specially, we design and implement a streaming anomaly detection system based on EADetection to perform real-time detection. Extensive experiments confirm that EADetection can achieve real time and improve accuracy, significantly reduces latency by 36.8% and reduces false positive rate by 6.4% compared with existing approach.

Download Full-text

TransET: Knowledge Graph Embedding with Entity Types

Electronics ◽

10.3390/electronics10121407 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1407

Author(s):

Peng Wang ◽

Jing Zhou ◽

Yuzhang Liu ◽

Xingchen Zhou

Keyword(s):

Link Prediction ◽

State Of The Art ◽

Score Function ◽

Graph Embedding ◽

Vector Spaces ◽

Knowledge Graph ◽

Semantic Features ◽

Knowledge Graphs ◽

Real World Datasets ◽

Low Dimensional

Knowledge graph embedding aims to embed entities and relations into low-dimensional vector spaces. Most existing methods only focus on triple facts in knowledge graphs. In addition, models based on translation or distance measurement cannot fully represent complex relations. As well-constructed prior knowledge, entity types can be employed to learn the representations of entities and relations. In this paper, we propose a novel knowledge graph embedding model named TransET, which takes advantage of entity types to learn more semantic features. More specifically, circle convolution based on the embeddings of entity and entity types is utilized to map head entity and tail entity to type-specific representations, then translation-based score function is used to learn the presentation triples. We evaluated our model on real-world datasets with two benchmark tasks of link prediction and triple classification. Experimental results demonstrate that it outperforms state-of-the-art models in most cases.

Download Full-text

xtgeebcv: A command for bias-corrected sandwich variance estimation for GEE analyses of cluster randomized trials

The Stata Journal Promoting communications on statistics and Stata ◽

10.1177/1536867x20931001 ◽

2020 ◽

Vol 20 (2) ◽

pp. 363-381 ◽

Cited By ~ 4

Author(s):

John A. Gallis ◽

Fan Li ◽

Elizabeth L. Turner

Keyword(s):

Randomized Trials ◽

Variance Estimation ◽

Public Health Education ◽

Standard Errors ◽

Future Research ◽

Type I ◽

Cluster Randomized Trials ◽

Finite Sample ◽

Individual Level ◽

Cluster Randomized

Cluster randomized trials, where clusters (for example, schools or clinics) are randomized to comparison arms but measurements are taken on individuals, are commonly used to evaluate interventions in public health, education, and the social sciences. Analysis is often conducted on individual-level outcomes, and such analysis methods must consider that outcomes for members of the same cluster tend to be more similar than outcomes for members of other clusters. A popular individual-level analysis technique is generalized estimating equations (GEE). However, it is common to randomize a small number of clusters (for example, 30 or fewer), and in this case, the GEE standard errors obtained from the sandwich variance estimator will be biased, leading to inflated type I errors. Some bias-corrected standard errors have been proposed and studied to account for this finite-sample bias, but none has yet been implemented in Stata. In this article, we describe several popular bias corrections to the robust sandwich variance. We then introduce our newly created command, xtgeebcv, which will allow Stata users to easily apply finite-sample corrections to standard errors obtained from GEE models. We then provide examples to demonstrate the use of xtgeebcv. Finally, we discuss suggestions about which finite-sample corrections to use in which situations and consider areas of future research that may improve xtgeebcv.

Download Full-text

Unsupervised Anomaly Detection with Distillated Teacher-Student Network Ensemble

Entropy ◽

10.3390/e23020201 ◽

2021 ◽

Vol 23 (2) ◽

pp. 201

Author(s):

Qinfeng Xiao ◽

Jing Wang ◽

Youfang Lin ◽

Wenbo Gongsa ◽

Ganghui Hu ◽

...

Keyword(s):

Anomaly Detection ◽

Multivariate Data ◽

Failure Detection ◽

Superior Performance ◽

Detection Algorithms ◽

Teacher Student ◽

Model Complex ◽

Unsupervised Anomaly Detection ◽

Real World Datasets ◽

Complex Features

We address the problem of unsupervised anomaly detection for multivariate data. Traditional machine learning based anomaly detection algorithms rely on specific assumptions of normal patterns and fail to model complex feature interactions and relations. Recently, existing deep learning based methods are promising for extracting representations from complex features. These methods train an auxiliary task, e.g., reconstruction and prediction, on normal samples. They further assume that anomalies fail to perform well on the auxiliary task since they are never trained during the model optimization. However, the assumption does not always hold in practice. Deep models may also perform the auxiliary task well on anomalous samples, leading to the failure detection of anomalies. To effectively detect anomalies for multivariate data, this paper introduces a teacher-student distillation based framework Distillated Teacher-Student Network Ensemble (DTSNE). The paradigm of the teacher-student distillation is able to deal with high-dimensional complex features. In addition, an ensemble of student networks provides a better capability to avoid generalizing the auxiliary task performance on anomalous samples. To validate the effectiveness of our model, we conduct extensive experiments on real-world datasets. Experimental results show superior performance of DTSNE over competing methods. Analysis and discussion towards the behavior of our model are also provided in the experiment section.

Download Full-text

Hybrid Internal Anomaly Detection System for IoT: Reactive Nodes with Cross-Layer Operation

Security and Communication Networks ◽

10.1155/2018/3672698 ◽

2018 ◽

Vol 2018 ◽

pp. 1-15 ◽

Cited By ~ 4

Author(s):

Nanda Kumar Thanigaivelan ◽

Ethiopia Nigussie ◽

Seppo Virtanen ◽

Jouni Isoaho

Keyword(s):

Anomaly Detection ◽

False Positive ◽

Detection System ◽

False Positive Rate ◽

Security Analysis ◽

Abnormal Behavior ◽

Test Bed ◽

Positive Rate ◽

Parent Node ◽

Anomaly Detection System

We present a hybrid internal anomaly detection system that shares detection tasks between router and nodes. It allows nodes to react instinctively against the anomaly node by enforcing temporary communication ban on it. Each node monitors its own neighbors and if abnormal behavior is detected, the node blocks the packets of the anomaly node at link layer and reports the incident to its parent node. A novel RPL control message, Distress Propagation Object (DPO), is formulated and used for reporting the anomaly and network activities to the parent node and subsequently to the router. The system has configurable profile settings and is able to learn and differentiate between the nodes normal and suspicious activities without a need for prior knowledge. It has different subsystems and operation phases that are distributed in both the nodes and router, which act on data link and network layers. The system uses network fingerprinting to be aware of changes in network topology and approximate threat locations without any assistance from a positioning subsystem. The developed system was evaluated using test-bed consisting of Zolertia nodes and in-house developed PandaBoard based gateway as well as emulation environment of Cooja. The evaluation revealed that the system has low energy consumption overhead and fast response. The system occupies 3.3 KB of ROM and 0.86 KB of RAM for its operations. Security analysis confirms nodes reaction against abnormal nodes and successful detection of packet flooding, selective forwarding, and clone attacks. The system’s false positive rate evaluation demonstrates that the proposed system exhibited 5% to 10% lower false positive rate compared to simple detection system.

Download Full-text

IoT Dataset Validation Using Machine Learning Techniques for Traffic Anomaly Detection

Electronics ◽

10.3390/electronics10222857 ◽

2021 ◽

Vol 10 (22) ◽

pp. 2857

Author(s):

Laura Vigoya ◽

Diego Fernandez ◽

Victor Carneiro ◽

Francisco Nóvoa

Keyword(s):

Machine Learning ◽

Anomaly Detection ◽

False Positive Rate ◽

Machine Learning Techniques ◽

Support Vector ◽

High Detection Rate ◽

Security Vulnerabilities ◽

Smart Systems ◽

Learning Techniques ◽

Positive Rate

With advancements in engineering and science, the application of smart systems is increasing, generating a faster growth of the IoT network traffic. The limitations due to IoT restricted power and computing devices also raise concerns about security vulnerabilities. Machine learning-based techniques have recently gained credibility in a successful application for the detection of network anomalies, including IoT networks. However, machine learning techniques cannot work without representative data. Given the scarcity of IoT datasets, the DAD emerged as an instrument for knowing the behavior of dedicated IoT-MQTT networks. This paper aims to validate the DAD dataset by applying Logistic Regression, Naive Bayes, Random Forest, AdaBoost, and Support Vector Machine to detect traffic anomalies in IoT. To obtain the best results, techniques for handling unbalanced data, feature selection, and grid search for hyperparameter optimization have been used. The experimental results show that the proposed dataset can achieve a high detection rate in all the experiments, providing the best mean accuracy of 0.99 for the tree-based models, with a low false-positive rate, ensuring effective anomaly detection.

Download Full-text

Spectral-Spatial Anomaly Detection of Hyperspectral Data Based on Improved Isolation Forest

10.36227/techrxiv.14428880.v3 ◽

2021 ◽

Author(s):

Xiangyu Song ◽

Sunil Aryal ◽

Kai Ming Ting ◽

zhen Liu ◽

Bin He

Keyword(s):

Anomaly Detection ◽

Spatial Information ◽

Hyperspectral Image ◽

Gabor Filter ◽

Poor Performance ◽

Hyperspectral Data ◽

Relative Mass ◽

Anomaly Score ◽

Spectral Anomaly ◽

Isolation Forest

Anomaly detection in hyperspectral image is affected by redundant bands and the limited utilization capacity of spectral-spatial information. In this article, we propose a novel Improved Isolation Forest (IIF) algorithm based on the assumption that anomaly pixels are more susceptible to isolation than the background pixels. The proposed IIF is a modified version of the Isolation Forest (iForest) algorithm, which addresses the poor performance of iForest in detecting local anomalies and anomaly detection in high-dimensional data. Further, we propose a spectral-spatial anomaly detector based on IIF (SSIIFD) to make full use of global and local information, as well as spectral and spatial information. To be specific, first, we apply the Gabor filter to extract spatial features, which are then employed as input to the Relative Mass Isolation Forest (ReMass-iForest) detector to obtain the spatial anomaly score. Next, original images are divided into several homogeneous regions via the Entropy Rate Segmentation (ERS) algorithm, and the preprocessed images are then employed as input to the proposed IIF detector to obtain the spectral anomaly score. Finally, we fuse the spatial and spectral anomaly scores by combining them linearly to predict anomaly pixels. The experimental results on four real hyperspectral data sets demonstrate that the proposed detector outperforms other state-of-the-art methods.

Download Full-text

Error Distribution-based Anomaly Score for Forecasting-based Anomaly Detection of PV Systems

10.1109/ictc52510.2021.9620808 ◽

2021 ◽

Author(s):

HyunYong Lee ◽

Nac-Woo Kim ◽

Jun-Gi Lee ◽

Byung-Tak Lee

Keyword(s):

Anomaly Detection ◽

Error Distribution ◽

Pv Systems ◽

Anomaly Score

Download Full-text

A Latent State-Based Multimodal Execution Monitor with Anomaly Detection and Classification for Robot Introspection

Applied Sciences ◽

10.3390/app9061072 ◽

2019 ◽

Vol 9 (6) ◽

pp. 1072 ◽

Cited By ~ 7

Author(s):

Hongmin Wu ◽

Yisheng Guan ◽

Juan Rojas

Keyword(s):

Anomaly Detection ◽

Multivariate Time Series ◽

Sensory Data ◽

Shared Workspaces ◽

Autonomous Manipulation ◽

Log Likelihood ◽

Anomaly Detector ◽

Anomaly Classification ◽

Robot Cooperation ◽

Multiclass Classifier

Robot introspection is expected to greatly aid longer-term autonomy of autonomous manipulation systems. By equipping robots with abilities that allow them to assess the quality of their sensory data, robots can detect and classify anomalies and recover appropriately from common anomalies. This work builds on our previous Sense-Plan-Act-Introspect-Recover (SPAIR) system. We introduce an improved anomaly detector that exploits latent states to monitor anomaly occurrence when robots collaborate with humans in shared workspaces, but also present a multiclass classifier that is activated with anomaly detection. Both implementations are derived from Bayesian non-parametric methods with strong modeling capabilities for learning and inference of multivariate time series with complex and uncertain behavior patterns. In particular, we explore the use of a hierarchical Dirichlet stochastic process prior to learning a Hidden Markov Model (HMM) with a switching vector auto-regressive observation model (sHDP-VAR-HMM). The detector uses a dynamic log-likelihood threshold that varies by latent state for anomaly detection and the anomaly classifier is implemented by calculating the cumulative log-likelihood of testing observation based on trained models. The purpose of our work is to equip the robot with anomaly detection and anomaly classification for the full set of skills associated with a given manipulation task. We consider a human–robot cooperation task to verify our work and measure the robustness and accuracy of each skill. Our improved detector succeeded in detecting 136 common anomalies and 368 nominal executions with a total accuracy of 91.0%. An overall anomaly classification accuracy of 97.1% is derived by performing the anomaly classification on an anomaly dataset that consists of 7 kinds of detected anomalies from a total of 136 anomalies samples.

Download Full-text

Gene set enrichment for reproducible science: comparison of CERNO and eight other algorithms

Bioinformatics ◽

10.1093/bioinformatics/btz447 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5146-5154 ◽

Cited By ~ 19

Author(s):

Joanna Zyla ◽

Michal Marczyk ◽

Teresa Domaszewska ◽

Stefan H E Kaufmann ◽

Joanna Polanska ◽

...

Keyword(s):

False Positive Rate ◽

R Package ◽

Supplementary Information ◽

Computational Time ◽

P Value ◽

Gene Set ◽

Related Data ◽

Novel Approach ◽

Positive Rate ◽

Real World Datasets

Abstract Motivation Analysis of gene set (GS) enrichment is an essential part of functional omics studies. Here, we complement the established evaluation metrics of GS enrichment algorithms with a novel approach to assess the practical reproducibility of scientific results obtained from GS enrichment tests when applied to related data from different studies. Results We evaluated eight established and one novel algorithm for reproducibility, sensitivity, prioritization, false positive rate and computational time. In addition to eight established algorithms, we also included Coincident Extreme Ranks in Numerical Observations (CERNO), a flexible and fast algorithm based on modified Fisher P-value integration. Using real-world datasets, we demonstrate that CERNO is robust to ranking metrics, as well as sample and GS size. CERNO had the highest reproducibility while remaining sensitive, specific and fast. In the overall ranking Pathway Analysis with Down-weighting of Overlapping Genes, CERNO and over-representation analysis performed best, while CERNO and GeneSetTest scored high in terms of reproducibility. Availability and implementation tmod package implementing the CERNO algorithm is available from CRAN (cran.r-project.org/web/packages/tmod/index.html) and an online implementation can be found at http://tmod.online/. The datasets analyzed in this study are widely available in the KEGGdzPathwaysGEO, KEGGandMetacoreDzPathwaysGEO R package and GEO repository. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text