Characterizing and understanding ensemble-based anomaly-detection
Anomaly Detection (AD) has grown in importance in recent years, as a result of an increasing digitalization of services and data storage, and abnormal behavior detection has become a key task. However, discovering abnormal data that is mixed with the huge amount of data available is a daunting problem and the efficacy of the current methods depends on a wide range of assumptions. One effective strategy for detecting anomalies is to combine multiple models, which are called "ensembles", but the factors that determine their performance are often hard to determine, making their calibration and improvement a challenging task. In this paper we address these problems by employing a four-step method for the characterization and understanding of ensemble-based anomaly-detection task. We start by characterizing several datasets and analyzing the factors that make it hard to detect their anomalies. We then evaluate to what extent existing algorithms are able to detect anomalies in the same datasets. On the basis of both analyses, we propose a stacking-based ensemble that outperformed a state-of-the-art baseline, Isolation Forest. Finally, we examine the benefits and drawbacks of our proposal.