A Random Fourier Features based Streaming Algorithm for Anomaly Detection in Large Datasets

Computing and networking systems traditionally record their activity in log files, which have been used for multiple purposes, such as troubleshooting, accounting, post-incident analysis of security breaches, capacity planning and anomaly detection. In earlier systems those log files were processed manually by system administrators, or with the support of basic applications for filtering, compiling and pre-processing the logs for specific purposes. However, as the volume of these log files continues to grow (more logs per system, more systems per domain), it is becoming increasingly difficult to process those logs using traditional tools, especially for less straightforward purposes such as anomaly detection. On the other hand, as systems continue to become more complex, the potential of using large datasets built of logs from heterogeneous sources for detecting anomalies without prior domain knowledge becomes higher. Anomaly detection tools for such scenarios face two challenges. First, devising appropriate data analysis solutions for effectively detecting anomalies from large data sources, possibly without prior domain knowledge. Second, adopting data processing platforms able to cope with the large datasets and complex data analysis algorithms required for such purposes. In this paper we address those challenges by proposing an integrated scalable framework that aims at efficiently detecting anomalous events on large amounts of unlabeled data logs. Detection is supported by clustering and classification methods that take advantage of parallel computing environments. We validate our approach using the the well known NASA Hypertext Transfer Protocol (HTTP) logs datasets. Fourteen features were extracted in order to train a k-means model for separating anomalous and normal events in highly coherent clusters. A second model, making use of the XGBoost system implementing a gradient tree boosting algorithm, uses the previous binary clustered data for producing a set of simple interpretable rules. These rules represent the rationale for generalizing its application over a massive number of unseen events in a distributed computing environment. The classified anomaly events produced by our framework can be used, for instance, as candidates for further forensic and compliance auditing analysis in security management.

Download Full-text

Regression Tree Based Explanation for Anomaly Detection Algorithm

Proceedings ◽

10.3390/proceedings2020054007 ◽

2020 ◽

Vol 54 (1) ◽

pp. 7

Author(s):

Iñigo López-Riobóo Botana ◽

Carlos Eiras-Franco ◽

Amparo Alonso-Betanzos

Keyword(s):

Anomaly Detection ◽

Input Data ◽

Regression Tree ◽

Detection Algorithm ◽

Large Datasets ◽

Network Intrusion Detection ◽

Homogeneous Groups ◽

Network Intrusion ◽

Novel Approach ◽

Classification And Regression

This work presents EADMNC (Explainable Anomaly Detection on Mixed Numerical and Categorical spaces), a novel approach to address explanation using an anomaly detection algorithm, ADMNC, which provides accurate detections on mixed numerical and categorical input spaces. Our improved algorithm leverages the formulation of the ADMNC model to offer pre-hoc explainability based on CART (Classification and Regression Trees). The explanation is presented as a segmentation of the input data into homogeneous groups that can be described with a few variables, offering supervisors novel information for justifications. To prove scalability and interpretability, we list experimental results on real-world large datasets focusing on network intrusion detection domain.

Download Full-text

An evolving Takagi-Sugeno model based on aggregated trapezium clouds for anomaly detection in large datasets

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-16254 ◽

2017 ◽

Vol 32 (3) ◽

pp. 2295-2308 ◽

Cited By ~ 12

Author(s):

Meng-Xian Wang ◽

Jian-Qiang Wang

Keyword(s):

Anomaly Detection ◽

Large Datasets ◽

Model Based ◽

Sugeno Model ◽

Takagi Sugeno

Download Full-text

Machine learning classification algorithms and anomaly detection in conventional meters and Tunisian electricity consumption large datasets

Computers & Electrical Engineering ◽

10.1016/j.compeleceng.2021.107329 ◽

2021 ◽

Vol 94 ◽

pp. 107329

Author(s):

Simona-Vasilica Oprea ◽

Adela Bâra

Keyword(s):

Machine Learning ◽

Anomaly Detection ◽

Electricity Consumption ◽

Large Datasets ◽

Classification Algorithms ◽

Machine Learning Classification

Download Full-text

A distributed data streaming algorithm for network-wide traffic anomaly detection

ACM SIGMETRICS Performance Evaluation Review ◽

10.1145/1639562.1639596 ◽

2009 ◽

Vol 37 (2) ◽

pp. 81-82 ◽

Cited By ~ 11

Author(s):

Yang Liu ◽

Linfeng Zhang ◽

Yong Guan

Keyword(s):

Anomaly Detection ◽

Distributed Data ◽

Data Streaming ◽

Streaming Algorithm ◽

Traffic Anomaly ◽

Traffic Anomaly Detection

Download Full-text

Considerations in the Interpretation of Cosmological Anomalies

Proceedings of the International Astronomical Union ◽

10.1017/s1743921314011132 ◽

2014 ◽

Vol 10 (S306) ◽

pp. 124-130 ◽

Cited By ~ 2

Author(s):

Hiranya V. Peiris

Keyword(s):

Experimental Design ◽

Anomaly Detection ◽

Ad Hoc ◽

Scientific Discovery ◽

New Physics ◽

Large Datasets ◽

Signal To Noise ◽

Self Consistent ◽

Known Unknowns ◽

The Look

AbstractAnomalies drive scientific discovery – they are associated with the cutting edge of the research frontier, and thus typically exploit data in the low signal-to-noise regime. In astronomy, the prevalence of systematics –- both “known unknowns” and “unknown unknowns” – combined with increasingly large datasets, the widespread use of ad hoc estimators for anomaly detection, and the “look-elsewhere” effect, can lead to spurious false detections. In this informal note, I argue that anomaly detection leading to discoveries of new physics requires a combination of physical understanding, careful experimental design to avoid confirmation bias, and self-consistent statistical methods. These points are illustrated with several concrete examples from cosmology.

Download Full-text

An Efficient Anomaly Detection Based on Optimal Deep Belief Network in Big Data

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f9178.109119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 708-716

Keyword(s):

Machine Learning ◽

Big Data ◽

Anomaly Detection ◽

Computer Network ◽

Large Datasets ◽

Support Vector ◽

The Internet ◽

Data Sets ◽

Data Generation ◽

Detection Techniques

Nowadays, the internet and network service user’s counts are increasing and the data generation speed also very high. Then again, we see greater security dangers on the internet, enterprise network, websites and the network. Anomaly has been known as one of the effective cyber threats over the internet which increasing exponentially and thus overcomes the commonly used approaches for anomaly detection and classification. Anomaly detection is used in big data analytics to recognize the unexpected behaviour. The most commonly used characteristics in network environment are size and dimensionality, which are big datasets and also impose problems in recognizing useful patterns, For example, to identify the network traffic anomalies from the large datasets. Due to the enormous increase of computer network based facilities it is a challenge to perform fast and efficient anomaly detection. The anomaly recognition in big data sets is more useful to discover fraud and abnormal action. Here, we mainly focus on the problems regarding anomaly detection, so we introduce a novel machine learning based anomaly detection technique. Machine learning approach is used to enhance the anomaly detection speed which is very much useful to detect the anomaly from the large datasets. We evaluate the proposed framework by performing experiments with larger data sets and compare to several existing techniques such as fuzzy, SVM (Support Vector Machine) and PSO (Particle swarm optimization). It has shown 98% percentage of accuracy and the false rate of 0.002 % on proposed classifier. The experimental results illuminate that better performance than existing anomaly detection techniques in big data environment.

Download Full-text