COMPARATIVE ANALYSIS OF SYSTEM LOGS AND STREAMING DATA ANOMALY DETECTION ALGORITHMS

This paper provides with the description, comparative analysis of multiple commonly used approaches of the analysis of system logs, and streaming data massively generated by company IT infrastructure with an unattended anomaly detection feature. An importance of the anomaly detection is dictated by the growing costs of system downtime due to the events that would have been predicted based on the log entries with the abnormal data reported. Anomaly detection systems are built using standard workflow of the data collection, parsing, information extraction and detection steps. Most of the document is related to the anomaly detection step and algorithms like regression, decision tree, SVM, clustering, principal components analysis, invariants mining and hierarchical temporal memory model. Model-based anomaly algorithms and hierarchical temporary memory algorithms were used to process HDFS, BGL and NAB datasets with ~16m log messages and 365k data points of the streaming data. The data was manually labeled to enable the training of the models and accuracy calculation. According to the results, supervised anomaly detection systems achieve high precision but require significant training effort, while HTM-based algorithm shows the highest detection precision with zero training. Detection of the abnormal system behavior plays an important role in large-scale incident management systems. Timely detection allows IT administrators to quickly identify issues and resolve them immediately. This approach reduces the system downtime dramatically.Most of the IT systems generate logs with the detailed information of the operations. Therefore, the logs become an ideal data source of the anomaly detection solutions. The volume of the logs makes it impossible to analyze them manually and requires automated approaches.

Download Full-text

Valid Probabilistic Anomaly Detection Models for System Logs

Wireless Communications and Mobile Computing ◽

10.1155/2020/8827185 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Chunbo Liu ◽

Lanlan Pan ◽

Zhaojun Gu ◽

Jialiang Wang ◽

Yitong Ren ◽

...

Keyword(s):

Anomaly Detection ◽

Large Scale ◽

Learning Algorithm ◽

Recall Rate ◽

Support Vector ◽

Fusion Algorithm ◽

Flexible Tool ◽

System Logs ◽

Output Only ◽

Better Than

System logs can record the system status and important events during system operation in detail. Detecting anomalies in the system logs is a common method for modern large-scale distributed systems. Yet threshold-based classification models used for anomaly detection output only two values: normal or abnormal, which lacks probability of estimating whether the prediction results are correct. In this paper, a statistical learning algorithm Venn-Abers predictor is adopted to evaluate the confidence of prediction results in the field of system log anomaly detection. It is able to calculate the probability distribution of labels for a set of samples and provide a quality assessment of predictive labels to some extent. Two Venn-Abers predictors LR-VA and SVM-VA have been implemented based on Logistic Regression and Support Vector Machine, respectively. Then, the differences among different algorithms are considered so as to build a multimodel fusion algorithm by Stacking. And then a Venn-Abers predictor based on the Stacking algorithm called Stacking-VA is implemented. The performances of four types of algorithms (unimodel, Venn-Abers predictor based on unimodel, multimodel, and Venn-Abers predictor based on multimodel) are compared in terms of validity and accuracy. Experiments are carried out on a log dataset of the Hadoop Distributed File System (HDFS). For the comparative experiments on unimodels, the results show that the validities of LR-VA and SVM-VA are better than those of the two corresponding underlying models. Compared with the underlying model, the accuracy of the SVM-VA predictor is better than that of LR-VA predictor, and more significantly, the recall rate increases from 81% to 94%. In the case of experiments on multiple models, the algorithm based on Stacking multimodel fusion is significantly superior to the underlying classifier. The average accuracy of Stacking-VA is larger than 0.95, which is more stable than the prediction results of LR-VA and SVM-VA. Experimental results show that the Venn-Abers predictor is a flexible tool that can make accurate and valid probability predictions in the field of system log anomaly detection.

Download Full-text

Evaluation of Distributed Machine Learning Algorithms for Anomaly Detection from Large-Scale System Logs: A Case Study

2018 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata.2018.8621967 ◽

2018 ◽

Cited By ~ 1

Author(s):

Merve Astekin ◽

Harun Zengin ◽

Hasan Sozer

Keyword(s):

Machine Learning ◽

Anomaly Detection ◽

Large Scale ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Large Scale System ◽

System Logs ◽

Distributed Machine Learning

Download Full-text

Threat Hunting in Windows Using Big Security Log Data

Security, Privacy, and Forensics Issues in Big Data - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-5225-9742-1.ch007 ◽

2020 ◽

pp. 168-188 ◽

Cited By ~ 1

Author(s):

Mohammad Rasool Fatemi ◽

Ali A. Ghorbani

Keyword(s):

Intrusion Detection ◽

Anomaly Detection ◽

Detection System ◽

Intrusion Detection Systems ◽

Detection Methods ◽

Sources Of Information ◽

Detection Systems ◽

System Logs ◽

Analysis System ◽

Anomaly Detection System

System logs are one of the most important sources of information for anomaly and intrusion detection systems. In a general log-based anomaly detection system, network, devices, and host logs are all collected and used together for analysis and the detection of anomalies. However, the ever-increasing volume of logs remains as one of the main challenges that anomaly detection tools face. Based on Sysmon, this chapter proposes a host-based log analysis system that detects anomalies without using network logs to reduce the volume and to show the importance of host-based logs. The authors implement a Sysmon parser to parse and extract features from the logs and use them to perform detection methods on the data. The valuable information is successfully retained after two extensive volume reduction steps. An anomaly detection system is proposed and performed on five different datasets with up to 55,000 events which detects the attacks using the preserved logs. The analysis results demonstrate the significance of host-based logs in auditing, security monitoring, and intrusion detection systems.

Download Full-text

A Comparative Analysis of Traditional and Deep Learning-Based Anomaly Detection Methods for Streaming Data

2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA) ◽

10.1109/icmla.2019.00105 ◽

2019 ◽

Cited By ~ 2

Author(s):

Mohsin Munir ◽

Muhammad Ali Chattha ◽

Andreas Dengel ◽

Sheraz Ahmed

Keyword(s):

Deep Learning ◽

Comparative Analysis ◽

Anomaly Detection ◽

Streaming Data ◽

Detection Methods

Download Full-text

Towards Large-Scale, Heterogeneous Anomaly Detection Systems in Industrial Networks: A Survey of Current Trends

Security and Communication Networks ◽

10.1155/2017/9150965 ◽

2017 ◽

Vol 2017 ◽

pp. 1-17 ◽

Cited By ~ 4

Author(s):

Mikel Iturbe ◽

Iñaki Garitano ◽

Urko Zurutuza ◽

Roberto Uribeetxeberria

Keyword(s):

Big Data ◽

Anomaly Detection ◽

Cyber Security ◽

Large Scale ◽

Big Data Analytics ◽

Research Field ◽

Open Problems ◽

Detection Systems ◽

Industrial Networks ◽

Active Research

Industrial Networks (INs) are widespread environments where heterogeneous devices collaborate to control and monitor physical processes. Some of the controlled processes belong to Critical Infrastructures (CIs), and, as such, IN protection is an active research field. Among different types of security solutions, IN Anomaly Detection Systems (ADSs) have received wide attention from the scientific community. While INs have grown in size and in complexity, requiring the development of novel, Big Data solutions for data processing, IN ADSs have not evolved at the same pace. In parallel, the development of Big Data frameworks such as Hadoop or Spark has led the way for applying Big Data Analytics to the field of cyber-security, mainly focusing on the Information Technology (IT) domain. However, due to the particularities of INs, it is not feasible to directly apply IT security mechanisms in INs, as IN ADSs face unique characteristics. In this work we introduce three main contributions. First, we survey the area of Big Data ADSs that could be applicable to INs and compare the surveyed works. Second, we develop a novel taxonomy to classify existing IN-based ADSs. And, finally, we present a discussion of open problems in the field of Big Data ADSs for INs that can lead to further development.

Download Full-text

Incremental Analysis of Large-Scale System Logs for Anomaly Detection

2019 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata47090.2019.9006593 ◽

2019 ◽

Author(s):

Merve Astekin ◽

Selim Ozcan ◽

Hasan Sozer

Keyword(s):

Anomaly Detection ◽

Large Scale ◽

Incremental Analysis ◽

Large Scale System ◽

System Logs

Download Full-text

DILAF: A framework for distributed analysis of large-scale system logs for anomaly detection

Software Practice and Experience ◽

10.1002/spe.2653 ◽

2018 ◽

Vol 49 (2) ◽

pp. 153-170 ◽

Cited By ~ 2

Author(s):

Merve Astekin ◽

Harun Zengin ◽

Hasan Sözer

Keyword(s):

Anomaly Detection ◽

Large Scale ◽

Large Scale System ◽

System Logs ◽

Distributed Analysis

Download Full-text

Efficiency of Telematics Systems in Management of Operational Activities in Road Transport Enterprises

Energies ◽

10.3390/en13184906 ◽

2020 ◽

Vol 13 (18) ◽

pp. 4906

Author(s):

Ryszard K. Miler ◽

Marcin J. Kisielewski ◽

Anna Brzozowska ◽

Antonina Kalinichenko

Keyword(s):

Information Technology ◽

Energy Consumption ◽

Comparative Analysis ◽

Large Scale ◽

Research Problem ◽

Road Transport ◽

Original Matrix ◽

It Systems ◽

Operational Management ◽

Operational Activities

Implemented in road transport enterprises (RTEs) on a large scale, telematics systems are dedicated both to the particular aspects of their operation and to the integrated fields of the total operational functioning of such entities. Hence, a research problem can be defined as the identification of their efficiency levels in the context of operational activities undertaken by RTEs (including more holistic effects, e.g., lowering fuel/energy consumption and negative environmental impacts). Current research studies refer to the efficiency of some particular modules, but there have not been any publications focused on describing the efficiency of telematics systems in a more integrated (holistic) way, due to the lack of a universal tool that could be applied to provide this type of measurement. In this paper, an attempt at filling the identified cognitive gap is presented through empirical research analysing the original matrix developed by the authors that refers to the efficiency rates of organisational activities undertaken by RTEs. The purpose of this paper is to present a tool that has been designed to provide a holistic evaluation of efficiency of telematics systems in RTE operational management. The results are presented in a form of an individual (ontogenetic) matrix of the analysed companies, for which a determinant was calculated with the use of Sarrus’ rule. Obtained in such a way, the set of values identified for the determinants of the subsequent ontogenetic matrices came as an arithmetic progression that characterised the scope and the level of the influence exerted by the implemented IT (information technology) systems on the organisational efficiency of operational activities undertaken by the analysed RTEs. We present a hypothesis stating that the originally developed matrix can be viewed as a reliable tool used for comparative analysis in the field of efficiency of telematics systems in RTEs, and this hypothesis was positively verified during the research. The obtained results prove the significant potential for the wide application of the discussed matrix, which can be used as a universal tool for the analysis and comparison of efficiency indicated by the integrated IT systems in the operational activities undertaken by RTEs.

Download Full-text