scholarly journals Extending Isolation Forest for Anomaly Detection in Big Data via K-Means

2021 ◽  
Vol 5 (4) ◽  
pp. 1-26
Author(s):  
Md Tahmid Rahman Laskar ◽  
Jimmy Xiangji Huang ◽  
Vladan Smetana ◽  
Chris Stewart ◽  
Kees Pouw ◽  
...  

Industrial Information Technology infrastructures are often vulnerable to cyberattacks. To ensure security to the computer systems in an industrial environment, it is required to build effective intrusion detection systems to monitor the cyber-physical systems (e.g., computer networks) in the industry for malicious activities. This article aims to build such intrusion detection systems to protect the computer networks from cyberattacks. More specifically, we propose a novel unsupervised machine learning approach that combines the K-Means algorithm with the Isolation Forest for anomaly detection in industrial big data scenarios. Since our objective is to build the intrusion detection system for the big data scenario in the industrial domain, we utilize the Apache Spark framework to implement our proposed model that was trained in large network traffic data (about 123 million instances of network traffic) stored in Elasticsearch. Moreover, we evaluate our proposed model on the live streaming data and find that our proposed system can be used for real-time anomaly detection in the industrial setup. In addition, we address different challenges that we face while training our model on large datasets and explicitly describe how these issues were resolved. Based on our empirical evaluation in different use cases for anomaly detection in real-world network traffic data, we observe that our proposed system is effective to detect anomalies in big data scenarios. Finally, we evaluate our proposed model on several academic datasets to compare with other models and find that it provides comparable performance with other state-of-the-art approaches.

2021 ◽  
Vol 11 (4) ◽  
pp. 1674
Author(s):  
Nuno Oliveira ◽  
Isabel Praça ◽  
Eva Maia ◽  
Orlando Sousa

With the latest advances in information and communication technologies, greater amounts of sensitive user and corporate information are shared continuously across the network, making it susceptible to an attack that can compromise data confidentiality, integrity, and availability. Intrusion Detection Systems (IDS) are important security mechanisms that can perform the timely detection of malicious events through the inspection of network traffic or host-based logs. Many machine learning techniques have proven to be successful at conducting anomaly detection throughout the years, but only a few considered the sequential nature of data. This work proposes a sequential approach and evaluates the performance of a Random Forest (RF), a Multi-Layer Perceptron (MLP), and a Long-Short Term Memory (LSTM) on the CIDDS-001 dataset. The resulting performance measures of this particular approach are compared with the ones obtained from a more traditional one, which only considers individual flow information, in order to determine which methodology best suits the concerned scenario. The experimental outcomes suggest that anomaly detection can be better addressed from a sequential perspective. The LSTM is a highly reliable model for acquiring sequential patterns in network traffic data, achieving an accuracy of 99.94% and an f1-score of 91.66%.


2021 ◽  
pp. 1-18
Author(s):  
Satish Kumar ◽  
Sunanda Gupta ◽  
Sakshi Arora

Network Intrusion detection systems (NIDS) detect malicious and intrusive information in computer networks. Presently, commercial NIDS is based on machine learning approaches that have complex algorithms and increase intrusion detection efficiency and efficacy. These machine learning-based NIDS use high dimensional network traffic data from which intrusive information is to be detected. This high-dimensional network traffic data in NIDS needs to be preprocessed and normalized to make it suitable for machine learning tools. A machine learning approach with appropriate normalization and prepossessing increases NIDS performance. This paper presents an empirical study on various normalization methods implemented on a benchmark network traffic dataset, KDD Cup’99, that has been used to evaluate the NIDS model. The present study shows decimal normalization has a better prediction performance than non-normalized traffic data categorized into ‘normal’ or ‘intrusive’ classes.


Author(s):  
Nuno Oliveira ◽  
Isabel Praça ◽  
Eva Maia ◽  
Orlando Sousa

With the latest advances in information and communication technologies, greater amounts of sensitive user and corporate information are constantly shared across the network making it susceptible to an attack that can compromise data confidentiality, integrity and availability. Intrusion Detection Systems (IDS) are important security mechanisms that can perform a timely detection of malicious events through the inspection of network traffic or host-based logs. Throughout the years, many machine learning techniques have proven to be successful at conducting anomaly detection but only a few considered the sequential nature of data. This work proposes a sequential approach and evaluates the performance of a Random Forest (RF), a Multi-Layer Perceptron (MLP) and a Long-Short Term Memory (LSTM) on the CIDDS-001 dataset. The resulting performance measures of this particular approach are compared with the ones obtained from a more traditional one, that only considers individual flow information, in order to determine which methodology best suits the concerned scenario. The experimental outcomes lead to believe that anomaly detection can be better addressed from a sequential perspective and that the LSTM is a very reliable model for acquiring sequential patterns in network traffic data, achieving an accuracy of 99.94% and a f1-score of 91.66%.


Attackers, spread all around the world, have become a major threat to SCADA systems, since they started using opened-standard networks, integrated to corporate networks and accessing the Internet. It is true that there are also many different security solutions and techniques available, such as firewalls, encryption, network traffic analysis and a few others, though, intruders still managed to gain access and control delicate systems. Pointed as a non-invasive solution, intrusion detection systems (IDS) are able to monitor and report activities of any anomaly or strange patterns. However, due to the lack of SCADA network traffic data, such IDS solutions are still primitive and based on just well-known vulnerabilities and attacks, where a dedicated IDS is necessary to properly protect SCADA in water distribution systems. This study highlights SCADA vulnerabilities and security issues, through a qualitative approach, using known attacks and examples in security as case studies and aiming to present scenarios on this issue, as well, an overview of today’s SCADA vulnerabilities and main threats. Results show that the identification of Intrusion Detection Systems (IDS), with their approaches and types, also widely implemented in regular IT networks, help on providing a higher security level and identifying abnormal traffic data. Such systems have indeed shown a good success rate on identifying malicious traffic in SCADA networks, mainly because of their evolution to Ethernet and open communication protocols. Based on these singular characteristics, studying SCADA networks and their communication protocols is seen as a major factor to properly develop robust security mechanisms and tolls.


2014 ◽  
Vol 5 (2) ◽  
pp. 39-53 ◽  
Author(s):  
Bachir Bahamida ◽  
Dalila Boughaci

Due to a growing number of intrusion events, organizations are increasingly implementing various intrusion detection systems that classify network traffic data as normal or anomaly. In this paper, three intrusion detection systems based fuzzy meta-heuristics are proposed. The first one is a fuzzy stochastic local search (FSLS). The second one is a fuzzy tabu search (FTS) and the third one is a fuzzy deferential evolution (FDE). These classifiers are built on a knowledge base modelled as a fuzzy rule “if-then”. The main purpose of these methods is to get the highest quality solutions by optimizing the fuzzy rules generation. The proposed classifiers FSLS, FTS and FDE are tested on the benchmark KDD'99 intrusion dataset and compared with some well-known existing techniques for intrusion detection. The results show the efficiency of the proposed approaches in the intrusion detection field.


2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Sydney M. Kasongo ◽  
Yanxia Sun

AbstractComputer networks intrusion detection systems (IDSs) and intrusion prevention systems (IPSs) are critical aspects that contribute to the success of an organization. Over the past years, IDSs and IPSs using different approaches have been developed and implemented to ensure that computer networks within enterprises are secure, reliable and available. In this paper, we focus on IDSs that are built using machine learning (ML) techniques. IDSs based on ML methods are effective and accurate in detecting networks attacks. However, the performance of these systems decreases for high dimensional data spaces. Therefore, it is crucial to implement an appropriate feature extraction method that can prune some of the features that do not possess a great impact in the classification process. Moreover, many of the ML based IDSs suffer from an increase in false positive rate and a low detection accuracy when the models are trained on highly imbalanced datasets. In this paper, we present an analysis the UNSW-NB15 intrusion detection dataset that will be used for training and testing our models. Moreover, we apply a filter-based feature reduction technique using the XGBoost algorithm. We then implement the following ML approaches using the reduced feature space: Support Vector Machine (SVM), k-Nearest-Neighbour (kNN), Logistic Regression (LR), Artificial Neural Network (ANN) and Decision Tree (DT). In our experiments, we considered both the binary and multiclass classification configurations. The results demonstrated that the XGBoost-based feature selection method allows for methods such as the DT to increase its test accuracy from 88.13 to 90.85% for the binary classification scheme.


Sign in / Sign up

Export Citation Format

Share Document