Machine Learning Techniques for Intrusion Detection

This chapter proposes a hybrid classifier technique for network Intrusion Detection System by implementing a method that combines Random Forest classification technique with K-Means and Gaussian Mixture clustering algorithms. Random-forest will build patterns of intrusion over a training data in misuse-detection, while anomaly-detection intrusions will be identiðed by the outlier-detection mechanism. The implementation and simulation of the proposed method for various metrics are carried out under varying threshold values. The effectiveness of the proposed method has been carried out for metrics such as precision, recall, accuracy rate, false alarm rate, and detection rate. The various existing algorithms are analyzed extensively. It is observed experimentally that the proposed method gives superior results compared to the existing simpler classifiers as well as existing hybrid classifier techniques. The proposed hybrid classifier technique outperforms other common existing classifiers with an accuracy of 99.84%, false alarm rate as 0.09% and the detection rate as 99.7%.

Download Full-text

Security of Things Intrusion Detection System for Smart Healthcare

Electronics ◽

10.3390/electronics10121375 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1375

Author(s):

Celestine Iwendi ◽

Joseph Henry Anajemba ◽

Cresantus Biamba ◽

Desire Ngabo

Keyword(s):

Genetic Algorithm ◽

Intrusion Detection ◽

False Alarm ◽

False Alarm Rate ◽

Detection Rate ◽

Web Security ◽

Intrusion Detection Systems ◽

High Detection Rate ◽

Detection Systems ◽

Smart Healthcare

Web security plays a very crucial role in the Security of Things (SoT) paradigm for smart healthcare and will continue to be impactful in medical infrastructures in the near future. This paper addressed a key component of security-intrusion detection systems due to the number of web security attacks, which have increased dramatically in recent years in healthcare, as well as the privacy issues. Various intrusion-detection systems have been proposed in different works to detect cyber threats in smart healthcare and to identify network-based attacks and privacy violations. This study was carried out as a result of the limitations of the intrusion detection systems in responding to attacks and challenges and in implementing privacy control and attacks in the smart healthcare industry. The research proposed a machine learning support system that combined a Random Forest (RF) and a genetic algorithm: a feature optimization method that built new intrusion detection systems with a high detection rate and a more accurate false alarm rate. To optimize the functionality of our approach, a weighted genetic algorithm and RF were combined to generate the best subset of functionality that achieved a high detection rate and a low false alarm rate. This study used the NSL-KDD dataset to simultaneously classify RF, Naive Bayes (NB) and logistic regression classifiers for machine learning. The results confirmed the importance of optimizing functionality, which gave better results in terms of the false alarm rate, precision, detection rate, recall and F1 metrics. The combination of our genetic algorithm and RF models achieved a detection rate of 98.81% and a false alarm rate of 0.8%. This research raised awareness of privacy and authentication in the smart healthcare domain, wireless communications and privacy control and developed the necessary intelligent and efficient web system. Furthermore, the proposed algorithm was applied to examine the F1-score and precisionperformance as compared to the NSL-KDD and CSE-CIC-IDS2018 datasets using different scaling factors. The results showed that the proposed GA was greatly optimized, for which the average precision was optimized by 5.65% and the average F1-score by 8.2%.

Download Full-text

RFCell: A Gene Selection Approach for scRNA-seq Clustering Based on Permutation and Random Forest

Frontiers in Genetics ◽

10.3389/fgene.2021.665843 ◽

2021 ◽

Vol 12 ◽

Author(s):

Yuan Zhao ◽

Zhao-Yu Fang ◽

Cui-Xiang Lin ◽

Chao Deng ◽

Yun-Pei Xu ◽

...

Keyword(s):

Random Forest ◽

Single Cell ◽

Gene Selection ◽

Clustering Algorithms ◽

Selection Methods ◽

Clustering Methods ◽

Cell Type ◽

Cell Type Specificity ◽

Random Forest Classification ◽

Forest Classification

In recent years, the application of single cell RNA-seq (scRNA-seq) has become more and more popular in fields such as biology and medical research. Analyzing scRNA-seq data can discover complex cell populations and infer single-cell trajectories in cell development. Clustering is one of the most important methods to analyze scRNA-seq data. In this paper, we focus on improving scRNA-seq clustering through gene selection, which also reduces the dimensionality of scRNA-seq data. Studies have shown that gene selection for scRNA-seq data can improve clustering accuracy. Therefore, it is important to select genes with cell type specificity. Gene selection not only helps to reduce the dimensionality of scRNA-seq data, but also can improve cell type identification in combination with clustering methods. Here, we proposed RFCell, a supervised gene selection method, which is based on permutation and random forest classification. We first use RFCell and three existing gene selection methods to select gene sets on 10 scRNA-seq data sets. Then, three classical clustering algorithms are used to cluster the cells obtained by these gene selection methods. We found that the gene selection performance of RFCell was better than other gene selection methods.

Download Full-text

An Intelligent Approach for Intrusion Detection using Convolutional Neural Network

Journal of Network Security Computer Networks ◽

10.46610/jonscn.2022.v08i01.001 ◽

2022 ◽

Vol 8 (1) ◽

Author(s):

P. Manoj Kumar ◽

M. Parvathy ◽

C. Abinaya Devi

Keyword(s):

Neural Network ◽

Intrusion Detection ◽

False Alarm ◽

Convolutional Neural Network ◽

False Alarm Rate ◽

Real Time ◽

Detection Rate ◽

Performance Metrics ◽

Classification Model ◽

Real Time Traffic

Intrusion Detection Systems (IDS) is one of the important aspects of cyber security that can detect the anomalies in the network traffic. IDS are a part of Second defense line of a system that can be deployed along with other security measures such as access control, authentication mechanisms and encryption techniques to secure the systems against cyber-attacks. However, IDS suffers from the problem of handling large volume of data and in detecting zero-day attacks (new types of attacks) in a real-time traffic environment. To overcome this problem, an intelligent Deep Learning approach for Intrusion Detection is proposed based on Convolutional Neural Network (CNN-IDS). Initially, the model is trained and tested under a new real-time traffic dataset, CSE-CIC-IDS 2018 dataset. Then, the performance of CNN-IDS model is studied based on three important performance metrics namely, accuracy / training time, detection rate and false alarm rate. Finally, the experimental results are compared with those of various Deep Discriminative models including Recurrent Neural network (RNN), Deep Neural Network (DNN) etc., proposed for IDS under the same dataset. The Comparative results show that the proposed CNN-IDS model is very much suitable for modelling a classification model both in terms of binary and multi-class classification with higher detection rate, accuracy, and lower false alarm rate. The CNN-IDS model improves the accuracy of intrusion detection and provides a new research method for intrusion detection.

Download Full-text

CHIRPS: Explaining random forest classification

Artificial Intelligence Review ◽

10.1007/s10462-020-09833-6 ◽

2020 ◽

Vol 53 (8) ◽

pp. 5747-5788

Author(s):

Julian Hatwell ◽

Mohamed Medhat Gaber ◽

R. Muhammad Atif Azad

Keyword(s):

Random Forest ◽

Pattern Mining ◽

Frequent Pattern Mining ◽

Training Data ◽

Frequent Pattern ◽

Data Sets ◽

Random Forest Classification ◽

Human In The Loop ◽

Forest Classification ◽

Unseen Data

Abstract Modern machine learning methods typically produce “black box” models that are opaque to interpretation. Yet, their demand has been increasing in the Human-in-the-Loop processes, that is, those processes that require a human agent to verify, approve or reason about the automated decisions before they can be applied. To facilitate this interpretation, we propose Collection of High Importance Random Path Snippets (CHIRPS); a novel algorithm for explaining random forest classification per data instance. CHIRPS extracts a decision path from each tree in the forest that contributes to the majority classification, and then uses frequent pattern mining to identify the most commonly occurring split conditions. Then a simple, conjunctive form rule is constructed where the antecedent terms are derived from the attributes that had the most influence on the classification. This rule is returned alongside estimates of the rule’s precision and coverage on the training data along with counter-factual details. An experimental study involving nine data sets shows that classification rules returned by CHIRPS have a precision at least as high as the state of the art when evaluated on unseen data (0.91–0.99) and offer a much greater coverage (0.04–0.54). Furthermore, CHIRPS uniquely controls against under- and over-fitting solutions by maximising novel objective functions that are better suited to the local (per instance) explanation setting.

Download Full-text

The Use of Ensemble Models for Multiple Class and Binary Class Classification for Improving Intrusion Detection Systems

Sensors ◽

10.3390/s20092559 ◽

2020 ◽

Vol 20 (9) ◽

pp. 2559 ◽

Cited By ~ 9

Author(s):

Celestine Iwendi ◽

Suleman Khan ◽

Joseph Henry Anajemba ◽

Mohit Mittal ◽

Mamdouh Alenezi ◽

...

Keyword(s):

Machine Learning ◽

Intrusion Detection ◽

False Alarm ◽

False Alarm Rate ◽

Detection Rate ◽

Detection System ◽

Denial Of Service ◽

Intrusion Detection Systems ◽

Ensemble Models ◽

Detection Systems

The pursuit to spot abnormal behaviors in and out of a network system is what led to a system known as intrusion detection systems for soft computing besides many researchers have applied machine learning around this area. Obviously, a single classifier alone in the classifications seems impossible to control network intruders. This limitation is what led us to perform dimensionality reduction by means of correlation-based feature selection approach (CFS approach) in addition to a refined ensemble model. The paper aims to improve the Intrusion Detection System (IDS) by proposing a CFS + Ensemble Classifiers (Bagging and Adaboost) which has high accuracy, high packet detection rate, and low false alarm rate. Machine Learning Ensemble Models with base classifiers (J48, Random Forest, and Reptree) were built. Binary classification, as well as Multiclass classification for KDD99 and NSLKDD datasets, was done while all the attacks were named as an anomaly and normal traffic. Class labels consisted of five major attacks, namely Denial of Service (DoS), Probe, User-to-Root (U2R), Root to Local attacks (R2L), and Normal class attacks. Results from the experiment showed that our proposed model produces 0 false alarm rate (FAR) and 99.90% detection rate (DR) for the KDD99 dataset, and 0.5% FAR and 98.60% DR for NSLKDD dataset when working with 6 and 13 selected features.

Download Full-text

Predicting the outcrop of pre-Quaternary formations in the Dorog Basin (Hungary) using random forest classification

10.5194/egusphere-egu2020-7255 ◽

2020 ◽

Author(s):

Reka Pogacsas ◽

Gaspar Albert

Keyword(s):

Remote Sensing ◽

Random Forest ◽

Slope Angle ◽

Morphological Characteristics ◽

Training Data ◽

Topographic Wetness Index ◽

Unique Region ◽

Random Forest Classification ◽

Forest Classification ◽

Geological Map

The Dorog Basin is a morphologically unique region of the Transdanubian Mountains revealing the combined work of tectonic forces and erosion. Overprinted by the forms of fluvial erosion, numerous NW-SE striking half-graben and horst structures are present. The surface is dominantly covered by lose 1&#8211;15 m thick Quaternary sediments (aeolian loess, and siliciclastic alluvial and coluvial formations), while the lithified bedrock consists of Mesozoic carbonates, Paleogene limestones, marls and sandstones and limnic coal sequences. The rheological difference of the Quaternary and pre-Quaternary formations is so pronounced that the morphological characteristics of the outcrops also differ significantly. The area was in the focus of geologists for many decades, due to its Eocene coal beds, and a renewal of the geological map of the region is in progress. The current research aims to assist the mapping with multivariate methods based on geomorphological attributes, such as slope angle, aspect, profile curvature, height, and topographic wetness index. We perform a random forest classification (RFC) using these variables, to predict the outcrops of pre-Quaternary formations in the study area.Random forest is a powerful tool for multivariate classification that uses several decision trees, each one with a prediction, where the most popular one will be the overall result [1]. The reason why it is getting popular in spatial predictions is the high accuracy to classify raster-type objects [2]. We used raster-type spatial data as subject of RFC predicting a result for each pixel. The geology of the study area was known from previous geological mapping [3]. Morphological information was derived from the MERIT DEM.Our model used a raster with multiple bands containing geomorphological variables, and training data from the digitalized geological map. The number of random samples of data was 2500. After testing several combinations of the bands, and several spacing of the study areas, the best prediction has cca. 80% accuracy. Model validation is based on the calculation of rates of well predicted pixels in the same rasterized geological map that was used for training. Our aim was to use exact data, which is completely true for remotely sensed images, but not for geological maps. That means the accuracy still can be improved by field perception, or from borehole data.&#160;References:[1] Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18-22.[2] Belgiu, M., & Dr&#259;gu&#355;, L. (2016). Random forest in remote sensing: A review of applications and future directions. ISPRS Journal of Photogrammetry and Remote Sensing, 114, 24-31.[3] Gidai, L., Nagy, G., & Siposs, Z. (1981). Geological map of the Dorog Basin 1: 25 000. [in Hungarian] Geological Institute of Hungary, Budapest.

Download Full-text

Analysis of NSL KDD Dataset Using Classification Algorithms for Intrusion Detection System

Recent Patents on Engineering ◽

10.2174/1872212112666180402122150 ◽

2019 ◽

Vol 13 (2) ◽

pp. 142-147

Author(s):

Srishti Sharma ◽

Yogita Gigras ◽

Rita Chhikara ◽

Anuradha Dhull

Keyword(s):

Random Forest ◽

Intrusion Detection ◽

Detection System ◽

Random Trees ◽

Attribute Selection ◽

Classification Algorithms ◽

Random Forest Classification ◽

Detection Systems ◽

Forest Classification ◽

Feature Attribute

Background: Intrusion detection systems are responsible for detecting anomalies and network attacks. Building of an effective IDS depends upon the readily available dataset. This dataset is used to train and test intelligent IDS. In this research, NSL KDD dataset (an improvement over original KDD Cup 1999 dataset) is used as KDD’99 contains huge amount of redundant records, which makes it difficult to process the data accurately. Methods: The classification techniques applied on this dataset to analyze the data are decision trees like J48, Random Forest and Random Trees. Results: On comparison of these three classification algorithms, Random Forest was proved to produce the best results and therefore, Random Forest classification method was used to further analyze the data. The results are analyzed and depicted in this paper with the help of feature/attribute selection by applying all the possible combinations. Conclusion: There are total of eight significant attributes selected after applying various attribute selection methods on NSL KDD dataset.

Download Full-text

On the Detection Capabilities of Signature-Based Intrusion Detection Systems in the Context of Web Attacks

Applied Sciences ◽

10.3390/app12020852 ◽

2022 ◽

Vol 12 (2) ◽

pp. 852

Author(s):

Jesús Díaz-Verdejo ◽

Javier Muñoz-Calle ◽

Antonio Estepa Alonso ◽

Rafael Estepa Alonso ◽

Germán Madinabeitia

Keyword(s):

Intrusion Detection ◽

False Alarm ◽

False Alarm Rate ◽

Detection Rate ◽

Real Life ◽

Intrusion Detection Systems ◽

Operational Environment ◽

Detection Systems ◽

Web Attacks ◽

Trade Offs

Signature-based Intrusion Detection Systems (SIDS) play a crucial role within the arsenal of security components of most organizations. They can find traces of known attacks in the network traffic or host events for which patterns or signatures have been pre-established. SIDS include standard packages of detection rulesets, but only those rules suited to the operational environment should be activated for optimal performance. However, some organizations might skip this tuning process and instead activate default off-the-shelf rulesets without understanding its implications and trade-offs. In this work, we help gain insight into the consequences of using predefined rulesets in the performance of SIDS. We experimentally explore the performance of three SIDS in the context of web attacks. In particular, we gauge the detection rate obtained with predefined subsets of rules for Snort, ModSecurity and Nemesida using seven attack datasets. We also determine the precision and rate of alert generated by each detector in a real-life case using a large trace from a public webserver. Results show that the maximum detection rate achieved by the SIDS under test is insufficient to protect systems effectively and is lower than expected for known attacks. Our results also indicate that the choice of predefined settings activated on each detector strongly influences its detection capability and false alarm rate. Snort and ModSecurity scored either a very poor detection rate (activating the less-sensitive predefined ruleset) or a very poor precision (activating the full ruleset). We also found that using various SIDS for a cooperative decision can improve the precision or the detection rate, but not both. Consequently, it is necessary to reflect upon the role of these open-source SIDS with default configurations as core elements for protection in the context of web attacks. Finally, we provide an efficient method for systematically determining which rules deactivate from a ruleset to significantly reduce the false alarm rate for a target operational environment. We tested our approach using Snort’s ruleset in our real-life trace, increasing the precision from 0.015 to 1 in less than 16 h of work.

Download Full-text

Intrusion Detection Based on Approximate Information Entropy for Random Forest Classification

Proceedings of the 2019 4th International Conference on Big Data and Computing - ICBDC 2019 ◽

10.1145/3335484.3335488 ◽

2019 ◽

Author(s):

Le Yang ◽

Manchun Cai ◽

Yongcheng Duan ◽

Xue Yang

Keyword(s):

Random Forest ◽

Intrusion Detection ◽

Information Entropy ◽

Random Forest Classification ◽

Forest Classification

Download Full-text

Random forest classification of morphology in the northern Gerecse (Hungary) to predict landslide-prone slopes

10.5194/egusphere-egu2020-8365 ◽

2020 ◽

Author(s):

Gáspár Albert ◽

Dávid Gerzsenyi

Keyword(s):

Random Forest ◽

Decision Trees ◽

Training Data ◽

Danube River ◽

Predictor Variables ◽

Geological Data ◽

Random Forest Classification ◽

Fluvial Terraces ◽

Forest Classification ◽

Geological Map

The morphology of the Gerecse Hills bears the imprints of fluvial terraces of the Danube River, Neogene tectonism and Quaternary erosion. The solid bedrocks are composed of Mesozoic and Paleogene limestones, marls, and sandstones, and are covered by 115 m thick layers of unconsolidated Quaternary fluvial, lacustrine, and aeolian sediments. Hillslopes, stream valleys, and loessy riverside bluffs are prone to landslides, which caused serious damages in inhabited and agricultural areas in the past. Attempts to map these landslides were made and the observations were documented in the National Landslide Cadastre (NLC) inventory since the 1970&#8217;s. These documentations are sporadic, concentrating only on certain locations, and they often refer inaccurately to the state and extent of the landslides. The aim of the present study was to complete and correct the landslide inventory by using quantitative modelling. On the 480 sq. km large study area all records of the inventory were revisited and corrected. Using objective criteria, the renewed records and additional sample locations were sorted into one of the following morphological categories: scarps, debris, transitional area, stable accumulation areas, stable hilltops, and stable slopes. The categorized map of these observations served as training data for the random forest classification (RFC).Random forest is a powerful tool for multivariate classification that uses several decision trees. In our case, the predictions were done for each pixels of medium-resolution (~10 m) rasters. The predictor variables of the decision trees were morphometric and geological indices. The terrain indices were derived from the MERIT DEM with SAGA routines and the categorized geological data is from a medium-scale geological map [1]. The predictor variables were packed in a multi-band raster and the RFC method was executed using R 3.5 with RStudio.After testing several combinations of the predictor variables and two different categorisation of the geological data, the best prediction has cca. 80% accuracy. The validation of the model is based on the calculation of the rate of well-predicted pixels compared to the total cell-count of the training data. The results showed that the probable location of landslide-prone slopes is not restricted to the areas recorded in the National Landslide Cadastre inventory. Based on the model, only ~6% of the estimated location of the highly unstable slopes (scarps) fall within the NLC polygons in the study area.The project was partly supported by the Thematic Excellence Program, Industry and Digitization Subprogram, NRDI Office, project no. ED_18-1-2019-0030 (from the part of G. Albert) and the &#218;NKP-19-3 New National Excellence Program of the Ministry for Innovation and Technology (from the part of D. Gerzsenyi).Reference:[1] Gyalog L., and S&#237;khegyi F., eds. Geological map of Hungary (scale: 1:100 000). Budapest, Hungary, Geological Institute of Hungary, 2005.

Download Full-text