Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced Dataset

2020 ◽  
Vol 10 (2) ◽  
pp. 1-26
Author(s):  
Naghmeh Moradpoor Sheykhkanloo ◽  
Adam Hall

An insider threat can take on many forms and fall under different categories. This includes malicious insider, careless/unaware/uneducated/naïve employee, and the third-party contractor. Machine learning techniques have been studied in published literature as a promising solution for such threats. However, they can be biased and/or inaccurate when the associated dataset is hugely imbalanced. Therefore, this article addresses the insider threat detection on an extremely imbalanced dataset which includes employing a popular balancing technique known as spread subsample. The results show that although balancing the dataset using this technique did not improve performance metrics, it did improve the time taken to build the model and the time taken to test the model. Additionally, the authors realised that running the chosen classifiers with parameters other than the default ones has an impact on both balanced and imbalanced scenarios, but the impact is significantly stronger when using the imbalanced dataset.

Author(s):  
Mohsen Kamyab ◽  
Stephen Remias ◽  
Erfan Najmi ◽  
Kerrick Hood ◽  
Mustafa Al-Akshar ◽  
...  

According to the Federal Highway Administration (FHWA), US work zones on freeways account for nearly 24% of nonrecurring freeway delays and 10% of overall congestion. Historically, there have been limited scalable datasets to investigate the specific causes of congestion due to work zones or to improve work zone planning processes to characterize the impact of work zone congestion. In recent years, third-party data vendors have provided scalable speed data from Global Positioning System (GPS) devices and cell phones which can be used to characterize mobility on all roadways. Each work zone has unique characteristics and varying mobility impacts which are predicted during the planning and design phases, but can realistically be quite different from what is ultimately experienced by the traveling public. This paper uses these datasets to introduce a scalable Work Zone Mobility Audit (WZMA) template. Additionally, the paper uses metrics developed for individual work zones to characterize the impact of more than 250 work zones varying in length and duration from Southeast Michigan. The authors make recommendations to work zone engineers on useful data to collect for improving the WZMA. As more systematic work zone data are collected, improved analytical assessment techniques, such as machine learning processes, can be used to identify the factors that will predict future work zone impacts. The paper concludes by demonstrating two machine learning algorithms, Random Forest and XGBoost, which show historical speed variation is a critical component when predicting the mobility impact of work zones.


2021 ◽  
Vol 11 (7) ◽  
pp. 3130
Author(s):  
Janka Kabathova ◽  
Martin Drlik

Early and precisely predicting the students’ dropout based on available educational data belongs to the widespread research topic of the learning analytics research field. Despite the amount of already realized research, the progress is not significant and persists on all educational data levels. Even though various features have already been researched, there is still an open question, which features can be considered appropriate for different machine learning classifiers applied to the typical scarce set of educational data at the e-learning course level. Therefore, the main goal of the research is to emphasize the importance of the data understanding, data gathering phase, stress the limitations of the available datasets of educational data, compare the performance of several machine learning classifiers, and show that also a limited set of features, which are available for teachers in the e-learning course, can predict student’s dropout with sufficient accuracy if the performance metrics are thoroughly considered. The data collected from four academic years were analyzed. The features selected in this study proved to be applicable in predicting course completers and non-completers. The prediction accuracy varied between 77 and 93% on unseen data from the next academic year. In addition to the frequently used performance metrics, the comparison of machine learning classifiers homogeneity was analyzed to overcome the impact of the limited size of the dataset on obtained high values of performance metrics. The results showed that several machine learning algorithms could be successfully applied to a scarce dataset of educational data. Simultaneously, classification performance metrics should be thoroughly considered before deciding to deploy the best performance classification model to predict potential dropout cases and design beneficial intervention mechanisms.


2021 ◽  
Vol 10 (7) ◽  
pp. 436
Author(s):  
Amerah Alghanim ◽  
Musfira Jilani ◽  
Michela Bertolotto ◽  
Gavin McArdle

Volunteered Geographic Information (VGI) is often collected by non-expert users. This raises concerns about the quality and veracity of such data. There has been much effort to understand and quantify the quality of VGI. Extrinsic measures which compare VGI to authoritative data sources such as National Mapping Agencies are common but the cost and slow update frequency of such data hinder the task. On the other hand, intrinsic measures which compare the data to heuristics or models built from the VGI data are becoming increasingly popular. Supervised machine learning techniques are particularly suitable for intrinsic measures of quality where they can infer and predict the properties of spatial data. In this article we are interested in assessing the quality of semantic information, such as the road type, associated with data in OpenStreetMap (OSM). We have developed a machine learning approach which utilises new intrinsic input features collected from the VGI dataset. Specifically, using our proposed novel approach we obtained an average classification accuracy of 84.12%. This result outperforms existing techniques on the same semantic inference task. The trustworthiness of the data used for developing and training machine learning models is important. To address this issue we have also developed a new measure for this using direct and indirect characteristics of OSM data such as its edit history along with an assessment of the users who contributed the data. An evaluation of the impact of data determined to be trustworthy within the machine learning model shows that the trusted data collected with the new approach improves the prediction accuracy of our machine learning technique. Specifically, our results demonstrate that the classification accuracy of our developed model is 87.75% when applied to a trusted dataset and 57.98% when applied to an untrusted dataset. Consequently, such results can be used to assess the quality of OSM and suggest improvements to the data set.


2021 ◽  
Vol 35 (1) ◽  
pp. 11-21
Author(s):  
Himani Tyagi ◽  
Rajendra Kumar

IoT is characterized by communication between things (devices) that constantly share data, analyze, and make decisions while connected to the internet. This interconnected architecture is attracting cyber criminals to expose the IoT system to failure. Therefore, it becomes imperative to develop a system that can accurately and automatically detect anomalies and attacks occurring in IoT networks. Therefore, in this paper, an Intrsuion Detection System (IDS) based on extracted novel feature set synthesizing BoT-IoT dataset is developed that can swiftly, accurately and automatically differentiate benign and malicious traffic. Instead of using available feature reduction techniques like PCA that can change the core meaning of variables, a unique feature set consisting of only seven lightweight features is developed that is also IoT specific and attack traffic independent. Also, the results shown in the study demonstrates the effectiveness of fabricated seven features in detecting four wide variety of attacks namely DDoS, DoS, Reconnaissance, and Information Theft. Furthermore, this study also proves the applicability and efficiency of supervised machine learning algorithms (KNN, LR, SVM, MLP, DT, RF) in IoT security. The performance of the proposed system is validated using performance Metrics like accuracy, precision, recall, F-Score and ROC. Though the accuracy of Decision Tree (99.9%) and Randon Forest (99.9%) Classifiers are same but other metrics like training and testing time shows Random Forest comparatively better.


Author(s):  
M. M. Ata ◽  
K. M. Elgamily ◽  
M. A. Mohamed

The presented paper proposes an algorithm for palmprint recognition using seven different machine learning algorithms. First of all, we have proposed a region of interest (ROI) extraction methodology which is a two key points technique. Secondly, we have performed some image enhancement techniques such as edge detection and morphological operations in order to make the ROI image more suitable for the Hough transform. In addition, we have applied the Hough transform in order to extract all the possible principle lines on the ROI images. We have extracted the most salient morphological features of those lines; slope and length. Furthermore, we have applied the invariant moments algorithm in order to produce 7 appropriate hues of interest. Finally, after performing a complete hybrid feature vectors, we have applied different machine learning algorithms in order to recognize palmprints effectively. Recognition accuracy have been tested by calculating precision, sensitivity, specificity, accuracy, dice, Jaccard coefficients, correlation coefficients, and training time. Seven different supervised machine learning algorithms have been implemented and utilized. The effect of forming the proposed hybrid feature vectors between Hough transform and Invariant moment have been utilized and tested. Experimental results show that the feed forward neural network with back propagation has achieved about 99.99% recognition accuracy among all tested machine learning techniques.


Sales forecasting is an important when it comes to companies who are engaged in retailing, logistics, manufacturing, marketing and wholesaling. It allows companies to allocate resources efficiently, to estimate revenue of the sales and to plan strategies which are better for company’s future. In this paper, predicting product sales from a particular store is done in a way that produces better performance compared to any machine learning algorithms. The dataset used for this project is Big Mart Sales data of the 2013.Nowadays shopping malls and Supermarkets keep track of the sales data of the each and every individual item for predicting the future demand of the customer. It contains large amount of customer data and the item attributes. Further, the frequent patterns are detected by mining the data from the data warehouse. Then the data can be used for predicting the sales of the future with the help of several machine learning techniques (algorithms) for the companies like Big Mart. In this project, we propose a model using the Xgboost algorithm for predicting sales of companies like Big Mart and founded that it produces better performance compared to other existing models. An analysis of this model with other models in terms of their performance metrics is made in this project. Big Mart is an online marketplace where people can buy or sell or advertise your merchandise at low cost. The goal of the paper is to make Big Mart the shopping paradise for the buyers and a marketing solutions for the sellers as well. The ultimate aim is the complete satisfaction of the customers. The project “SUPERMARKET SALES PREDICTION” builds a predictive model and finds out the sales of each of the product at a particular store. The Big Mart use this model to under the properties of the products which plays a major role in increasing the sales. This can also be done on the basis hypothesis that should be done before looking at the data


Materials ◽  
2022 ◽  
Vol 15 (2) ◽  
pp. 647
Author(s):  
Meijun Shang ◽  
Hejun Li ◽  
Ayaz Ahmad ◽  
Waqas Ahmad ◽  
Krzysztof Adam Ostrowski ◽  
...  

Environment-friendly concrete is gaining popularity these days because it consumes less energy and causes less damage to the environment. Rapid increases in the population and demand for construction throughout the world lead to a significant deterioration or reduction in natural resources. Meanwhile, construction waste continues to grow at a high rate as older buildings are destroyed and demolished. As a result, the use of recycled materials may contribute to improving the quality of life and preventing environmental damage. Additionally, the application of recycled coarse aggregate (RCA) in concrete is essential for minimizing environmental issues. The compressive strength (CS) and splitting tensile strength (STS) of concrete containing RCA are predicted in this article using decision tree (DT) and AdaBoost machine learning (ML) techniques. A total of 344 data points with nine input variables (water, cement, fine aggregate, natural coarse aggregate, RCA, superplasticizers, water absorption of RCA and maximum size of RCA, density of RCA) were used to run the models. The data was validated using k-fold cross-validation and the coefficient correlation coefficient (R2), mean square error (MSE), mean absolute error (MAE), and root mean square error values (RMSE). However, the model’s performance was assessed using statistical checks. Additionally, sensitivity analysis was used to determine the impact of each variable on the forecasting of mechanical properties.


Entropy ◽  
2021 ◽  
Vol 23 (10) ◽  
pp. 1258
Author(s):  
Taher Al-Shehari ◽  
Rakan A. Alsowail

Insider threats are malicious acts that can be carried out by an authorized employee within an organization. Insider threats represent a major cybersecurity challenge for private and public organizations, as an insider attack can cause extensive damage to organization assets much more than external attacks. Most existing approaches in the field of insider threat focused on detecting general insider attack scenarios. However, insider attacks can be carried out in different ways, and the most dangerous one is a data leakage attack that can be executed by a malicious insider before his/her leaving an organization. This paper proposes a machine learning-based model for detecting such serious insider threat incidents. The proposed model addresses the possible bias of detection results that can occur due to an inappropriate encoding process by employing the feature scaling and one-hot encoding techniques. Furthermore, the imbalance issue of the utilized dataset is also addressed utilizing the synthetic minority oversampling technique (SMOTE). Well known machine learning algorithms are employed to detect the most accurate classifier that can detect data leakage events executed by malicious insiders during the sensitive period before they leave an organization. We provide a proof of concept for our model by applying it on CMU-CERT Insider Threat Dataset and comparing its performance with the ground truth. The experimental results show that our model detects insider data leakage events with an AUC-ROC value of 0.99, outperforming the existing approaches that are validated on the same dataset. The proposed model provides effective methods to address possible bias and class imbalance issues for the aim of devising an effective insider data leakage detection system.


Author(s):  
Karthik R. ◽  
Ifrah Alam ◽  
Bandaru Umamadhuri ◽  
Bharath K. P. ◽  
Rajesh Kumar M.

In this chapter, the authors use various signal processing techniques to analyze and gain insights on how ECG signals for patients suffering from sleep apnea (sleep apnea or obstructive sleep apnea occurs when the muscles that support the soft tissues in the throat, such as tongue and soft palate, relax temporarily) disease vary with respect to a normal person's ECG. The work has three stages: firstly, to identify waves, complexes, morphology in an ECG which reflect the presence of the disease; second, feature extraction techniques to extract features of ECG such as duration of the wave, amplitude distribution, and morphology classes; and third, detailed clustering (unsupervised) algorithm analysis of the extracted features with efficient feature reduction methodologies such as PCA and LDA. Finally, the authors use supervised machine learning algorithms (SVM, naive Bayes classifier, feed forward neural network, and decision tree) to distinguish between ECG signals with sleep apnea and normal ECG signals.


2020 ◽  
Vol 10 (15) ◽  
pp. 5208
Author(s):  
Mohammed Nasser Al-Mhiqani ◽  
Rabiah Ahmad ◽  
Z. Zainal Abidin ◽  
Warusia Yassin ◽  
Aslinda Hassan ◽  
...  

Insider threat has become a widely accepted issue and one of the major challenges in cybersecurity. This phenomenon indicates that threats require special detection systems, methods, and tools, which entail the ability to facilitate accurate and fast detection of a malicious insider. Several studies on insider threat detection and related areas in dealing with this issue have been proposed. Various studies aimed to deepen the conceptual understanding of insider threats. However, there are many limitations, such as a lack of real cases, biases in making conclusions, which are a major concern and remain unclear, and the lack of a study that surveys insider threats from many different perspectives and focuses on the theoretical, technical, and statistical aspects of insider threats. The survey aims to present a taxonomy of contemporary insider types, access, level, motivation, insider profiling, effect security property, and methods used by attackers to conduct attacks and a review of notable recent works on insider threat detection, which covers the analyzed behaviors, machine-learning techniques, dataset, detection methodology, and evaluation metrics. Several real cases of insider threats have been analyzed to provide statistical information about insiders. In addition, this survey highlights the challenges faced by other researchers and provides recommendations to minimize obstacles.


Sign in / Sign up

Export Citation Format

Share Document