Improving Detection of False Data Injection Attacks Using Machine Learning with Feature Selection and Oversampling

Critical infrastructures have recently been integrated with digital controls to support intelligent decision making. Although this integration provides various benefits and improvements, it also exposes the system to new cyberattacks. In particular, the injection of false data and commands into communication is one of the most common and fatal cyberattacks in critical infrastructures. Hence, in this paper, we investigate the effectiveness of machine-learning algorithms in detecting False Data Injection Attacks (FDIAs). In particular, we focus on two of the most widely used critical infrastructures, namely power systems and water treatment plants. This study focuses on tackling two key technical issues: (1) finding the set of best features under a different combination of techniques and (2) resolving the class imbalance problem using oversampling methods. We evaluate the performance of each algorithm in terms of time complexity and detection accuracy to meet the time-critical requirements of critical infrastructures. Moreover, we address the inherent skewed distribution problem and the data imbalance problem commonly found in many critical infrastructure datasets. Our results show that the considered minority oversampling techniques can improve the Area Under Curve (AUC) of GradientBoosting, AdaBoost, and kNN by 10–12%.

Download Full-text

ADDRESSING THE CLASS IMBALANCE PROBLEM IN THE AUTOMATIC IMAGE CLASSIFICATION OF COASTAL LITTER FROM ORTHOPHOTOS DERIVED FROM UAS IMAGERY

ISPRS Annals of Photogrammetry Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-annals-v-3-2020-439-2020 ◽

2020 ◽

Vol V-3-2020 ◽

pp. 439-445

Author(s):

D. Duarte ◽

U. Andriolo ◽

G. Gonçalves

Keyword(s):

Machine Learning ◽

Class Imbalance ◽

Machine Learning Algorithms ◽

Unmanned Aerial Systems ◽

Class Imbalance Problem ◽

Marine Litter ◽

Imbalance Problem ◽

Small Set ◽

Aerial Systems

Abstract. Unmanned Aerial Systems (UAS) has been recently used for mapping marine litter on beach-dune environment. Machine learning algorithms have been applied on UAS-derived images and orthophotos for automated marine litter items detection. As sand and vegetation are much predominant on the orthophoto, marine litter items constitute a small set of data, thus a class much less represented on the image scene. This communication aims to analyse the class imbalance issue on orthophotos for automated marine litter items detection. In the used dataset, the percentage of patches containing marine litter is close to 1% of the total amount of patches, hence representing a clear class imbalance issue. This problem has been previously indicated as detrimental for machine learning frameworks. Three different approaches were tested to address this imbalance, namely class weighting, oversampling and classifier thresholding. Oversampling had the best performance with a f1-score of 0.68, while the other methods had f1-score value of 0.56 on average. The results indicate that future works devoted to UAS-based automated marine litter detection should take in consideration the use of the oversampling method, which helped to improve the results of about 7% in the specific case shown in this paper.

Download Full-text

Solving the Problem of Class Imbalance in the Prediction of Hotel Cancelations: A Hybridized Machine Learning Approach

Processes ◽

10.3390/pr9101713 ◽

2021 ◽

Vol 9 (10) ◽

pp. 1713

Author(s):

Mohd Adil ◽

Mohd Faizan Ansari ◽

Ahmad Alahmadi ◽

Jei-Zheng Wu ◽

Ripon K. Chakrabortty

Keyword(s):

Machine Learning ◽

Class Imbalance ◽

Machine Learning Algorithms ◽

Support Vector ◽

Class Imbalance Problem ◽

Computing Power ◽

Imbalance Problem ◽

Learning Classifier ◽

Machine Learning Approach ◽

Hotel Booking

The cancelation of bookings puts a considerable strain on management decisions in the case of the hospitability industry. Booking cancelations restrict precise predictions and are thus a critical tool for revenue management performance. However, in recent times, thanks to the availability of considerable computing power through machine learning (ML) approaches, it has become possible to create more accurate models to predict the cancelation of bookings compared to more traditional methods. Previous studies have used several ML approaches, such as support vector machine (SVM), neural network (NN), and decision tree (DT) models for predicting hotel cancelations. However, they are yet to address the class imbalance problem that exists in the prediction of hotel cancelations. In this study, we have shortened this gap by introducing an oversampling technique to address class imbalance problems, in conjunction with machine learning algorithms to better predict hotel booking cancelations. A combination of the synthetic minority oversampling technique and the edited nearest neighbors (SMOTE-ENN) algorithm is proposed to address the problem of class imbalance. Class imbalance is a general problem that occurs when classifying which class has more examples compared to others. Our research has shown that, after addressing the class imbalance problem, the performance of a machine learning classifier improves significantly.

Download Full-text

Identifying student behavior in MOOCs using Machine Learning

International Journal for Innovation Education and Research ◽

10.31686/ijier.vol7.iss3.1318 ◽

2019 ◽

Vol 7 (3) ◽

pp. 30-39 ◽

Cited By ~ 1

Author(s):

Vanessa Faria De Souza ◽

Gabriela Perry

Keyword(s):

Machine Learning ◽

Literature Review ◽

Student Behavior ◽

Class Imbalance ◽

External Factors ◽

Class Imbalance Problem ◽

Data Manipulation ◽

Imbalance Problem ◽

Student Classification

This paper presents the results literature review, carried out with the objective of identifying prevalent research goals and challenges in the prediction of student behavior in MOOCs, using Machine Learning. The results allowed recognizingthree goals: 1. Student Classification and 2. Dropout prediction. Regarding the challenges, five items were identified: 1. Incompatibility of AVAs, 2. Complexity of data manipulation, 3. Class Imbalance Problem, 4. Influence of External Factors and 5. Difficulty in manipulating data by untrained personnel.

Download Full-text

Improving Logging Prediction on Imbalanced Datasets

International Journal of Open Source Software and Processes ◽

10.4018/ijossp.2016040103 ◽

2016 ◽

Vol 7 (2) ◽

pp. 43-71 ◽

Cited By ~ 3

Author(s):

Sangeeta Lal ◽

Neetu Sardana ◽

Ashish Sureka

Keyword(s):

Machine Learning ◽

Open Source ◽

Class Imbalance ◽

Learning Model ◽

Learning Models ◽

Class Imbalance Problem ◽

Imbalanced Datasets ◽

Imbalance Problem ◽

Machine Learning Model ◽

Machine Learning Models

Logging is an important yet tough decision for OSS developers. Machine-learning models are useful in improving several steps of OSS development, including logging. Several recent studies propose machine-learning models to predict logged code construct. The prediction performances of these models are limited due to the class-imbalance problem since the number of logged code constructs is small as compared to non-logged code constructs. No previous study analyzes the class-imbalance problem for logged code construct prediction. The authors first analyze the performances of J48, RF, and SVM classifiers for catch-blocks and if-blocks logged code constructs prediction on imbalanced datasets. Second, the authors propose LogIm, an ensemble and threshold-based machine-learning model. Third, the authors evaluate the performance of LogIm on three open-source projects. On average, LogIm model improves the performance of baseline classifiers, J48, RF, and SVM, by 7.38%, 9.24%, and 4.6% for catch-blocks, and 12.11%, 14.95%, and 19.13% for if-blocks logging prediction.

Download Full-text

Image Classifying Based on Cost-Sensitive Layered Cascade Learning

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.701-702.453 ◽

2014 ◽

Vol 701-702 ◽

pp. 453-458

Author(s):

Feng Huang ◽

Yun Liang ◽

Li Huang ◽

Ji Ming Yao ◽

Wen Feng Tian

Keyword(s):

Image Classification ◽

Class Imbalance ◽

Classification Performance ◽

Machine Learning Algorithms ◽

Class Imbalance Problem ◽

Data Set ◽

Misclassification Cost ◽

Imbalance Problem ◽

Specific Category ◽

The Cost

Image Classification is an important means of image processing, Traditional research of image classification usually based on following assumptions: aiming for the overall classification accuracy, sample of different category has the same importance in data set and all the misclassification brings same cost. Unfortunately, class imbalance and cost sensitive are ubiquitous in classification in real world process, sample size of specific category in data set may much more than others and misclassification cost is sharp distinction between different categories. High dimension of eigenvector caused by diversity content of images and the big complexity gap between distinguish different categories of images are common problems when dealing with image Classification, therefore, one single machine learning algorithms is not sufficient when dealing with complex image classification contains the above characteristics. To cure the above problems, a layered cascade image classifying method based on cost-sensitive and class-imbalance was proposed, a set of cascading learning was build, and the inner patterns of images of specific category was learned in different stages, also, the cost function was introduced, thus, the method can effectively respond to the cost-sensitive and class-imbalance problem of image classifying. Moreover, the structure of this method is flexible as the layer of cascading and the algorithm in every stage can be readjusted based on business requirements of image classifying. The result of application in sensitive image classifying for smart grid indicates that this image classifying based on cost-sensitive layered cascade learning obtains better image classification performance than the existing methods.

Download Full-text

Biased support vector machine and weighted-smote in handling class imbalance problem

International Journal of Advances in Intelligent Informatics ◽

10.26555/ijain.v4i1.146 ◽

2018 ◽

Vol 4 (1) ◽

pp. 21 ◽

Cited By ~ 21

Author(s):

Hartono Hartono ◽

Opim Salim Sitompul ◽

Tulus Tulus ◽

Erna Budhiarti Nababan

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Data Distribution ◽

Class Imbalance ◽

Support Vector ◽

Class Imbalance Problem ◽

Precise Method ◽

Imbalance Problem

Class imbalance occurs when instances in a class are much higher than in other classes. This machine learning major problem can affect the predicted accuracy. Support Vector Machine (SVM) is robust and precise method in handling class imbalance problem but weak in the bias data distribution, Biased Support Vector Machine (BSVM) became popular choice to solve the problem. BSVM provide better control sensitivity yet lack accuracy compared to general SVM. This study proposes the integration of BSVM and SMOTEBoost to handle class imbalance problem. Non Support Vector (NSV) sets from negative samples and Support Vector (SV) sets from positive samples will undergo a Weighted-SMOTE process. The results indicate that implementation of Biased Support Vector Machine and Weighted-SMOTE achieve better accuracy and sensitivity.

Download Full-text

An Insight on the Class Imbalance Problem and Its Solutions in Big Data

Large-Scale Data Streaming, Processing, and Blockchain Security - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-7998-3444-1.ch002 ◽

2021 ◽

pp. 39-49

Author(s):

Khyati Ahlawat ◽

Anuradha Chug ◽

Amit Prakash Singh

Keyword(s):

Machine Learning ◽

Big Data ◽

Class Imbalance ◽

Classification Problem ◽

Correct Classification ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Methods And Techniques ◽

Conventional Machine ◽

Work Done

Expansion of data in the dimensions of volume, variety, or velocity is leading to big data. Learning from this big data is challenging and beyond capacity of conventional machine learning methods and techniques. Generally, big data getting generated from real-time scenarios is imbalance in nature with uneven distribution of classes. This imparts additional complexity in learning from big data since the class that is underrepresented is more influential and its correct classification becomes critical than that of overrepresented class. This chapter addresses the imbalance problem and its solutions in context of big data along with a detailed survey of work done in this area. Subsequently, it also presents an experimental view for solving imbalance classification problem and a comparative analysis between different methodologies afterwards.

Download Full-text

A Comprehensive Analysis of Handling Imbalanced Dataset

International Journal of Advanced Trends in Computer Science and Engineering ◽

10.30534/ijatcse/2021/031022021 ◽

2021 ◽

Vol 10 (2) ◽

pp. 454-463

Keyword(s):

Machine Learning ◽

Class Imbalance ◽

Evaluation Process ◽

Data Sampling ◽

Class Imbalance Problem ◽

Minority Class ◽

Class Distribution ◽

Imbalance Problem ◽

Conventional Manner ◽

Imbalance Dataset

Classification is a major obstacle in Machine Learning generally and also specific when tackling class imbalance problem. A dataset is said to be imbalanced if a class we are interested in falls to the minority class and appears scanty when compared to the majority class, the minority class is also known as the positive class while the majority class is also known as the negative class. Class imbalance has been a major bottleneck for Machine Learning scientist as it often leads to using wrong model for different purposes, this Survey will lead researchers to choose the right model and the best strategies to handle imbalance dataset in the course of tackling machine learning problems. Proper handling of class imbalance dataset could leads to accurate and good result. Handling class imbalance data in a conventional manner, especially when the level of imbalance is high may leads to accuracy paradox (an assumption of realizing 99% accuracy during evaluation process when the class distribution is highly imbalanced), hence imbalance class distribution requires special consideration, and for this purpose we dealt extensively on handling and solving imbalanced class problem in machine learning, such as Data Sampling Approach, Cost sensitive learning approach and Ensemble Approach.

Download Full-text

Effects of Class Imbalance Using Machine Learning Algorithms

International Journal of Applied Evolutionary Computation ◽

10.4018/ijaec.2021010101 ◽

2021 ◽

Vol 12 (1) ◽

pp. 1-17

Author(s):

Swati V. Narwane ◽

Sudhir D. Sawarkar

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Class Imbalance ◽

Imbalanced Data ◽

Machine Learning Algorithms ◽

Training Data ◽

Model Accuracy ◽

Data Set ◽

Class Distribution ◽

Imbalance Problem

Class imbalance is the major hurdle for machine learning-based systems. Data set is the backbone of machine learning and must be studied to handle the class imbalance. The purpose of this paper is to investigate the effect of class imbalance on the data sets. The proposed methodology determines the model accuracy for class distribution. To find possible solutions, the behaviour of an imbalanced data set was investigated. The study considers two case studies with data set divided balanced to unbalanced class distribution. Testing of the data set with trained and test data was carried out for standard machine learning algorithms. Model accuracy for class distribution was measured with the training data set. Further, the built model was tested with individual binary class. Results show that, for the improvement of the system performance, it is essential to work on class imbalance problems. The study concludes that the system produces biased results due to the majority class. In the future, the multiclass imbalance problem can be studied using advanced algorithms.

Download Full-text

Artificial Immune System-Based Classification in Extremely Imbalanced Classification Problems

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213017500099 ◽

2017 ◽

Vol 26 (03) ◽

pp. 1750009 ◽

Cited By ~ 1

Author(s):

Dionisios N. Sotiropoulos ◽

George A. Tsihrintzis

Keyword(s):

Machine Learning ◽

Immune System ◽

Artificial Immune System ◽

Class Imbalance ◽

Classification Algorithm ◽

Support Vector ◽

Artificial Immune ◽

Class Imbalance Problem ◽

Minority Class ◽

Imbalance Problem

This paper focuses on a special category of machine learning problems arising in cases where the set of available training instances is significantly biased towards a particular class of patterns. Our work addresses the so-called Class Imbalance Problem through the utilization of an Artificial Immune System-(AIS)based classification algorithm which encodes the inherent ability of the Adaptive Immune System to mediate the exceptionally imbalanced “self” / “non-self” discrimination process. From a computational point of view, this process constitutes an extremely imbalanced pattern classification task since the vast majority of molecular patterns pertain to the “non-self” space. Our work focuses on investigating the effect of the class imbalance problem on the AIS-based classification algorithm by assessing its relative ability to deal with extremely skewed datasets when compared against two state-of-the-art machine learning paradigms such as Support Vector Machines (SVMs) and Multi-Layer Perceptrons (MLPs). To this end, we conducted a series of experiments on a music-related dataset where a small fraction of positive samples was to be recognized against the vast volume of negative samples. The results obtained indicate that the utilized bio-inspired classifier outperforms SVMs in detecting patterns from the minority class while its performance on the same task is competently close to the one exhibited by MLPs. Our findings suggest that the AIS-based classifier relies on its intrinsic resampling and class-balancing functionality in order to address the class imbalance problem.

Download Full-text