Solving the Problem of Class Imbalance in the Prediction of Hotel Cancelations: A Hybridized Machine Learning Approach

The cancelation of bookings puts a considerable strain on management decisions in the case of the hospitability industry. Booking cancelations restrict precise predictions and are thus a critical tool for revenue management performance. However, in recent times, thanks to the availability of considerable computing power through machine learning (ML) approaches, it has become possible to create more accurate models to predict the cancelation of bookings compared to more traditional methods. Previous studies have used several ML approaches, such as support vector machine (SVM), neural network (NN), and decision tree (DT) models for predicting hotel cancelations. However, they are yet to address the class imbalance problem that exists in the prediction of hotel cancelations. In this study, we have shortened this gap by introducing an oversampling technique to address class imbalance problems, in conjunction with machine learning algorithms to better predict hotel booking cancelations. A combination of the synthetic minority oversampling technique and the edited nearest neighbors (SMOTE-ENN) algorithm is proposed to address the problem of class imbalance. Class imbalance is a general problem that occurs when classifying which class has more examples compared to others. Our research has shown that, after addressing the class imbalance problem, the performance of a machine learning classifier improves significantly.

Download Full-text

Biased support vector machine and weighted-smote in handling class imbalance problem

International Journal of Advances in Intelligent Informatics ◽

10.26555/ijain.v4i1.146 ◽

2018 ◽

Vol 4 (1) ◽

pp. 21 ◽

Cited By ~ 21

Author(s):

Hartono Hartono ◽

Opim Salim Sitompul ◽

Tulus Tulus ◽

Erna Budhiarti Nababan

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Data Distribution ◽

Class Imbalance ◽

Support Vector ◽

Class Imbalance Problem ◽

Precise Method ◽

Imbalance Problem

Class imbalance occurs when instances in a class are much higher than in other classes. This machine learning major problem can affect the predicted accuracy. Support Vector Machine (SVM) is robust and precise method in handling class imbalance problem but weak in the bias data distribution, Biased Support Vector Machine (BSVM) became popular choice to solve the problem. BSVM provide better control sensitivity yet lack accuracy compared to general SVM. This study proposes the integration of BSVM and SMOTEBoost to handle class imbalance problem. Non Support Vector (NSV) sets from negative samples and Support Vector (SV) sets from positive samples will undergo a Weighted-SMOTE process. The results indicate that implementation of Biased Support Vector Machine and Weighted-SMOTE achieve better accuracy and sensitivity.

Download Full-text

Improving Detection of False Data Injection Attacks Using Machine Learning with Feature Selection and Oversampling

Energies ◽

10.3390/en15010212 ◽

2021 ◽

Vol 15 (1) ◽

pp. 212

Author(s):

Ajit Kumar ◽

Neetesh Saxena ◽

Souhwan Jung ◽

Bong Jun Choi

Keyword(s):

Machine Learning ◽

Class Imbalance ◽

Machine Learning Algorithms ◽

Skewed Distribution ◽

Critical Infrastructures ◽

Detection Accuracy ◽

Class Imbalance Problem ◽

False Data Injection ◽

Injection Attacks ◽

Imbalance Problem

Critical infrastructures have recently been integrated with digital controls to support intelligent decision making. Although this integration provides various benefits and improvements, it also exposes the system to new cyberattacks. In particular, the injection of false data and commands into communication is one of the most common and fatal cyberattacks in critical infrastructures. Hence, in this paper, we investigate the effectiveness of machine-learning algorithms in detecting False Data Injection Attacks (FDIAs). In particular, we focus on two of the most widely used critical infrastructures, namely power systems and water treatment plants. This study focuses on tackling two key technical issues: (1) finding the set of best features under a different combination of techniques and (2) resolving the class imbalance problem using oversampling methods. We evaluate the performance of each algorithm in terms of time complexity and detection accuracy to meet the time-critical requirements of critical infrastructures. Moreover, we address the inherent skewed distribution problem and the data imbalance problem commonly found in many critical infrastructure datasets. Our results show that the considered minority oversampling techniques can improve the Area Under Curve (AUC) of GradientBoosting, AdaBoost, and kNN by 10–12%.

Download Full-text

Artificial Immune System-Based Classification in Extremely Imbalanced Classification Problems

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213017500099 ◽

2017 ◽

Vol 26 (03) ◽

pp. 1750009 ◽

Cited By ~ 1

Author(s):

Dionisios N. Sotiropoulos ◽

George A. Tsihrintzis

Keyword(s):

Machine Learning ◽

Immune System ◽

Artificial Immune System ◽

Class Imbalance ◽

Classification Algorithm ◽

Support Vector ◽

Artificial Immune ◽

Class Imbalance Problem ◽

Minority Class ◽

Imbalance Problem

This paper focuses on a special category of machine learning problems arising in cases where the set of available training instances is significantly biased towards a particular class of patterns. Our work addresses the so-called Class Imbalance Problem through the utilization of an Artificial Immune System-(AIS)based classification algorithm which encodes the inherent ability of the Adaptive Immune System to mediate the exceptionally imbalanced “self” / “non-self” discrimination process. From a computational point of view, this process constitutes an extremely imbalanced pattern classification task since the vast majority of molecular patterns pertain to the “non-self” space. Our work focuses on investigating the effect of the class imbalance problem on the AIS-based classification algorithm by assessing its relative ability to deal with extremely skewed datasets when compared against two state-of-the-art machine learning paradigms such as Support Vector Machines (SVMs) and Multi-Layer Perceptrons (MLPs). To this end, we conducted a series of experiments on a music-related dataset where a small fraction of positive samples was to be recognized against the vast volume of negative samples. The results obtained indicate that the utilized bio-inspired classifier outperforms SVMs in detecting patterns from the minority class while its performance on the same task is competently close to the one exhibited by MLPs. Our findings suggest that the AIS-based classifier relies on its intrinsic resampling and class-balancing functionality in order to address the class imbalance problem.

Download Full-text

ADDRESSING THE CLASS IMBALANCE PROBLEM IN THE AUTOMATIC IMAGE CLASSIFICATION OF COASTAL LITTER FROM ORTHOPHOTOS DERIVED FROM UAS IMAGERY

ISPRS Annals of Photogrammetry Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-annals-v-3-2020-439-2020 ◽

2020 ◽

Vol V-3-2020 ◽

pp. 439-445

Author(s):

D. Duarte ◽

U. Andriolo ◽

G. Gonçalves

Keyword(s):

Machine Learning ◽

Class Imbalance ◽

Machine Learning Algorithms ◽

Unmanned Aerial Systems ◽

Class Imbalance Problem ◽

Marine Litter ◽

Imbalance Problem ◽

Small Set ◽

Aerial Systems

Abstract. Unmanned Aerial Systems (UAS) has been recently used for mapping marine litter on beach-dune environment. Machine learning algorithms have been applied on UAS-derived images and orthophotos for automated marine litter items detection. As sand and vegetation are much predominant on the orthophoto, marine litter items constitute a small set of data, thus a class much less represented on the image scene. This communication aims to analyse the class imbalance issue on orthophotos for automated marine litter items detection. In the used dataset, the percentage of patches containing marine litter is close to 1% of the total amount of patches, hence representing a clear class imbalance issue. This problem has been previously indicated as detrimental for machine learning frameworks. Three different approaches were tested to address this imbalance, namely class weighting, oversampling and classifier thresholding. Oversampling had the best performance with a f1-score of 0.68, while the other methods had f1-score value of 0.56 on average. The results indicate that future works devoted to UAS-based automated marine litter detection should take in consideration the use of the oversampling method, which helped to improve the results of about 7% in the specific case shown in this paper.

Download Full-text

Integration of Sentinel-1 and Sentinel-2 Data with the G-SMOTE Technique for Boosting Land Cover Classification Accuracy

Applied Sciences ◽

10.3390/app112110309 ◽

2021 ◽

Vol 11 (21) ◽

pp. 10309

Author(s):

Hamid Ebrahimy ◽

Amin Naboureh ◽

Bakhtiar Feizizadeh ◽

Jagannath Aryal ◽

Omid Ghorbanzadeh

Keyword(s):

Machine Learning ◽

Land Cover ◽

Global Climate ◽

Class Imbalance ◽

Sampling Technique ◽

Machine Learning Algorithms ◽

Recursive Feature Elimination ◽

Support Vector ◽

Class Imbalance Problem ◽

Sentinel 2

The importance of Land Cover (LC) classification is recognized by an increasing number of scholars who employ LC information in various applications (i.e., address global climate change and achieve sustainable development). However, studying the roles of balancing data, image integration, and performance of different machine learning algorithms in various landscapes has not received as much attention from scientists. Therefore, the present study investigates the performance of three frequently used Machine Learning (ML) algorithms, including Extreme Learning Machines (ELM), Support Vector Machines (SVM), and Random Forest (RF) in LC mapping at six different landscapes. Moreover, the Geometric Synthetic Minority Over-sampling Technique (G-SMOTE) was adopted to deal with the class imbalance problem. In this work, the time-series of Sentinel-1 and Sentinel-2 data were integrated to improve LC mapping accuracy, taking advantage of both data. Moreover, Support Vector Machine-Recursive Feature Elimination (SVM-RFE) was implemented to distinguish the most informative features. Based on the results, the RF integrated with G-SMOTE showed the best result for four landscapes (coastal, cropland, desert, and semi-arid). SVM integrated with G-SMOTE had the highest accuracy in the remaining two landscapes (plain and mountain). Applied ML algorithms showed good performances in various landscapes, ranging Overall Accuracy (OA) from 85% to 93% for RF, 83% to 94% for SVM, and 84% to 92% for ELM. The outcomes exhibit that although applying G-SMOTE may slightly decrease OA values, it generally boosts the results of LC classification accuracies in various landscapes, particularly for minority classes.

Download Full-text

Fault detection for air conditioning system using machine learning

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v9.i1.pp109-116 ◽

2020 ◽

Vol 9 (1) ◽

pp. 109

Author(s):

Noor Asyikin Sulaiman ◽

Md Pauzi Abdullah ◽

Hayati Abdullah ◽

Muhammad Noorazlan Shah Zainudin ◽

Azdiana Md Yusop

Keyword(s):

Machine Learning ◽

Supervised Learning ◽

Air Conditioning ◽

Machine Learning Algorithms ◽

Coefficient Of Performance ◽

Support Vector ◽

Air Conditioning System ◽

Learning Classifier ◽

Negative Impacts ◽

The Impact

Air conditioning system is a complex system and consumes the most energy in a building. Any fault in the system operation such as cooling tower fan faulty, compressor failure, damper stuck, etc. could lead to energy wastage and reduction in the system’s coefficient of performance (COP). Due to the complexity of the air conditioning system, detecting those faults is hard as it requires exhaustive inspections. This paper consists of two parts; i) to investigate the impact of different faults related to the air conditioning system on COP and ii) to analyse the performances of machine learning algorithms to classify those faults. Three supervised learning classifier models were developed, which were deep learning, support vector machine (SVM) and multi-layer perceptron (MLP). The performances of each classifier were investigated in terms of six different classes of faults. Results showed that different faults give different negative impacts on the COP. Also, the three supervised learning classifier models able to classify all faults for more than 94%, and MLP produced the highest accuracy and precision among all.

Download Full-text

Identifying student behavior in MOOCs using Machine Learning

International Journal for Innovation Education and Research ◽

10.31686/ijier.vol7.iss3.1318 ◽

2019 ◽

Vol 7 (3) ◽

pp. 30-39 ◽

Cited By ~ 1

Author(s):

Vanessa Faria De Souza ◽

Gabriela Perry

Keyword(s):

Machine Learning ◽

Literature Review ◽

Student Behavior ◽

Class Imbalance ◽

External Factors ◽

Class Imbalance Problem ◽

Data Manipulation ◽

Imbalance Problem ◽

Student Classification

This paper presents the results literature review, carried out with the objective of identifying prevalent research goals and challenges in the prediction of student behavior in MOOCs, using Machine Learning. The results allowed recognizingthree goals: 1. Student Classification and 2. Dropout prediction. Regarding the challenges, five items were identified: 1. Incompatibility of AVAs, 2. Complexity of data manipulation, 3. Class Imbalance Problem, 4. Influence of External Factors and 5. Difficulty in manipulating data by untrained personnel.

Download Full-text

Improving Logging Prediction on Imbalanced Datasets

International Journal of Open Source Software and Processes ◽

10.4018/ijossp.2016040103 ◽

2016 ◽

Vol 7 (2) ◽

pp. 43-71 ◽

Cited By ~ 3

Author(s):

Sangeeta Lal ◽

Neetu Sardana ◽

Ashish Sureka

Keyword(s):

Machine Learning ◽

Open Source ◽

Class Imbalance ◽

Learning Model ◽

Learning Models ◽

Class Imbalance Problem ◽

Imbalanced Datasets ◽

Imbalance Problem ◽

Machine Learning Model ◽

Machine Learning Models

Logging is an important yet tough decision for OSS developers. Machine-learning models are useful in improving several steps of OSS development, including logging. Several recent studies propose machine-learning models to predict logged code construct. The prediction performances of these models are limited due to the class-imbalance problem since the number of logged code constructs is small as compared to non-logged code constructs. No previous study analyzes the class-imbalance problem for logged code construct prediction. The authors first analyze the performances of J48, RF, and SVM classifiers for catch-blocks and if-blocks logged code constructs prediction on imbalanced datasets. Second, the authors propose LogIm, an ensemble and threshold-based machine-learning model. Third, the authors evaluate the performance of LogIm on three open-source projects. On average, LogIm model improves the performance of baseline classifiers, J48, RF, and SVM, by 7.38%, 9.24%, and 4.6% for catch-blocks, and 12.11%, 14.95%, and 19.13% for if-blocks logging prediction.

Download Full-text

Image Classifying Based on Cost-Sensitive Layered Cascade Learning

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.701-702.453 ◽

2014 ◽

Vol 701-702 ◽

pp. 453-458

Author(s):

Feng Huang ◽

Yun Liang ◽

Li Huang ◽

Ji Ming Yao ◽

Wen Feng Tian

Keyword(s):

Image Classification ◽

Class Imbalance ◽

Classification Performance ◽

Machine Learning Algorithms ◽

Class Imbalance Problem ◽

Data Set ◽

Misclassification Cost ◽

Imbalance Problem ◽

Specific Category ◽

The Cost

Image Classification is an important means of image processing, Traditional research of image classification usually based on following assumptions: aiming for the overall classification accuracy, sample of different category has the same importance in data set and all the misclassification brings same cost. Unfortunately, class imbalance and cost sensitive are ubiquitous in classification in real world process, sample size of specific category in data set may much more than others and misclassification cost is sharp distinction between different categories. High dimension of eigenvector caused by diversity content of images and the big complexity gap between distinguish different categories of images are common problems when dealing with image Classification, therefore, one single machine learning algorithms is not sufficient when dealing with complex image classification contains the above characteristics. To cure the above problems, a layered cascade image classifying method based on cost-sensitive and class-imbalance was proposed, a set of cascading learning was build, and the inner patterns of images of specific category was learned in different stages, also, the cost function was introduced, thus, the method can effectively respond to the cost-sensitive and class-imbalance problem of image classifying. Moreover, the structure of this method is flexible as the layer of cascading and the algorithm in every stage can be readjusted based on business requirements of image classifying. The result of application in sensitive image classifying for smart grid indicates that this image classifying based on cost-sensitive layered cascade learning obtains better image classification performance than the existing methods.

Download Full-text

A machine learning approach to predict ethnicity using personal name and census location in Canada

PLoS ONE ◽

10.1371/journal.pone.0241239 ◽

2020 ◽

Vol 15 (11) ◽

pp. e0241239

Author(s):

Kai On Wong ◽

Osmar R. Zaïane ◽

Faith G. Davis ◽

Yutaka Yasui

Keyword(s):

Machine Learning ◽

First Nations ◽

Predictive Value ◽

Large Scale ◽

Performance Metrics ◽

Characteristic Curve ◽

Machine Learning Algorithms ◽

Support Vector ◽

Learning Approach ◽

Machine Learning Approach

Background Canada is an ethnically-diverse country, yet its lack of ethnicity information in many large databases impedes effective population research and interventions. Automated ethnicity classification using machine learning has shown potential to address this data gap but its performance in Canada is largely unknown. This study conducted a large-scale machine learning framework to predict ethnicity using a novel set of name and census location features. Methods Using census 1901, the multiclass and binary class classification machine learning pipelines were developed. The 13 ethnic categories examined were Aboriginal (First Nations, Métis, Inuit, and all-combined)), Chinese, English, French, Irish, Italian, Japanese, Russian, Scottish, and others. Machine learning algorithms included regularized logistic regression, C-support vector, and naïve Bayes classifiers. Name features consisted of the entire name string, substrings, double-metaphones, and various name-entity patterns, while location features consisted of the entire location string and substrings of province, district, and subdistrict. Predictive performance metrics included sensitivity, specificity, positive predictive value, negative predictive value, F1, Area Under the Curve for Receiver Operating Characteristic curve, and accuracy. Results The census had 4,812,958 unique individuals. For multiclass classification, the highest performance achieved was 76% F1 and 91% accuracy. For binary classifications for Chinese, French, Italian, Japanese, Russian, and others, the F1 ranged 68–95% (median 87%). The lower performance for English, Irish, and Scottish (F1 ranged 63–67%) was likely due to their shared cultural and linguistic heritage. Adding census location features to the name-based models strongly improved the prediction in Aboriginal classification (F1 increased from 50% to 84%). Conclusions The automated machine learning approach using only name and census location features can predict the ethnicity of Canadians with varying performance by specific ethnic categories.

Download Full-text