scholarly journals A Novel Technique on Class Imbalance Big Data using Analogous under Sampling Approach

2018 ◽  
Vol 179 (33) ◽  
pp. 18-21
Author(s):  
Mohammad Imran ◽  
Vaddi Srinivasa
2021 ◽  
Author(s):  
Timothy Oladunni ◽  
Sourou Tossou ◽  
Yayehyrad Haile ◽  
Adonias Kidane

COVID-19 pandemic that broke out in the late 2019 has spread across the globe. The disease has infected millions of people. Thousands of lives have been lost. The momentum of the disease has been slowed by the introduction of vaccine. However, some countries are still recording high number of casualties. The focus of this work is to design, develop and evaluate a machine learning county level COVID-19 severity classifier. The proposed model will predict severity of the disease in a county into low, moderate, or high. Policy makers will find the work useful in the distribution of vaccines. Four learning algorithms (two ensembles and two non-ensembles) were trained and evaluated. Class imbalance was addressed using NearMiss under-sampling of the majority classes. The result of our experiment shows that the ensemble models outperformed the non-ensemble models by a considerable margin.


2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Tawfiq Hasanin ◽  
Taghi M. Khoshgoftaar ◽  
Joffrey L. Leevy ◽  
Richard A. Bauder

AbstractSevere class imbalance between majority and minority classes in Big Data can bias the predictive performance of Machine Learning algorithms toward the majority (negative) class. Where the minority (positive) class holds greater value than the majority (negative) class and the occurrence of false negatives incurs a greater penalty than false positives, the bias may lead to adverse consequences. Our paper incorporates two case studies, each utilizing three learners, six sampling approaches, two performance metrics, and five sampled distribution ratios, to uniquely investigate the effect of severe class imbalance on Big Data analytics. The learners (Gradient-Boosted Trees, Logistic Regression, Random Forest) were implemented within the Apache Spark framework. The first case study is based on a Medicare fraud detection dataset. The second case study, unlike the first, includes training data from one source (SlowlorisBig Dataset) and test data from a separate source (POST dataset). Results from the Medicare case study are not conclusive regarding the best sampling approach using Area Under the Receiver Operating Characteristic Curve and Geometric Mean performance metrics. However, it should be noted that the Random Undersampling approach performs adequately in the first case study. For the SlowlorisBig case study, Random Undersampling convincingly outperforms the other five sampling approaches (Random Oversampling, Synthetic Minority Over-sampling TEchnique, SMOTE-borderline1 , SMOTE-borderline2 , ADAptive SYNthetic) when measuring performance with Area Under the Receiver Operating Characteristic Curve and Geometric Mean metrics. Based on its classification performance in both case studies, Random Undersampling is the best choice as it results in models with a significantly smaller number of samples, thus reducing computational burden and training time.


2020 ◽  
Vol 10 (14) ◽  
pp. 4945
Author(s):  
R. G. Gayathri ◽  
Atul Sajjanhar ◽  
Yong Xiang

Cybersecurity attacks can arise from internal and external sources. The attacks perpetrated by internal sources are also referred to as insider threats. These are a cause of serious concern to organizations because of the significant damage that can be inflicted by malicious insiders. In this paper, we propose an approach for insider threat classification which is motivated by the effectiveness of pre-trained deep convolutional neural networks (DCNNs) for image classification. In the proposed approach, we extract features from usage patterns of insiders and represent these features as images. Hence, images are used to represent the resource access patterns of the employees within an organization. After construction of images, we use pre-trained DCNNs for anomaly detection, with the aim to identify malicious insiders. Random under sampling is used for reducing the class imbalance issue. The proposed approach is evaluated using the MobileNetV2, VGG19, and ResNet50 pre-trained models, and a benchmark dataset. Experimental results show that the proposed method is effective and outperforms other state-of-the-art methods.


Author(s):  
Shaojian Qiu ◽  
Lu Lu ◽  
Siyu Jiang ◽  
Yang Guo

Machine-learning-based software defect prediction (SDP) methods are receiving great attention from the researchers of intelligent software engineering. Most existing SDP methods are performed under a within-project setting. However, there usually is little to no within-project training data to learn an available supervised prediction model for a new SDP task. Therefore, cross-project defect prediction (CPDP), which uses labeled data of source projects to learn a defect predictor for a target project, was proposed as a practical SDP solution. In real CPDP tasks, the class imbalance problem is ubiquitous and has a great impact on performance of the CPDP models. Unlike previous studies that focus on subsampling and individual methods, this study investigated 15 imbalanced learning methods for CPDP tasks, especially for assessing the effectiveness of imbalanced ensemble learning (IEL) methods. We evaluated the 15 methods by extensive experiments on 31 open-source projects derived from five datasets. Through analyzing a total of 37504 results, we found that in most cases, the IEL method that combined under-sampling and bagging approaches will be more effective than the other investigated methods.


2016 ◽  
Vol 216 ◽  
pp. 250-260 ◽  
Author(s):  
Chandresh Kumar Maurya ◽  
Durga Toshniwal ◽  
Gopalan Vijendran Venkoparao

Sign in / Sign up

Export Citation Format

Share Document