The impact of data difficulty factors on classification of imbalanced and concept drifting data streams

AbstractClass imbalance introduces additional challenges when learning classifiers from concept drifting data streams. Most existing work focuses on designing new algorithms for dealing with the global imbalance ratio and does not consider other data complexities. Independent research on static imbalanced data has highlighted the influential role of local data difficulty factors such as minority class decomposition and presence of unsafe types of examples. Despite often being present in real-world data, the interactions between concept drifts and local data difficulty factors have not been investigated in concept drifting data streams yet. We thoroughly study the impact of such interactions on drifting imbalanced streams. For this purpose, we put forward a new categorization of concept drifts for class imbalanced problems. Through comprehensive experiments with synthetic and real data streams, we study the influence of concept drifts, global class imbalance, local data difficulty factors, and their combinations, on predictions of representative online classifiers. Experimental results reveal the high influence of new considered factors and their local drifts, as well as differences in existing classifiers’ reactions to such factors. Combinations of multiple factors are the most challenging for classifiers. Although existing classifiers are partially capable of coping with global class imbalance, new approaches are needed to address challenges posed by imbalanced data streams.

Download Full-text

VFC-SMOTE: very fast continuous synthetic minority oversampling for evolving data streams

Data Mining and Knowledge Discovery ◽

10.1007/s10618-021-00786-0 ◽

2021 ◽

Author(s):

Alessio Bernardo ◽

Emanuele Della Valle

Keyword(s):

Data Streams ◽

Concept Drift ◽

Class Imbalance ◽

Imbalanced Data ◽

Real Data ◽

Minority Class ◽

Machine Learning Classification ◽

Imbalance Learning ◽

Class Imbalance Learning ◽

Better Than

AbstractThe world is constantly changing, and so are the massive amount of data produced. However, only a few studies deal with online class imbalance learning that combines the challenges of class-imbalanced data streams and concept drift. In this paper, we propose the very fast continuous synthetic minority oversampling technique (VFC-SMOTE). It is a novel meta-strategy to be prepended to any streaming machine learning classification algorithm aiming at oversampling the minority class using a new version of Smote and Borderline-Smote inspired by Data Sketching. We benchmarked VFC-SMOTE pipelines on synthetic and real data streams containing different concept drifts, imbalance levels, and class distributions. We bring statistical evidence that VFC-SMOTE pipelines learn models whose minority class performances are better than state-of-the-art. Moreover, we analyze the time/memory consumption and the concept drift recovery speed.

Download Full-text

VP07 Collaboratively Modelling The Impact Of Interventions Retrospectively

International Journal of Technology Assessment in Health Care ◽

10.1017/s026646231700304x ◽

2017 ◽

Vol 33 (S1) ◽

pp. 149-149

Author(s):

Gordon Bache ◽

Sukh Tatla ◽

Deborah Simpson

Keyword(s):

Real World ◽

Real Data ◽

System Level ◽

Local Data ◽

Real World Data ◽

Total Saving ◽

Retrospective Assessment ◽

Administration Time ◽

The Individual ◽

The Impact

INTRODUCTION:A conventional approach to communicating value is to model the budget impact of a medicine and the associated formulations in which it is available to be prescribed. However, such an approach does not demonstrate the actual realization of the proposed impact. This abstract outlines an approach to presenting retrospective data back to healthcare professionals (HCP) that blends assumptions and real-world data. For illustrative purposes, we present the results of an application of the model for subcutaneously delivered trastuzumab in an anonymized trust in Yorkshire and Humber.METHODS:The authors developed a model that examined one calendar year (from April 2014) of redistributed sales data for both the intravenous and subcutaneous formulations of trastuzumab for every National Health Service (NHS) trust in England. A series of baseline assumptions (1) were used to model the resource impact of different formulations such as chair time, HCP time, pharmacy preparation time, consumables, wastage, and other considerations. Impacts were estimated at the individual attendance level and scaled to the caseload. These baseline assumptions could then be overwritten by the individual trust using local data.RESULTS:The site delivered approximately 985 doses of subcutaneous trastuzumab over a period of 12 months from April 2014, which represented about 76 percent of the total number of doses delivered. Chair time is estimated to have reduced by 22 minutes per attendance, resulting in a total saving of 361hours. HCP administration time is estimated to have reduced by 23 minutes per attendance, resulting in a total saving of 378 hours based on changing 985 IV doses to SC therapy.CONCLUSIONS:Blending real data and assumptions to provide a retrospective assessment of actual benefits realized back to HCPs is a powerful tool for demonstrating real-world value at both an individual trust and system level.

Download Full-text

Oversampling Imbalanced Data Based on Convergent WGAN for Network Threat Detection

Security and Communication Networks ◽

10.1155/2021/9206440 ◽

2021 ◽

Vol 2021 ◽

pp. 1-14

Author(s):

Yanping Xu ◽

Xiaoyu Zhang ◽

Zhenliang Qiu ◽

Xia Zhang ◽

Jian Qiu ◽

...

Keyword(s):

Nash Equilibrium ◽

Loss Function ◽

Class Imbalance ◽

Imbalanced Data ◽

Real Data ◽

Threat Detection ◽

Training Process ◽

Generative Adversarial Network ◽

Minority Class ◽

Adversarial Network

Class imbalance is a common problem in network threat detection. Oversampling the minority class is regarded as a popular countermeasure by generating enough new minority samples. Generative adversarial network (GAN) is a typical generative model that can generate any number of artificial minority samples, which are close to the real data. However, it is difficult to train GAN, and the Nash equilibrium is almost impossible to achieve. Therefore, in order to improve the training stability of GAN for oversampling to detect the network threat, a convergent WGAN-based oversampling model called convergent WGAN (CWGAN) is proposed in this paper. The training process of CWGAN contains multiple iterations. In each iteration, the training epochs of the discriminator are dynamic, which is determined by the convergence of discriminator loss function in the last two iterations. When the discriminator is trained to convergence, the generator will then be trained to generate new minority samples. The experiment results show that CWGAN not only improve the training stability of WGAN on the loss smoother and closer to 0 but also improve the performance of the minority class through oversampling, which means that CWGAN can improve the performance of network threat detection.

Download Full-text

Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data

Foundations of Computing and Decision Sciences ◽

10.1515/fcds-2017-0007 ◽

2017 ◽

Vol 42 (2) ◽

pp. 149-176 ◽

Cited By ~ 7

Author(s):

Szymon Wojciechowski ◽

Szymon Wilk

Keyword(s):

Experimental Study ◽

Class Imbalance ◽

Imbalanced Data ◽

Classification Performance ◽

Data Sets ◽

Artificial Data ◽

Minority Class ◽

Imbalanced Data Sets ◽

The Impact

Abstract In this paper we describe results of an experimental study where we checked the impact of various difficulty factors in imbalanced data sets on the performance of selected classifiers applied alone or combined with several preprocessing methods. In the study we used artificial data sets in order to systematically check factors such as dimensionality, class imbalance ratio or distribution of specific types of examples (safe, borderline, rare and outliers) in the minority class. The results revealed that the latter factor was the most critical one and it exacerbated other factors (in particular class imbalance). The best classification performance was demonstrated by non-symbolic classifiers, particular by k-NN classifiers (with 1 or 3 neighbors - 1NN and 3NN, respectively) and by SVM. Moreover, they benefited from different preprocessing methods - SVM and 1NN worked best with undersampling, while oversampling was more beneficial for 3NN.

Download Full-text

A Heterogeneous Ensemble Learning Framework for Spam Detection in Social Networks with Imbalanced Data

Applied Sciences ◽

10.3390/app10030936 ◽

2020 ◽

Vol 10 (3) ◽

pp. 936 ◽

Cited By ~ 3

Author(s):

Chensu Zhao ◽

Yang Xin ◽

Xuefeng Li ◽

Yixian Yang ◽

Yuling Chen

Keyword(s):

Social Networks ◽

Ensemble Learning ◽

Class Imbalance ◽

Imbalanced Data ◽

Machine Learning Algorithms ◽

Spam Detection ◽

Real World Data ◽

Learning Framework ◽

The Impact ◽

Base Module

The popularity of social networks provides people with many conveniences, but their rapid growth has also attracted many attackers. In recent years, the malicious behavior of social network spammers has seriously threatened the information security of ordinary users. To reduce this threat, many researchers have mined the behavior characteristics of spammers and have obtained good results by applying machine learning algorithms to identify spammers in social networks. However, most of these studies overlook class imbalance situations that exist in real world data. In this paper, we propose a heterogeneous stacking-based ensemble learning framework to ameliorate the impact of class imbalance on spam detection in social networks. The proposed framework consists of two main components, a base module and a combining module. In the base module, we adopt six different base classifiers and utilize this classifier diversity to construct new ensemble input members. In the combination module, we introduce cost sensitive learning into deep neural network training. By setting different costs for misclassification and dynamically adjusting the weights of the prediction results of the base classifiers, we can integrate the input members and aggregate the classification results. The experimental results show that our framework effectively improves the spam detection rate on imbalanced datasets.

Download Full-text

Kernel Based Data-Adaptive Support Vector Machines for Multi-Class Classification

Mathematics ◽

10.3390/math9090936 ◽

2021 ◽

Vol 9 (9) ◽

pp. 936

Author(s):

Jianli Shao ◽

Xin Liu ◽

Wenqing He

Keyword(s):

Machine Learning ◽

Spatial Association ◽

Class Imbalance ◽

Imbalanced Data ◽

Real Data ◽

Kernel Functions ◽

Support Vector ◽

Classification Problems ◽

Rare Class ◽

Data Adaptive

Imbalanced data exist in many classification problems. The classification of imbalanced data has remarkable challenges in machine learning. The support vector machine (SVM) and its variants are popularly used in machine learning among different classifiers thanks to their flexibility and interpretability. However, the performance of SVMs is impacted when the data are imbalanced, which is a typical data structure in the multi-category classification problem. In this paper, we employ the data-adaptive SVM with scaled kernel functions to classify instances for a multi-class population. We propose a multi-class data-dependent kernel function for the SVM by considering class imbalance and the spatial association among instances so that the classification accuracy is enhanced. Simulation studies demonstrate the superb performance of the proposed method, and a real multi-class prostate cancer image dataset is employed as an illustration. Not only does the proposed method outperform the competitor methods in terms of the commonly used accuracy measures such as the F-score and G-means, but also successfully detects more than 60% of instances from the rare class in the real data, while the competitors can only detect less than 20% of the rare class instances. The proposed method will benefit other scientific research fields, such as multiple region boundary detection.

Download Full-text

Scalable kernel-based SVM classification algorithm on imbalance air quality data for proficient healthcare

Complex & Intelligent Systems ◽

10.1007/s40747-021-00435-5 ◽

2021 ◽

Author(s):

Shwet Ketu ◽

Pramod Kumar Mishra

Keyword(s):

Air Pollution ◽

Air Quality ◽

Class Imbalance ◽

Imbalanced Data ◽

Classification Algorithm ◽

Quality Data ◽

Pollution Level ◽

Classification Problems ◽

Chi Square ◽

The Impact

AbstractIn the last decade, we have seen drastic changes in the air pollution level, which has become a critical environmental issue. It should be handled carefully towards making the solutions for proficient healthcare. Reducing the impact of air pollution on human health is possible only if the data is correctly classified. In numerous classification problems, we are facing the class imbalance issue. Learning from imbalanced data is always a challenging task for researchers, and from time to time, possible solutions have been developed by researchers. In this paper, we are focused on dealing with the imbalanced class distribution in a way that the classification algorithm will not compromise its performance. The proposed algorithm is based on the concept of the adjusting kernel scaling (AKS) method to deal with the multi-class imbalanced dataset. The kernel function's selection has been evaluated with the help of weighting criteria and the chi-square test. All the experimental evaluation has been performed on sensor-based Indian Central Pollution Control Board (CPCB) dataset. The proposed algorithm with the highest accuracy of 99.66% wins the race among all the classification algorithms i.e. Adaboost (59.72%), Multi-Layer Perceptron (95.71%), GaussianNB (80.87%), and SVM (96.92). The results of the proposed algorithm are also better than the existing literature methods. It is also clear from these results that our proposed algorithm is efficient for dealing with class imbalance problems along with enhanced performance. Thus, accurate classification of air quality through our proposed algorithm will be useful for improving the existing preventive policies and will also help in enhancing the capabilities of effective emergency response in the worst pollution situation.

Download Full-text

A two-stage clustering-based cold-start method for active learning

Intelligent Data Analysis ◽

10.3233/ida-205393 ◽

2021 ◽

Vol 25 (5) ◽

pp. 1169-1185

Author(s):

Deniu He ◽

Hong Yu ◽

Guoyin Wang ◽

Jie Li

Keyword(s):

Active Learning ◽

Class Imbalance ◽

Imbalanced Data ◽

Cold Start ◽

Classification Performance ◽

The Novel ◽

Two Stage ◽

Minority Class ◽

Novel Method ◽

Multiple Clusters

The problem of initialization of active learning is considered in this paper. Especially, this paper studies the problem in an imbalanced data scenario, which is called as class-imbalance active learning cold-start. The novel method is two-stage clustering-based active learning cold-start (ALCS). In the first stage, to separate the instances of minority class from that of majority class, a multi-center clustering is constructed based on a new inter-cluster tightness measure, thus the data is grouped into multiple clusters. Then, in the second stage, the initial training instances are selected from each cluster based on an adaptive candidate representative instances determination mechanism and a clusters-cyclic instance query mechanism. The comprehensive experiments demonstrate the effectiveness of the proposed method from the aspects of class coverage, classification performance, and impact on active learning.

Download Full-text

Imbalanced Data Classification Based on Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.443.741 ◽

2013 ◽

Vol 443 ◽

pp. 741-745

Author(s):

Hu Li ◽

Peng Zou ◽

Wei Hong Han ◽

Rong Ze Xia

Keyword(s):

Real World ◽

Imbalanced Data ◽

Data Classification ◽

Comprehensive Analysis ◽

Classification Method ◽

Classification Methods ◽

Real World Data ◽

Minority Class ◽

Imbalanced Data Classification ◽

Traditional Classification

Many real world data is imbalanced, i.e. one category contains significantly more samples than other categories. Traditional classification methods take different categories equally and are often ineffective. Based on the comprehensive analysis of existing researches, we propose a new imbalanced data classification method based on clustering. The method clusters both majority class and minority class at first. Then, clustered minority class will be over-sampled by SMOTE while clustered majority class be under-sampled randomly. Through clustering, the proposed method can avoid the loss of useful information while resampling. Experiments on several UCI datasets show that the proposed method can effectively improve the classification results on imbalanced data.

Download Full-text

Science for Everyone

Advances in Information and Communication Technology Education - ICTs for Modern Educational and Instructional Advancement ◽

10.4018/978-1-60566-936-6.ch028 ◽

2010 ◽

pp. 344-354

Author(s):

Charles A. Wood

Keyword(s):

Data Streams ◽

Emerging Technologies ◽

Data Use ◽

Real Data ◽

Scientific Instruments ◽

Culture Of Learning ◽

Directed Learning

Recent and emerging technologies offer many opportunities for exploration and learning. These technologies allow learners (of any age) to work with real data, use authentic scientific instruments, explore immersive simulations and act as scientists. The capabilities soon to be available raise questions about the role of schools and do rely on directed learning traditionally supplied by teachers. The prevalence of new tools and data streams can transform society, not just kids, into a culture of learning.

Download Full-text