Classification of Imbalanced Data Using Deep Learning with Adding Noise

This paper proposes a method to treat the classification of imbalanced data by adding noise to the feature space of convolutional neural network (CNN) without changing a data set (ratio of majority and minority data). Besides, a hybrid loss function of crossentropy and KL divergence is proposed. The proposed approach can improve the accuracy of minority class in the testing data. In addition, a simple design method for selecting structure of CNN is first introduced and then, we add noise in feature space of CNN to obtain proper features by a training process and to improve the classification results. From comparison results, we can find that the proposed method can extract the suitable features to improve the accuracy of minority class. Finally, illustrated examples of multiclass classification problems and the corresponding discussion in balance ratio are presented. Our approach performs well with smaller network structure compared with other deep models. In addition, the performance is improved over 40% in defective accuracy by adding noise approach. Finally, the accuracy is higher than 96%; even the imbalanced ratio (IR) is one hundred.

Download Full-text

Embedding Undersampling Rotation Forest for Imbalanced Problem

Computational Intelligence and Neuroscience ◽

10.1155/2018/6798042 ◽

2018 ◽

Vol 2018 ◽

pp. 1-15 ◽

Cited By ~ 3

Author(s):

Huaping Guo ◽

Xiaoyu Diao ◽

Hongbing Liu

Keyword(s):

Imbalanced Data ◽

Feature Space ◽

Original Data ◽

Training Set ◽

Data Set ◽

Minority Class ◽

Rotation Forest ◽

Novel Method ◽

Individual Classifier ◽

The Cost

Rotation Forest is an ensemble learning approach achieving better performance comparing to Bagging and Boosting through building accurate and diverse classifiers using rotated feature space. However, like other conventional classifiers, Rotation Forest does not work well on the imbalanced data which are characterized as having much less examples of one class (minority class) than the other (majority class), and the cost of misclassifying minority class examples is often much more expensive than the contrary cases. This paper proposes a novel method called Embedding Undersampling Rotation Forest (EURF) to handle this problem (1) sampling subsets from the majority class and learning a projection matrix from each subset and (2) obtaining training sets by projecting re-undersampling subsets of the original data set to new spaces defined by the matrices and constructing an individual classifier from each training set. For the first method, undersampling is to force the rotation matrix to better capture the features of the minority class without harming the diversity between individual classifiers. With respect to the second method, the undersampling technique aims to improve the performance of individual classifiers on the minority class. The experimental results show that EURF achieves significantly better performance comparing to other state-of-the-art methods.

Download Full-text

A branch & bound algorithm to determine optimal bivariate splits for oblique decision tree induction

Applied Intelligence ◽

10.1007/s10489-021-02281-x ◽

2021 ◽

Author(s):

Ferdinand Bollwein ◽

Stephan Westphal

Keyword(s):

Decision Tree ◽

Feature Space ◽

Classification Problems ◽

Decision Tree Induction ◽

Single Attribute ◽

Global Optimal ◽

The Individual ◽

Tree Building ◽

Very High ◽

Multiclass Classification Problems

AbstractUnivariate decision tree induction methods for multiclass classification problems such as CART, C4.5 and ID3 continue to be very popular in the context of machine learning due to their major benefit of being easy to interpret. However, as these trees only consider a single attribute per node, they often get quite large which lowers their explanatory value. Oblique decision tree building algorithms, which divide the feature space by multidimensional hyperplanes, often produce much smaller trees but the individual splits are hard to interpret. Moreover, the effort of finding optimal oblique splits is very high such that heuristics have to be applied to determine local optimal solutions. In this work, we introduce an effective branch and bound procedure to determine global optimal bivariate oblique splits for concave impurity measures. Decision trees based on these bivariate oblique splits remain fairly interpretable due to the restriction to two attributes per split. The resulting trees are significantly smaller and more accurate than their univariate counterparts due to their ability of adapting better to the underlying data and capturing interactions of attribute pairs. Moreover, our evaluation shows that our algorithm even outperforms algorithms based on heuristically obtained multivariate oblique splits despite the fact that we are focusing on two attributes only.

Download Full-text

Oversampling the minority class in a multi-linear feature space for imbalanced data classification

IEEJ Transactions on Electrical and Electronic Engineering ◽

10.1002/tee.22715 ◽

2018 ◽

Vol 13 (10) ◽

pp. 1483-1491 ◽

Cited By ~ 1

Author(s):

Peifeng Liang ◽

Weite Li ◽

Jinglu Hu

Keyword(s):

Imbalanced Data ◽

Data Classification ◽

Feature Space ◽

Minority Class ◽

Linear Feature ◽

Imbalanced Data Classification

Download Full-text

Classification of Microchannel Flame Regimes Based on Convolutional Neural Networks

10.1115/power2021-64437 ◽

2021 ◽

Author(s):

Seyed Navid Roohani Isfahani ◽

Vinicius M. Sauer ◽

Ingmar Schoegl

Keyword(s):

Data Augmentation ◽

Dynamic Range ◽

High Dynamic Range ◽

Data Sets ◽

Data Set ◽

Testing Data ◽

Experimental Approaches ◽

Transition Points ◽

Combustion Regimes

Abstract Micro-combustion has shown significant potential to study and characterize the combustion behavior of hydrocarbon fuels. Among several experimental approaches based on this method, the most prominent one employs an externally heated micro-channel. Three distinct combustion regimes are reported for this device namely, weak flames, flames with repetitive extinction and ignition (FREI), and normal flames, which are formed at low, moderate, and high flow rate ranges, respectively. Within each flame regime, noticeable differences exist in both shape and luminosity where transition points can be used to obtain insights into fuel characteristics. In this study, flame images are obtained using a monochrome camera equipped with a 430 nm bandpass filter to capture the chemiluminescence signal emitted by the flame. Sequences of conventional flame photographs are taken during the experiment, which are computationally merged to generate high dynamic range (HDR) images. In a highly diluted fuel/oxidizer mixture, it is observed that FREI disappear and are replaced by a gradual and direct transition between weak and normal flames which makes it hard to identify different combustion regimes. To resolve the issue, a convolutional neural network (CNN) is introduced to classify the flame regime. The accuracy of the model is calculated to be 99.34, 99.66, and 99.83% for “training”, “validation”, and “testing” data-sets, respectively. This level of accuracy is achieved by conducting a grid search to acquire optimized parameters for CNN. Furthermore, a data augmentation technique based on different experimental scenarios is used to generate flame images to increase the size of the data-set.

Download Full-text

Efficient Imbalanced Multimedia Concept Retrieval by Deep Learning on Spark Clusters

Deep Learning and Neural Networks ◽

10.4018/978-1-7998-0414-7.ch017 ◽

2020 ◽

pp. 274-294

Author(s):

Yilin Yan ◽

Min Chen ◽

Saad Sadiq ◽

Mei-Ling Shyu

Keyword(s):

Neural Network ◽

Deep Learning ◽

Convolutional Neural Network ◽

Imbalanced Data ◽

Network Models ◽

Multimedia Data ◽

Neural Network Models ◽

Minority Class ◽

Imbalanced Data Classification

The classification of imbalanced datasets has recently attracted significant attention due to its implications in several real-world use cases. The classifiers developed on datasets with skewed distributions tend to favor the majority classes and are biased against the minority class. Despite extensive research interests, imbalanced data classification remains a challenge in data mining research, especially for multimedia data. Our attempt to overcome this hurdle is to develop a convolutional neural network (CNN) based deep learning solution integrated with a bootstrapping technique. Considering that convolutional neural networks are very computationally expensive coupled with big training datasets, we propose to extract features from pre-trained convolutional neural network models and feed those features to another full connected neutral network. Spark implementation shows promising performance of our model in handling big datasets with respect to feasibility and scalability.

Download Full-text

Classification of reduction invariants with improved backpropagation

International Journal of Mathematics and Mathematical Sciences ◽

10.1155/s0161171202006117 ◽

2002 ◽

Vol 30 (4) ◽

pp. 239-247

Author(s):

S. M. Shamsuddin ◽

M. Darus ◽

M. N. Sulaiman

Keyword(s):

Principal Component Analysis ◽

Principal Component ◽

Feature Space ◽

Original Data ◽

Component Analysis ◽

Classification Problems ◽

Share Prices ◽

Data Space ◽

Backpropagation Network

Data reduction is a process of feature extraction that transforms the data space into a feature space of much lower dimension compared to the original data space, yet it retains most of the intrinsic information content of the data. This can be done by using a number of methods, such as principal component analysis (PCA), factor analysis, and feature clustering. Principal components are extracted from a collection of multivariate cases as a way of accounting for as much of the variation in that collection as possible by means of as few variables as possible. On the other hand, backpropagation network has been used extensively in classification problems such as XOR problems, share prices prediction, and pattern recognition. This paper proposes an improved error signal of backpropagation network for classification of the reduction invariants using principal component analysis, for extracting the bulk of the useful information present in moment invariants of handwritten digits, leaving the redundant information behind. Higher order centralised scale- invariants are used to extract features of handwritten digits before PCA, and the reduction invariants are sent to the improved backpropagation model for classification purposes.

Download Full-text

On Improving the Classification of Imbalanced Data

Cybernetics and Information Technologies ◽

10.1515/cait-2017-0004 ◽

2017 ◽

Vol 17 (1) ◽

pp. 45-62 ◽

Cited By ~ 1

Author(s):

Lincy Meera Mathews ◽

Hari Seetha

Keyword(s):

Nearest Neighbor ◽

Imbalanced Data ◽

Experimental Results ◽

Minority Class ◽

Nearest Neighbor Classifier ◽

Neighbor Classifier

Abstract Mining of imbalanced data isachallenging task due to its complex inherent characteristics. The conventional classifiers such as the nearest neighbor severely bias towards the majority class, as minority class data are under-represented and outnumbered. This paper focuses on building an improved Nearest Neighbor Classifier foratwo class imbalanced data. Three oversampling techniques are presented, for generation of artificial instances for the minority class for balancing the distribution among the classes. Experimental results showed that the proposed methods outperformed the conventional classifier.

Download Full-text

Boosting Minority Class Prediction on Imbalanced Point Cloud Data

Applied Sciences ◽

10.3390/app10030973 ◽

2020 ◽

Vol 10 (3) ◽

pp. 973 ◽

Cited By ~ 1

Author(s):

Hsien-I Lin ◽

Mihn Cong Nguyen

Keyword(s):

Point Cloud ◽

Imbalanced Data ◽

Automated Assignment ◽

Test Accuracy ◽

Point Cloud Data ◽

Minority Class ◽

Class Prediction ◽

Cloud Data ◽

Data Imbalance

Data imbalance during the training of deep networks can cause the network to skip directly to learning minority classes. This paper presents a novel framework by which to train segmentation networks using imbalanced point cloud data. PointNet, an early deep network used for the segmentation of point cloud data, proved effective in the point-wise classification of balanced data; however, performance degraded when imbalanced data was used. The proposed approach involves removing between-class data point imbalances and guiding the network to pay more attention to majority classes. Data imbalance is alleviated using a hybrid-sampling method involving oversampling, as well as undersampling, respectively, to decrease the amount of data in majority classes and increase the amount of data in minority classes. A balanced focus loss function is also used to emphasize the minority classes through the automated assignment of costs to the various classes based on their density in the point cloud. Experiments demonstrate the effectiveness of the proposed training framework when provided a point cloud dataset pertaining to six objects. The mean intersection over union (mIoU) test accuracy results obtained using PointNet training were as follows: XYZRGB data (91%) and XYZ data (86%). The mIoU test accuracy results obtained using the proposed scheme were as follows: XYZRGB data (98%) and XYZ data (93%).

Download Full-text

Classification of multiclass imbalanced data using cost-sensitive decision tree C5.0

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v9.i1.pp65-72 ◽

2020 ◽

Vol 9 (1) ◽

pp. 65

Author(s):

M. Aldiki Febriantono ◽

Sholeh Hadi Pramono ◽

Rahmadwati Rahmadwati ◽

Golshah Naghdy

Keyword(s):

Decision Tree ◽

Cost Model ◽

Imbalanced Data ◽

Minimum Cost ◽

Information Value ◽

Tree Model ◽

Minority Class ◽

Classifier Performance ◽

The Cost

The multiclass imbalanced data problems in data mining were an interesting to study currently. The problems had an influence on the classification process in machine learning processes. Some cases showed that minority class in the dataset had an important information value compared to the majority class. When minority class was misclassification, it would affect the accuracy value and classifier performance. In this research, cost sensitive decision tree C5.0 was used to solve multiclass imbalanced data problems. The first stage, making the decision tree model uses the C5.0 algorithm then the cost sensitive learning uses the metacost method to obtain the minimum cost model. The results of testing the C5.0 algorithm had better performance than C4.5 and ID3 algorithms. The percentage of algorithm performance from C5.0, C4.5 and ID3 were 40.91%, 40, 24% and 19.23%.

Download Full-text

Variant of Data Particle Geometrical Divide for Imbalanced Data Sets Classification by the Example of Occupancy Detection

Applied Sciences ◽

10.3390/app11114970 ◽

2021 ◽

Vol 11 (11) ◽

pp. 4970

Author(s):

Łukasz Rybak ◽

Janusz Dudczyk

Keyword(s):

Learning Algorithm ◽

Imbalanced Data ◽

Evaluation Process ◽

Unbalanced Data ◽

Data Sets ◽

Classification Problems ◽

Mass Model ◽

Data Set ◽

Imbalanced Data Sets ◽

Occupancy Detection

The history of gravitational classification started in 1977. Over the years, the gravitational approaches have reached many extensions, which were adapted into different classification problems. This article is the next stage of the research concerning the algorithms of creating data particles by their geometrical divide. In the previous analyses it was established that the Geometrical Divide (GD) method outperforms the algorithm creating the data particles based on classes by a compound of 1 ÷ 1 cardinality. This occurs in the process of balanced data sets classification, in which class centroids are close to each other and the groups of objects, described by different labels, overlap. The purpose of the article was to examine the efficiency of the Geometrical Divide method in the unbalanced data sets classification, by the example of real case-occupancy detecting. In addition, in the paper, the concept of the Unequal Geometrical Divide (UGD) was developed. The evaluation of approaches was conducted on 26 unbalanced data sets-16 with the features of Moons and Circles data sets and 10 created based on real occupancy data set. In the experiment, the GD method and its unbalanced variant (UGD) as well as the 1CT1P approach, were compared. Each method was combined with three data particle mass determination algorithms-n-Mass Model (n-MM), Stochastic Learning Algorithm (SLA) and Bath-update Algorithm (BLA). k-fold cross validation method, precision, recall, F-measure, and number of used data particles were applied in the evaluation process. Obtained results showed that the methods based on geometrical divide outperform the 1CT1P approach in the imbalanced data sets classification. The article’s conclusion describes the observations and indicates the potential directions of further research and development of methods, which concern creating the data particle through its geometrical divide.

Download Full-text