scholarly journals Classification of Imbalanced Data Using Deep Learning with Adding Noise

2021 ◽  
Vol 2021 ◽  
pp. 1-18
Author(s):  
Wan-Wei Fan ◽  
Ching-Hung Lee

This paper proposes a method to treat the classification of imbalanced data by adding noise to the feature space of convolutional neural network (CNN) without changing a data set (ratio of majority and minority data). Besides, a hybrid loss function of crossentropy and KL divergence is proposed. The proposed approach can improve the accuracy of minority class in the testing data. In addition, a simple design method for selecting structure of CNN is first introduced and then, we add noise in feature space of CNN to obtain proper features by a training process and to improve the classification results. From comparison results, we can find that the proposed method can extract the suitable features to improve the accuracy of minority class. Finally, illustrated examples of multiclass classification problems and the corresponding discussion in balance ratio are presented. Our approach performs well with smaller network structure compared with other deep models. In addition, the performance is improved over 40% in defective accuracy by adding noise approach. Finally, the accuracy is higher than 96%; even the imbalanced ratio (IR) is one hundred.

2018 ◽  
Vol 2018 ◽  
pp. 1-15 ◽  
Author(s):  
Huaping Guo ◽  
Xiaoyu Diao ◽  
Hongbing Liu

Rotation Forest is an ensemble learning approach achieving better performance comparing to Bagging and Boosting through building accurate and diverse classifiers using rotated feature space. However, like other conventional classifiers, Rotation Forest does not work well on the imbalanced data which are characterized as having much less examples of one class (minority class) than the other (majority class), and the cost of misclassifying minority class examples is often much more expensive than the contrary cases. This paper proposes a novel method called Embedding Undersampling Rotation Forest (EURF) to handle this problem (1) sampling subsets from the majority class and learning a projection matrix from each subset and (2) obtaining training sets by projecting re-undersampling subsets of the original data set to new spaces defined by the matrices and constructing an individual classifier from each training set. For the first method, undersampling is to force the rotation matrix to better capture the features of the minority class without harming the diversity between individual classifiers. With respect to the second method, the undersampling technique aims to improve the performance of individual classifiers on the minority class. The experimental results show that EURF achieves significantly better performance comparing to other state-of-the-art methods.


Author(s):  
Ferdinand Bollwein ◽  
Stephan Westphal

AbstractUnivariate decision tree induction methods for multiclass classification problems such as CART, C4.5 and ID3 continue to be very popular in the context of machine learning due to their major benefit of being easy to interpret. However, as these trees only consider a single attribute per node, they often get quite large which lowers their explanatory value. Oblique decision tree building algorithms, which divide the feature space by multidimensional hyperplanes, often produce much smaller trees but the individual splits are hard to interpret. Moreover, the effort of finding optimal oblique splits is very high such that heuristics have to be applied to determine local optimal solutions. In this work, we introduce an effective branch and bound procedure to determine global optimal bivariate oblique splits for concave impurity measures. Decision trees based on these bivariate oblique splits remain fairly interpretable due to the restriction to two attributes per split. The resulting trees are significantly smaller and more accurate than their univariate counterparts due to their ability of adapting better to the underlying data and capturing interactions of attribute pairs. Moreover, our evaluation shows that our algorithm even outperforms algorithms based on heuristically obtained multivariate oblique splits despite the fact that we are focusing on two attributes only.


2021 ◽  
Author(s):  
Seyed Navid Roohani Isfahani ◽  
Vinicius M. Sauer ◽  
Ingmar Schoegl

Abstract Micro-combustion has shown significant potential to study and characterize the combustion behavior of hydrocarbon fuels. Among several experimental approaches based on this method, the most prominent one employs an externally heated micro-channel. Three distinct combustion regimes are reported for this device namely, weak flames, flames with repetitive extinction and ignition (FREI), and normal flames, which are formed at low, moderate, and high flow rate ranges, respectively. Within each flame regime, noticeable differences exist in both shape and luminosity where transition points can be used to obtain insights into fuel characteristics. In this study, flame images are obtained using a monochrome camera equipped with a 430 nm bandpass filter to capture the chemiluminescence signal emitted by the flame. Sequences of conventional flame photographs are taken during the experiment, which are computationally merged to generate high dynamic range (HDR) images. In a highly diluted fuel/oxidizer mixture, it is observed that FREI disappear and are replaced by a gradual and direct transition between weak and normal flames which makes it hard to identify different combustion regimes. To resolve the issue, a convolutional neural network (CNN) is introduced to classify the flame regime. The accuracy of the model is calculated to be 99.34, 99.66, and 99.83% for “training”, “validation”, and “testing” data-sets, respectively. This level of accuracy is achieved by conducting a grid search to acquire optimized parameters for CNN. Furthermore, a data augmentation technique based on different experimental scenarios is used to generate flame images to increase the size of the data-set.


Author(s):  
Yilin Yan ◽  
Min Chen ◽  
Saad Sadiq ◽  
Mei-Ling Shyu

The classification of imbalanced datasets has recently attracted significant attention due to its implications in several real-world use cases. The classifiers developed on datasets with skewed distributions tend to favor the majority classes and are biased against the minority class. Despite extensive research interests, imbalanced data classification remains a challenge in data mining research, especially for multimedia data. Our attempt to overcome this hurdle is to develop a convolutional neural network (CNN) based deep learning solution integrated with a bootstrapping technique. Considering that convolutional neural networks are very computationally expensive coupled with big training datasets, we propose to extract features from pre-trained convolutional neural network models and feed those features to another full connected neutral network. Spark implementation shows promising performance of our model in handling big datasets with respect to feasibility and scalability.


2002 ◽  
Vol 30 (4) ◽  
pp. 239-247
Author(s):  
S. M. Shamsuddin ◽  
M. Darus ◽  
M. N. Sulaiman

Data reduction is a process of feature extraction that transforms the data space into a feature space of much lower dimension compared to the original data space, yet it retains most of the intrinsic information content of the data. This can be done by using a number of methods, such as principal component analysis (PCA), factor analysis, and feature clustering. Principal components are extracted from a collection of multivariate cases as a way of accounting for as much of the variation in that collection as possible by means of as few variables as possible. On the other hand, backpropagation network has been used extensively in classification problems such as XOR problems, share prices prediction, and pattern recognition. This paper proposes an improved error signal of backpropagation network for classification of the reduction invariants using principal component analysis, for extracting the bulk of the useful information present in moment invariants of handwritten digits, leaving the redundant information behind. Higher order centralised scale- invariants are used to extract features of handwritten digits before PCA, and the reduction invariants are sent to the improved backpropagation model for classification purposes.


2017 ◽  
Vol 17 (1) ◽  
pp. 45-62 ◽  
Author(s):  
Lincy Meera Mathews ◽  
Hari Seetha

Abstract Mining of imbalanced data isachallenging task due to its complex inherent characteristics. The conventional classifiers such as the nearest neighbor severely bias towards the majority class, as minority class data are under-represented and outnumbered. This paper focuses on building an improved Nearest Neighbor Classifier foratwo class imbalanced data. Three oversampling techniques are presented, for generation of artificial instances for the minority class for balancing the distribution among the classes. Experimental results showed that the proposed methods outperformed the conventional classifier.


2020 ◽  
Vol 10 (3) ◽  
pp. 973 ◽  
Author(s):  
Hsien-I Lin ◽  
Mihn Cong Nguyen

Data imbalance during the training of deep networks can cause the network to skip directly to learning minority classes. This paper presents a novel framework by which to train segmentation networks using imbalanced point cloud data. PointNet, an early deep network used for the segmentation of point cloud data, proved effective in the point-wise classification of balanced data; however, performance degraded when imbalanced data was used. The proposed approach involves removing between-class data point imbalances and guiding the network to pay more attention to majority classes. Data imbalance is alleviated using a hybrid-sampling method involving oversampling, as well as undersampling, respectively, to decrease the amount of data in majority classes and increase the amount of data in minority classes. A balanced focus loss function is also used to emphasize the minority classes through the automated assignment of costs to the various classes based on their density in the point cloud. Experiments demonstrate the effectiveness of the proposed training framework when provided a point cloud dataset pertaining to six objects. The mean intersection over union (mIoU) test accuracy results obtained using PointNet training were as follows: XYZRGB data (91%) and XYZ data (86%). The mIoU test accuracy results obtained using the proposed scheme were as follows: XYZRGB data (98%) and XYZ data (93%).


Author(s):  
M. Aldiki Febriantono ◽  
Sholeh Hadi Pramono ◽  
Rahmadwati Rahmadwati ◽  
Golshah Naghdy

The multiclass imbalanced data problems in data mining were an interesting to study currently. The problems had an influence on the classification process in machine learning processes. Some cases showed that minority class in the dataset had an important information value compared to the majority class. When minority class was misclassification, it would affect the accuracy value and classifier performance. In this research, cost sensitive decision tree C5.0 was used to solve multiclass imbalanced data problems. The first stage, making the decision tree model uses the C5.0 algorithm then the cost sensitive learning uses the metacost method to obtain the minimum cost model. The results of testing the C5.0 algorithm had better performance than C4.5 and ID3 algorithms. The percentage of algorithm performance from C5.0, C4.5 and ID3 were 40.91%, 40, 24% and 19.23%.


2021 ◽  
Vol 11 (11) ◽  
pp. 4970
Author(s):  
Łukasz Rybak ◽  
Janusz Dudczyk

The history of gravitational classification started in 1977. Over the years, the gravitational approaches have reached many extensions, which were adapted into different classification problems. This article is the next stage of the research concerning the algorithms of creating data particles by their geometrical divide. In the previous analyses it was established that the Geometrical Divide (GD) method outperforms the algorithm creating the data particles based on classes by a compound of 1 ÷ 1 cardinality. This occurs in the process of balanced data sets classification, in which class centroids are close to each other and the groups of objects, described by different labels, overlap. The purpose of the article was to examine the efficiency of the Geometrical Divide method in the unbalanced data sets classification, by the example of real case-occupancy detecting. In addition, in the paper, the concept of the Unequal Geometrical Divide (UGD) was developed. The evaluation of approaches was conducted on 26 unbalanced data sets-16 with the features of Moons and Circles data sets and 10 created based on real occupancy data set. In the experiment, the GD method and its unbalanced variant (UGD) as well as the 1CT1P approach, were compared. Each method was combined with three data particle mass determination algorithms-n-Mass Model (n-MM), Stochastic Learning Algorithm (SLA) and Bath-update Algorithm (BLA). k-fold cross validation method, precision, recall, F-measure, and number of used data particles were applied in the evaluation process. Obtained results showed that the methods based on geometrical divide outperform the 1CT1P approach in the imbalanced data sets classification. The article’s conclusion describes the observations and indicates the potential directions of further research and development of methods, which concern creating the data particle through its geometrical divide.


Sign in / Sign up

Export Citation Format

Share Document