Imbalanced Data Handling in Multi-label Aspect Categorization using Oversampling and Ensemble Learning

Author(s):  
Wildan Dicky Alnatara ◽  
Masayu Leylia Khodra
2017 ◽  
Vol 19 (1) ◽  
pp. 42-49
Author(s):  
Divya Agrawal ◽  
Padma Bonde

Prediction using classification techniques is one of the fundamental feature widely applied in various fields. Classification accuracy is still a great challenge due to data imbalance problem. The increased volume of data is also posing a challenge for data handling and prediction, particularly when technology is used as the interface between customers and the company. As the data imbalance increases it directly affects the classification accuracy of the entire system. AUC (area under the curve) and lift proved to be good evaluation metrics. Classification techniques help to improve classification accuracy, but in case of imbalanced dataset classification accuracy does not predict well and other techniques, such as oversampling needs to be resorted. Paper presented Voting based ensembling technique to improve classification accuracy in case of imbalanced data. The voting based ensemble is based on taking the votes on the best class obtained by the three classification techniques, namely, Logistics Regression, Classification Trees and Discriminant Analysis. The observed result revealed improvement in classification accuracy by using voting ensembling technique.


Classification is a supervised learning task based on categorizing things in groups on the basis of class labels. Algorithms are trained with labeled datasets for accomplishing the task of classification. In the process of classification, datasets plays an important role. If in a dataset, instances of one label/class (majority class) are much more than instances of another label/class (minority class), such that it becomes hard to understand and learn characteristics of minority class for a classifier, such dataset is termed an imbalanced dataset. These types of datasets raise the problem of biased prediction or misclassification in the real world, as models based on such datasets may give very high accuracy during training, but as not familiar with minority class instances, would not be able to predict minority class and thus fails poorly. A survey on various techniques proposed by the researchers for handling imbalanced data has been presented and a comparison of the techniques based on f-measure has been identified and discussed.


Sign in / Sign up

Export Citation Format

Share Document