The Effects of Class Label Noise on Highly-Imbalanced Big Data

Author(s):  
Robert K. L. Kennedy ◽  
Justin M. Johnson ◽  
Taghi M. Khoshgoftaar
Keyword(s):  
Big Data ◽  
2018 ◽  
Vol 275 ◽  
pp. 2374-2383 ◽  
Author(s):  
Maryam Sabzevari ◽  
Gonzalo Martínez-Muñoz ◽  
Alberto Suárez

Text classification and clustering approach is essential for big data environments. In supervised learning applications many classification algorithms have been proposed. In the era of big data, a large volume of training data is available in many machine learning works. However, there is a possibility of mislabeled or unlabeled data that are not labeled properly. Some labels may be incorrect resulted in label noise which in turn regress learning performance of a classifier. A general approach to address label noise is to apply noise filtering techniques to identify and remove noise before learning. A range of noise filtering approaches have been developed to improve the classifiers performance. This paper proposes noise filtering approach in text data during the training phase. Many supervised learning algorithms generates high error rates due to noise in training dataset, our work eliminates such noise and provides accurate classification system.


2017 ◽  
Vol 9 (2) ◽  
pp. 173 ◽  
Author(s):  
Charlotte Pelletier ◽  
Silvia Valero ◽  
Jordi Inglada ◽  
Nicolas Champion ◽  
Claire Marais Sicre ◽  
...  

2014 ◽  
Vol 2014 ◽  
pp. 1-14 ◽  
Author(s):  
Shehzad Khalid ◽  
Sannia Arshad ◽  
Sohail Jabbar ◽  
Seungmin Rho

We have presented a classification framework that combines multiple heterogeneous classifiers in the presence of class label noise. An extension ofm-Mediods based modeling is presented that generates model of various classes whilst identifying and filtering noisy training data. This noise free data is further used to learn model for other classifiers such as GMM and SVM. A weight learning method is then introduced to learn weights on each class for different classifiers to construct an ensemble. For this purpose, we applied genetic algorithm to search for an optimal weight vector on which classifier ensemble is expected to give the best accuracy. The proposed approach is evaluated on variety of real life datasets. It is also compared with existing standard ensemble techniques such as Adaboost, Bagging, and Random Subspace Methods. Experimental results show the superiority of proposed ensemble method as compared to its competitors, especially in the presence of class label noise and imbalance classes.


Sign in / Sign up

Export Citation Format

Share Document