Notes on the H-measure of classifier performance

Complexity curve: a graphical measure of data complexity and classifier performance

10.7287/peerj.preprints.2095 ◽

2016 ◽

Author(s):

Julian Zubek ◽

Dariusz M Plewczynski

Keyword(s):

Learning Curve ◽

Performance Measure ◽

Machine Learning Algorithms ◽

Classification Error ◽

Data Sets ◽

Data Complexity ◽

Complexity Measures ◽

Data Set ◽

New Variant ◽

Classifier Performance

We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. Contrary to some popular measures it is not focused on the shape of decision boundary in a classification task but on the amount of available data with respect to attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. We use it to propose a new variant of learning curve plot called generalisation curve. Generalisation curve is a standard learning curve with x-axis rescaled according to the data set complexity curve. It is a classifier performance measure, which shows how well the information present in the data is utilised. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining the performance of specific classifiers on these sets. Then we apply our methodology to a panel of benchmarks of standard machine learning algorithms on typical data sets, demonstrating how it can be used in practice to gain insights into data characteristics and classifier behaviour. Moreover, we show that complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without reducing classification accuracy. Associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation).

Download Full-text

PERFORMANCE EVALUATION OF DIFFERENT WEIGHTING SCHEMES ON KNN-BASED EMOTION RECOGNITION IN MANDARIN SPEECH

International Journal of Information Acquisition ◽

10.1142/s021987890700140x ◽

2007 ◽

Vol 04 (04) ◽

pp. 339-346

Author(s):

TSANG-LONG PAO ◽

YUN-MAW CHENG ◽

YU-TE CHEN ◽

JUN-HENG YEH

Keyword(s):

Decision Making ◽

Performance Evaluation ◽

Weighting Function ◽

Performance Measure ◽

Daily Activities ◽

Rational Decision ◽

Emotional Speech ◽

Weighting Functions ◽

Rational Decision Making ◽

Knn Classifier

Since emotion is important in influencing cognition, perception of daily activities such as learning, communication and even rational decision-making, it must be considered in human-computer interaction. In this paper, we compare four different weighting functions in weighted KNN-based classifiers to recognize five emotions, including anger, happiness, sadness, neutral and boredom, from Mandarin emotional speech. The classifiers studied include weighted KNN, weighted CAP, and weighted D-KNN. We use the result of traditional KNN classifier as the line performance measure. The experimental results show that the used Fibonacci weighting function outperforms others in all weighted classifiers. The highest accuracy achieves 81.4% with weighted D-KNN classifier.

Download Full-text

Complexity curve: a graphical measure of data complexity and classifier performance

10.7287/peerj.preprints.2095v1 ◽

2016 ◽

Author(s):

Julian Zubek ◽

Dariusz M Plewczynski

Keyword(s):

Learning Curve ◽

Performance Measure ◽

Machine Learning Algorithms ◽

Classification Error ◽

Data Sets ◽

Data Complexity ◽

Complexity Measures ◽

Data Set ◽

New Variant ◽

Classifier Performance

We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. Contrary to some popular measures it is not focused on the shape of decision boundary in a classification task but on the amount of available data with respect to attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. We use it to propose a new variant of learning curve plot called generalisation curve. Generalisation curve is a standard learning curve with x-axis rescaled according to the data set complexity curve. It is a classifier performance measure, which shows how well the information present in the data is utilised. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining the performance of specific classifiers on these sets. Then we apply our methodology to a panel of benchmarks of standard machine learning algorithms on typical data sets, demonstrating how it can be used in practice to gain insights into data characteristics and classifier behaviour. Moreover, we show that complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without reducing classification accuracy. Associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation).

Download Full-text