scholarly journals Complexity curve: a graphical measure of data complexity and classifier performance

Author(s):  
Julian Zubek ◽  
Dariusz M Plewczynski

We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. Contrary to some popular measures it is not focused on the shape of decision boundary in a classification task but on the amount of available data with respect to attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. We use it to propose a new variant of learning curve plot called generalisation curve. Generalisation curve is a standard learning curve with x-axis rescaled according to the data set complexity curve. It is a classifier performance measure, which shows how well the information present in the data is utilised. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining the performance of specific classifiers on these sets. Then we apply our methodology to a panel of benchmarks of standard machine learning algorithms on typical data sets, demonstrating how it can be used in practice to gain insights into data characteristics and classifier behaviour. Moreover, we show that complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without reducing classification accuracy. Associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation).

2016 ◽  
Author(s):  
Julian Zubek ◽  
Dariusz M Plewczynski

We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. Contrary to some popular measures it is not focused on the shape of decision boundary in a classification task but on the amount of available data with respect to attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. We use it to propose a new variant of learning curve plot called generalisation curve. Generalisation curve is a standard learning curve with x-axis rescaled according to the data set complexity curve. It is a classifier performance measure, which shows how well the information present in the data is utilised. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining the performance of specific classifiers on these sets. Then we apply our methodology to a panel of benchmarks of standard machine learning algorithms on typical data sets, demonstrating how it can be used in practice to gain insights into data characteristics and classifier behaviour. Moreover, we show that complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without reducing classification accuracy. Associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation).


2016 ◽  
Vol 2 ◽  
pp. e76 ◽  
Author(s):  
Julian Zubek ◽  
Dariusz M. Plewczynski

We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. In contrast to some popular complexity measures, it is not focused on the shape of a decision boundary in a classification task but on the amount of available data with respect to the attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. It demonstrates the relative increase of available information with the growth of sample size. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We then compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining performance of specific classifiers on these sets. We also apply our methodology to a panel of simple benchmark data sets, demonstrating how it can be used in practice to gain insights into data characteristics. Moreover, we show that the complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without compromising classification accuracy. The associated code is available to download at:https://github.com/zubekj/complexity_curve(open source Python implementation).


Author(s):  
Aska E. Mehyadin ◽  
Adnan Mohsin Abdulazeez ◽  
Dathar Abas Hasan ◽  
Jwan N. Saeed

The bird classifier is a system that is equipped with an area machine learning technology and uses a machine learning method to store and classify bird calls. Bird species can be known by recording only the sound of the bird, which will make it easier for the system to manage. The system also provides species classification resources to allow automated species detection from observations that can teach a machine how to recognize whether or classify the species. Non-undesirable noises are filtered out of and sorted into data sets, where each sound is run via a noise suppression filter and a separate classification procedure so that the most useful data set can be easily processed. Mel-frequency cepstral coefficient (MFCC) is used and tested through different algorithms, namely Naïve Bayes, J4.8 and Multilayer perceptron (MLP), to classify bird species. J4.8 has the highest accuracy (78.40%) and is the best. Accuracy and elapsed time are (39.4 seconds).


Author(s):  
Li Yang ◽  
Qi Wang ◽  
Yu Rao

Abstract Film Cooling is an important and widely used technology to protect hot sections of gas turbines. The last decades witnessed a fast growth of research and publications in the field of film cooling. However, except for the correlations for single row film cooling and the Seller correlation for cooling superposition, there were rarely generalized models for film cooling under superposition conditions. Meanwhile, the numerous data obtained for complex hole distributions were not emerged or integrated from different sources, and recent new data had no avenue to contribute to a compatible model. The technical barriers that obstructed the generalization of film cooling models are: a) the lack of a generalizable model; b) the large number of input variables to describe film cooling. The present study aimed at establishing a generalizable model to describe multiple row film cooling under a large parameter space, including hole locations, hole size, hole angles, blowing ratios etc. The method allowed data measured within different streamwise lengths and different surface areas to be integrated in a single model, in the form 1-D sequences. A Long Short Term Memory model was designed to model the local behavior of film cooling. Careful training, testing and validation were conducted to regress the model. The presented results showed that the method was accurate within the CFD data set generated in this study. The presented method could serve as a base model that allowed past and future film cooling research to contribute to a common data base. Meanwhile, the model could also be transferred from simulation data sets to experimental data sets using advanced machine learning algorithms in the future.


Author(s):  
André M. Carrington ◽  
Paul W. Fieguth ◽  
Hammad Qazi ◽  
Andreas Holzinger ◽  
Helen H. Chen ◽  
...  

Abstract Background In classification and diagnostic testing, the receiver-operator characteristic (ROC) plot and the area under the ROC curve (AUC) describe how an adjustable threshold causes changes in two types of error: false positives and false negatives. Only part of the ROC curve and AUC are informative however when they are used with imbalanced data. Hence, alternatives to the AUC have been proposed, such as the partial AUC and the area under the precision-recall curve. However, these alternatives cannot be as fully interpreted as the AUC, in part because they ignore some information about actual negatives. Methods We derive and propose a new concordant partial AUC and a new partial c statistic for ROC data—as foundational measures and methods to help understand and explain parts of the ROC plot and AUC. Our partial measures are continuous and discrete versions of the same measure, are derived from the AUC and c statistic respectively, are validated as equal to each other, and validated as equal in summation to whole measures where expected. Our partial measures are tested for validity on a classic ROC example from Fawcett, a variation thereof, and two real-life benchmark data sets in breast cancer: the Wisconsin and Ljubljana data sets. Interpretation of an example is then provided. Results Results show the expected equalities between our new partial measures and the existing whole measures. The example interpretation illustrates the need for our newly derived partial measures. Conclusions The concordant partial area under the ROC curve was proposed and unlike previous partial measure alternatives, it maintains the characteristics of the AUC. The first partial c statistic for ROC plots was also proposed as an unbiased interpretation for part of an ROC curve. The expected equalities among and between our newly derived partial measures and their existing full measure counterparts are confirmed. These measures may be used with any data set but this paper focuses on imbalanced data with low prevalence. Future work Future work with our proposed measures may: demonstrate their value for imbalanced data with high prevalence, compare them to other measures not based on areas; and combine them with other ROC measures and techniques.


Author(s):  
Brendan Juba ◽  
Hai S. Le

Practitioners of data mining and machine learning have long observed that the imbalance of classes in a data set negatively impacts the quality of classifiers trained on that data. Numerous techniques for coping with such imbalances have been proposed, but nearly all lack any theoretical grounding. By contrast, the standard theoretical analysis of machine learning admits no dependence on the imbalance of classes at all. The basic theorems of statistical learning establish the number of examples needed to estimate the accuracy of a classifier as a function of its complexity (VC-dimension) and the confidence desired; the class imbalance does not enter these formulas anywhere. In this work, we consider the measures of classifier performance in terms of precision and recall, a measure that is widely suggested as more appropriate to the classification of imbalanced data. We observe that whenever the precision is moderately large, the worse of the precision and recall is within a small constant factor of the accuracy weighted by the class imbalance. A corollary of this observation is that a larger number of examples is necessary and sufficient to address class imbalance, a finding we also illustrate empirically.


2016 ◽  
Vol 2016 ◽  
pp. 1-7
Author(s):  
Zhizheng Liang

Feature scaling has attracted considerable attention during the past several decades because of its important role in feature selection. In this paper, a novel algorithm for learning scaling factors of features is proposed. It first assigns a nonnegative scaling factor to each feature of data and then adopts a generalized performance measure to learn the optimal scaling factors. It is of interest to note that the proposed model can be transformed into a convex optimization problem: second-order cone programming (SOCP). Thus the scaling factors of features in our method are globally optimal in some sense. Several experiments on simulated data, UCI data sets, and the gene data set are conducted to demonstrate that the proposed method is more effective than previous methods.


2019 ◽  
Vol 16 (1) ◽  
pp. 155-178 ◽  
Author(s):  
Kristina Andric ◽  
Damir Kalpic ◽  
Zoran Bohacek

In this paper we investigate the role of sample size and class distribution in credit risk assessments, focusing on real life imbalanced data sets. Choosing the optimal sample is of utmost importance for the quality of predictive models and has become an increasingly important topic with the recent advances in automating lending decision processes and the ever growing richness in data collected by financial institutions. To address the observed research gap, a large-scale experimental evaluation of real-life data sets of different characteristics was performed, using several classification algorithms and performance measures. Results indicate that various factors play a role in determining the optimal class distribution, namely the performance measure, classification algorithm and data set characteristics. The study also provides valuable insight on how to design the training sample to maximize prediction performance and the suitability of using different classification algorithms by assessing their sensitivity to class imbalance and sample size.


2020 ◽  
Vol 17 (6) ◽  
pp. 916-925
Author(s):  
Niyati Behera ◽  
Guruvayur Mahalakshmi

Attributes, whether qualitative or non-qualitative are the formal description of any real-world entity and are crucial in modern knowledge representation models like ontology. Though ample evidence for the amount of research done for mining non-qualitative attributes (like part-of relation) extraction from text as well as the Web is available in the wealth of literature, on the other side limited research can be found relating to qualitative attribute (i.e., size, color, taste etc.,) mining. Herein this research article an analytical framework has been proposed to retrieve qualitative attribute values from unstructured domain text. The research objective covers two aspects of information retrieval (1) acquiring quality values from unstructured text and (2) then assigning attribute to them by comparing the Google derived meaning or context of attributes as well as quality value (adjectives). The goal has been accomplished by using a framework which integrates Vector Space Modelling (VSM) with a probabilistic Multinomial Naive Bayes (MNB) classifier. Performance Evaluation has been carried out on two data sets (1) HeiPLAS Development Data set (106 adjective-noun exemplary phrases) and (2) a text data set in Medicinal Plant Domain (MPD). System is found to perform better with probabilistic approach compared to the existing pattern-based framework in the state of art


Sensors ◽  
2020 ◽  
Vol 20 (3) ◽  
pp. 825 ◽  
Author(s):  
Fadi Al Machot ◽  
Mohammed R. Elkobaisi ◽  
Kyandoghere Kyamakya

Due to significant advances in sensor technology, studies towards activity recognition have gained interest and maturity in the last few years. Existing machine learning algorithms have demonstrated promising results by classifying activities whose instances have been already seen during training. Activity recognition methods based on real-life settings should cover a growing number of activities in various domains, whereby a significant part of instances will not be present in the training data set. However, to cover all possible activities in advance is a complex and expensive task. Concretely, we need a method that can extend the learning model to detect unseen activities without prior knowledge regarding sensor readings about those previously unseen activities. In this paper, we introduce an approach to leverage sensor data in discovering new unseen activities which were not present in the training set. We show that sensor readings can lead to promising results for zero-shot learning, whereby the necessary knowledge can be transferred from seen to unseen activities by using semantic similarity. The evaluation conducted on two data sets extracted from the well-known CASAS datasets show that the proposed zero-shot learning approach achieves a high performance in recognizing unseen (i.e., not present in the training dataset) new activities.


Sign in / Sign up

Export Citation Format

Share Document