Credible Intervals for Precision and Recall Based on a K-Fold Cross-Validated Beta Distribution

2016 ◽  
Vol 28 (8) ◽  
pp. 1694-1722 ◽  
Author(s):  
Yu Wang ◽  
Jihong Li

In typical machine learning applications such as information retrieval, precision and recall are two commonly used measures for assessing an algorithm's performance. Symmetrical confidence intervals based on K-fold cross-validated t distributions are widely used for the inference of precision and recall measures. As we confirmed through simulated experiments, however, these confidence intervals often exhibit lower degrees of confidence, which may easily lead to liberal inference results. Thus, it is crucial to construct faithful confidence (credible) intervals for precision and recall with a high degree of confidence and a short interval length. In this study, we propose two posterior credible intervals for precision and recall based on K-fold cross-validated beta distributions. The first credible interval for precision (or recall) is constructed based on the beta posterior distribution inferred by all K data sets corresponding to K confusion matrices from a K-fold cross-validation. Second, considering that each data set corresponding to a confusion matrix from a K-fold cross-validation can be used to infer a beta posterior distribution of precision (or recall), the second proposed credible interval for precision (or recall) is constructed based on the average of K beta posterior distributions. Experimental results on simulated and real data sets demonstrate that the first credible interval proposed in this study almost always resulted in degrees of confidence greater than 95%. With an acceptable degree of confidence, both of our two proposed credible intervals have shorter interval lengths than those based on a corrected K-fold cross-validated t distribution. Meanwhile, the average ranks of these two credible intervals are superior to that of the confidence interval based on a K-fold cross-validated t distribution for the degree of confidence and are superior to that of the confidence interval based on a corrected K-fold cross-validated t distribution for the interval length in all 27 cases of simulated and real data experiments. However, the confidence intervals based on the K-fold and corrected K-fold cross-validated t distributions are in the two extremes. Thus, when focusing on the reliability of the inference for precision and recall, the proposed methods are preferable, especially for the first credible interval.

2020 ◽  
Author(s):  
Matthias Flor ◽  
Michael Weiβ ◽  
Thomas Selhorst ◽  
Christine Müller-Graf ◽  
Matthias Greiner

Abstract Background: Various methods exist for statistical inference about a prevalence that consider misclassifications due to an imperfect diagnostic test. However, traditional methods are known to suffer from truncation of the prevalence estimate and the confidence intervals constructed around the point estimate, as well as from under-performance of the confidence intervals' coverage. Methods: In this study, we used simulated data sets to validate a Bayesian prevalence estimation method and compare its performance to frequentist methods, i.e. the Rogan-Gladen estimate for prevalence, RGE, in combination with several methods of confidence interval construction. Our performance measures are (i) error distribution of the point estimate against the simulated true prevalence and (ii) coverage and length of the confidence interval, or credible interval in the case of the Bayesian method. Results: Across all data sets, the Bayesian point estimate and the RGE produced similar error distributions with slight advanteges of the former over the latter. In addition, the Bayesian estimate did not suffer from the RGE's truncation problem at zero or unity. With respect to coverage performance of the confidence and credible intervals, all of the traditional frequentist methods exhibited strong under-coverage, whereas the Bayesian credible interval as well as a newly developed frequentist method by Lang and Reiczigel performed as desired, with the Bayesian method having a very slight advantage in terms of interval length. Conclusion: The Bayesian prevalence estimation method should be prefered over traditional frequentist methods. An acceptable alternative is to combine the Rogan-Gladen point estimate with the Lang-Reiczigel confidence interval.


2020 ◽  
Author(s):  
Matthias Flor ◽  
Michael Weiβ ◽  
Thomas Selhorst ◽  
Christine Müller-Graf ◽  
Matthias Greiner

Abstract Background: Various methods exist for statistical inference about a prevalence that consider misclassifications due to an imperfect diagnostic test. However, traditional methods are known to suffer from truncation of the prevalence estimate and the confidence intervals constructed around the point estimate, as well as from under-performance of the confidence intervals' coverage. Methods : In this study, we used simulated data sets to validate a Bayesian prevalence estimation method and compare its performance to frequentist methods, i.e. the Rogan-Gladen estimate for prevalence, RGE, in combination with several methods of confidence interval construction. Our performance measures are (i) error distribution of the point estimate against the simulated true prevalence and (ii) coverage and length of the confidence interval, or credible interval in the case of the Bayesian method.Results: Across all data sets, the Bayesian point estimate and the RGE produced similar error distributions with slight advanteges of the former over the latter. In addition, the Bayesian estimate did not suffer from the RGE's truncation problem at zero or unity. With respect to coverage performance of the confidence and credible intervals, all of the traditional frequentist methods exhibited strong under-coverage, whereas the Bayesian credible interval as well as a newly developed frequentist method by Lang and Reiczigel performed as desired, with the Bayesian method having a very slight advantage in terms of interval length. Conclusion: The Bayesian prevalence estimation method should be prefered over traditional frequentist methods. An acceptable alternative is to combine the Rogan-Gladen point estimate with the Lang-Reiczigel confidence interval.


2020 ◽  
Author(s):  
Matthias Flor ◽  
Michael Weiβ ◽  
Thomas Selhorst ◽  
Christine Müller-Graf ◽  
Matthias Greiner

Abstract Background: Various methods exist for statistical inference about a prevalence that consider misclassifications due to an imperfect diagnostic test. However, traditional methods are known to suffer from censoring of the prevalence estimate and the confidence intervals constructed around the point estimate, as well as from under-performance of the confidence intervals' coverage. Methods: In this study, we used simulated data sets to validate a Bayesian prevalence estimation method and compare its performance to frequentist methods, i.e. the Rogan-Gladen estimate for prevalence, RGE, in combination with several methods of confidence interval construction. Our performance measures are (i) bias of the point estimate against the simulated true prevalence and (ii) coverage and length of the confidence interval, or credible interval in the case of the Bayesian method. Results: Across all data sets, the Bayesian point estimate and the RGE produced similar bias distributions with slight advanteges of the former over the latter. In addition, the Bayesian estimate did not suffer from the RGE's censoring problem at zero or unity. With respect to coverage performance of the confidence and credible intervals, all of the traditional frequentist methods exhibited strong under-coverage, whereas the Bayesian credible interval as well as a newly developed frequentist method by Lang and Reiczigel performed as desired, with the Bayesian method having a very slight advantage in terms of interval length. Conclusion: The Bayesian prevalence estimation method should be prefered over traditional frequentist methods. An acceptable alternative is to combine the Rogan-Gladen point estimate with the Lang-Reiczigel confidence interval.


2013 ◽  
Vol 284-287 ◽  
pp. 3111-3114
Author(s):  
Hsiang Chuan Liu ◽  
Wei Sung Chen ◽  
Ben Chang Shia ◽  
Chia Chen Lee ◽  
Shang Ling Ou ◽  
...  

In this paper, a novel fuzzy measure, high order lambda measure, was proposed, based on the Choquet integral with respect to this new measure, a novel composition forecasting model which composed the GM(1,1) forecasting model, the time series model and the exponential smoothing model was also proposed. For evaluating the efficiency of this improved composition forecasting model, an experiment with a real data by using the 5 fold cross validation mean square error was conducted. The performances of Choquet integral composition forecasting model with the P-measure, Lambda-measure, L-measure and high order lambda measure, respectively, a ridge regression composition forecasting model and a multiple linear regression composition forecasting model and the traditional linear weighted composition forecasting model were compared. The experimental results showed that the Choquet integral composition forecasting model with respect to the high order lambda measure has the best performance.


Author(s):  
WASIF AFZAL ◽  
RICHARD TORKAR ◽  
ROBERT FELDT

In the presence of a number of algorithms for classification and prediction in software engineering, there is a need to have a systematic way of assessing their performances. The performance assessment is typically done by some form of partitioning or resampling of the original data to alleviate biased estimation. For predictive and classification studies in software engineering, there is a lack of a definitive advice on the most appropriate resampling method to use. This is seen as one of the contributing factors for not being able to draw general conclusions on what modeling technique or set of predictor variables are the most appropriate. Furthermore, the use of a variety of resampling methods make it impossible to perform any formal meta-analysis of the primary study results. Therefore, it is desirable to examine the influence of various resampling methods and to quantify possible differences. Objective and method: This study empirically compares five common resampling methods (hold-out validation, repeated random sub-sampling, 10-fold cross-validation, leave-one-out cross-validation and non-parametric bootstrapping) using 8 publicly available data sets with genetic programming (GP) and multiple linear regression (MLR) as software quality classification approaches. Location of (PF, PD) pairs in the ROC (receiver operating characteristics) space and area under an ROC curve (AUC) are used as accuracy indicators. Results: The results show that in terms of the location of (PF, PD) pairs in the ROC space, bootstrapping results are in the preferred region for 3 of the 8 data sets for GP and for 4 of the 8 data sets for MLR. Based on the AUC measure, there are no significant differences between the different resampling methods using GP and MLR. Conclusion: There can be certain data set properties responsible for insignificant differences between the resampling methods based on AUC. These include imbalanced data sets, insignificant predictor variables and high-dimensional data sets. With the current selection of data sets and classification techniques, bootstrapping is a preferred method based on the location of (PF, PD) pair data in the ROC space. Hold-out validation is not a good choice for comparatively smaller data sets, where leave-one-out cross-validation (LOOCV) performs better. For comparatively larger data sets, 10-fold cross-validation performs better than LOOCV.


Author(s):  
G. G. Hamedani ◽  
Mahdi Rasekhi ◽  
Sayed Najibi ◽  
Haitham M. Yousof ◽  
Morad Alizadeh

In this paper, a new class of continuous distributions with two extra positive parameters is introduced and is called the Type II General Exponential (TIIGE) distribution. Some special models are presented. Asymptotics, explicit expressions for the ordinary and incomplete moments, moment residual life, reversed residual life, quantile and generating functions and stress-strengh reliability function are derived. Characterizations of this family are obtained based on truncated moments, hazard function, conditional expectation of certain functions of the random variable are obtained. The performance of the maximum likelihood estimators in terms of biases, mean squared errors and confidence interval length is examined by means of a simulation study. Two real data sets are used to illustrate the application of the proposed class.


2018 ◽  
Vol 41 (2) ◽  
pp. 251-267 ◽  
Author(s):  
Abbas Pak ◽  
Arjun Kumar Gupta ◽  
Nayereh Bagheri Khoolenjani

In this paper  we study the reliability of a multicomponent stress-strength model assuming that the components follow power Lindley model.  The maximum likelihood estimate of the reliability parameter and its asymptotic confidence interval are obtained. Applying the parametric Bootstrap technique, interval estimation of the reliability is presented.  Also, the Bayes estimate and highest posterior density credible interval of the reliability parameter are derived using suitable priors on the parameters. Because there is no closed form for the Bayes estimate, we use the Markov Chain Monte Carlo method to obtain approximate Bayes  estimate of the reliability. To evaluate the performances of different procedures,  simulation studies are conducted and an example of real data sets is provided.


2018 ◽  
Vol 7 (2.15) ◽  
pp. 136 ◽  
Author(s):  
Rosaida Rosly ◽  
Mokhairi Makhtar ◽  
Mohd Khalid Awang ◽  
Mohd Isa Awang ◽  
Mohd Nordin Abdul Rahman

This paper analyses the performance of classification models using single classification and combination of ensemble method, which are Breast Cancer Wisconsin and Hepatitis data sets as training datasets. This paper presents a comparison of different classifiers based on a 10-fold cross validation using a data mining tool. In this experiment, various classifiers are implemented including three popular ensemble methods which are boosting, bagging and stacking for the combination. The result shows that for the classification of the Breast Cancer Wisconsin data set, the single classification of Naïve Bayes (NB) and a combination of bagging+NB algorithm displayed the highest accuracy at the same percentage (97.51%) compared to other combinations of ensemble classifiers. For the classification of the Hepatitisdata set, the result showed that the combination of stacking+Multi-Layer Perception (MLP) algorithm achieved a higher accuracy at 86.25%. By using the ensemble classifiers, the result may be improved. In future, a multi-classifier approach will be proposed by introducing a fusion at the classification level between these classifiers to obtain classification with higher accuracies.  


2012 ◽  
Vol 433-440 ◽  
pp. 3959-3963 ◽  
Author(s):  
Bayram Akdemir ◽  
Nurettin Çetinkaya

In distributing systems, load forecasting is one of the major management problems to carry on energy flowing; protect the systems, and economic management. In order to manage the system, next step of the load characteristics must be inform from historical data sets. For the forecasting, not only historical parameters are used but also external parameters such as weather conditions, seasons and populations and etc. have much importance to forecast the next behavior of the load characteristic. Holidays and week days have different affects on energy consumption in any country. In this study, target is to forecast the peak energy level the next an hour and to compare affects of week days and holidays on peak energy needs. Energy consumption data sets have nonlinear characteristics and it is not easy to fit any curve due to its nonlinearity and lots of parameters. In order to forecast peak energy level, Adaptive neural fuzzy inference system is used for hourly affects of holidays and week days on peak energy level is argued. The obtained values from output of the artificial intelligence are evaluated two fold cross validation and mean absolute percentage error. The obtained two fold cross validation error as mean absolute percentage error is 3.51 and included holidays data set has more accuracy than the data set without holiday. Total success increased 2.4%.


Sign in / Sign up

Export Citation Format

Share Document