scholarly journals When to consult precision-recall curves

Author(s):  
Jonathan Cook ◽  
Vikram Ramadas

Receiver operating characteristic (ROC) curves are commonly used to evaluate predictions of binary outcomes. When there is a small percentage of items of interest (as would be the case with fraud detection, for example), ROC curves can provide an inflated view of performance. This can cause challenges in determining which set of predictions is better. In this article, we discuss the conditions under which precision-recall curves may be preferable to ROC curves. As an illustrative example, we compare two commonly used fraud predictors (Beneish’s [1999, Financial Analysts Journal 55: 24–36] M score and Dechow et al.’s [2011, Contemporary Accounting Research 28: 17–82] F score) using both ROC and precision-recall curves. To aid the reader with using precision-recall curves, we also introduce the command prcurve to plot them.

1978 ◽  
Vol 17 (03) ◽  
pp. 157-161 ◽  
Author(s):  
F. T. De Dombal ◽  
Jane C. Horrocks

This paper uses simple receiver operating characteristic (ROC) curves (i) to study the effect of varying computer confidence of threshold levels and (ii) to evaluate clinical performance in the diagnosis of acute appendicitis. Over 1300 patients presenting to five centres with abdominal pain of short duration were studied in varying detail. Clinical and computer-aided diagnostic predictions were compared with the »final« diagnosis. From these studies it is concluded the simplistic setting of a 50/50 confidence threshold for the computer program is as »good« as any other. The proximity of a computer-aided system changed clinical behaviour patterns; a higher overall performance level was achieved and clinicians performance levels became associated with the »mildly conservative« end of the computers ROC curve. Prior forecasts of over-confidence or ultra-caution amongst clinicians using the computer-aided system have not been fulfilled.


Diagnostics ◽  
2021 ◽  
Vol 11 (6) ◽  
pp. 949
Author(s):  
Cecil J. Weale ◽  
Don M. Matshazi ◽  
Saarah F. G. Davids ◽  
Shanel Raghubeer ◽  
Rajiv T. Erasmus ◽  
...  

This cross-sectional study investigated the association of miR-1299, -126-3p and -30e-3p with and their diagnostic capability for dysglycaemia in 1273 (men, n = 345) South Africans, aged >20 years. Glycaemic status was assessed by oral glucose tolerance test (OGTT). Whole blood microRNA (miRNA) expressions were assessed using TaqMan-based reverse transcription quantitative-PCR (RT-qPCR). Receiver operating characteristic (ROC) curves assessed the ability of each miRNA to discriminate dysglycaemia, while multivariable logistic regression analyses linked expression with dysglycaemia. In all, 207 (16.2%) and 94 (7.4%) participants had prediabetes and type 2 diabetes mellitus (T2DM), respectively. All three miRNAs were significantly highly expressed in individuals with prediabetes compared to normotolerant patients, p < 0.001. miR-30e-3p and miR-126-3p were also significantly more expressed in T2DM versus normotolerant patients, p < 0.001. In multivariable logistic regressions, the three miRNAs were consistently and continuously associated with prediabetes, while only miR-126-3p was associated with T2DM. The ROC analysis indicated all three miRNAs had a significant overall predictive ability to diagnose prediabetes, diabetes and the combination of both (dysglycaemia), with the area under the receiver operating characteristic curve (AUC) being significantly higher for miR-126-3p in prediabetes. For prediabetes diagnosis, miR-126-3p (AUC = 0.760) outperformed HbA1c (AUC = 0.695), p = 0.042. These results suggest that miR-1299, -126-3p and -30e-3p are associated with prediabetes, and measuring miR-126-3p could potentially contribute to diabetes risk screening strategies.


2008 ◽  
Vol 18 (02) ◽  
pp. 349-367
Author(s):  
CHRISTOPHER GITTINS ◽  
DAISEI KONNO ◽  
MICHAEL HOKE ◽  
ANTHONY RATKOWSKI

In this paper we assess the effect that clustering pixels into spectrally-similar background types, for example, soil, vegetation, and water in hyperspectral visible/near-IR/SWIR imagery, prior to applying a detection methodology has on material detection statistics. Specifically, we examine the effects of data segmentation on two statistically-based detection metrics, the Subspace Generalized Likelihood Ratio Test (Subspace GLRT) and the Adaptive Cosine Estimator (ACE), applied to a publicly-available AVIRIS datacube augmented with a synthetic material spectrum in selected pixels. The use of synthetic spectrum-augmented data enables quantitative comparison of Subspace-GLRT and ACE using Receiver Operating Characteristic (ROC) curves. For all cases investigated, Receiver Operating Characteristic (ROC) curves generated using ACE were as good as or superior to those generated using Subspace-GLRT. The favorability of ACE over Subspace-GLRT was more pronounced as the synthetic spectrum mixing fraction decreased. For probabilities of detection in the range of 50-80%, segmentation reduced the probability of false alarm by a factor of 3–5 when using ACE. In contrast, segmentation had no apparent effect on detection statistics using Subspace-GLRT, in this example.


Author(s):  
Mario A. Cleves

The area under the receiver operating characteristic (ROC) curve is often used to summarize and compare the discriminatory accuracy of a diagnostic test or modality, and to evaluate the predictive power of statistical models for binary outcomes. Parametric maximum likelihood methods for fitting of the ROC curve provide direct estimates of the area under the ROC curve and its variance. Nonparametric methods, on the other hand, provide estimates of the area under the ROC curve, but do not directly estimate its variance. Three algorithms for computing the variance for the area under the nonparametric ROC curve are commonly used, although ambiguity exists about their behavior under diverse study conditions. Using simulated data, we found similar asymptotic performance between these algorithms when the diagnostic test produces results on a continuous scale, but found notable differences in small samples, and when the diagnostic test yields results on a discrete diagnostic scale.


Sign in / Sign up

Export Citation Format

Share Document