scholarly journals Estimating the Area under the ROC Curve with Modified Profile Likelihoods

2016 ◽  
Vol 6 (1) ◽  
pp. 1
Author(s):  
Giuliana Cortese

Receiver operating characteristic (ROC) curves are a frequent tool to study the discriminating ability of a certain characteristic. The area under the ROC curve (AUC) is a widely used measure of statistical accuracy of continuous markers for diagnostic tests, and has the advantage of providing a single summary  index of overall performance of the test. Recent studies have shown some critical issues related to traditional point and interval estimates for the AUC, especially for small samples, more complex models, unbalanced samples or values near the boundary of the parameter space, i.e., when the AUC approaches the values 0.5 or 1.Parametric models for the AUC have shown to be powerful when the underlying distributional assumptions are not misspecified. However, in the above circumstances parametric inference may be not accurate, sometimes yielding  misleading conclusions. The objective of the paper is to propose an alternative inferential approach based on modified profile likelihoods, which provides more accurate statistical results in any parametric settings, including the above circumstances. The proposed method is illustrated for the binormal model, but can potentially be used in any other complex model and for any other parametric distribution. We report simulation studies to show the improved performance of the proposed approach, when compared to classical first-order likelihood theory. An  application to real-life data in a small sample setting is also discussed, to provide practical guidelines.

Author(s):  
Bob Uttl ◽  
Victoria C. Violo

In a widely cited and widely talked about study, MacNell et al. (2015) [1] examined SET ratings of one female and one male instructor, each teaching two sections of the same online course, one section under their true gender and the other section under false/opposite gender. MacNell et al. concluded that students rated perceived female instructors more harshly than perceived male instructors, demonstrating gender bias against perceived female instructors. Boring, Ottoboni, and Stark (2016) [2] re-analyzed MacNell et al.’s data and confirmed their conclusions. However, the design of MacNell et al. study is fundamentally flawed. First, MacNell et al.’ section sample sizes were extremely small, ranging from 8 to 12 students. Second, MacNell et al. included only one female and one male instructor. Third, MacNell et al.’s findings depend on three outliers – three unhappy students (all in perceived female conditions) who gave their instructors the lowest possible ratings on all or nearly all SET items. We re-analyzed MacNell et al.’s data with and without the three outliers. Our analyses showed that the gender bias against perceived female instructors disappeared. Instead, students rated the actual female vs. male instructor higher, regardless of perceived gender. MacNell et al.’s study is a real-life demonstration that conclusions based on extremely small sample-sized studies are unwarranted and uninterpretable.


2021 ◽  
Vol 2 (1) ◽  
pp. 148-179
Author(s):  
Mohammad Jahangir Hossain Mojumder

Nowadays, demands are growing for outcome-based and transferable learning, particularly in higher education. Being the terminal formal schooling, it needs facilitation of pupils’ achievement of problem-solving skills for real-life by teachers. To this end, this qualitative research employs a case study approach, which is suitable to test an event with small samples, and a phenomenological method to analyze respondents’ perceptions and activities thematically and descriptively to assess changes. In-depth interviews, focus group discussions, and class observations are used to collect data from two selected colleges to examine the extent of professional development and methodological shift in teaching as effects of training to include active learning strategies for better learning outcomes. The data though reveals that the selected flagship training program offers a bunch of pedagogical methods (not need-based) to imbibe, yet reject the idea that the nationally arranged training remains a successful effort to increase trainees’ knowledge, skills, and polish attitudes except disseminating a few concepts superficially. Moreover, trainees lack the motivation to shift their teaching habits and are unconvinced that the application of these newly learned strategies will transform anything. Likewise, they are discontented about training contents and unenthusiastic in consort with unfavorable opinions about training procedures and trainers to some extent. Therefore, the results suggest limited or no significant professional development and modification in teaching practice, rather teachers continue conventional teacher-centered method, and the effort stays insufficient, extraneous, ‘fragmented’, and ‘intellectually superficial’. Additionally, at the colleges, large class size, inappropriate sitting arrangement, pervasive traditionality, absenteeism, and other analogous challenges limited them to change their practice. Considering all these, this study suggests that alternations should be initiated at a micro (teachers & college) and macro-level (training providers & policymakers) to offer tailor-made, autonomous, and need-based training. Last but not the least, this endeavor is limited by being entirely qualitative with a small sample size and not eliciting the views of any of the trainers and policymakers and which can be an indication of points of departure for future study.


2015 ◽  
Vol 26 (6) ◽  
pp. 2603-2621 ◽  
Author(s):  
Dai Feng ◽  
Giuliana Cortese ◽  
Richard Baumgartner

The receiver operating characteristic (ROC) curve is frequently used as a measure of accuracy of continuous markers in diagnostic tests. The area under the ROC curve (AUC) is arguably the most widely used summary index for the ROC curve. Although the small sample size scenario is common in medical tests, a comprehensive study of small sample size properties of various methods for the construction of the confidence/credible interval (CI) for the AUC has been by and large missing in the literature. In this paper, we describe and compare 29 non-parametric and parametric methods for the construction of the CI for the AUC when the number of available observations is small. The methods considered include not only those that have been widely adopted, but also those that have been less frequently mentioned or, to our knowledge, never applied to the AUC context. To compare different methods, we carried out a simulation study with data generated from binormal models with equal and unequal variances and from exponential models with various parameters and with equal and unequal small sample sizes. We found that the larger the true AUC value and the smaller the sample size, the larger the discrepancy among the results of different approaches. When the model is correctly specified, the parametric approaches tend to outperform the non-parametric ones. Moreover, in the non-parametric domain, we found that a method based on the Mann–Whitney statistic is in general superior to the others. We further elucidate potential issues and provide possible solutions to along with general guidance on the CI construction for the AUC when the sample size is small. Finally, we illustrate the utility of different methods through real life examples.


2006 ◽  
Vol 361 (1475) ◽  
pp. 2023-2037 ◽  
Author(s):  
Thomas P Curtis ◽  
Ian M Head ◽  
Mary Lunn ◽  
Stephen Woodcock ◽  
Patrick D Schloss ◽  
...  

The extent of microbial diversity is an intrinsically fascinating subject of profound practical importance. The term ‘diversity’ may allude to the number of taxa or species richness as well as their relative abundance. There is uncertainty about both, primarily because sample sizes are too small. Non-parametric diversity estimators make gross underestimates if used with small sample sizes on unevenly distributed communities. One can make richness estimates over many scales using small samples by assuming a species/taxa-abundance distribution. However, no one knows what the underlying taxa-abundance distributions are for bacterial communities. Latterly, diversity has been estimated by fitting data from gene clone libraries and extrapolating from this to taxa-abundance curves to estimate richness. However, since sample sizes are small, we cannot be sure that such samples are representative of the community from which they were drawn. It is however possible to formulate, and calibrate, models that predict the diversity of local communities and of samples drawn from that local community. The calibration of such models suggests that migration rates are small and decrease as the community gets larger. The preliminary predictions of the model are qualitatively consistent with the patterns seen in clone libraries in ‘real life’. The validation of this model is also confounded by small sample sizes. However, if such models were properly validated, they could form invaluable tools for the prediction of microbial diversity and a basis for the systematic exploration of microbial diversity on the planet.


2010 ◽  
Vol 30 (4) ◽  
pp. 509-517 ◽  
Author(s):  
Mithat Gönen ◽  
Glenn Heller

Receiver operating characteristic (ROC) curves evaluate the discriminatory power of a continuous marker to predict a binary outcome. The most popular parametric model for an ROC curve is the binormal model, which assumes that the marker, after a monotone transformation, is normally distributed conditional on the outcome. Here, the authors present an alternative to the binormal model based on the Lehmann family, also known as the proportional hazards specification. The resulting ROC curve and its functionals (such as the area under the curve and the sensitivity at a given level of specificity) have simple analytic forms. Closed-form expressions for the functional estimates and their corresponding asymptotic variances are derived. This family accommodates the comparison of multiple markers, covariate adjustments, and clustered data through a regression formulation. Evaluation of the underlying assumptions, model fitting, and model selection can be performed using any off-the-shelf proportional hazards statistical software package.


Author(s):  
Bob Uttl ◽  
Victoria Violo ◽  
Bob Uttl ◽  
Bob Uttl ◽  
Bob Uttl

In a widely cited and widely talked about study, MacNell et al. (2015) examined SET ratings of one female and one male instructor, each teaching two sections of the same online course, one section under their true gender and the other section under false/opposite gender. MacNell et al. concluded that students rated perceived female instructors more harshly than perceived male instructors, demonstrating gender bias against perceived female instructors. Boring, Ottoboni, and Stark (2016) re-analyzed MacNell et al.s data and confirmed their conclusions. However, the design of MacNell et al. study is fundamentally flawed. First, MacNell et al. section sample sizes were extremely small, ranging from 8 to 12 students. Second, MacNell et al. included only one female and one male instructor. Third, MacNell et al.s findings depend on three outliers -- three unhappy students (all in perceived female conditions) who gave their instructors the lowest possible ratings on all or nearly all SET items. We re-analyzed MacNell et al.s data with and without the three outliers. Our analyses showed that the gender bias against perceived female instructors disappeared. Instead, students rated the actual female vs. male instructor higher, regardless of perceived gender. MacNell et al.s study is a real-life demonstration that conclusions based on extremely small sample-sized studies are unwarranted and uninterpretable.


1994 ◽  
Vol 33 (02) ◽  
pp. 180-186 ◽  
Author(s):  
H. Brenner ◽  
O. Gefeller

Abstract:The traditional concept of describing the validity of a diagnostic test neglects the presence of chance agreement between test result and true (disease) status. Sensitivity and specificity, as the fundamental measures of validity, can thus only be considered in conjunction with each other to provide an appropriate basis for the evaluation of the capacity of the test to discriminate truly diseased from truly undiseased subjects. In this paper, chance-corrected analogues of sensitivity and specificity are presented as supplemental measures of validity, which pay attention to the problem of chance agreement and offer the opportunity to be interpreted separately. While recent proposals of chance-correction techniques, suggested by several authors in this context, lead to measures which are dependent on disease prevalence, our method does not share this major disadvantage. We discuss the extension of the conventional ROC-curve approach to chance-corrected measures of sensitivity and specificity. Furthermore, point and asymptotic interval estimates of the parameters of interest are derived under different sampling frameworks for validation studies. The small sample behavior of the estimates is investigated in a simulation study, leading to a logarithmic modification of the interval estimate in order to hold the nominal confidence level for small samples.


2021 ◽  
Vol 9 (1) ◽  
pp. 172-189
Author(s):  
David Benkeser ◽  
Jialu Ran

Abstract Understanding the pathways whereby an intervention has an effect on an outcome is a common scientific goal. A rich body of literature provides various decompositions of the total intervention effect into pathway-specific effects. Interventional direct and indirect effects provide one such decomposition. Existing estimators of these effects are based on parametric models with confidence interval estimation facilitated via the nonparametric bootstrap. We provide theory that allows for more flexible, possibly machine learning-based, estimation techniques to be considered. In particular, we establish weak convergence results that facilitate the construction of closed-form confidence intervals and hypothesis tests and prove multiple robustness properties of the proposed estimators. Simulations show that inference based on large-sample theory has adequate small-sample performance. Our work thus provides a means of leveraging modern statistical learning techniques in estimation of interventional mediation effects.


2021 ◽  
Vol 28 (Supplement_1) ◽  
Author(s):  
M Santos ◽  
S Paula ◽  
I Almeida ◽  
H Santos ◽  
H Miranda ◽  
...  

Abstract Funding Acknowledgements Type of funding sources: None. Introduction Patients (P) with acute heart failure (AHF) are a heterogeneous population. Risk stratification at admission may help predict in-hospital complications and needs. The Get With The Guidelines Heart Failure score (GWTG-HF) predicts in-hospital mortality (M) of P admitted with AHF. ACTION ICU score is validated to estimate the risk of complications requiring ICU care in non-ST elevation acute coronary syndromes. Objective To validate ACTION-ICU score in AHF and to compare ACTION-ICU to GWTG-HF as predictors of in-hospital M (IHM), early M [1-month mortality (1mM)] and 1-month readmission (1mRA), using real-life data. Methods Based on a single-center retrospective study, data collected from P admitted in the Cardiology department with AHF between 2010 and 2017. P without data on previous cardiovascular history or uncompleted clinical data were excluded. Statistical analysis used chi-square, non-parametric tests, logistic regression analysis and ROC curve analysis. Results Among the 300 P admitted with AHF included, mean age was 67.4 ± 12.6 years old and 72.7% were male. Systolic blood pressure (SBP) was 131.2 ± 37.0mmHg, glomerular filtration rate (GFR) was 57.1 ± 23.5ml/min. 35.3% were admitted in Killip-Kimball class (KKC) 4. ACTION-ICU score was 10.4 ± 2.3 and GWTG-HF was 41.7 ± 9.6. Inotropes’ usage was necessary in 32.7% of the P, 11.3% of the P needed non-invasive ventilation (NIV), 8% needed invasive ventilation (IV). IHM rate was 5% and 1mM was 8%. 6.3% of the P were readmitted 1 month after discharge. Older age (p < 0.001), lower SBP (p = 0,035) and need of inotropes (p < 0.001) were predictors of IHM in our population. As expected, patients presenting in KKC 4 had higher IHM (OR 8.13, p < 0.001). Older age (OR 1.06, p = 0.002, CI 1.02-1.10), lower SBP (OR 1.01, p = 0.05, CI 1.00-1.02) and lower left ventricle ejection fraction (LVEF) (OR 1.06, p < 0.001, CI 1.03-1.09) were predictors of need of NIV. None of the variables were predictive of IV. LVEF (OR 0.924, p < 0.001, CI 0.899-0.949), lower SBP (OR 0.80, p < 0.001, CI 0.971-0.988), higher urea (OR 1.01, p < 0.001, CI 1.005-1.018) and lower sodium (OR 0.92, p = 0.002, CI 0.873-0.971) were predictors of inotropes’ usage. Logistic regression showed that GWTG-HF predicted IHM (OR 1.12, p < 0.001, CI 1.05-1.19), 1mM (OR 1.10, p = 1.10, CI 1.04-1.16) and inotropes’s usage (OR 1.06, p < 0.001, CI 1.03-1.10), however it was not predictive of 1mRA, need of IV or NIV. Similarly, ACTION-ICU predicted IHM (OR 1.51, p = 0.02, CI 1.158-1.977), 1mM (OR 1.45, p = 0.002, CI 1.15-1.81) and inotropes’ usage (OR 1.22, p = 0.002, CI 1.08-1.39), but not 1mRA, the need of IV or NIV. ROC curve analysis revealed that GWTG-HF score performed better than ACTION-ICU regarding IHM (AUC 0.774, CI 0.46-0-90 vs AUC 0.731, CI 0.59-0.88) and 1mM (AUC 0.727, CI 0.60-0.85 vs AUC 0.707, CI 0.58-0.84). Conclusion In our population, both scores were able to predict IHM, 1mM and inotropes’s usage.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Florent Le Borgne ◽  
Arthur Chatton ◽  
Maxime Léger ◽  
Rémi Lenain ◽  
Yohann Foucher

AbstractIn clinical research, there is a growing interest in the use of propensity score-based methods to estimate causal effects. G-computation is an alternative because of its high statistical power. Machine learning is also increasingly used because of its possible robustness to model misspecification. In this paper, we aimed to propose an approach that combines machine learning and G-computation when both the outcome and the exposure status are binary and is able to deal with small samples. We evaluated the performances of several methods, including penalized logistic regressions, a neural network, a support vector machine, boosted classification and regression trees, and a super learner through simulations. We proposed six different scenarios characterised by various sample sizes, numbers of covariates and relationships between covariates, exposure statuses, and outcomes. We have also illustrated the application of these methods, in which they were used to estimate the efficacy of barbiturates prescribed during the first 24 h of an episode of intracranial hypertension. In the context of GC, for estimating the individual outcome probabilities in two counterfactual worlds, we reported that the super learner tended to outperform the other approaches in terms of both bias and variance, especially for small sample sizes. The support vector machine performed well, but its mean bias was slightly higher than that of the super learner. In the investigated scenarios, G-computation associated with the super learner was a performant method for drawing causal inferences, even from small sample sizes.


Sign in / Sign up

Export Citation Format

Share Document