How to Correct for Chance Agreement in the Estimation of Sensitivity and Specificity of Diagnostic Tests

1994 ◽  
Vol 33 (02) ◽  
pp. 180-186 ◽  
Author(s):  
H. Brenner ◽  
O. Gefeller

Abstract:The traditional concept of describing the validity of a diagnostic test neglects the presence of chance agreement between test result and true (disease) status. Sensitivity and specificity, as the fundamental measures of validity, can thus only be considered in conjunction with each other to provide an appropriate basis for the evaluation of the capacity of the test to discriminate truly diseased from truly undiseased subjects. In this paper, chance-corrected analogues of sensitivity and specificity are presented as supplemental measures of validity, which pay attention to the problem of chance agreement and offer the opportunity to be interpreted separately. While recent proposals of chance-correction techniques, suggested by several authors in this context, lead to measures which are dependent on disease prevalence, our method does not share this major disadvantage. We discuss the extension of the conventional ROC-curve approach to chance-corrected measures of sensitivity and specificity. Furthermore, point and asymptotic interval estimates of the parameters of interest are derived under different sampling frameworks for validation studies. The small sample behavior of the estimates is investigated in a simulation study, leading to a logarithmic modification of the interval estimate in order to hold the nominal confidence level for small samples.

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Florent Le Borgne ◽  
Arthur Chatton ◽  
Maxime Léger ◽  
Rémi Lenain ◽  
Yohann Foucher

AbstractIn clinical research, there is a growing interest in the use of propensity score-based methods to estimate causal effects. G-computation is an alternative because of its high statistical power. Machine learning is also increasingly used because of its possible robustness to model misspecification. In this paper, we aimed to propose an approach that combines machine learning and G-computation when both the outcome and the exposure status are binary and is able to deal with small samples. We evaluated the performances of several methods, including penalized logistic regressions, a neural network, a support vector machine, boosted classification and regression trees, and a super learner through simulations. We proposed six different scenarios characterised by various sample sizes, numbers of covariates and relationships between covariates, exposure statuses, and outcomes. We have also illustrated the application of these methods, in which they were used to estimate the efficacy of barbiturates prescribed during the first 24 h of an episode of intracranial hypertension. In the context of GC, for estimating the individual outcome probabilities in two counterfactual worlds, we reported that the super learner tended to outperform the other approaches in terms of both bias and variance, especially for small sample sizes. The support vector machine performed well, but its mean bias was slightly higher than that of the super learner. In the investigated scenarios, G-computation associated with the super learner was a performant method for drawing causal inferences, even from small sample sizes.


Medicina ◽  
2021 ◽  
Vol 57 (5) ◽  
pp. 503
Author(s):  
Thomas F. Monaghan ◽  
Syed N. Rahman ◽  
Christina W. Agudelo ◽  
Alan J. Wein ◽  
Jason M. Lazar ◽  
...  

Sensitivity, which denotes the proportion of subjects correctly given a positive assignment out of all subjects who are actually positive for the outcome, indicates how well a test can classify subjects who truly have the outcome of interest. Specificity, which denotes the proportion of subjects correctly given a negative assignment out of all subjects who are actually negative for the outcome, indicates how well a test can classify subjects who truly do not have the outcome of interest. Positive predictive value reflects the proportion of subjects with a positive test result who truly have the outcome of interest. Negative predictive value reflects the proportion of subjects with a negative test result who truly do not have the outcome of interest. Sensitivity and specificity are inversely related, wherein one increases as the other decreases, but are generally considered stable for a given test, whereas positive and negative predictive values do inherently vary with pre-test probability (e.g., changes in population disease prevalence). This article will further detail the concepts of sensitivity, specificity, and predictive values using a recent real-world example from the medical literature.


2011 ◽  
Vol 6 (2) ◽  
pp. 252-277 ◽  
Author(s):  
Stephen T. Ziliak

AbstractStudent's exacting theory of errors, both random and real, marked a significant advance over ambiguous reports of plant life and fermentation asserted by chemists from Priestley and Lavoisier down to Pasteur and Johannsen, working at the Carlsberg Laboratory. One reason seems to be that William Sealy Gosset (1876–1937) aka “Student” – he of Student'st-table and test of statistical significance – rejected artificial rules about sample size, experimental design, and the level of significance, and took instead an economic approach to the logic of decisions made under uncertainty. In his job as Apprentice Brewer, Head Experimental Brewer, and finally Head Brewer of Guinness, Student produced small samples of experimental barley, malt, and hops, seeking guidance for industrial quality control and maximum expected profit at the large scale brewery. In the process Student invented or inspired half of modern statistics. This article draws on original archival evidence, shedding light on several core yet neglected aspects of Student's methods, that is, Guinnessometrics, not discussed by Ronald A. Fisher (1890–1962). The focus is on Student's small sample, economic approach to real error minimization, particularly in field and laboratory experiments he conducted on barley and malt, 1904 to 1937. Balanced designs of experiments, he found, are more efficient than random and have higher power to detect large and real treatment differences in a series of repeated and independent experiments. Student's world-class achievement poses a challenge to every science. Should statistical methods – such as the choice of sample size, experimental design, and level of significance – follow the purpose of the experiment, rather than the other way around? (JEL classification codes: C10, C90, C93, L66)


2016 ◽  
Vol 41 (5) ◽  
pp. 472-505 ◽  
Author(s):  
Elizabeth Tipton ◽  
Kelly Hallberg ◽  
Larry V. Hedges ◽  
Wendy Chan

Background: Policy makers and researchers are frequently interested in understanding how effective a particular intervention may be for a specific population. One approach is to assess the degree of similarity between the sample in an experiment and the population. Another approach is to combine information from the experiment and the population to estimate the population average treatment effect (PATE). Method: Several methods for assessing the similarity between a sample and population currently exist as well as methods estimating the PATE. In this article, we investigate properties of six of these methods and statistics in the small sample sizes common in education research (i.e., 10–70 sites), evaluating the utility of rules of thumb developed from observational studies in the generalization case. Result: In small random samples, large differences between the sample and population can arise simply by chance and many of the statistics commonly used in generalization are a function of both sample size and the number of covariates being compared. The rules of thumb developed in observational studies (which are commonly applied in generalization) are much too conservative given the small sample sizes found in generalization. Conclusion: This article implies that sharp inferences to large populations from small experiments are difficult even with probability sampling. Features of random samples should be kept in mind when evaluating the extent to which results from experiments conducted on nonrandom samples might generalize.


PEDIATRICS ◽  
1989 ◽  
Vol 83 (3) ◽  
pp. A72-A72
Author(s):  
Student

The believer in the law of small numbers practices science as follows: 1. He gambles his research hypotheses on small samples without realizing that the odds against him are unreasonably high. He overestimates power. 2. He has undue confidence in early trends (e.g., the data of the first few subjects) and in the stability of observed patterns (e.g., the number and identity of significant results). He overestimates significance. 3. In evaluating replications, his or others', he has unreasonably high expectations about the replicability of significant results. He underestimates the breadth of confidence intervals. 4. He rarely attributes a deviation of results from expectations to sampling variability, because he finds a causal "explanation" for any discrepancy. Thus, he has little opportunity to recognize sampling variation in action. His belief in the law of small numbers, therefore, will forever remain intact.


2018 ◽  
Vol 2018 ◽  
pp. 1-10
Author(s):  
Lifeng Wu ◽  
Yan Chen

To deal with the forecasting with small samples in the supply chain, three grey models with fractional order accumulation are presented. Human judgment of future trends is incorporated into the order number of accumulation. The output of the proposed model will provide decision-makers in the supply chain with more forecasting information for short time periods. The results of practical real examples demonstrate that the model provides remarkable prediction performances compared with the traditional forecasting model.


2010 ◽  
Vol 9 ◽  
pp. CIN.S4020 ◽  
Author(s):  
Chen Zhao ◽  
Michael L. Bittner ◽  
Robert S. Chapkin ◽  
Edward R. Dougherty

When confronted with a small sample, feature-selection algorithms often fail to find good feature sets, a problem exacerbated for high-dimensional data and large feature sets. The problem is compounded by the fact that, if one obtains a feature set with a low error estimate, the estimate is unreliable because training-data-based error estimators typically perform poorly on small samples, exhibiting optimistic bias or high variance. One way around the problem is limit the number of features being considered, restrict features sets to sizes such that all feature sets can be examined by exhaustive search, and report a list of the best performing feature sets. If the list is short, then it greatly restricts the possible feature sets to be considered as candidates; however, one can expect the lowest error estimates obtained to be optimistically biased so that there may not be a close-to-optimal feature set on the list. This paper provides a power analysis of this methodology; in particular, it examines the kind of results one should expect to obtain relative to the length of the list and the number of discriminating features among those considered. Two measures are employed. The first is the probability that there is at least one feature set on the list whose true classification error is within some given tolerance of the best feature set and the second is the expected number of feature sets on the list whose true errors are within the given tolerance of the best feature set. These values are plotted as functions of the list length to generate power curves. The results show that, if the number of discriminating features is not too small—that is, the prior biological knowledge is not too poor—then one should expect, with high probability, to find good feature sets. Availability: companion website at http://gsp.tamu.edu/Publications/supplementary/zhao09a/


Entropy ◽  
2018 ◽  
Vol 20 (8) ◽  
pp. 601 ◽  
Author(s):  
Paul Darscheid ◽  
Anneli Guthke ◽  
Uwe Ehret

When constructing discrete (binned) distributions from samples of a data set, applications exist where it is desirable to assure that all bins of the sample distribution have nonzero probability. For example, if the sample distribution is part of a predictive model for which we require returning a response for the entire codomain, or if we use Kullback–Leibler divergence to measure the (dis-)agreement of the sample distribution and the original distribution of the variable, which, in the described case, is inconveniently infinite. Several sample-based distribution estimators exist which assure nonzero bin probability, such as adding one counter to each zero-probability bin of the sample histogram, adding a small probability to the sample pdf, smoothing methods such as Kernel-density smoothing, or Bayesian approaches based on the Dirichlet and Multinomial distribution. Here, we suggest and test an approach based on the Clopper–Pearson method, which makes use of the binominal distribution. Based on the sample distribution, confidence intervals for bin-occupation probability are calculated. The mean of each confidence interval is a strictly positive estimator of the true bin-occupation probability and is convergent with increasing sample size. For small samples, it converges towards a uniform distribution, i.e., the method effectively applies a maximum entropy approach. We apply this nonzero method and four alternative sample-based distribution estimators to a range of typical distributions (uniform, Dirac, normal, multimodal, and irregular) and measure the effect with Kullback–Leibler divergence. While the performance of each method strongly depends on the distribution type it is applied to, on average, and especially for small sample sizes, the nonzero, the simple “add one counter”, and the Bayesian Dirichlet-multinomial model show very similar behavior and perform best. We conclude that, when estimating distributions without an a priori idea of their shape, applying one of these methods is favorable.


2020 ◽  
Vol 30 (1) ◽  
pp. e38066
Author(s):  
Jimmie Leppink

Research in education is often associated with comparing group averages and linear relations in sufficiently large samples and evidence-based practice is about using the outcomes of that research in the practice of education. However, there are questions that are important for the practice of education that cannot really be addressed by comparisons of group averages and linear relations, no matter how large the samples. Besides, different types of constraints including logistic, financial, and ethical ones may make larger-sample research unfeasible or at least questionable. What has remained less known in many fields is that there are study designs and statistical methods for research involving small samples or even individuals that allow us to address questions of importance for the practice of education. This article discusses one type of such situations and provides a simple coherent statistical approach that provides point and interval estimates of differences of interest regardless of the type of the outcome variable and that is of use in other types of studies involving large samples, small samples, and single individuals.


Sign in / Sign up

Export Citation Format

Share Document