Estimating Completion Rates from Small Samples Using Binomial Confidence Intervals: Comparisons and Recommendations

The completion rate — the proportion of participants who successfully complete a task — is a common usability measurement. As is true for any point measurement, practitioners should compute appropriate confidence intervals for completion rate data. For proportions such as the completion rate, the appropriate interval is a binomial confidence interval. The most widely-taught method for calculating binomial confidence intervals (the “Wald Method,” discussed both in introductory statistics texts and in the human factors literature) grossly understates the width of the true interval when sample sizes are small. Alternative “exact” methods over-correct the problem by providing intervals that are too conservative. This can result in practitioners unintentionally accepting interfaces that are unusable or rejecting interfaces that are usable. We examined alternative methods for building confidence intervals from small sample completion rates, using Monte Carlo methods to sample data from a number of real, large-sample usability tests. It appears that the best method for practitioners to compute 95% confidence intervals for small-sample completion rates is to add two successes and two failures to the observed completion rate, then compute the confidence interval using the Wald method (the “Adjusted Wald Method”). This simple approach provides the best coverage, is fairly easy to compute, and agrees with other analyses in the statistics literature.

Download Full-text

Estimating completion rates from small samples using binomial confidence intervals: Comparisons and recommendations

PsycEXTRA Dataset ◽

10.1037/e577532012-007 ◽

2005 ◽

Cited By ~ 9

Author(s):

Jeff Sauro ◽

James R. Lewis

Keyword(s):

Confidence Intervals ◽

Small Samples ◽

Completion Rates

Download Full-text

SMALL SAMPLE SIZE SCIENTIST

PEDIATRICS ◽

10.1542/peds.83.3.a72a ◽

1989 ◽

Vol 83 (3) ◽

pp. A72-A72

Author(s):

Student

Keyword(s):

Sample Size ◽

Confidence Intervals ◽

Causal Explanation ◽

Small Sample Size ◽

Small Sample ◽

Small Samples ◽

High Expectations ◽

Sampling Variation ◽

The Law ◽

The Stability

The believer in the law of small numbers practices science as follows: 1. He gambles his research hypotheses on small samples without realizing that the odds against him are unreasonably high. He overestimates power. 2. He has undue confidence in early trends (e.g., the data of the first few subjects) and in the stability of observed patterns (e.g., the number and identity of significant results). He overestimates significance. 3. In evaluating replications, his or others', he has unreasonably high expectations about the replicability of significant results. He underestimates the breadth of confidence intervals. 4. He rarely attributes a deviation of results from expectations to sampling variability, because he finds a causal "explanation" for any discrepancy. Thus, he has little opportunity to recognize sampling variation in action. His belief in the law of small numbers, therefore, will forever remain intact.

Download Full-text

A Monte Carlo Study of Confidence Interval Methods for Generalizability Coefficient

Educational and Psychological Measurement ◽

10.1177/00131644211033899 ◽

2021 ◽

pp. 001316442110338

Author(s):

Zhehan Jiang ◽

Mark Raymond ◽

Christine DiStefano ◽

Dexin Shi ◽

Ren Liu ◽

...

Keyword(s):

Confidence Interval ◽

Confidence Intervals ◽

Generalizability Theory ◽

Monte Carlo Study ◽

Small Sample ◽

Interval Methods ◽

Point Estimate ◽

Linear Mixed Effect Model ◽

Mixed Effect ◽

Measurement Design

Computing confidence intervals around generalizability coefficients has long been a challenging task in generalizability theory. This is a serious practical problem because generalizability coefficients are often computed from designs where some facets have small sample sizes, and researchers have little guide regarding the trustworthiness of the coefficients. As generalizability theory can be framed to a linear mixed-effect model (LMM), bootstrap and simulation techniques from LMM paradigm can be used to construct the confidence intervals. The purpose of this research is to examine four different LMM-based methods for computing the confidence intervals that have been proposed and to determine their accuracy under six simulated conditions based on the type of test scores (normal, dichotomous, and polytomous data) and data measurement design ( p× i× r and p× [ i:r]). A bootstrap technique called “parametric methods with spherical random effects” consistently produced more accurate confidence intervals than the three other LMM-based methods. Furthermore, the selected technique was compared with model-based approach to investigate the performance at the levels of variance components via the second simulation study, where the numbers of examines, raters, and items were varied. We conclude with the recommendation generalizability coefficients, the confidence interval should accompany the point estimate.

Download Full-text

How to Correct for Chance Agreement in the Estimation of Sensitivity and Specificity of Diagnostic Tests

Methods of Information in Medicine ◽

10.1055/s-0038-1635010 ◽

1994 ◽

Vol 33 (02) ◽

pp. 180-186 ◽

Cited By ~ 11

Author(s):

H. Brenner ◽

O. Gefeller

Keyword(s):

Sensitivity And Specificity ◽

Disease Status ◽

Disease Prevalence ◽

Small Sample ◽

Small Samples ◽

Interval Estimate ◽

Traditional Concept ◽

Interval Estimates ◽

Major Disadvantage ◽

Test Result

Abstract:The traditional concept of describing the validity of a diagnostic test neglects the presence of chance agreement between test result and true (disease) status. Sensitivity and specificity, as the fundamental measures of validity, can thus only be considered in conjunction with each other to provide an appropriate basis for the evaluation of the capacity of the test to discriminate truly diseased from truly undiseased subjects. In this paper, chance-corrected analogues of sensitivity and specificity are presented as supplemental measures of validity, which pay attention to the problem of chance agreement and offer the opportunity to be interpreted separately. While recent proposals of chance-correction techniques, suggested by several authors in this context, lead to measures which are dependent on disease prevalence, our method does not share this major disadvantage. We discuss the extension of the conventional ROC-curve approach to chance-corrected measures of sensitivity and specificity. Furthermore, point and asymptotic interval estimates of the parameters of interest are derived under different sampling frameworks for validation studies. The small sample behavior of the estimates is investigated in a simulation study, leading to a logarithmic modification of the interval estimate in order to hold the nominal confidence level for small samples.

Download Full-text

Empirical Nonparametric Bootstrap Strategies in Quantitative Trait Loci Mapping: Conditioning on the Genetic Model

Genetics ◽

10.1093/genetics/148.1.525 ◽

1998 ◽

Vol 148 (1) ◽

pp. 525-535

Author(s):

Claude M Lebreton ◽

Peter M Visscher

Keyword(s):

Quantitative Trait Loci ◽

Confidence Interval ◽

Confidence Intervals ◽

Quantitative Trait ◽

Genetic Factors ◽

Genetic Model ◽

Original Data ◽

Nonparametric Bootstrap ◽

Test Protocol ◽

Trait Loci

AbstractSeveral nonparametric bootstrap methods are tested to obtain better confidence intervals for the quantitative trait loci (QTL) positions, i.e., with minimal width and unbiased coverage probability. Two selective resampling schemes are proposed as a means of conditioning the bootstrap on the number of genetic factors in our model inferred from the original data. The selection is based on criteria related to the estimated number of genetic factors, and only the retained bootstrapped samples will contribute a value to the empirically estimated distribution of the QTL position estimate. These schemes are compared with a nonselective scheme across a range of simple configurations of one QTL on a one-chromosome genome. In particular, the effect of the chromosome length and the relative position of the QTL are examined for a given experimental power, which determines the confidence interval size. With the test protocol used, it appears that the selective resampling schemes are either unbiased or least biased when the QTL is situated near the middle of the chromosome. When the QTL is closer to one end, the likelihood curve of its position along the chromosome becomes truncated, and the nonselective scheme then performs better inasmuch as the percentage of estimated confidence intervals that actually contain the real QTL's position is closer to expectation. The nonselective method, however, produces larger confidence intervals. Hence, we advocate use of the selective methods, regardless of the QTL position along the chromosome (to reduce confidence interval sizes), but we leave the problem open as to how the method should be altered to take into account the bias of the original estimate of the QTL's position.

Download Full-text

G-computation and machine learning for estimating the causal effects of binary exposure statuses on binary outcomes

Scientific Reports ◽

10.1038/s41598-021-81110-0 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Florent Le Borgne ◽

Arthur Chatton ◽

Maxime Léger ◽

Rémi Lenain ◽

Yohann Foucher

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Statistical Power ◽

Small Sample ◽

Causal Effects ◽

Small Samples ◽

Support Vector ◽

Sample Sizes ◽

Super Learner ◽

Small Sample Sizes

AbstractIn clinical research, there is a growing interest in the use of propensity score-based methods to estimate causal effects. G-computation is an alternative because of its high statistical power. Machine learning is also increasingly used because of its possible robustness to model misspecification. In this paper, we aimed to propose an approach that combines machine learning and G-computation when both the outcome and the exposure status are binary and is able to deal with small samples. We evaluated the performances of several methods, including penalized logistic regressions, a neural network, a support vector machine, boosted classification and regression trees, and a super learner through simulations. We proposed six different scenarios characterised by various sample sizes, numbers of covariates and relationships between covariates, exposure statuses, and outcomes. We have also illustrated the application of these methods, in which they were used to estimate the efficacy of barbiturates prescribed during the first 24 h of an episode of intracranial hypertension. In the context of GC, for estimating the individual outcome probabilities in two counterfactual worlds, we reported that the super learner tended to outperform the other approaches in terms of both bias and variance, especially for small sample sizes. The support vector machine performed well, but its mean bias was slightly higher than that of the super learner. In the investigated scenarios, G-computation associated with the super learner was a performant method for drawing causal inferences, even from small sample sizes.

Download Full-text

Small Sample Confidence Intervals for the Odds Ratio

Communications in Statistics - Simulation and Computation ◽

10.1081/sac-200040691 ◽

2004 ◽

Vol 33 (4) ◽

pp. 1095-1113 ◽

Cited By ~ 27

Author(s):

Raef Lawson

Keyword(s):

Confidence Intervals ◽

Odds Ratio ◽

Small Sample

Download Full-text

W.S. Gosset and Some Neglected Concepts in Experimental Statistics: Guinnessometrics II

Journal of Wine Economics ◽

10.1017/s1931436100001632 ◽

2011 ◽

Vol 6 (2) ◽

pp. 252-277 ◽

Cited By ~ 3

Author(s):

Stephen T. Ziliak

Keyword(s):

Experimental Design ◽

Sample Size ◽

Large Scale ◽

Statistical Significance ◽

Small Sample ◽

Small Samples ◽

Significant Advance ◽

Economic Approach ◽

Barley Malt ◽

Level Of Significance

AbstractStudent's exacting theory of errors, both random and real, marked a significant advance over ambiguous reports of plant life and fermentation asserted by chemists from Priestley and Lavoisier down to Pasteur and Johannsen, working at the Carlsberg Laboratory. One reason seems to be that William Sealy Gosset (1876–1937) aka “Student” – he of Student'st-table and test of statistical significance – rejected artificial rules about sample size, experimental design, and the level of significance, and took instead an economic approach to the logic of decisions made under uncertainty. In his job as Apprentice Brewer, Head Experimental Brewer, and finally Head Brewer of Guinness, Student produced small samples of experimental barley, malt, and hops, seeking guidance for industrial quality control and maximum expected profit at the large scale brewery. In the process Student invented or inspired half of modern statistics. This article draws on original archival evidence, shedding light on several core yet neglected aspects of Student's methods, that is, Guinnessometrics, not discussed by Ronald A. Fisher (1890–1962). The focus is on Student's small sample, economic approach to real error minimization, particularly in field and laboratory experiments he conducted on barley and malt, 1904 to 1937. Balanced designs of experiments, he found, are more efficient than random and have higher power to detect large and real treatment differences in a series of repeated and independent experiments. Student's world-class achievement poses a challenge to every science. Should statistical methods – such as the choice of sample size, experimental design, and level of significance – follow the purpose of the experiment, rather than the other way around? (JEL classification codes: C10, C90, C93, L66)

Download Full-text

Implications of Small Samples for Generalization: Adjustments and Rules of Thumb

Evaluation Review ◽

10.1177/0193841x16655665 ◽

2016 ◽

Vol 41 (5) ◽

pp. 472-505 ◽

Cited By ~ 16

Author(s):

Elizabeth Tipton ◽

Kelly Hallberg ◽

Larry V. Hedges ◽

Wendy Chan

Keyword(s):

Observational Studies ◽

Small Sample ◽

Average Treatment Effect ◽

Small Samples ◽

Sample Sizes ◽

Random Samples ◽

Rules Of Thumb ◽

Large Populations ◽

Small Sample Sizes ◽

Combine Information

Background: Policy makers and researchers are frequently interested in understanding how effective a particular intervention may be for a specific population. One approach is to assess the degree of similarity between the sample in an experiment and the population. Another approach is to combine information from the experiment and the population to estimate the population average treatment effect (PATE). Method: Several methods for assessing the similarity between a sample and population currently exist as well as methods estimating the PATE. In this article, we investigate properties of six of these methods and statistics in the small sample sizes common in education research (i.e., 10–70 sites), evaluating the utility of rules of thumb developed from observational studies in the generalization case. Result: In small random samples, large differences between the sample and population can arise simply by chance and many of the statistics commonly used in generalization are a function of both sample size and the number of covariates being compared. The rules of thumb developed in observational studies (which are commonly applied in generalization) are much too conservative given the small sample sizes found in generalization. Conclusion: This article implies that sharp inferences to large populations from small experiments are difficult even with probability sampling. Features of random samples should be kept in mind when evaluating the extent to which results from experiments conducted on nonrandom samples might generalize.

Download Full-text

Interval Forecasting in Supply Chain with Small Sample

Mathematical Problems in Engineering ◽

10.1155/2018/1737840 ◽

2018 ◽

Vol 2018 ◽

pp. 1-10

Author(s):

Lifeng Wu ◽

Yan Chen

Keyword(s):

Supply Chain ◽

Small Sample ◽

Decision Makers ◽

Small Samples ◽

Forecasting Model ◽

Future Trends ◽

Human Judgment ◽

Proposed Model ◽

Grey Models ◽

Short Time

To deal with the forecasting with small samples in the supply chain, three grey models with fractional order accumulation are presented. Human judgment of future trends is incorporated into the order number of accumulation. The output of the proposed model will provide decision-makers in the supply chain with more forecasting information for short time periods. The results of practical real examples demonstrate that the model provides remarkable prediction performances compared with the traditional forecasting model.

Download Full-text