Power, Effect Size, P ‐Values, and Estimating Required Sample Size Using Python

What is best criterion for determining statistical significance? In psychology, the criterion has been p < .05. This criterion has been criticized since its inception, and the criticisms have been rejuvenated with recent failures to replicate studies published in top psychology journals. Several replacement criteria have been suggested including reducing the alpha level to .005 or switching to other types of criteria such as Bayes factors or effect sizes. Here, various decision criteria for statistical significance were evaluated using signal detection analysis on the outcomes of simulated data. The signal detection measure of area under the curve (AUC) is a measure of discriminability with a value of 1 indicating perfect discriminability and 0.5 indicating chance performance. Applied to criteria for statistical significance, it provides an estimate of the decision criterion’s performance in discriminating real effects from null effects. AUCs were high (M = .96, median = .97) for p values, suggesting merit in using p values to discriminate significant effects. AUCs can be used to assess methodological questions such as how much improvement will be gained with increased sample size, how much discriminability will be lost with questionable research practices, and whether it is better to run a single high-powered study or a study plus a replication at lower powers. AUCs were also used to compare performance across p values, Bayes factors, and effect size (Cohen’s d). AUCs were equivalent for p values and Bayes factors and were slightly higher for effect size. Signal detection analysis provides separate measures of discriminability and bias. With respect to bias, the specific thresholds that produced maximally-optimal utility depended on sample size, although this dependency was particularly notable for p values and less so for Bayes factors. The application of signal detection theory to the issue of statistical significance highlights the need to focus on both false alarms and misses, rather than false alarms alone.

Download Full-text

The cost of large numbers of hypothesis tests on power, effect size and sample size

Molecular Psychiatry ◽

10.1038/mp.2010.117 ◽

2010 ◽

Vol 17 (1) ◽

pp. 108-114 ◽

Cited By ~ 15

Author(s):

L C Lazzeroni ◽

A Ray

Keyword(s):

Sample Size ◽

Effect Size ◽

Hypothesis Tests ◽

Large Numbers ◽

Power Effect ◽

The Cost

Download Full-text

RANDOMISED TRIALS OF STROKE DUE TO INTRACEREBRAL HAEMORRHAGE: SYSTEMATIC REVIEW OF TRIAL CHARACTERISTICS, RISK OF BIAS, SAMPLE SIZE, AND EFFECT SIZE

10.26226/morressier.58e389b1d462b8029238501a ◽

2017 ◽

Author(s):

Arina Tamborska

Keyword(s):

Systematic Review ◽

Sample Size ◽

Intracerebral Haemorrhage ◽

Effect Size ◽

Risk Of Bias ◽

Randomised Trials

Download Full-text

Cavalier Use of Inferential Statistics Is a Major Source of False and Irreproducible Scientific Findings

Mathematics ◽

10.3390/math9060603 ◽

2021 ◽

Vol 9 (6) ◽

pp. 603

Author(s):

Leonid Hanin

Keyword(s):

Sample Size ◽

Gaussian Approximation ◽

Statistical Significance ◽

Statistical Analyses ◽

Random Sample Size ◽

P Values ◽

The Central Limit Theorem ◽

Fixed Sample ◽

Large Numbers ◽

Significance Levels

I uncover previously underappreciated systematic sources of false and irreproducible results in natural, biomedical and social sciences that are rooted in statistical methodology. They include the inevitably occurring deviations from basic assumptions behind statistical analyses and the use of various approximations. I show through a number of examples that (a) arbitrarily small deviations from distributional homogeneity can lead to arbitrarily large deviations in the outcomes of statistical analyses; (b) samples of random size may violate the Law of Large Numbers and thus are generally unsuitable for conventional statistical inference; (c) the same is true, in particular, when random sample size and observations are stochastically dependent; and (d) the use of the Gaussian approximation based on the Central Limit Theorem has dramatic implications for p-values and statistical significance essentially making pursuit of small significance levels and p-values for a fixed sample size meaningless. The latter is proven rigorously in the case of one-sided Z test. This article could serve as a cautionary guidance to scientists and practitioners employing statistical methods in their work.

Download Full-text

How the Maximal Evidence of P-Values Against Point Null Hypotheses Depends on Sample Size

The American Statistician ◽

10.1080/00031305.2016.1209128 ◽

2016 ◽

Vol 70 (4) ◽

pp. 335-341 ◽

Cited By ~ 26

Author(s):

Leonhard Held ◽

Manuela Ott

Keyword(s):

Sample Size ◽

P Values

Download Full-text

Comparisons of allocating sample size rationally into individual regions under heterogeneous effect size in a multiregional trial by a fixed effect model and a random effect model

Communication in Statistics- Theory and Methods ◽

10.1080/03610926.2014.974823 ◽

2016 ◽

Vol 45 (23) ◽

pp. 7060-7074 ◽

Cited By ~ 2

Author(s):

Feng-shou Ko

Keyword(s):

Sample Size ◽

Effect Size ◽

Random Effect ◽

Random Effect Model ◽

Fixed Effect Model ◽

Fixed Effect ◽

Effect Model ◽

Heterogeneous Effect

Download Full-text

Effect size, statistical power, and sample size for assessing interactions between categorical and continuous variables

British Journal of Mathematical and Statistical Psychology ◽

10.1111/bmsp.12147 ◽

2018 ◽

Vol 72 (1) ◽

pp. 136-154 ◽

Cited By ~ 5

Author(s):

Gwowen Shieh

Keyword(s):

Sample Size ◽

Effect Size ◽

Statistical Power ◽

Continuous Variables

Download Full-text

A Multi-faceted Mess: A Review of Statistical Power Analysis in Psychology Journal Articles

10.31234/osf.io/3bdfu ◽

2019 ◽

Cited By ~ 2

Author(s):

Rob Cribbie ◽

Nataly Beribisky ◽

Udi Alter

Keyword(s):

Sample Size ◽

Effect Size ◽

Power Analysis ◽

Statistical Power ◽

Type I Error ◽

A Priori ◽

Type I ◽

Specific Level ◽

Maximum Sample Size ◽

Power Analyses

Many bodies recommend that a sample planning procedure, such as traditional NHST a priori power analysis, is conducted during the planning stages of a study. Power analysis allows the researcher to estimate how many participants are required in order to detect a minimally meaningful effect size at a specific level of power and Type I error rate. However, there are several drawbacks to the procedure that render it “a mess.” Specifically, the identification of the minimally meaningful effect size is often difficult but unavoidable for conducting the procedure properly, the procedure is not precision oriented, and does not guide the researcher to collect as many participants as feasibly possible. In this study, we explore how these three theoretical issues are reflected in applied psychological research in order to better understand whether these issues are concerns in practice. To investigate how power analysis is currently used, this study reviewed the reporting of 443 power analyses in high impact psychology journals in 2016 and 2017. It was found that researchers rarely use the minimally meaningful effect size as a rationale for the chosen effect in a power analysis. Further, precision-based approaches and collecting the maximum sample size feasible are almost never used in tandem with power analyses. In light of these findings, we offer that researchers should focus on tools beyond traditional power analysis when sample planning, such as collecting the maximum sample size feasible.

Download Full-text

Statistical Power in Content Analysis Designs: How Effect Size, Sample Size and Coding Accuracy Jointly Affect Hypothesis Testing ‐ A Monte Carlo Simulation Approach.

Computational Communication Research ◽

10.5117/ccr2021.1.003.geis ◽

2021 ◽

Vol 3 (1) ◽

pp. 61-89

Author(s):

Stefan Geiß

Keyword(s):

Monte Carlo Simulation ◽

Monte Carlo ◽

Content Analysis ◽

Sample Size ◽

Effect Size ◽

Statistical Power ◽

Effect Sizes ◽

Sample Sizes ◽

Expected Effect ◽

Sample Size Effect

Abstract This study uses Monte Carlo simulation techniques to estimate the minimum required levels of intercoder reliability in content analysis data for testing correlational hypotheses, depending on sample size, effect size and coder behavior under uncertainty. The ensuing procedure is analogous to power calculations for experimental designs. In most widespread sample size/effect size settings, the rule-of-thumb that chance-adjusted agreement should be ≥.80 or ≥.667 corresponds to the simulation results, resulting in acceptable α and β error rates. However, this simulation allows making precise power calculations that can consider the specifics of each study’s context, moving beyond one-size-fits-all recommendations. Studies with low sample sizes and/or low expected effect sizes may need coder agreement above .800 to test a hypothesis with sufficient statistical power. In studies with high sample sizes and/or high expected effect sizes, coder agreement below .667 may suffice. Such calculations can help in both evaluating and in designing studies. Particularly in pre-registered research, higher sample sizes may be used to compensate for low expected effect sizes and/or borderline coding reliability (e.g. when constructs are hard to measure). I supply equations, easy-to-use tables and R functions to facilitate use of this framework, along with example code as online appendix.

Download Full-text

Biostatistical Analysis: A Primer for Clinical Exercise Physiology, Part 1

Journal of Clinical Exercise Physiology ◽

10.31189/2165-6193-7.3.63 ◽

2018 ◽

Vol 7 (3) ◽

pp. 63-69

Author(s):

Suzanne L. Havstad ◽

George W. Divine

Keyword(s):

Sample Size ◽

Gold Standard ◽

Exercise Physiology ◽

Analytical Techniques ◽

Multiple Comparisons ◽

Inference Process ◽

P Values ◽

Advantages And Disadvantages ◽

Point Estimates ◽

Biostatistical Analysis

ABSTRACT In this first of a two-part series on introductory biostatistics, we briefly describe common designs. The advantages and disadvantages of six design types are highlighted. The randomized clinical trial is the gold standard to which other designs are compared. We present the benefits of randomization and discuss the importance of power and sample size. Sample size and power calculations for any design need to be based on meaningful effects of interest. We give examples of how the effect of interest and the sample size interrelate. We also define concepts helpful to the statistical inference process. When drawing conclusions from a completed study, P values, point estimates, and confidence intervals will all assist the researcher. Finally, the issue of multiple comparisons is briefly explored. The second paper in this series will describe basic analytical techniques and discuss some common mistakes in the interpretation of data.

Download Full-text