scholarly journals Estimating Population Mean Power Under Conditions of Heterogeneity and Selection for Significance

2020 ◽  
Vol 4 ◽  
Author(s):  
Jerry Brunner ◽  
Ulrich Schimmack

In scientific fields that use significance tests, statistical power is important for successful replications of significant results because it is the long-run success rate in a series of exact replication studies. For any population of significant results, there is a population of power values of the statistical tests on which conclusions are based. We give exact theoretical results showing how selection for significance affects the distribution of statistical power in a heterogeneous population of significance tests. In a set of large-scale simulation studies, we compare four methods for estimating population mean power of a set of studies selected for significance (a maximum likelihood model, extensions of p-curve and p-uniform, & z-curve). The p-uniform and p-curve methods performed well with a fixed effects size and varying sample sizes. However, when there was substantial variability in effect sizes as well as sample sizes, both methods systematically overestimate mean power. With heterogeneity in effect sizes, the maximum likelihood model produced the most accurate estimates when the distribution of effect sizes matched the assumptions of the model, but z-curve produced more accurate estimates when the assumptions of the maximum likelihood model were not met. We recommend the use of z-curve to estimate the typical power of significant results, which has implications for the replicability of significant results in psychology journals.

2017 ◽  
Author(s):  
Ulrich Schimmack ◽  
Jerry Brunner

In recent years, the replicability of original findings published in psychology journals has been questioned. A key concern is that selection for significance inflates observed effect sizes and observed power. If selection bias is severe, replication studies are unlikely to reproduce a significant result. We introduce z-curve as a new method that can estimate the average true power for sets of studies that are selected for significance. We compare this method with p-curve, which has been used for the same purpose. Simulation studies show that both methods perform well when all studies have the same power, but p-curve overestimates power if power varies across studies. Based on these findings, we recommend z-curve to estimate power for sets of studies that are heterogeneous and selected for significance. Application of z-curve to various datasets suggests that the average replicability of published results in psychology is approximately 50%, but there is substantial heterogeneity and many psychological studies remain underpowered and are likely to produce false negative results. To increase replicability and credibility of published results it is important to reduce selection bias and to increase statistical power.


2021 ◽  
Vol 3 (1) ◽  
pp. 61-89
Author(s):  
Stefan Geiß

Abstract This study uses Monte Carlo simulation techniques to estimate the minimum required levels of intercoder reliability in content analysis data for testing correlational hypotheses, depending on sample size, effect size and coder behavior under uncertainty. The ensuing procedure is analogous to power calculations for experimental designs. In most widespread sample size/effect size settings, the rule-of-thumb that chance-adjusted agreement should be ≥.80 or ≥.667 corresponds to the simulation results, resulting in acceptable α and β error rates. However, this simulation allows making precise power calculations that can consider the specifics of each study’s context, moving beyond one-size-fits-all recommendations. Studies with low sample sizes and/or low expected effect sizes may need coder agreement above .800 to test a hypothesis with sufficient statistical power. In studies with high sample sizes and/or high expected effect sizes, coder agreement below .667 may suffice. Such calculations can help in both evaluating and in designing studies. Particularly in pre-registered research, higher sample sizes may be used to compensate for low expected effect sizes and/or borderline coding reliability (e.g. when constructs are hard to measure). I supply equations, easy-to-use tables and R functions to facilitate use of this framework, along with example code as online appendix.


1995 ◽  
Vol 55 (5) ◽  
pp. 773-776 ◽  
Author(s):  
Bernard S. Gorman ◽  
Louis H. Primavera ◽  
David B. Allison

2020 ◽  
Vol 63 (5) ◽  
pp. 1572-1580
Author(s):  
Laura Gaeta ◽  
Christopher R. Brydges

Purpose The purpose was to examine and determine effect size distributions reported in published audiology and speech-language pathology research in order to provide researchers and clinicians with more relevant guidelines for the interpretation of potentially clinically meaningful findings. Method Cohen's d, Hedges' g, Pearson r, and sample sizes ( n = 1,387) were extracted from 32 meta-analyses in journals in speech-language pathology and audiology. Percentile ranks (25th, 50th, 75th) were calculated to determine estimates for small, medium, and large effect sizes, respectively. The median sample size was also used to explore statistical power for small, medium, and large effect sizes. Results For individual differences research, effect sizes of Pearson r = .24, .41, and .64 were found. For group differences, Cohen's d /Hedges' g = 0.25, 0.55, and 0.93. These values can be interpreted as small, medium, and large effect sizes in speech-language pathology and audiology. The majority of published research was inadequately powered to detect a medium effect size. Conclusions Effect size interpretations from published research in audiology and speech-language pathology were found to be underestimated based on Cohen's (1988, 1992) guidelines. Researchers in the field should consider using Pearson r = .25, .40, and .65 and Cohen's d /Hedges' g = 0.25, 0.55, and 0.95 as small, medium, and large effect sizes, respectively, and collect larger sample sizes to ensure that both significant and nonsignificant findings are robust and replicable.


Author(s):  
Marc J. Lajeunesse

The common justification for meta-analysis is the increased statistical power to detect effects over what is obtained from individual studies. For ecologists and evolutionary biologists, the statistical power of meta-analysis is important because effect sizes are usually relatively small in these fields, and experimental sample sizes are often limited for logistic reasons. Consequently, many studies lack sufficient power to detect an experimental effect should it exist. This chapter provides a brief overview of the factors that determine the statistical power of meta-analysis. It presents statistics for calculating the power of pooled effect sizes to evaluate nonzero effects and the power of within- and between-study homogeneity tests. It also surveys ways to improve the statistical power of meta-analysis, and ends with a discussion on the overall utility of power statistics for meta-analysis.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Anderson Souza Oliveira ◽  
Cristina Ioana Pirscoveanu

AbstractLow reproducibility and non-optimal sample sizes are current concerns in scientific research, especially within human movement studies. Therefore, this study aimed to examine the implications of different sample sizes and number of steps on data variability and statistical outcomes from kinematic and kinetics running biomechanical variables. Forty-four participants ran overground using their preferred technique (normal) and minimizing the contact sound volume (silent). Running speed, peak vertical, braking forces, and vertical average loading rate were extracted from > 40 steps/runner. Data stability was computed using a sequential estimation technique. Statistical outcomes (p values and effect sizes) from the comparison normal vs silent running were extracted from 100,000 random samples, using various combinations of sample size (from 10 to 40 runners) and number of steps (from 5 to 40 steps). The results showed that only 35% of the study sample could reach average stability using up to 10 steps across all biomechanical variables. The loading rate was consistently significantly lower during silent running compared to normal running, with large effect sizes across all combinations. However, variables presenting small or medium effect sizes (running speed and peak braking force), required > 20 runners to reach significant differences. Therefore, varying sample sizes and number of steps are shown to influence the normal vs silent running statistical outcomes in a variable-dependent manner. Based on our results, we recommend that studies involving analysis of traditional running biomechanical variables use a minimum of 25 participants and 25 steps from each participant to provide appropriate data stability and statistical power.


2019 ◽  
Vol 50 (5-6) ◽  
pp. 292-304 ◽  
Author(s):  
Mario Wenzel ◽  
Marina Lind ◽  
Zarah Rowland ◽  
Daniela Zahn ◽  
Thomas Kubiak

Abstract. Evidence on the existence of the ego depletion phenomena as well as the size of the effects and potential moderators and mediators are ambiguous. Building on a crossover design that enables superior statistical power within a single study, we investigated the robustness of the ego depletion effect between and within subjects and moderating and mediating influences of the ego depletion manipulation checks. Our results, based on a sample of 187 participants, demonstrated that (a) the between- and within-subject ego depletion effects only had negligible effect sizes and that there was (b) large interindividual variability that (c) could not be explained by differences in ego depletion manipulation checks. We discuss the implications of these results and outline a future research agenda.


Sign in / Sign up

Export Citation Format

Share Document