An Examination of Effect Sizes and Statistical Power in Speech, Language, and Hearing Research

2020 ◽  
Vol 63 (5) ◽  
pp. 1572-1580
Author(s):  
Laura Gaeta ◽  
Christopher R. Brydges

Purpose The purpose was to examine and determine effect size distributions reported in published audiology and speech-language pathology research in order to provide researchers and clinicians with more relevant guidelines for the interpretation of potentially clinically meaningful findings. Method Cohen's d, Hedges' g, Pearson r, and sample sizes ( n = 1,387) were extracted from 32 meta-analyses in journals in speech-language pathology and audiology. Percentile ranks (25th, 50th, 75th) were calculated to determine estimates for small, medium, and large effect sizes, respectively. The median sample size was also used to explore statistical power for small, medium, and large effect sizes. Results For individual differences research, effect sizes of Pearson r = .24, .41, and .64 were found. For group differences, Cohen's d /Hedges' g = 0.25, 0.55, and 0.93. These values can be interpreted as small, medium, and large effect sizes in speech-language pathology and audiology. The majority of published research was inadequately powered to detect a medium effect size. Conclusions Effect size interpretations from published research in audiology and speech-language pathology were found to be underestimated based on Cohen's (1988, 1992) guidelines. Researchers in the field should consider using Pearson r = .25, .40, and .65 and Cohen's d /Hedges' g = 0.25, 0.55, and 0.95 as small, medium, and large effect sizes, respectively, and collect larger sample sizes to ensure that both significant and nonsignificant findings are robust and replicable.

2021 ◽  
Vol 3 (1) ◽  
pp. 61-89
Author(s):  
Stefan Geiß

Abstract This study uses Monte Carlo simulation techniques to estimate the minimum required levels of intercoder reliability in content analysis data for testing correlational hypotheses, depending on sample size, effect size and coder behavior under uncertainty. The ensuing procedure is analogous to power calculations for experimental designs. In most widespread sample size/effect size settings, the rule-of-thumb that chance-adjusted agreement should be ≥.80 or ≥.667 corresponds to the simulation results, resulting in acceptable α and β error rates. However, this simulation allows making precise power calculations that can consider the specifics of each study’s context, moving beyond one-size-fits-all recommendations. Studies with low sample sizes and/or low expected effect sizes may need coder agreement above .800 to test a hypothesis with sufficient statistical power. In studies with high sample sizes and/or high expected effect sizes, coder agreement below .667 may suffice. Such calculations can help in both evaluating and in designing studies. Particularly in pre-registered research, higher sample sizes may be used to compensate for low expected effect sizes and/or borderline coding reliability (e.g. when constructs are hard to measure). I supply equations, easy-to-use tables and R functions to facilitate use of this framework, along with example code as online appendix.


2013 ◽  
Vol 112 (3) ◽  
pp. 835-844 ◽  
Author(s):  
M. T. Bradley ◽  
A. Brand

Tables of alpha values as a function of sample size, effect size, and desired power were presented. The tables indicated expected alphas for small, medium, and large effect sizes given a variety of sample sizes. It was evident that sample sizes for most psychological studies are adequate for large effect sizes defined at .8. The typical alpha level of .05 and desired power of 90% can be achieved with 70 participants in two groups. It was perhaps doubtful if these ideal levels of alpha and power have generally been achieved for medium effect sizes in actual research, since 170 participants would be required. Small effect sizes have rarely been tested with an adequate number of participants or power. Implications were discussed.


2020 ◽  
Author(s):  
Daniel J. Dunleavy

Background. In recent years, the veracity of scientific findings has come under intense scrutinyin what has been called the “replication crisis” (sometimes called the “reproducibility crisis” or“crisis of confidence”). This crisis is marked by the propagation of scientific claims which weresubsequently contested, found to be exaggerated, or deemed false. The causes of this crisis aremany, but include poor research design, inappropriate statistical analysis, and the manipulationof study results. Though it is uncertain if social work is in the midst of a similar crisis, it is notunlikely, given parallels between the field and adjacent disciplines in crisis.Objective. This dissertation aims to articulate these problems, as well as foundational issues instatistical theory, in order to scrutinize statistical practice in social work research. In doing so, itparallels recent work in psychology, neuroscience, medicine, ecology, and other scientificdisciplines, while introducing a new program of meta-research to the social work profession.Method. Five leading social work journals were analyzed across a five-year period (2014-2018).In all 1,906 articles were reviewed, with 310 meeting inclusion criteria. The study was dividedinto three complementary parts. Statistical reporting practices were coded and analyzed in Part 1of the study (n = 310). Using reported sample sizes from these articles, a power survey wasperformed, in Part 2, for small, medium, and large effect sizes (n = 207). A novel statistical tool,the p-curve, was used in Part 3 to evaluate the evidential value of results from one journal(Research on Social Work Practice) and to assess for bias. Results from 39 of the 78 eligiblearticles were included in the analysis. Data and materials are available at: https://osf.io/45z3h/Results. Part 1: Notably, 86.1% of articles reviewed did not report an explicit alpha level. Apower analysis was performed in only 7.4% of articles. Use of p-values was common, beingreported in 96.8% of articles, but only 29% of articles reported them in exact form. Only 36.5%of articles reported confidence intervals; with the 95% coverage rate being the most common(reported in 31.3% of all studies). Effect sizes were explicitly reported in the results section ortables in a little more than half of articles (55.2%). Part 2: The mean statistical power for articleswas 57% for small effects, 88% for medium effects, and 95% for large effects. 61% of studiesdid not have adequate power (.80) to detect a small effect, 19% did not have adequate power todetect a medium effect, and 7% a large effect. A robustness test yielded similar but moreconservative estimates for these findings. Part 3: Both the primary p-curve and robustness testyielded right-skewed curves, indicating evidential value for the included set of results, and noevidence of bias.Conclusion. Overall, these findings provide a snapshot of the status of contemporary social workresearch. The results are preliminary but indicate areas where statistical design and reporting canbe improved in published research. The results of the power survey suggest that the field hasincreased mean statistical power compared to prior decades; though these findings are tentativeand have numerous limitations. The results of the p-curve demonstrate its potential as a tool forinvestigating bias within published research; while suggesting that the results included fromResearch on Social Work Practice have evidential value. In all this study provides a first steptowards a broader and more comprehensive assessment of the field.


2021 ◽  
Author(s):  
Kleber Neves ◽  
Pedro Batista Tan ◽  
Olavo Bohrer Amaral

Diagnostic screening models for the interpretation of null hypothesis significance test (NHST) results have been influential in highlighting the effect of selective publication on the reproducibility of the published literature, leading to John Ioannidis’ much-cited claim that most published research findings are false. These models, however, are typically based on the assumption that hypotheses are dichotomously true or false, without considering that effect sizes for different hypotheses are not the same. To address this limitation, we develop a simulation model that overcomes this by modeling effect sizes explicitly using different continuous distributions, while retaining other aspects of previous models such as publication bias and the pursuit of statistical significance. Our results show that the combination of selective publication, bias, low statistical power and unlikely hypotheses consistently leads to high proportions of false positives, irrespective of the effect size distribution assumed. Using continuous effect sizes also allows us to evaluate the degree of effect size overestimation and prevalence of estimates with the wrong signal in the literature, showing that the same factors that drive false-positive results also lead to errors in estimating effect size direction and magnitude. Nevertheless, the relative influence of these factors on different metrics varies depending on the distribution assumed for effect sizes. The model is made available as an R ShinyApp interface, allowing one to explore features of the literature in various scenarios.


2018 ◽  
Author(s):  
Michele B. Nuijten ◽  
Marcel A. L. M. van Assen ◽  
Hilde Augusteijn ◽  
Elise Anne Victoire Crompvoets ◽  
Jelte M. Wicherts

In this meta-study, we analyzed 2,442 effect sizes from 131 meta-analyses in intelligence research, published from 1984 to 2014, to estimate the average effect size, median power, and evidence for bias. We found that the average effect size in intelligence research was a Pearson’s correlation of .26, and the median sample size was 60. Furthermore, across primary studies, we found a median power of 11.9% to detect a small effect, 54.5% to detect a medium effect, and 93.9% to detect a large effect. We documented differences in average effect size and median estimated power between different types of in intelligence studies (correlational studies, studies of group differences, experiments, toxicology, and behavior genetics). On average, across all meta-analyses (but not in every meta-analysis), we found evidence for small study effects, potentially indicating publication bias and overestimated effects. We found no differences in small study effects between different study types. We also found no convincing evidence for the decline effect, US effect, or citation bias across meta-analyses. We conclude that intelligence research does show signs of low power and publication bias, but that these problems seem less severe than in many other scientific fields.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Anderson Souza Oliveira ◽  
Cristina Ioana Pirscoveanu

AbstractLow reproducibility and non-optimal sample sizes are current concerns in scientific research, especially within human movement studies. Therefore, this study aimed to examine the implications of different sample sizes and number of steps on data variability and statistical outcomes from kinematic and kinetics running biomechanical variables. Forty-four participants ran overground using their preferred technique (normal) and minimizing the contact sound volume (silent). Running speed, peak vertical, braking forces, and vertical average loading rate were extracted from > 40 steps/runner. Data stability was computed using a sequential estimation technique. Statistical outcomes (p values and effect sizes) from the comparison normal vs silent running were extracted from 100,000 random samples, using various combinations of sample size (from 10 to 40 runners) and number of steps (from 5 to 40 steps). The results showed that only 35% of the study sample could reach average stability using up to 10 steps across all biomechanical variables. The loading rate was consistently significantly lower during silent running compared to normal running, with large effect sizes across all combinations. However, variables presenting small or medium effect sizes (running speed and peak braking force), required > 20 runners to reach significant differences. Therefore, varying sample sizes and number of steps are shown to influence the normal vs silent running statistical outcomes in a variable-dependent manner. Based on our results, we recommend that studies involving analysis of traditional running biomechanical variables use a minimum of 25 participants and 25 steps from each participant to provide appropriate data stability and statistical power.


2021 ◽  
Author(s):  
Ymkje Anna de Vries ◽  
Robert A Schoevers ◽  
Julian Higgins ◽  
Marcus Munafo ◽  
Jojanneke Bastiaansen

Background: Previous research has suggested that statistical power is suboptimal in many biomedical disciplines, but it is unclear whether power is better in trials for particular interventions, disorders, or outcome types. We therefore performed a detailed examination of power in trials of psychotherapy, pharmacotherapy, and complementary and alternative medicine (CAM) for mood, anxiety, and psychotic disorders. Methods: We extracted data from the Cochrane Database of Systematic Reviews (Mental Health). We focused on continuous efficacy outcomes and estimated power to detect standardized effect sizes (SMD=0.20-0.80, primary effect size SMD=0.40) and the meta-analytic effect size (ESMA). We performed meta-regression to estimate the influence of including underpowered studies in meta-analyses. Results: We included 216 reviews with 8809 meta-analyses and 36540 studies. Statistical power for continuous efficacy outcomes was very low across intervention and disorder types (overall median [IQR] power for SMD=0.40: 0.33 [0.19-0.54]; for ESMA: 0.15 [0.07-0.44]), only reaching conventionally acceptable levels (80%) for SMD=0.80. Median power to detect the ESMA was higher in TAU/waitlist-controlled (0.54-0.66) or placebo-controlled (0.15-0.40) trials than in trials comparing active treatments (0.07-0.10). Meta-regression indicated that adequately-powered studies produced smaller effect sizes than underpowered studies (B=-0.06, p=0.008). Conclusions: Power to detect both fixed and meta-analytic effect sizes in clinical trials in psychiatry was low across all interventions and disorders examined. As underpowered studies produced larger effect sizes than adequately-powered studies, these results confirm the need to increase sample sizes and to reduce reporting bias against studies reporting null results to improve the reliability of the published literature.


2020 ◽  
Vol 8 (4) ◽  
pp. 36
Author(s):  
Michèle B. Nuijten ◽  
Marcel A. L. M. van Assen ◽  
Hilde E. M. Augusteijn ◽  
Elise A. V. Crompvoets ◽  
Jelte M. Wicherts

In this meta-study, we analyzed 2442 effect sizes from 131 meta-analyses in intelligence research, published from 1984 to 2014, to estimate the average effect size, median power, and evidence for bias. We found that the average effect size in intelligence research was a Pearson’s correlation of 0.26, and the median sample size was 60. Furthermore, across primary studies, we found a median power of 11.9% to detect a small effect, 54.5% to detect a medium effect, and 93.9% to detect a large effect. We documented differences in average effect size and median estimated power between different types of intelligence studies (correlational studies, studies of group differences, experiments, toxicology, and behavior genetics). On average, across all meta-analyses (but not in every meta-analysis), we found evidence for small-study effects, potentially indicating publication bias and overestimated effects. We found no differences in small-study effects between different study types. We also found no convincing evidence for the decline effect, US effect, or citation bias across meta-analyses. We concluded that intelligence research does show signs of low power and publication bias, but that these problems seem less severe than in many other scientific fields.


2021 ◽  
pp. 016327872110243
Author(s):  
Donna Chen ◽  
Matthew S. Fritz

Although the bias-corrected (BC) bootstrap is an often-recommended method for testing mediation due to its higher statistical power relative to other tests, it has also been found to have elevated Type I error rates with small sample sizes. Under limitations for participant recruitment, obtaining a larger sample size is not always feasible. Thus, this study examines whether using alternative corrections for bias in the BC bootstrap test of mediation for small sample sizes can achieve equal levels of statistical power without the associated increase in Type I error. A simulation study was conducted to compare Efron and Tibshirani’s original correction for bias, z 0, to six alternative corrections for bias: (a) mean, (b–e) Winsorized mean with 10%, 20%, 30%, and 40% trimming in each tail, and (f) medcouple (robust skewness measure). Most variation in Type I error (given a medium effect size of one regression slope and zero for the other slope) and power (small effect size in both regression slopes) was found with small sample sizes. Recommendations for applied researchers are made based on the results. An empirical example using data from the ATLAS drug prevention intervention study is presented to illustrate these results. Limitations and future directions are discussed.


2017 ◽  
Vol 83 (4) ◽  
pp. 428-445 ◽  
Author(s):  
Nicholas A. Gage ◽  
Bryan G. Cook ◽  
Brian Reichow

Publication bias involves the disproportionate representation of studies with large and significant effects in the published research. Among other problems, publication bias results in inflated omnibus effect sizes in meta-analyses, giving the impression that interventions have stronger effects than they actually do. Although evidence suggests that publication bias exists in other fields, research has not examined the issue in special education. In this study, we examined the inclusion of gray literature, testing for publication bias, the extent to which publication bias exists, the relation of including gray literature to the presence of publication bias, and differences in effect size magnitude for gray literature and published studies among 109 meta-analyses published in special education journals. We found the following: (a) 42% of meta-analyses included gray literature, (b) 33% examined publication bias, (c) meta-analyses not including gray literature were more likely to reflect publication bias, and (d) published studies had larger effect sizes than gray literature. We discuss implications and recommendations for research and practice.


Sign in / Sign up

Export Citation Format

Share Document