Effect size, confidence intervals and statistical power in psychological research

Arnoldo Téllez; Cirilo H. García; Victor Corral-Verdugo

doi:10.11621/pir.2015.0303

How to Detect Publication Bias in Psychological Research

Zeitschrift für Psychologie ◽

10.1027/2151-2604/a000386 ◽

2019 ◽

Vol 227 (4) ◽

pp. 261-279 ◽

Cited By ~ 2

Author(s):

Frank Renkewitz ◽

Melanie Keiner

Keyword(s):

Publication Bias ◽

Effect Size ◽

Statistical Power ◽

Type I Error ◽

Psychological Research ◽

Type I ◽

True Effect Size ◽

Questionable Research Practices ◽

True Effect ◽

Meta Analyses

Abstract. Publication biases and questionable research practices are assumed to be two of the main causes of low replication rates. Both of these problems lead to severely inflated effect size estimates in meta-analyses. Methodologists have proposed a number of statistical tools to detect such bias in meta-analytic results. We present an evaluation of the performance of six of these tools. To assess the Type I error rate and the statistical power of these methods, we simulated a large variety of literatures that differed with regard to true effect size, heterogeneity, number of available primary studies, and sample sizes of these primary studies; furthermore, simulated studies were subjected to different degrees of publication bias. Our results show that across all simulated conditions, no method consistently outperformed the others. Additionally, all methods performed poorly when true effect sizes were heterogeneous or primary studies had a small chance of being published, irrespective of their results. This suggests that in many actual meta-analyses in psychology, bias will remain undiscovered no matter which detection method is used.

Using confidence intervals to estimate the response of salmon populations (Oncorhynchus spp.) to experimental habitat alterations

Canadian Journal of Fisheries and Aquatic Sciences ◽

10.1139/f05-179 ◽

2005 ◽

Vol 62 (12) ◽

pp. 2716-2726 ◽

Cited By ~ 24

Author(s):

Michael J Bradford ◽

Josh Korman ◽

Paul S Higgins

Keyword(s):

Confidence Intervals ◽

Effect Size ◽

Null Hypothesis ◽

Statistical Power ◽

Habitat Restoration ◽

Coho Salmon ◽

Fish Habitat ◽

Monitoring Program ◽

Considerable Uncertainty ◽

Monitoring Programs

There is considerable uncertainty about the effectiveness of fish habitat restoration programs, and reliable monitoring programs are needed to evaluate them. Statistical power analysis based on traditional hypothesis tests are usually used for monitoring program design, but here we argue that effect size estimates and their associated confidence intervals are more informative because results can be compared with both the null hypothesis of no effect and effect sizes of interest, such as restoration goals. We used a stochastic simulation model to compare alternative monitoring strategies for a habitat alteration that would change the productivity and capacity of a coho salmon (Oncorhynchus kisutch) producing stream. Estimates of the effect size using a freshwater stockrecruit model were more precise than those from monitoring the abundance of either spawners or smolts. Less than ideal monitoring programs can produce ambiguous results, which are cases in which the confidence interval includes both the null hypothesis and the effect size of interest. Our model is a useful planning tool because it allows the evaluation of the utility of different types of monitoring data, which should stimulate discussion on how the results will ultimately inform decision-making.

USING R FOR PSYCHOLOGICAL RESEARCH: A TUTORIAL OF BASIC METHODS

Ukrainian Psychological Journal ◽

10.17721/upj.2020.2(14).2 ◽

2020 ◽

pp. 28-63

Author(s):

A. G. Vinogradov

Keyword(s):

Sample Size ◽

Empirical Research ◽

Effect Size ◽

Statistical Power ◽

Sample Size Calculation ◽

Psychological Research ◽

Research Data ◽

Statistical Hypothesis ◽

Categorical Variables ◽

Measurement Scales

The article belongs to a special modern genre of scholar publications, so-called tutorials – articles devoted to the application of the latest methods of design, modeling or analysis in an accessible format in order to disseminate best practices. The article acquaints Ukrainian psychologists with the basics of using the R programming language to the analysis of empirical research data. The article discusses the current state of world psychology in connection with the Crisis of Confidence, which arose due to the low reproducibility of empirical research. This problem is caused by poor quality of psychological measurement tools, insufficient attention to adequate sample planning, typical statistical hypothesis testing practices, and so-called “questionable research practices.” The tutorial demonstrates methods for determining the sample size depending on the expected magnitude of the effect size and desired statistical power, performing basic variable transformations and statistical analysis of psychological research data using language and environment R. The tutorial presents minimal system of R functions required to carry out: modern analysis of reliability of measurement scales, sample size calculation, point and interval estimation of effect size for four the most widespread in psychology designs for the analysis of two variables’ interdependence. These typical problems include finding the differences between the means and variances in two or more samples, correlations between continuous and categorical variables. Practical information on data preparation, import, basic transformations, and application of basic statistical methods in the cloud version of RStudio is provided.

Effect Size and Statistical Power in Psychological Research

The Irish Journal of Psychology ◽

10.1080/03033910.2007.10446244 ◽

2007 ◽

Vol 28 (1-2) ◽

pp. 3-12 ◽

Cited By ~ 4

Author(s):

David Clark-Carter

Keyword(s):

Effect Size ◽

Statistical Power ◽

Psychological Research

Exploring power in response inhibition tasks using the bootstrap: The impact of number of participants, number of trials, effect magnitude, and study design

10.31234/osf.io/eb4jd ◽

2019 ◽

Cited By ~ 1

Author(s):

Curtis David Von Gunten ◽

Bruce D Bartholow

Keyword(s):

Study Design ◽

Effect Size ◽

Statistical Power ◽

Bootstrap Sampling ◽

Reliable Measure ◽

Internal Reliability ◽

The Impact ◽

Power Analyses ◽

Effect Magnitude

A primary psychometric concern with laboratory-based inhibition tasks has been their reliability. However, a reliable measure may not be necessary or sufficient for reliably detecting effects (statistical power). The current study used a bootstrap sampling approach to systematically examine how the number of participants, the number of trials, the magnitude of an effect, and study design (between- vs. within-subject) jointly contribute to power in five commonly used inhibition tasks. The results demonstrate the shortcomings of relying solely on measurement reliability when determining the number of trials to use in an inhibition task: high internal reliability can be accompanied with low power and low reliability can be accompanied with high power. For instance, adding additional trials once sufficient reliability has been reached can result in large gains in power. The dissociation between reliability and power was particularly apparent in between-subject designs where the number of participants contributed greatly to power but little to reliability, and where the number of trials contributed greatly to reliability but only modestly (depending on the task) to power. For between-subject designs, the probability of detecting small-to-medium-sized effects with 150 participants (total) was generally less than 55%. However, effect size was positively associated with number of trials. Thus, researchers have some control over effect size and this needs to be considered when conducting power analyses using analytic methods that take such effect sizes as an argument. Results are discussed in the context of recent claims regarding the role of inhibition tasks in experimental and individual difference designs.

Abstract 3789: Nessun Dorma : Have the Risks of Rosiglitazone been Exaggerated?

Circulation ◽

10.1161/circ.116.suppl_16.ii_861-c ◽

2007 ◽

Vol 116 (suppl_16) ◽

Author(s):

George A Diamond ◽

Sanjay Kaul

Keyword(s):

Confidence Intervals ◽

Effect Size ◽

Fixed Effects ◽

Meta Analysis ◽

Cardiovascular Death ◽

Sensitivity Analyses ◽

Odds Ratios ◽

True Effect Size ◽

Index Study ◽

The Stability

Background A highly publicized meta-analysis of 42 clinical trials comprising 27,844 diabetics ignited a firestorm of controversy by charging that treatment with rosiglitazone was associated with a “…worrisome…” 43% greater risk of myocardial infarction ( p =0.03) and a 64% greater risk of cardiovascular death ( p =0.06). Objective The investigators excluded 4 trials from the infarction analysis and 19 trials from the mortality analysis in which no events were observed. We sought to determine if these exclusions biased the results. Methods We compared the index study to a Bayesian meta-analysis of the entire 42 trials (using odds ratio as the measure of effect size) and to fixed-effects and random-effects analyses with and without a continuity correction that adjusts for values of zero. Results The odds ratios and confidence intervals for the analyses are summarized in the Table . Odds ratios for infarction ranged from 1.43 to 1.22 and for death from 1.64 to 1.13. Corrected models resulted in substantially smaller odds ratios and narrower confidence intervals than did uncorrected models. Although corrected risks remain elevated, none are statistically significant (*p<0.05). Conclusions Given the fragility of the effect sizes and confidence intervals, the charge that roziglitazone increases the risk of adverse events is not supported by these additional analyses. The exaggerated values observed in the index study are likely the result of excluding the zero-event trials from analysis. Continuity adjustments mitigate this error and provide more consistent and reliable assessments of true effect size. Transparent sensitivity analyses should therefore be performed over a realistic range of the operative assumptions to verify the stability of such assessments especially when outcome events are rare. Given the relatively wide confidence intervals, additional data will be required to adjudicate these inconclusive results.

Effect size, statistical power, and sample size for assessing interactions between categorical and continuous variables

British Journal of Mathematical and Statistical Psychology ◽

10.1111/bmsp.12147 ◽

2018 ◽

Vol 72 (1) ◽

pp. 136-154 ◽

Cited By ~ 5

Author(s):

Gwowen Shieh

Keyword(s):

Sample Size ◽

Effect Size ◽

Statistical Power ◽

Continuous Variables

An N-Pact Factor for Clinical Psychological Research

10.31234/osf.io/4fybk ◽

2018 ◽

Author(s):

Kathleen Wade Reardon ◽

Avante J Smack ◽

Kathrin Herzhoff ◽

Jennifer L Tackett

Keyword(s):

Sample Size ◽

Statistical Power ◽

Empirical Studies ◽

Psychological Research ◽

Medium Effect ◽

Medium Effect Size ◽

One Step ◽

Hard To Reach Populations ◽

Self Examination ◽

Adequate Sample Size

Although an emphasis on adequate sample size and statistical power has a long history in clinical psychological science (Cohen, 1992), increased attention to the replicability of scientific findings has again turned attention to the importance of statistical power (Bakker, van Dijk, & Wicherts, 2012). These recent efforts have not yet circled back to modern clinical psychological research, despite the continued importance of sample size and power in producing a credible body of evidence. As one step in this process of scientific self-examination, the present study estimated an N-pact Factor (the statistical power of published empirical studies to detect typical effect sizes; Fraley & Vazire, 2014) in two leading clinical journals (the Journal of Abnormal Psychology; JAP, and the Journal of Consulting and Clinical Psychology; JCCP) for the years 2000, 2005, 2010, and 2015. Study sample size, as one proxy for statistical power, is a useful focus because it allows direct comparisons with other subfields and may highlight some of the core methodological differences between clinical and other areas (e.g., hard-to-reach populations, greater emphasis on correlational designs). We found that, across all years examined, the average median sample size in clinical research is 179 participants (175 for JAP and 182 for JCCP). The power to detect a small-medium effect size of .20 is just below 80% for both journals. Although the clinical N-pact factor was higher than that estimated for social psychology, the statistical power in clinical journals is still limited to detect many effects of interest to clinical psychologists, with little evidence of improvement in sample sizes over time.

A Multi-faceted Mess: A Review of Statistical Power Analysis in Psychology Journal Articles

10.31234/osf.io/3bdfu ◽

2019 ◽

Cited By ~ 2

Author(s):

Rob Cribbie ◽

Nataly Beribisky ◽

Udi Alter

Keyword(s):

Sample Size ◽

Effect Size ◽

Power Analysis ◽

Statistical Power ◽

Type I Error ◽

A Priori ◽

Type I ◽

Specific Level ◽

Maximum Sample Size ◽

Power Analyses

Many bodies recommend that a sample planning procedure, such as traditional NHST a priori power analysis, is conducted during the planning stages of a study. Power analysis allows the researcher to estimate how many participants are required in order to detect a minimally meaningful effect size at a specific level of power and Type I error rate. However, there are several drawbacks to the procedure that render it “a mess.” Specifically, the identification of the minimally meaningful effect size is often difficult but unavoidable for conducting the procedure properly, the procedure is not precision oriented, and does not guide the researcher to collect as many participants as feasibly possible. In this study, we explore how these three theoretical issues are reflected in applied psychological research in order to better understand whether these issues are concerns in practice. To investigate how power analysis is currently used, this study reviewed the reporting of 443 power analyses in high impact psychology journals in 2016 and 2017. It was found that researchers rarely use the minimally meaningful effect size as a rationale for the chosen effect in a power analysis. Further, precision-based approaches and collecting the maximum sample size feasible are almost never used in tandem with power analyses. In light of these findings, we offer that researchers should focus on tools beyond traditional power analysis when sample planning, such as collecting the maximum sample size feasible.

Statistical Power in Content Analysis Designs: How Effect Size, Sample Size and Coding Accuracy Jointly Affect Hypothesis Testing ‐ A Monte Carlo Simulation Approach.

Computational Communication Research ◽

10.5117/ccr2021.1.003.geis ◽

2021 ◽

Vol 3 (1) ◽

pp. 61-89

Author(s):

Stefan Geiß

Keyword(s):

Monte Carlo Simulation ◽

Monte Carlo ◽

Content Analysis ◽

Sample Size ◽

Effect Size ◽

Statistical Power ◽

Effect Sizes ◽

Sample Sizes ◽

Expected Effect ◽

Sample Size Effect

Abstract This study uses Monte Carlo simulation techniques to estimate the minimum required levels of intercoder reliability in content analysis data for testing correlational hypotheses, depending on sample size, effect size and coder behavior under uncertainty. The ensuing procedure is analogous to power calculations for experimental designs. In most widespread sample size/effect size settings, the rule-of-thumb that chance-adjusted agreement should be ≥.80 or ≥.667 corresponds to the simulation results, resulting in acceptable α and β error rates. However, this simulation allows making precise power calculations that can consider the specifics of each study’s context, moving beyond one-size-fits-all recommendations. Studies with low sample sizes and/or low expected effect sizes may need coder agreement above .800 to test a hypothesis with sufficient statistical power. In studies with high sample sizes and/or high expected effect sizes, coder agreement below .667 may suffice. Such calculations can help in both evaluating and in designing studies. Particularly in pre-registered research, higher sample sizes may be used to compensate for low expected effect sizes and/or borderline coding reliability (e.g. when constructs are hard to measure). I supply equations, easy-to-use tables and R functions to facilitate use of this framework, along with example code as online appendix.