Low statistical power in biomedical science: a review of three human research domains

Studies with low statistical power increase the likelihood that a statistically significant finding represents a false positive result. We conducted a review of meta-analyses of studies investigating the association of biological, environmental or cognitive parameters with neurological, psychiatric and somatic diseases, excluding treatment studies, in order to estimate the average statistical power across these domains. Taking the effect size indicated by a meta-analysis as the best estimate of the likely true effect size, and assuming a threshold for declaring statistical significance of 5%, we found that approximately 50% of studies have statistical power in the 0–10% or 11–20% range, well below the minimum of 80% that is often considered conventional. Studies with low statistical power appear to be common in the biomedical sciences, at least in the specific subject areas captured by our search strategy. However, we also observe evidence that this depends in part on research methodology, with candidate gene studies showing very low average power and studies using cognitive/behavioural measures showing high average power. This warrants further investigation.

Download Full-text

How to Detect Publication Bias in Psychological Research

Zeitschrift für Psychologie ◽

10.1027/2151-2604/a000386 ◽

2019 ◽

Vol 227 (4) ◽

pp. 261-279 ◽

Cited By ~ 2

Author(s):

Frank Renkewitz ◽

Melanie Keiner

Keyword(s):

Publication Bias ◽

Effect Size ◽

Statistical Power ◽

Type I Error ◽

Psychological Research ◽

Type I ◽

True Effect Size ◽

Questionable Research Practices ◽

True Effect ◽

Meta Analyses

Abstract. Publication biases and questionable research practices are assumed to be two of the main causes of low replication rates. Both of these problems lead to severely inflated effect size estimates in meta-analyses. Methodologists have proposed a number of statistical tools to detect such bias in meta-analytic results. We present an evaluation of the performance of six of these tools. To assess the Type I error rate and the statistical power of these methods, we simulated a large variety of literatures that differed with regard to true effect size, heterogeneity, number of available primary studies, and sample sizes of these primary studies; furthermore, simulated studies were subjected to different degrees of publication bias. Our results show that across all simulated conditions, no method consistently outperformed the others. Additionally, all methods performed poorly when true effect sizes were heterogeneous or primary studies had a small chance of being published, irrespective of their results. This suggests that in many actual meta-analyses in psychology, bias will remain undiscovered no matter which detection method is used.

Download Full-text

Can Reliance be Placed on a Single Meta-Analysis?

Australian & New Zealand Journal of Psychiatry ◽

10.3109/00048679009077710 ◽

1990 ◽

Vol 24 (3) ◽

pp. 405-415 ◽

Cited By ~ 16

Author(s):

Nathaniel McConaghy

Keyword(s):

Literature Review ◽

Effect Size ◽

Meta Analysis ◽

Statistical Significance ◽

Effect Sizes ◽

Control Groups ◽

Consistent Finding ◽

Placebo Controls ◽

Effect Of Treatment ◽

Meta Analyses

Meta-analysis replaced statistical significance with effect size in the hope of resolving controversy concerning evaluation of treatment effects. Statistical significance measured reliability of the effect of treatment, not its efficacy. It was strongly influenced by the number of subjects investigated. Effect size as assessed originally, eliminated this influence but by standardizing the size of the treatment effect could distort it. Meta-analyses which combine the results of studies which employ different subject types, outcome measures, treatment aims, no-treatment rather than placebo controls or therapists with varying experience can be misleading. To ensure discussion of these variables meta-analyses should be used as an aid rather than a substitute for literature review. While meta-analyses produce contradictory findings, it seems unwise to rely on the conclusions of an individual analysis. Their consistent finding that placebo treatments obtain markedly higher effect sizes than no treatment hopefully will render the use of untreated control groups obsolete.

Download Full-text

Bias in emerging biomarkers for bipolar disorder

Psychological Medicine ◽

10.1017/s0033291716000957 ◽

2016 ◽

Vol 46 (11) ◽

pp. 2287-2297 ◽

Cited By ~ 38

Author(s):

A. F. Carvalho ◽

C. A. Köhler ◽

B. S. Fernandes ◽

J. Quevedo ◽

K. W. Miskowiak ◽

...

Keyword(s):

Bipolar Disorder ◽

Effect Size ◽

Comprehensive Evaluation ◽

Meta Analysis ◽

Selective Reporting ◽

True Effect Size ◽

Genetic Biomarkers ◽

Morning Cortisol ◽

Meta Analyses ◽

Positive Results

BackgroundTo date no comprehensive evaluation has appraised the likelihood of bias or the strength of the evidence of peripheral biomarkers for bipolar disorder (BD). Here we performed an umbrella review of meta-analyses of peripheral non-genetic biomarkers for BD.MethodThe Pubmed/Medline, EMBASE and PsycInfo electronic databases were searched up to May 2015. Two independent authors conducted searches, examined references for eligibility, and extracted data. Meta-analyses in any language examining peripheral non-genetic biomarkers in participants with BD (across different mood states) compared to unaffected controls were included.ResultsSix references, which examined 13 biomarkers across 20 meta-analyses (5474 BD cases and 4823 healthy controls) met inclusion criteria. Evidence for excess of significance bias (i.e. bias favoring publication of ‘positive’ nominally significant results) was observed in 11 meta-analyses. Heterogeneity was high for (I2 ⩾ 50%) 16 meta-analyses. Only two biomarkers met criteria for suggestive evidence namely the soluble IL-2 receptor and morning cortisol. The median power of included studies, using the effect size of the largest dataset as the plausible true effect size of each meta-analysis, was 15.3%.ConclusionsOur findings suggest that there is an excess of statistically significant results in the literature of peripheral biomarkers for BD. Selective publication of ‘positive’ results and selective reporting of outcomes are possible mechanisms.

Download Full-text

Cannons and sparrows II: the enhanced Bernoulli exact method for determining statistical significance and effect size in the meta-analysis of k 2 × 2 tables

Emerging Themes in Epidemiology ◽

10.1186/s12982-021-00101-8 ◽

2021 ◽

Vol 18 (1) ◽

Author(s):

Lawrence M. Paul

Keyword(s):

Effect Size ◽

Type I Error ◽

Meta Analysis ◽

Statistical Significance ◽

Statistical Error ◽

Exact Method ◽

Type I ◽

Exact Test ◽

Inverse Variance ◽

Meta Analyses

Abstract Background The use of meta-analysis to aggregate the results of multiple studies has increased dramatically over the last 40 years. For homogeneous meta-analysis, the Mantel–Haenszel technique has typically been utilized. In such meta-analyses, the effect size across the contributing studies of the meta-analysis differs only by statistical error. If homogeneity cannot be assumed or established, the most popular technique developed to date is the inverse-variance DerSimonian and Laird (DL) technique (DerSimonian and Laird, in Control Clin Trials 7(3):177–88, 1986). However, both of these techniques are based on large sample, asymptotic assumptions. At best, they are approximations especially when the number of cases observed in any cell of the corresponding contingency tables is small. Results This research develops an exact, non-parametric test for evaluating statistical significance and a related method for estimating effect size in the meta-analysis of k 2 × 2 tables for any level of heterogeneity as an alternative to the asymptotic techniques. Monte Carlo simulations show that even for large values of heterogeneity, the Enhanced Bernoulli Technique (EBT) is far superior at maintaining the pre-specified level of Type I Error than the DL technique. A fully tested implementation in the R statistical language is freely available from the author. In addition, a second related exact test for estimating the Effect Size was developed and is also freely available. Conclusions This research has developed two exact tests for the meta-analysis of dichotomous, categorical data. The EBT technique was strongly superior to the DL technique in maintaining a pre-specified level of Type I Error even at extremely high levels of heterogeneity. As shown, the DL technique demonstrated many large violations of this level. Given the various biases towards finding statistical significance prevalent in epidemiology today, a strong focus on maintaining a pre-specified level of Type I Error would seem critical. In addition, a related exact method for estimating the Effect Size was developed.

Download Full-text

Bayesian Hypothesis Testing and Estimation under the Marginalized Random-Effects Meta-Analysis Model

10.31234/osf.io/ktcq4 ◽

2020 ◽

Author(s):

Robbie Cornelis Maria van Aert ◽

Joris Mulder

Keyword(s):

Hypothesis Testing ◽

Random Effects ◽

Effect Size ◽

Meta Analysis ◽

Bayes Factors ◽

Estimation Methods ◽

True Effect Size ◽

Bayesian Hypothesis Testing ◽

True Effect ◽

Meta Analyses

Meta-analysis methods are used to synthesize results of multiple studies on the same topic. The most frequently used statistical model in meta-analysis is the random-effects model containing parameters for the average effect, between-study variance in primary study's true effect size, and random effects for the study specific effects. We propose Bayesian hypothesis testing and estimation methods using the marginalized random-effects meta-analysis (MAREMA) model where the study specific true effects are regarded as nuisance parameters which are integrated out of the model. A flat prior distribution is placed on the overall effect size in case of estimation and a proper unit information prior for the overall effect size is proposed in case of hypothesis testing. For the between-study variance in true effect size, a proper uniform prior is placed on the proportion of total variance that can be attributed to between-study variability. Bayes factors are used for hypothesis testing that allow testing point and one-sided hypotheses. The proposed methodology has several attractive properties. First, the proposed MAREMA model encompasses models with a zero, negative, and positive between-study variance, which enables testing a zero between-study variance as it is not a boundary problem. Second, the methodology is suitable for default Bayesian meta-analyses as it requires no prior information about the unknown parameters. Third, the methodology can even be used in the extreme case when only two studies are available, because Bayes factors are not based on large sample theory. We illustrate the developed methods by applying it to two meta-analyses and introduce easy-to-use software in the R package BFpack to compute the proposed Bayes factors.

Download Full-text

Changing the logic of replication: A case from infant studies

10.31234/osf.io/xw6qt ◽

2019 ◽

Cited By ~ 2

Author(s):

Francesco Margoni ◽

Martin Shepperd

Keyword(s):

Statistical Power ◽

Sampling Error ◽

Meta Analysis ◽

Statistical Significance ◽

Small Sample ◽

Infant Research ◽

Replication Studies ◽

Wide Range ◽

Small Sample Sizes ◽

Meta Analyses

Infant research is making considerable progresses. However, among infant researchers there is growing concern regarding the widespread habit of undertaking studies that have small sample sizes and employ tests with low statistical power (to detect a wide range of possible effects). For many researchers, issues of confidence may be partially resolved by relying on replications. Here, we bring further evidence that the classical logic of confirmation, according to which the result of a replication study confirms the original finding when it reaches statistical significance, could be usefully abandoned. With real examples taken from the infant literature and Monte Carlo simulations, we show that a very wide range of possible replication results would in a formal statistical sense constitute confirmation as they can be explained simply due to sampling error. Thus, often no useful conclusion can be derived from a single or small number of replication studies. We suggest that, in order to accumulate and generate new knowledge, the dichotomous view of replication as confirmatory/disconfirmatory can be replaced by an approach that emphasizes the estimation of effect sizes via meta-analysis. Moreover, we discuss possible solutions for reducing problems affecting the validity of conclusions drawn from meta-analyses in infant research.

Download Full-text

How to detect publication bias in psychological research? A comparative evaluation of six statistical methods

10.31234/osf.io/w94ep ◽

2018 ◽

Cited By ~ 3

Author(s):

Frank Renkewitz ◽

Melanie Keiner

Keyword(s):

Publication Bias ◽

Effect Size ◽

Statistical Power ◽

Detection Methods ◽

Type I ◽

Optimal Method ◽

True Effect Size ◽

True Effect ◽

Size Heterogeneity ◽

Meta Analyses

Publication biases and questionable research practices are assumed to be two of the main causes of low replication rates observed in the social sciences. Both of these problems do not only increase the proportion of false positives in the literature but can also lead to severely inflated effect size estimates in meta-analyses. Methodologists have proposed a number of statistical tools to detect and correct such bias in meta-analytic results. We present an evaluation of the performance of six of these tools in detecting bias. To assess the Type I error rate and the statistical power of these tools we simulated a large variety of literatures that differed with regard to underlying true effect size, heterogeneity, number of available primary studies and variation of sample sizes in these primary studies. Furthermore, simulated primary studies were subjected to different degrees of publication bias. Our results show that the power of the detection methods follows a complex pattern. Across all simulated conditions, no method consistently outperformed all others. Hence, choosing an optimal method would require knowledge about parameters (e.g., true effect size, heterogeneity) that meta-analysts cannot have. Additionally, all methods performed badly when true effect sizes were heterogeneous or primary studies had a small chance of being published irrespective of their results. This suggests, that in many actual meta-analyses in psychology bias will remain undiscovered no matter which detection method is used.

Download Full-text

Feeling the future: A meta-analysis of 90 experiments on the anomalous anticipation of random future events

F1000Research ◽

10.12688/f1000research.7177.2 ◽

2016 ◽

Vol 4 ◽

pp. 1188 ◽

Cited By ~ 14

Author(s):

Daryl Bem ◽

Patrizio E. Tressoldi ◽

Thomas Rabeyron ◽

Michael Duggan

Keyword(s):

Effect Size ◽

Bayes Factor ◽

Statistical Tests ◽

Meta Analysis ◽

Statistical Significance ◽

Statistical Technique ◽

True Effect Size ◽

Selective Suppression ◽

Future Events ◽

Decisive Evidence

In 2011, one of the authors (DJB) published a report of nine experiments in the Journal of Personality and Social Psychology purporting to demonstrate that an individual’s cognitive and affective responses can be influenced by randomly selected stimulus events that do not occur until after his or her responses have already been made and recorded, a generalized variant of the phenomenon traditionally denoted by the term precognition. To encourage replications, all materials needed to conduct them were made available on request. We here report a meta-analysis of 90 experiments from 33 laboratories in 14 countries which yielded an overall effect greater than 6 sigma, z = 6.40, p = 1.2 × 10-10 with an effect size (Hedges’ g) of 0.09. A Bayesian analysis yielded a Bayes Factor of 5.1 × 109, greatly exceeding the criterion value of 100 for “decisive evidence” in support of the experimental hypothesis. When DJB’s original experiments are excluded, the combined effect size for replications by independent investigators is 0.06, z = 4.16, p = 1.1 × 10-5, and the BF value is 3,853, again exceeding the criterion for “decisive evidence.” The number of potentially unretrieved experiments required to reduce the overall effect size of the complete database to a trivial value of 0.01 is 544, and seven of eight additional statistical tests support the conclusion that the database is not significantly compromised by either selection bias or by intense “p-hacking”—the selective suppression of findings or analyses that failed to yield statistical significance. P-curve analysis, a recently introduced statistical technique, estimates the true effect size of the experiments to be 0.20 for the complete database and 0.24 for the independent replications, virtually identical to the effect size of DJB’s original experiments (0.22) and the closely related “presentiment” experiments (0.21). We discuss the controversial status of precognition and other anomalous effects collectively known as psi.

Download Full-text

Canons and Sparrows II*: The Enhanced Bernoulli Exact Method for Determining Statistical Significance and Effect Size in the Meta-Analysis of k 2 x 2 Tables

10.21203/rs.3.rs-139437/v1 ◽

2021 ◽

Author(s):

Lawrence Marc Paul

Keyword(s):

Effect Size ◽

Type I Error ◽

Meta Analysis ◽

Statistical Significance ◽

Statistical Error ◽

Type I ◽

Exact Test ◽

Strong Focus ◽

Inverse Variance ◽

Meta Analyses

Abstract BackgroundThe use of meta-analysis to aggregate the results of multiple studies has increased dramatically over the last 40 years. For homogeneous meta-analysis, the Mantel-Haenszel technique has typically been utilized. In such meta-analyses, the effect size across the contributing studies of the meta-analysis differ only by statistical error. If homogeneity cannot be assumed or established, the most popular technique developed to date is the inverse-variance DerSimonian & Laird (DL) technique [1]. However, both of these techniques are based on large sample, asymptotic assumptions. At best, they are approximations especially when the number of cases observed in any cell of the corresponding contingency tables is small.ResultsThis research develops an exact, non-parametric test for evaluating statistical significance and a related method for estimating effect size in the meta-analysis of k 2 x 2 tables for any level of heterogeneity as an alternative to the asymptotic techniques. Monte Carlo simulations show that even for large values of heterogeneity, the Enhanced Bernoulli Technique (EBT) is far superior at maintaining the pre-specified level of Type I Error than the DL technique. A fully tested implementation in the R statistical language is freely available from the author. In addition, a second related exact test for estimating the Effect Size was developed and is also freely available.ConclusionsThis research has developed two exact tests for the meta-analysis of dichotomous, categorical data. The EBT technique was strongly superior to the DL technique in maintaining a pre-specified level of Type I Error even at extremely high levels of heterogeneity. As shown, the DL technique demonstrated many large violations of this level. Given the various biases towards finding statistical significance prevalent in epidemiology today, a strong focus on maintaining a pre-specified level of Type I Error would seem critical.

Download Full-text

Dissertation R.C.M. van Aert

10.31222/osf.io/eqhjd ◽

2018 ◽

Cited By ~ 2

Author(s):

Robbie Cornelis Maria van Aert

Keyword(s):

Publication Bias ◽

Effect Size ◽

Meta Analysis ◽

Original Study ◽

The Other ◽

Effect Sizes ◽

True Effect Size ◽

True Effect ◽

Meta Analyses ◽

Q Statistic

More and more scientific research gets published nowadays, asking for statistical methods that enable researchers to get an overview of the literature in a particular research field. For that purpose, meta-analysis methods were developed that can be used for statistically combining the effect sizes from independent primary studies on the same topic. My dissertation focuses on two issues that are crucial when conducting a meta-analysis: publication bias and heterogeneity in primary studies’ true effect sizes. Accurate estimation of both the meta-analytic effect size as well as the between-study variance in true effect size is crucial since the results of meta-analyses are often used for policy making. Publication bias distorts the results of a meta-analysis since it refers to situations where publication of a primary study depends on its results. We developed new meta-analysis methods, p-uniform and p-uniform*, which estimate effect sizes corrected for publication bias and also test for publication bias. Although the methods perform well in many conditions, these and the other existing methods are shown not to perform well when researchers use questionable research practices. Additionally, when publication bias is absent or limited, traditional methods that do not correct for publication bias outperform p¬-uniform and p-uniform*. Surprisingly, we found no strong evidence for the presence of publication bias in our pre-registered study on the presence of publication bias in a large-scale data set consisting of 83 meta-analyses and 499 systematic reviews published in the fields of psychology and medicine. We also developed two methods for meta-analyzing a statistically significant published original study and a replication of that study, which reflects a situation often encountered by researchers. One method is a frequentist whereas the other method is a Bayesian statistical method. Both methods are shown to perform better than traditional meta-analytic methods that do not take the statistical significance of the original study into account. Analytical studies of both methods also show that sometimes the original study is better discarded for optimal estimation of the true effect size. Finally, we developed a program for determining the required sample size in a replication analogous to power analysis in null hypothesis testing. Computing the required sample size with the method revealed that large sample sizes (approximately 650 participants) are required to be able to distinguish a zero from a small true effect.Finally, in the last two chapters we derived a new multi-step estimator for the between-study variance in primary studies’ true effect sizes, and examined the statistical properties of two methods (Q-profile and generalized Q-statistic method) to compute the confidence interval of the between-study variance in true effect size. We proved that the multi-step estimator converges to the Paule-Mandel estimator which is nowadays one of the recommended methods to estimate the between-study variance in true effect sizes. Two Monte-Carlo simulation studies showed that the coverage probabilities of Q-profile and generalized Q-statistic method can be substantially below the nominal coverage rate if the assumptions underlying the random-effects meta-analysis model were violated.

Download Full-text