Obtaining Evidence for No Effect

Obtaining evidence that something does not exist requires knowing how big it would be were it to exist. Testing a theory that predicts an effect thus entails specifying the range of effect sizes consistent with the theory, in order to know when the evidence counts against the theory. Indeed, a theoretically relevant effect size must be specified for power calculations, equivalence testing, and Bayes factors in order that the inferential statistics test the theory. Specifying relevant effect sizes for power, or the equivalence region for equivalence testing, or the scale factor for Bayes factors, is necessary for many journal formats, such as registered reports, and should be necessary for all articles that use hypothesis testing. Yet there is little systematic advice on how to approach this problem. This article offers some principles and practical advice for specifying theoretically relevant effect sizes for hypothesis testing.

Download Full-text

A Comparison Of Equivalence Testing In Combination With Hypothesis Testing And Effect Sizes

Journal of Modern Applied Statistical Methods ◽

10.22237/jmasm/1067645160 ◽

2003 ◽

Vol 2 (2) ◽

pp. 329-340 ◽

Cited By ~ 6

Author(s):

Christopher J. Mecklin

Keyword(s):

Hypothesis Testing ◽

Effect Sizes ◽

Equivalence Testing

Download Full-text

Bayesian Hypothesis Testing and Estimation under the Marginalized Random-Effects Meta-Analysis Model

10.31234/osf.io/ktcq4 ◽

2020 ◽

Author(s):

Robbie Cornelis Maria van Aert ◽

Joris Mulder

Keyword(s):

Hypothesis Testing ◽

Random Effects ◽

Effect Size ◽

Meta Analysis ◽

Bayes Factors ◽

Estimation Methods ◽

True Effect Size ◽

Bayesian Hypothesis Testing ◽

True Effect ◽

Meta Analyses

Meta-analysis methods are used to synthesize results of multiple studies on the same topic. The most frequently used statistical model in meta-analysis is the random-effects model containing parameters for the average effect, between-study variance in primary study's true effect size, and random effects for the study specific effects. We propose Bayesian hypothesis testing and estimation methods using the marginalized random-effects meta-analysis (MAREMA) model where the study specific true effects are regarded as nuisance parameters which are integrated out of the model. A flat prior distribution is placed on the overall effect size in case of estimation and a proper unit information prior for the overall effect size is proposed in case of hypothesis testing. For the between-study variance in true effect size, a proper uniform prior is placed on the proportion of total variance that can be attributed to between-study variability. Bayes factors are used for hypothesis testing that allow testing point and one-sided hypotheses. The proposed methodology has several attractive properties. First, the proposed MAREMA model encompasses models with a zero, negative, and positive between-study variance, which enables testing a zero between-study variance as it is not a boundary problem. Second, the methodology is suitable for default Bayesian meta-analyses as it requires no prior information about the unknown parameters. Third, the methodology can even be used in the extreme case when only two studies are available, because Bayes factors are not based on large sample theory. We illustrate the developed methods by applying it to two meta-analyses and introduce easy-to-use software in the R package BFpack to compute the proposed Bayes factors.

Download Full-text

Beyond reporting statistical significance: Identifying informative effect sizes to improve scientific communication

Public Understanding of Science ◽

10.1177/0963662519834193 ◽

2019 ◽

Vol 28 (4) ◽

pp. 468-485 ◽

Cited By ~ 10

Author(s):

Paul HP Hanel ◽

David MA Mehler

Keyword(s):

Effect Size ◽

General Public ◽

Scientific Community ◽

Bayes Factor ◽

Statistical Significance ◽

Scientific Communication ◽

Bayes Factors ◽

Effect Sizes ◽

P Value ◽

Level Of Education

Transparent communication of research is key to foster understanding within and beyond the scientific community. An increased focus on reporting effect sizes in addition to p value–based significance statements or Bayes Factors may improve scientific communication with the general public. Across three studies ( N = 652), we compared subjective informativeness ratings for five effect sizes, Bayes Factor, and commonly used significance statements. Results showed that Cohen’s U3 was rated as most informative. For example, 440 participants (69%) found U3 more informative than Cohen’s d, while 95 (15%) found d more informative than U3, with 99 participants (16%) finding both effect sizes equally informative. This effect was not moderated by level of education. We therefore suggest that in general, Cohen’s U3 is used when scientific findings are communicated. However, the choice of the effect size may vary depending on what a researcher wants to highlight (e.g. differences or similarities).

Download Full-text

Sequential hypothesis testing with Bayes factors: Efficiently testing mean differences

10.31219/osf.io/w3s3s ◽

2016 ◽

Cited By ~ 2

Author(s):

Felix D. Schönbrodt ◽

Eric-Jan Wagenmakers ◽

Michael Zehetleitner ◽

Marco Perugini

Keyword(s):

Hypothesis Testing ◽

Effect Size ◽

A Priori ◽

Error Rates ◽

Bayes Factors ◽

Stopping Rules ◽

Type I ◽

Optional Stopping ◽

Mean Differences

Unplanned optional stopping rules have been criticized for inflating Type I error rates under the null hypothesis significance testing (NHST) paradigm. Despite these criticisms this research practice is not uncommon, probably as it appeals to researcher’s intuition to collect more data in order to push an indecisive result into a decisive region. In this contribution we investigate the properties of a procedure for Bayesian hypothesis testing that allows optional stopping with unlimited multiple testing, even after each participant. In this procedure, which we call Sequential Bayes Factors (SBF), Bayes factors are computed until an a priori defined level of evidence is reached. This allows flexible sampling plans and is not dependent upon correct effect size guesses in an a priori power analysis. We investigated the long-term rate of misleading evidence, the average expected sample sizes, and the biasedness of effect size estimates when an SBF design is applied to a test of mean differences between two groups. Compared to optimal NHST, the SBF design typically needs 50% to 70% smaller samples to reach a conclusion about the presence of an effect, while having the same or lower long-term rate of wrong inference.

Download Full-text

The critics rebutted: A Pyrrhic victory

Behavioral and Brain Sciences ◽

10.1017/s0140525x98391165 ◽

1998 ◽

Vol 21 (2) ◽

pp. 210-211 ◽

Cited By ~ 2

Author(s):

Stephan Lewandowsky ◽

Murray Maybery

Keyword(s):

Hypothesis Testing ◽

Effect Size ◽

Null Hypothesis ◽

Statistical Significance ◽

Effect Sizes ◽

Small Samples ◽

Pyrrhic Victory

We take up two issues discussed by Chow: the claim by critics of hypothesis testing that the null hypothesis (H0) is always false, and the claim that reporting effect sizes is more appropriate than relying on statistical significance. Concerning the former, we agree with Chow's sentiment despite noting serious shortcomings in his discussion. Concerning the latter, we agree with Chow that effect size need not translate into scientific relevance, and furthermore reiterate that with small samples effect size measures cannot substitute for significance.

Download Full-text

Bayesian hypothesis testing and estimation under the marginalized random-effects meta-analysis model

Psychonomic Bulletin & Review ◽

10.3758/s13423-021-01918-9 ◽

2021 ◽

Author(s):

Robbie C. M. van Aert ◽

Joris Mulder

Keyword(s):

Hypothesis Testing ◽

Random Effects ◽

Effect Size ◽

Meta Analysis ◽

Bayes Factors ◽

Estimation Methods ◽

Analysis Model ◽

Unknown Parameters ◽

Bayesian Hypothesis Testing ◽

Meta Analyses

AbstractMeta-analysis methods are used to synthesize results of multiple studies on the same topic. The most frequently used statistical model in meta-analysis is the random-effects model containing parameters for the overall effect, between-study variance in primary study’s true effect size, and random effects for the study-specific effects. We propose Bayesian hypothesis testing and estimation methods using the marginalized random-effects meta-analysis (MAREMA) model where the study-specific true effects are regarded as nuisance parameters which are integrated out of the model. We propose using a flat prior distribution on the overall effect size in case of estimation and a proper unit information prior for the overall effect size in case of hypothesis testing. For the between-study variance (which can attain negative values under the MAREMA model), a proper uniform prior is placed on the proportion of total variance that can be attributed to between-study variability. Bayes factors are used for hypothesis testing that allow testing point and one-sided hypotheses. The proposed methodology has several attractive properties. First, the proposed MAREMA model encompasses models with a zero, negative, and positive between-study variance, which enables testing a zero between-study variance as it is not a boundary problem. Second, the methodology is suitable for default Bayesian meta-analyses as it requires no prior information about the unknown parameters. Third, the proposed Bayes factors can even be used in the extreme case when only two studies are available because Bayes factors are not based on large sample theory. We illustrate the developed methods by applying it to two meta-analyses and introduce easy-to-use software in the R package to compute the proposed Bayes factors.

Download Full-text

Clinically Meaningful Change

Methodology ◽

10.1027/1614-2241/a000168 ◽

2019 ◽

Vol 15 (3) ◽

pp. 97-105

Author(s):

Rodrigo Ferrer ◽

Antonio Pardo

Keyword(s):

Effect Size ◽

False Negative ◽

False Negative Rate ◽

Point Of View ◽

Skewed Distribution ◽

Effect Sizes ◽

False Negatives ◽

Large Size ◽

Before And After ◽

Post Test

Abstract. In a recent paper, Ferrer and Pardo (2014) tested several distribution-based methods designed to assess when test scores obtained before and after an intervention reflect a statistically reliable change. However, we still do not know how these methods perform from the point of view of false negatives. For this purpose, we have simulated change scenarios (different effect sizes in a pre-post-test design) with distributions of different shapes and with different sample sizes. For each simulated scenario, we generated 1,000 samples. In each sample, we recorded the false-negative rate of the five distribution-based methods with the best performance from the point of view of the false positives. Our results have revealed unacceptable rates of false negatives even with effects of very large size, starting from 31.8% in an optimistic scenario (effect size of 2.0 and a normal distribution) to 99.9% in the worst scenario (effect size of 0.2 and a highly skewed distribution). Therefore, our results suggest that the widely used distribution-based methods must be applied with caution in a clinical context, because they need huge effect sizes to detect a true change. However, we made some considerations regarding the effect size and the cut-off points commonly used which allow us to be more precise in our estimates.

Download Full-text

PrePrint - Exploring Perceptions of Meaningful Relationships

10.31234/osf.io/m9cph ◽

2018 ◽

Author(s):

Nataly Beribisky ◽

Heather Davidson ◽

Rob Cribbie

Keyword(s):

Effect Size ◽

Mean Difference ◽

Practical Significance ◽

Equivalence Testing ◽

Significant Variability ◽

Meaningful Relationship ◽

Meaningful Relationships ◽

Context Free

Researchers often need to consider the practical significance of a relationship. For example, interpreting the magnitude of an effect size or establishing bounds in equivalence testing requires knowledge of the meaningfulness of a relationship. However, there has been little research exploring the degree of relationship among variables (e.g., correlation, mean difference) necessary for an association to be interpreted as meaningful or practically significant. In this study, we presented statistically trained and untrained participants with a collection of figures that displayed varying degrees of mean difference between groups or correlations among variables and participants indicated whether or not each relationship was meaningful. The results suggest that statistically trained and untrained participants differ in their qualification of a meaningful relationship, and that there is significant variability in how large a relationship must be before it is labeled meaningful. The results also shed some light on what degree of relationship is considered meaningful by individuals in a context-free setting.

Download Full-text