scholarly journals Improving Inferences About Null Effects With Bayes Factors and Equivalence Tests

2018 ◽  
Vol 75 (1) ◽  
pp. 45-57 ◽  
Author(s):  
Daniël Lakens ◽  
Neil McLatchie ◽  
Peder M Isager ◽  
Anne M Scheel ◽  
Zoltan Dienes

AbstractResearchers often conclude an effect is absent when a null-hypothesis significance test yields a nonsignificant p value. However, it is neither logically nor statistically correct to conclude an effect is absent when a hypothesis test is not significant. We present two methods to evaluate the presence or absence of effects: Equivalence testing (based on frequentist statistics) and Bayes factors (based on Bayesian statistics). In four examples from the gerontology literature, we illustrate different ways to specify alternative models that can be used to reject the presence of a meaningful or predicted effect in hypothesis tests. We provide detailed explanations of how to calculate, report, and interpret Bayes factors and equivalence tests. We also discuss how to design informative studies that can provide support for a null model or for the absence of a meaningful effect. The conceptual differences between Bayes factors and equivalence tests are discussed, and we also note when and why they might lead to similar or different inferences in practice. It is important that researchers are able to falsify predictions or can quantify the support for predicted null effects. Bayes factors and equivalence tests provide useful statistical tools to improve inferences about null effects.

2018 ◽  
Author(s):  
Daniel Lakens ◽  
Neil McLatchie ◽  
Peder Mortvedt Isager ◽  
Anne M. Scheel ◽  
Zoltan Dienes

Researchers often conclude an effect is absent when a null-hypothesis significance test yields a non-significant p-value. However, it is not logically nor statistically correct to conclude an effect is absent when a hypothesis test is not significant. We present two methods to evaluate the presence or absence of effects: Equivalence testing (based on frequentist statistics) and Bayes factors (based on Bayesian statistics). In four examples from the gerontology literature we illustrate different ways to specify alternative models that can be used to reject the presence of a meaningful or predicted effect in hypothesis tests. We provide detailed explanations of how to calculate, report, and interpret Bayes factors and equivalence tests. We also discuss how to design informative studies that can provide support for a null model or for the absence of a meaningful effect. The conceptual differences between Bayes factors and equivalence tests are discussed, and we also note when and why they might lead to similar or different inferences in practice. It is important that researchers are able to falsify predictions or can provide support for predicted null-effects. Bayes factors and equivalence tests provide useful statistical tools to improve inferences about null effects.


2018 ◽  
Author(s):  
Christopher Harms ◽  
Daniel Lakens

Being able to interpret `null effects' is important for cumulative knowledge generation in science. To draw informative conclusions from null-effects, researchers need to move beyond the incorrect interpretation of a non-significant result in a null-hypothesis significance test as evidence of the absence of an effect. We explain how to statistically evaluate null-results using equivalence tests, Bayesian estimation, and Bayes factors. A worked example demonstrates how to apply these statistical tools and interpret the results. Finally, we explain how no statistical approach can actually prove that the null-hypothesis is true, and briefly discuss the philosophical differences between statistical approaches to examine null-effects. The increasing availability of easy-to-use software and online tools to perform equivalence tests, Bayesian estimation, and calculate Bayes factors make it timely and feasible to complement or move beyond traditional null-hypothesis tests, and allow researchers to draw more informative conclusions about null-effects.


2017 ◽  
Vol 43 (4) ◽  
pp. 407-439 ◽  
Author(s):  
Jodi M. Casabianca ◽  
Charles Lewis

The null hypothesis test used in differential item functioning (DIF) detection tests for a subgroup difference in item-level performance—if the null hypothesis of “no DIF” is rejected, the item is flagged for DIF. Conversely, an item is kept in the test form if there is insufficient evidence of DIF. We present frequentist and empirical Bayes approaches for implementing statistical equivalence testing for DIF using the Mantel–Haenszel (MH) DIF statistic. With these approaches, rejection of the null hypothesis of “DIF” allows the conclusion of statistical equivalence, a more stringent criterion for keeping items. In other words, the roles of the null and alternative hypotheses are interchanged in order to have positive evidence that the DIF of an item is small. A simulation study compares the equivalence testing approaches to the traditional MH DIF detection method with the Educational Testing Service classification system. We illustrate the methods with item response data from the 2012 Programme for International Student Assessment.


2020 ◽  
Vol 3 (3) ◽  
pp. 300-308
Author(s):  
Bence Palfi ◽  
Zoltan Dienes

Psychologists are often interested in whether an independent variable has a different effect in condition A than in condition B. To test such a question, one needs to directly compare the effect of that variable in the two conditions (i.e., test the interaction). Yet many researchers tend to stop when they find a significant test in one condition and a nonsignificant test in the other condition, deeming this as sufficient evidence for a difference between the two conditions. In this Tutorial, we aim to raise awareness of this inferential mistake when Bayes factors are used with conventional cutoffs to draw conclusions. For instance, some researchers might falsely conclude that there must be good-enough evidence for the interaction if they find good-enough Bayesian evidence for the alternative hypothesis, H1, in condition A and good-enough Bayesian evidence for the null hypothesis, H0, in condition B. The case study we introduce highlights that ignoring the test of the interaction can lead to unjustified conclusions and demonstrates that the principle that any assertion about the existence of an interaction necessitates the direct comparison of the conditions is as true for Bayesian as it is for frequentist statistics. We provide an R script of the analyses of the case study and a Shiny app that can be used with a 2 × 2 design to develop intuitions on this issue, and we introduce a rule of thumb with which one can estimate the sample size one might need to have a well-powered design.


Author(s):  
Richard McCleary ◽  
David McDowall ◽  
Bradley J. Bartos

Chapter 6 addresses the sub-category of internal validity defined by Shadish et al., as statistical conclusion validity, or “validity of inferences about the correlation (covariance) between treatment and outcome.” The common threats to statistical conclusion validity can arise, or become plausible through either model misspecification or through hypothesis testing. The risk of a serious model misspecification is inversely proportional to the length of the time series, for example, and so is the risk of mistating the Type I and Type II error rates. Threats to statistical conclusion validity arise from the classical and modern hybrid significance testing structures, the serious threats that weigh heavily in p-value tests are shown to be undefined in Beyesian tests. While the particularly vexing threats raised by modern null hypothesis testing are resolved through the elimination of the modern null hypothesis test, threats to statistical conclusion validity would inevitably persist and new threats would arise.


Author(s):  
Carlos A. de B. Pereira ◽  
Adriano Polpo ◽  
Eduardo Y. Nakano

The main objective of this paper is to find a close link between the adaptive level of significance, presented here, and the sample size. We, statisticians, know of the inconsistency, or paradox, in the current classical tests of significance that are based on p-value statistics that is compared to the canonical significance levels (10%, 5% and 1%): "Raise the sample to reject the null hypothesis" is the recommendation of some ill-advised scientists! This paper will show that it is possible to eliminate this problem of significance tests. The Bayesian Lindley's paradox – "increase the sample to accept the hypothesis" – also disappears. Obviously, we present here only the beginning of a possible prominent research. The intention is to extend its use to more complex applications such as survival analysis, reliability tests and other areas. The main tools used here are the Bayes Factor and the extended Neyman-Pearson Lemma.


Author(s):  
D. Brynn Hibbert ◽  
J. Justin Gooding

• To understand the concept of the null hypothesis and the role of Type I and Type II errors. • To test that data are normally distributed and whether a datum is an outlier. • To determine whether there is systematic error in the mean of measurement results. • To perform tests to compare the means of two sets of data.… One of the uses to which data analysis is put is to answer questions about the data, or about the system that the data describes. In the former category are ‘‘is the data normally distributed?’’ and ‘‘are there any outliers in the data?’’ (see the discussions in chapter 1). Questions about the system might be ‘‘is the level of alcohol in the suspect’s blood greater than 0.05 g/100 mL?’’ or ‘‘does the new sensor give the same results as the traditional method?’’ In answering these questions we determine the probability of finding the data given the truth of a stated hypothesis—hence ‘‘hypothesis testing.’’ A hypothesis is a statement that might, or might not, be true. Usually the hypothesis is set up in such a way that it is possible to calculate the probability (P) of the data (or the test statistic calculated from the data) given the hypothesis, and then to make a decision about whether the hypothesis is to be accepted (high P) or rejected (low P). A particular case of a hypothesis test is one that determines whether or not the difference between two values is significant—a significance test. For this case we actually put forward the hypothesis that there is no real difference and the observed difference arises from random effects: it is called the null hypothesis (H<sub>0</sub>). If the probability that the data are consistent with the null hypothesis falls below a predetermined low value (say 0.05 or 0.01), then the hypothesis is rejected at that probability. Therefore, p<0.05 means that if the null hypothesis were true we would find the observed data (or more accurately the value of the statistic, or greater, calculated from the data) in less than 5% of repeated experiments.


2018 ◽  
Author(s):  
Daniel Lakens ◽  
Marie Delacre

To move beyond the limitations of null-hypothesis tests, statistical approaches have been developed where the observed data is compared against a range of values that are equivalent to the absence of a meaningful effect. Specifying a range of values around zero allows researchers to statistically reject the presence of effects large enough to matter, and prevents practically insignificant effects from being interpreted as a statistically significant difference. We compare the behavior of the recently proposed second generation p-value (Blume, D’Agostino McGowan, Dupont, &amp; Greevy, 2018) with the more established Two One-Sided Tests (TOST) equivalence testing procedure (Schuirmann, 1987). We show that the two approaches yield almost identical results under optimal conditions. Under suboptimal conditions (e.g., when the confidence interval is wider than the equivalence range, or when confidence intervals are asymmetric) the second generation p-value becomes difficult to interpret as a descriptive statistic. The second generation p-value is interpretable in a dichotomous manner (i.e., when the SGPV equals 0 or 1 because the confidence intervals lies completely within or outside of the equivalence range), but this dichotomous interpretation does not require calculations. We conclude that equivalence tests yield more consistent p-values, distinguish between datasets that yield the same second generation p-value, and allow for easier control of Type I and Type II error rates.


2021 ◽  
Author(s):  
Alexander Ly ◽  
Eric-Jan Wagenmakers

he “Full Bayesian Significance Test e-value”, henceforth FBST ev, has received increasing attention across a range of disciplines including psychology. We show that the FBST ev leads to four problems: (1) the FBST ev cannot quantify evidence in favor of a null hypothesis and therefore also cannot discriminate “evidence of absence” from “absence of evidence”; (2) the FBST ev is susceptible to sampling to a foregone conclusion; (3) the FBST ev violates the principle of predictive irrelevance, such that it is affected by data that are equally likely to occur under the null hypothesis and the alternative hypothesis; (4) the FBST ev suffers from the Jeffreys-Lindley paradox in that it does not include a correction for selection. These problems also plague the frequentist p-value. We conclude that although the FBST ev may be an improvement over the p-value, it does not provide a reasonable measure of evidence against the null hypothesis.


2017 ◽  
Author(s):  
Daniel Lakens ◽  
Anne M. Scheel ◽  
Peder Mortvedt Isager

Psychologists must be able to test both for the presence of an effect and for the absence of an effect. In addition to testing against zero, researchers can use the Two One-Sided Tests (TOST) procedure to test for equivalence and reject the presence of a smallest effect size of interest (SESOI). TOST can be used to determine if an observed effect is surprisingly small, given that a true effect at least as large as the SESOI exists. We explain a range of approaches to determine the SESOI in psychological science, and provide detailed examples of how equivalence tests should be performed and reported. Equivalence tests are an important extension of statistical tools psychologists currently use, and enable researchers to falsify predictions about the presence, and declare the absence, of meaningful effects.


Sign in / Sign up

Export Citation Format

Share Document