scholarly journals The Importance of Random Slopes in Mixed Models for Bayesian Hypothesis Testing

2021 ◽  
Author(s):  
Klaus Oberauer

Mixed models are gaining popularity in psychology. For frequentist mixed models, Barr, Levy, Scheepers, and Tily (2013) showed that excluding random slopes – differences between individuals in the direction and size of an effect – from a model when they are in the data can lead to a substantial increase in false-positive conclusions in null-hypothesis tests. Here I demonstrate through five simulations that the same is true for Bayesian hypothesis testing with mixed models, often yielding Bayes factors reflecting very strong evidence for a mean effect on the population level even if there was no such effect. Including random slopes in the model largely eliminates the risk of strong false positives, but reduces the chance of obtaining strong evidence for true effects. I recommend starting analysis with testing the support for random slopes in the data, and removing them from the models only if there is clear evidence against them.

Author(s):  
Alexander Ly ◽  
Eric-Jan Wagenmakers

AbstractThe “Full Bayesian Significance Test e-value”, henceforth FBST ev, has received increasing attention across a range of disciplines including psychology. We show that the FBST ev leads to four problems: (1) the FBST ev cannot quantify evidence in favor of a null hypothesis and therefore also cannot discriminate “evidence of absence” from “absence of evidence”; (2) the FBST ev is susceptible to sampling to a foregone conclusion; (3) the FBST ev violates the principle of predictive irrelevance, such that it is affected by data that are equally likely to occur under the null hypothesis and the alternative hypothesis; (4) the FBST ev suffers from the Jeffreys-Lindley paradox in that it does not include a correction for selection. These problems also plague the frequentist p-value. We conclude that although the FBST ev may be an improvement over the p-value, it does not provide a reasonable measure of evidence against the null hypothesis.


Epilepsy ◽  
2011 ◽  
pp. 241-248
Author(s):  
Ralph Andrzejak ◽  
Daniel Chicharro ◽  
Florian Mormann

2017 ◽  
Author(s):  
Mirko Thalmann ◽  
Marcel Niklaus ◽  
Klaus Oberauer

Using mixed-effects models and Bayesian statistics has been advocated by statisticians in recent years. Mixed-effects models allow researchers to adequately account for the structure in the data. Bayesian statistics – in contrast to frequentist statistics – can state the evidence in favor of or against an effect of interest. For frequentist statistical methods, it is known that mixed models can lead to serious over-estimation of evidence in favor of an effect (i.e., inflated Type-I error rate) when models fail to include individual differences in the effect sizes of predictors ("random slopes") that are actually present in the data. Here, we show through simulation that the same problem exists for Bayesian mixed models. Yet, at present there is no easy-to-use application that allows for the estimation of Bayes Factors for mixed models with random slopes on continuous predictors. Here, we close this gap by introducing a new R package called BayesRS. We tested its functionality in four simulation studies. They show that BayesRS offers a reliable and valid tool to compute Bayes Factors. BayesRS also allows users to account for correlations between random effects. In a fifth simulation study we show, however, that doing so leads to slight underestimation of the evidence in favor of an actually present effect. We only recommend modeling correlations between random effects when they are of primary interest and when sample size is large enough. BayesRS is available under https://cran.r-project.org/web/packages/BayesRS/, R code for all simulations is available under https://osf.io/nse5x/?view_only=b9a7caccd26a4764a084de3b8d459388


2019 ◽  
Vol 3 (Supplement_1) ◽  
pp. S773-S773
Author(s):  
Christopher Brydges ◽  
Allison A Bielak

Abstract Objective: Non-significant p values derived from null hypothesis significance testing do not distinguish between true null effects or cases where the data are insensitive in distinguishing the hypotheses. This study aimed to investigate the prevalence of Bayesian analyses in gerontological psychology, a statistical technique that can distinguish between conclusive and inconclusive non-significant results, by using Bayes factors (BFs) to reanalyze non-significant results from published gerontological research. Method: Non-significant results mentioned in abstracts of articles published in 2017 volumes of ten top gerontological psychology journals were extracted (N = 409) and categorized based on whether Bayesian analyses were conducted. BFs were calculated from non-significant t-tests within this sample to determine how frequently the null hypothesis was strongly supported. Results: Non-significant results were directly tested with Bayes factors in 1.22% of studies. Bayesian reanalyses of 195 non-significant t-tests found that only 7.69% of the findings provided strong evidence in support of the null hypothesis. Conclusions: Bayesian analyses are rarely used in gerontological research, and a large proportion of null findings were deemed inconclusive when reanalyzed with BFs. Researchers are encouraged to use BFs to test the validity of non-significant results, and ensure that sufficient sample sizes are used so that the meaningfulness of null findings can be evaluated.


2019 ◽  
Author(s):  
Christopher Brydges

Objective: Non-significant p values derived from null hypothesis significance testing do not distinguish between true null effects or cases where the data are insensitive in distinguishing the hypotheses. This study aimed to investigate the prevalence of Bayesian analyses in gerontological psychology, a statistical technique that can distinguish between conclusive and inconclusive non-significant results, by using Bayes factors (BFs) to reanalyze non-significant results from published gerontological research.Method: Non-significant results mentioned in abstracts of articles published in 2017 volumes of ten top gerontological psychology journals were extracted (N = 409) and categorized based on whether Bayesian analyses were conducted. BFs were calculated from non-significant t-tests within this sample to determine how frequently the null hypothesis was strongly supported.Results: Non-significant results were directly tested with Bayes factors in 1.22% of studies. Bayesian reanalyses of 195 non-significant ¬t-tests found that only 7.69% of the findings provided strong evidence in support of the null hypothesis.Conclusions: Bayesian analyses are rarely used in gerontological research, and a large proportion of null findings were deemed inconclusive when reanalyzed with BFs. Researchers are encouraged to use BFs to test the validity of non-significant results, and ensure that sufficient sample sizes are used so that the meaningfulness of null findings can be evaluated.


2019 ◽  
Vol 62 (12) ◽  
pp. 4544-4553 ◽  
Author(s):  
Christopher R. Brydges ◽  
Laura Gaeta

Purpose Null hypothesis significance testing is commonly used in audiology research to determine the presence of an effect. Knowledge of study outcomes, including nonsignificant findings, is important for evidence-based practice. Nonsignificant p values obtained from null hypothesis significance testing cannot differentiate between true null effects or underpowered studies. Bayes factors (BFs) are a statistical technique that can distinguish between conclusive and inconclusive nonsignificant results, and quantify the strength of evidence in favor of 1 hypothesis over another. This study aimed to investigate the prevalence of BFs in nonsignificant results in audiology research and the strength of evidence in favor of the null hypothesis in these results. Method Nonsignificant results mentioned in abstracts of articles published in 2018 volumes of 4 prominent audiology journals were extracted ( N = 108) and categorized based on whether BFs were calculated. BFs were calculated from nonsignificant t tests within this sample to determine how frequently the null hypothesis was strongly supported. Results Nonsignificant results were not directly tested with BFs in any study. Bayesian re-analysis of 93 nonsignificant t tests found that only 40.86% of findings provided moderate evidence in favor of the null hypothesis, and none provided strong evidence. Conclusion BFs are underutilized in audiology research, and a large proportion of null findings were deemed inconclusive when re-analyzed with BFs. Researchers are encouraged to use BFs to test the validity and strength of evidence of nonsignificant results and ensure that sufficient sample sizes are used so that conclusive findings (significant or not) are observed more frequently. Supplemental Material https://osf.io/b4kc7/


2017 ◽  
Vol 122 (1) ◽  
pp. 91-95 ◽  
Author(s):  
Douglas Curran-Everett

Statistics is essential to the process of scientific discovery. An inescapable tenet of statistics, however, is the notion of uncertainty which has reared its head within the arena of reproducibility of research. The Journal of Applied Physiology’s recent initiative, “Cores of Reproducibility in Physiology,” is designed to improve the reproducibility of research: each article is designed to elucidate the principles and nuances of using some piece of scientific equipment or some experimental technique so that other researchers can obtain reproducible results. But other researchers can use some piece of equipment or some technique with expert skill and still fail to replicate an experimental result if they neglect to consider the fundamental concepts of statistics of hypothesis testing and estimation and their inescapable connection to the reproducibility of research. If we want to improve the reproducibility of our research, then we want to minimize the chance that we get a false positive and—at the same time—we want to minimize the chance that we get a false negative. In this review I outline strategies to accomplish each of these things. These strategies are related intimately to fundamental concepts of statistics and the inherent uncertainty embedded in them.


Author(s):  
Heinrich A. Backmann ◽  
Marthe Larsen ◽  
Anders S. Danielsen ◽  
Solveig Hofvind

Abstract Objective To analyze the association between radiologists’ performance and image position within a batch in screen reading of mammograms in Norway. Method We described true and false positives and true and false negatives by groups of image positions and batch sizes for 2,937,312 screen readings performed from 2012 to 2018. Mixed-effects models were used to obtain adjusted proportions of true and false positive, true and false negative, sensitivity, and specificity for different image positions. We adjusted for time of day and weekday and included the individual variation between the radiologists as random effects. Time spent reading was included in an additional model to explore a possible mediation effect. Result True and false positives were negatively associated with image position within the batch, while the rates of true and false negatives were positively associated. In the adjusted analyses, the rate of true positives was 4.0 per 1000 (95% CI: 3.8–4.2) readings for image position 10 and 3.9 (95% CI: 3.7–4.1) for image position 60. The rate of true negatives was 94.4% (95% CI: 94.0–94.8) for image position 10 and 94.8% (95% CI: 94.4–95.2) for image position 60. Per 1000 readings, the rate of false negative was 0.60 (95% CI: 0.53–0.67) for image position 10 and 0.62 (95% CI: 0.55–0.69) for image position 60. Conclusion There was a decrease in the radiologists’ sensitivity throughout the batch, and although this effect was small, our results may be clinically relevant at a population level or when multiplying the differences with the number of screen readings for the individual radiologists. Key Points • True and false positive reading scores were negatively associated with image position within a batch. • A decreasing trend of positive scores indicated a beneficial effect of a certain number of screen readings within a batch. • False negative scores increased throughout the batch but the association was not statistically significant.


2020 ◽  
Author(s):  
Robbie Cornelis Maria van Aert ◽  
Joris Mulder

Meta-analysis methods are used to synthesize results of multiple studies on the same topic. The most frequently used statistical model in meta-analysis is the random-effects model containing parameters for the average effect, between-study variance in primary study's true effect size, and random effects for the study specific effects. We propose Bayesian hypothesis testing and estimation methods using the marginalized random-effects meta-analysis (MAREMA) model where the study specific true effects are regarded as nuisance parameters which are integrated out of the model. A flat prior distribution is placed on the overall effect size in case of estimation and a proper unit information prior for the overall effect size is proposed in case of hypothesis testing. For the between-study variance in true effect size, a proper uniform prior is placed on the proportion of total variance that can be attributed to between-study variability. Bayes factors are used for hypothesis testing that allow testing point and one-sided hypotheses. The proposed methodology has several attractive properties. First, the proposed MAREMA model encompasses models with a zero, negative, and positive between-study variance, which enables testing a zero between-study variance as it is not a boundary problem. Second, the methodology is suitable for default Bayesian meta-analyses as it requires no prior information about the unknown parameters. Third, the methodology can even be used in the extreme case when only two studies are available, because Bayes factors are not based on large sample theory. We illustrate the developed methods by applying it to two meta-analyses and introduce easy-to-use software in the R package BFpack to compute the proposed Bayes factors.


2021 ◽  
Author(s):  
Alexander Ly ◽  
Eric-Jan Wagenmakers

he “Full Bayesian Significance Test e-value”, henceforth FBST ev, has received increasing attention across a range of disciplines including psychology. We show that the FBST ev leads to four problems: (1) the FBST ev cannot quantify evidence in favor of a null hypothesis and therefore also cannot discriminate “evidence of absence” from “absence of evidence”; (2) the FBST ev is susceptible to sampling to a foregone conclusion; (3) the FBST ev violates the principle of predictive irrelevance, such that it is affected by data that are equally likely to occur under the null hypothesis and the alternative hypothesis; (4) the FBST ev suffers from the Jeffreys-Lindley paradox in that it does not include a correction for selection. These problems also plague the frequentist p-value. We conclude that although the FBST ev may be an improvement over the p-value, it does not provide a reasonable measure of evidence against the null hypothesis.


Sign in / Sign up

Export Citation Format

Share Document