scholarly journals Simulation data for the analysis of Bayesian posterior significance and effect size indices for the two-sample t-test to support reproducible medical research

2020 ◽  
Vol 13 (1) ◽  
Author(s):  
Riko Kelter

Abstract Objectives The data presented herein represents the simulated datasets of a recently conducted larger study which investigated the behaviour of Bayesian indices of significance and effect size as alternatives to traditional p-values. The study considered the setting of Student’s and Welch’s two-sample t-test often used in medical research. It investigated the influence of the sample size, noise, the selected prior hyperparameters and the sensitivity to type I errors. The posterior indices used included the Bayes factor, the region of practical equivalence, the probability of direction, the MAP-based p-value and the e-value in the Full Bayesian Significance Test. The simulation study was conducted in the statistical programming language R. Data description The R script files for simulation of the datasets used in the study are presented in this article. These script files can both simulate the raw datasets and run the analyses. As researchers may be faced with different effect sizes, noise levels or priors in their domain than the ones studied in the original paper, the scripts extend the original results by allowing to recreate all analyses of interest in different contexts. Therefore, they should be relevant to other researchers.

2015 ◽  
Vol 46 (3) ◽  
pp. 586-603 ◽  
Author(s):  
Ma Dolores Hidalgo ◽  
Isabel Benítez ◽  
Jose-Luis Padilla ◽  
Juana Gómez-Benito

The growing use of scales in survey questionnaires warrants the need to address how does polytomous differential item functioning (DIF) affect observed scale score comparisons. The aim of this study is to investigate the impact of DIF on the type I error and effect size of the independent samples t-test on the observed total scale scores. A simulation study was conducted, focusing on potential variables related to DIF in polytomous items, such as DIF pattern, sample size, magnitude, and percentage of DIF items. The results showed that DIF patterns and the number of DIF items affected the type I error rates and effect size of t-test values. The results highlighted the need to analyze DIF before making comparative group interpretations.


Methodology ◽  
2016 ◽  
Vol 12 (2) ◽  
pp. 44-51 ◽  
Author(s):  
José Manuel Caperos ◽  
Ricardo Olmos ◽  
Antonio Pardo

Abstract. Correlation analysis is one of the most widely used methods to test hypotheses in social and health sciences; however, its use is not completely error free. We have explored the frequency of inconsistencies between reported p-values and the associated test statistics in 186 papers published in four Spanish journals of psychology (1,950 correlation tests); we have also collected information about the use of one- versus two-tailed tests in the presence of directional hypotheses, and about the use of some kind of adjustment to control Type I errors due to simultaneous inference. Reported correlation tests (83.8%) are incomplete and 92.5% include an inexact p-value. Gross inconsistencies, which are liable to alter the statistical conclusions, appear in 4% of the reviewed tests, and 26.9% of the inconsistencies found were large enough to bias the results of a meta-analysis. The election of one-tailed tests and the use of adjustments to control the Type I error rate are negligible. We therefore urge authors, reviewers, and editorial boards to pay particular attention to this in order to prevent inconsistencies in statistical reports.


Author(s):  
Riko Kelter

AbstractTesting differences between a treatment and control group is common practice in biomedical research like randomized controlled trials (RCT). The standard two-sample t test relies on null hypothesis significance testing (NHST) via p values, which has several drawbacks. Bayesian alternatives were recently introduced using the Bayes factor, which has its own limitations. This paper introduces an alternative to current Bayesian two-sample t tests by interpreting the underlying model as a two-component Gaussian mixture in which the effect size is the quantity of interest, which is most relevant in clinical research. Unlike p values or the Bayes factor, the proposed method focusses on estimation under uncertainty instead of explicit hypothesis testing. Therefore, via a Gibbs sampler, the posterior of the effect size is produced, which is used subsequently for either estimation under uncertainty or explicit hypothesis testing based on the region of practical equivalence (ROPE). An illustrative example, theoretical results and a simulation study show the usefulness of the proposed method, and the test is made available in the R package . In sum, the new Bayesian two-sample t test provides a solution to the Behrens–Fisher problem based on Gaussian mixture modelling.


Genetics ◽  
2001 ◽  
Vol 158 (2) ◽  
pp. 875-883
Author(s):  
Luis E Montoya-Delgado ◽  
Telba Z Irony ◽  
Carlos A de B. Pereira ◽  
Martin R Whittle

Abstract Much forensic inference based upon DNA evidence is made assuming that the Hardy-Weinberg equilibrium (HWE) is valid for the genetic loci being used. Several statistical tests to detect and measure deviation from HWE have been devised, each having advantages and limitations. The limitations become more obvious when testing for deviation within multiallelic DNA loci is attempted. Here we present an exact test for HWE in the biallelic case, based on the ratio of weighted likelihoods under the null and alternative hypotheses, the Bayes factor. This test does not depend on asymptotic results and minimizes a linear combination of type I and type II errors. By ordering the sample space using the Bayes factor, we also define a significance (evidence) index, P value, using the weighted likelihood under the null hypothesis. We compare it to the conditional exact test for the case of sample size n = 10. Using the idea under the method of χ2 partition, the test is used sequentially to test equilibrium in the multiple allele case and then applied to two short tandem repeat loci, using a real Caucasian data bank, showing its usefulness.


Author(s):  
Christopher D. Green

The “replication crisis” may well be the single most important challenge facing empirical psychological research today. It appears that highly trained scientists, often without understanding the potentially dire long-term implications, have been mishandling standard statistical procedures in the service of attaining statistical “significance.” Exacerbating the problem, most academic journals do not publish research that has not produced a “significant” result. This toxic combination has resulted in journals apparently publishing many Type I errors and declining to publish many true failures to reject H0. In response, there has been an urgent call from some psychologists that studies be registered in advance so that their rationales, hypotheses, variables, sample sizes, and statistical analyses are recorded in advance, leaving less room for post hoc manipulation. In this chapter, I argue that this “open science” approach, though laudable, will prove insufficient because the null hypothesis significance test (NHST) is a poor criterion for scientific truth, even when it is handled correctly. The root of the problem is that, whatever statistical problems psychology may have, the discipline never developed the theoretical maturity required. For decades it has been satisfied testing weak theories that predict, at best, only the direction of the effect, rather than the size of effect. Indeed, uncritical acceptance of NHST by the discipline may have served to stunt psychology’s theoretical growth by giving researchers a way of building a successful career without having to develop models that make precise predictions. Improving our statistical “hygiene” would be a good thing, to be sure, but it is unlikely to resolve psychology’s growing credibility problem until our theoretical practices mature considerably.


2021 ◽  
Author(s):  
Qianrao Fu

It is a tradition that goes back to Jacob Cohen to calculate the sample size before collecting data. The most commonly asked question is: "How many subjects do we need to obtain a significant result if we use the p-value to evaluate the hypothesis if an effect size exists?" In the Bayesian framework, we may want to know how many subjects are needed to get convincing evidence if we use the Bayes factor to evaluate the hypothesis. This paper proposes a solution to the above question by reaching two goals: firstly, the size of the Bayes factor reaches a given threshold, and secondly the probability that the Bayes factor exceeds the given threshold reaches a required value. Researchers can express their expectations through the order or the sign hypothesis of the parameters in a linear regression model. For example, the researchers may expect the regression coefficient to be $\beta_1>\beta_2>\beta_3$, which is an order constrained hypothesis; or the researchers may expect a regression coefficient $\beta_1>0$, which is a sign hypothesis. The greatest advantage of using a specific hypothesis is that the sample size required is reduced compared to an unconstrained hypothesis to achieve the same probability that the Bayes factor exceeds some threshold. This article provides sample size tables for the null hypothesis, order hypothesis, sign hypothesis, complement hypothesis, and unconstrained hypothesis. To enhance the applicability, an R package is developed via a Monte Carlo simulation, which can facilitate psychologists while planning the sample size even if they do not have any statistical programming background.


2020 ◽  
Vol 3 (2) ◽  
pp. 216-228
Author(s):  
Hannes Rosenbusch ◽  
Leon P. Hilbert ◽  
Anthony M. Evans ◽  
Marcel Zeelenberg

Sometimes interesting statistical findings are produced by a small number of “lucky” data points within the tested sample. To address this issue, researchers and reviewers are encouraged to investigate outliers and influential data points. Here, we present StatBreak, an easy-to-apply method, based on a genetic algorithm, that identifies the observations that most strongly contributed to a finding (e.g., effect size, model fit, p value, Bayes factor). Within a given sample, StatBreak searches for the largest subsample in which a previously observed pattern is not present or is reduced below a specifiable threshold. Thus, it answers the following question: “Which (and how few) ‘lucky’ cases would need to be excluded from the sample for the data-based conclusion to change?” StatBreak consists of a simple R function and flags the luckiest data points for any form of statistical analysis. Here, we demonstrate the effectiveness of the method with simulated and real data across a range of study designs and analyses. Additionally, we describe StatBreak’s R function and explain how researchers and reviewers can apply the method to the data they are working with.


2019 ◽  
Vol 28 (4) ◽  
pp. 468-485 ◽  
Author(s):  
Paul HP Hanel ◽  
David MA Mehler

Transparent communication of research is key to foster understanding within and beyond the scientific community. An increased focus on reporting effect sizes in addition to p value–based significance statements or Bayes Factors may improve scientific communication with the general public. Across three studies ( N = 652), we compared subjective informativeness ratings for five effect sizes, Bayes Factor, and commonly used significance statements. Results showed that Cohen’s U3 was rated as most informative. For example, 440 participants (69%) found U3 more informative than Cohen’s d, while 95 (15%) found d more informative than U3, with 99 participants (16%) finding both effect sizes equally informative. This effect was not moderated by level of education. We therefore suggest that in general, Cohen’s U3 is used when scientific findings are communicated. However, the choice of the effect size may vary depending on what a researcher wants to highlight (e.g. differences or similarities).


Sign in / Sign up

Export Citation Format

Share Document