scholarly journals Systematic Review of the use of “Magnitude-Based Inference” in Sports Science and Medicine

Author(s):  
Keith Lohse ◽  
Kristin Sainani ◽  
J. Andrew Taylor ◽  
Michael Lloyd Butson ◽  
Emma Knight ◽  
...  

Magnitude-based inference (MBI) is a controversial statistical method that has been used in hundreds of papers in sports science despite criticism from statisticians. To better understand how this method has been applied in practice, we systematically reviewed 232 papers that used MBI. We extracted data on study design, sample size, and choice of MBI settings and parameters. Median sample size was 10 per group (interquartile range, IQR: 8 – 15) for multi-group studies and 14 (IQR: 10 – 24) for single-group studies; few studies reported a priori sample size calculations (15%). Authors predominantly applied MBI’s default settings and chose “mechanistic/non-clinical” rather than “clinical” MBI even when testing clinical interventions (only 14 studies out of 232 used clinical MBI). Using these data, we can estimate the Type I error rates for the typical MBI study. Authors frequently made dichotomous claims about effects based on the MBI criterion of a “likely” effect and sometimes based on the MBI criterion of a “possible” effect. When the sample size is n=8 to 15 per group, these inferences have Type I error rates of 12%-22% and 22%-45%, respectively. High Type I error rates were compounded by multiple testing: Authors reported results from a median of 30 tests related to outcomes; and few studies specified a primary outcome (14%). We conclude that MBI has promoted small studies, promulgated a “black box” approach to statistics, and led to numerous papers where the conclusions are not supported by the data. Amidst debates over the role of p-values and significance testing in science, MBI also provides an important natural experiment: we find no evidence that moving researchers away from p-values or null hypothesis significance testing makes them less prone to dichotomization or over-interpretation of findings.

2019 ◽  
Vol 3 (Supplement_1) ◽  
Author(s):  
Keisuke Ejima ◽  
Andrew Brown ◽  
Daniel Smith ◽  
Ufuk Beyaztas ◽  
David Allison

Abstract Objectives Rigor, reproducibility and transparency (RRT) awareness has expanded over the last decade. Although RRT can be improved from various aspects, we focused on type I error rates and power of commonly used statistical analyses testing mean differences of two groups, using small (n ≤ 5) to moderate sample sizes. Methods We compared data from five distinct, homozygous, monogenic, murine models of obesity with non-mutant controls of both sexes. Baseline weight (7–11 weeks old) was the outcome. To examine whether type I error rate could be affected by choice of statistical tests, we adjusted the empirical distributions of weights to ensure the null hypothesis (i.e., no mean difference) in two ways: Case 1) center both weight distributions on the same mean weight; Case 2) combine data from control and mutant groups into one distribution. From these cases, 3 to 20 mice were resampled to create a ‘plasmode’ dataset. We performed five common tests (Student's t-test, Welch's t-test, Wilcoxon test, permutation test and bootstrap test) on the plasmodes and computed type I error rates. Power was assessed using plasmodes, where the distribution of the control group was shifted by adding a constant value as in Case 1, but to realize nominal effect sizes. Results Type I error rates were unreasonably higher than the nominal significance level (type I error rate inflation) for Student's t-test, Welch's t-test and permutation especially when sample size was small for Case 1, whereas inflation was observed only for permutation for Case 2. Deflation was noted for bootstrap with small sample. Increasing sample size mitigated inflation and deflation, except for Wilcoxon in Case 1 because heterogeneity of weight distributions between groups violated assumptions for the purposes of testing mean differences. For power, a departure from the reference value was observed with small samples. Compared with the other tests, bootstrap was underpowered with small samples as a tradeoff for maintaining type I error rates. Conclusions With small samples (n ≤ 5), bootstrap avoided type I error rate inflation, but often at the cost of lower power. To avoid type I error rate inflation for other tests, sample size should be increased. Wilcoxon should be avoided because of heterogeneity of weight distributions between mutant and control mice. Funding Sources This study was supported in part by NIH and Japan Society for Promotion of Science (JSPS) KAKENHI grant.


2015 ◽  
Vol 9 (13) ◽  
pp. 1
Author(s):  
Tobi Kingsley Ochuko ◽  
Suhaida Abdullah ◽  
Zakiyah Binti Zain ◽  
Sharipah Soaad Syed Yahaya

<p class="zhengwen"><span lang="EN-GB">This study centres on the comparison of independent group tests in terms of power, by using parametric method, such</span><span lang="EN-GB"> as the Alexander-Govern test. The Alexander-Govern (<em>AG</em>) test uses mean as its central tendency measure. It is a better alternative compared to the Welch test, the James test and the <em>ANOVA</em>, because it produces high power and gives good control of Type I error rates for a normal data under variance heterogeneity. But this test is not robust for a non-normal data. When trimmed mean was applied on the test as its central tendency measure under non-normality, the test was only robust for two group condition, but as the number of groups increased more than two groups, the test was no more robust. As a result, a highly robust estimator known as the <em>MOM</em> estimator was applied on the test, as its central tendency measure. This test is not affected by the number of groups, but could not control Type I error rates under skewed heavy tailed distribution. In this study, the Winsorized <em>MOM</em> estimator was applied in the <em>AG</em> test, as its central tendency measure. A simulation of 5,000 data sets were generated and analysed on the test, using the <em>SAS</em> package. The result of the analysis, shows that with the pairing of unbalanced sample size of (15:15:20:30) with equal variance of (1:1:1:1) and the pairing of unbalanced sample size of (15:15:20:30) with unequal variance of (1:1:1:36) with effect size index (<em>f</em> = 0.8), the <em>AGWMOM </em>test only produced a high power value of 0.9562 and 0.8336 compared to the <em>AG </em>test, the <em>AGMOM </em>test and the <em>ANOVA </em>respectively and the test is considered to be sufficient.</span></p>


1988 ◽  
Vol 13 (3) ◽  
pp. 281-290 ◽  
Author(s):  
James Algina ◽  
Kezhen L. Tang

For Yao’s and James’ tests, Type I error rates were estimated for various combinations of the number of variables (p), samplesize ratio (n1: n2), sample-size-to-variables ratio, and degree of heteroscedasticity. These tests are alternatives to Hotelling’s T2 and are intended for use when the variance-covariance matrices are not equal in a study using two independent samples. The performance of Yao’s test was superior to that of James’. Yao’s test had appropriate Type I error rates when p ≥ 10, (n1 + n2)/p ≥ 10, and 1:2 ≤ n1:n2 ≤ 2:1. When (n1 + n2)/p = 20, Yao’s test was robust when n1: n2 was 5:1, 3:1, and 4:1 and p was 2, 6, and 10, respectively.


1996 ◽  
Vol 21 (2) ◽  
pp. 169-178 ◽  
Author(s):  
William T. Coombs ◽  
James Algina

Type I error rates for the Johansen test were estimated using simulated data for a variety of conditions. The design of the experiment was a 2 × 2× 2× 3× 9× 3 factorial. The factors were (a) type of distribution, (b) number of dependent variables, (c) number of groups, (d) ratio of the smallest sample size to the number of dependent variables, (e) sample size ratios, and (f) degree of heteroscedasticity. The results indicate that Type I error rates for the Johansen test depend heavily on the number of groups and the ratio of the smallest sample size to the number of dependent variables. Type I error rates depend to a lesser extent on the distribution types used in the study. Based on the results, sample size guidelines are presented.


2021 ◽  
Author(s):  
Daniel Lakens ◽  
Friedrich Pahlke ◽  
Gernot Wassmer

This tutorial illustrates how to design, analyze, and report group sequential designs. In these designs, groups of observations are collected and repeatedly analyzed, while controlling error rates. Compared to a fixed sample size design, where data is analyzed only once, group sequential designs offer the possibility to stop the study at interim looks at the data either for efficacy or futility. Hence, they provide greater flexibility and are more efficient in the sense that due to early stopping the expected sample size is smaller as compared to the sample size in the design with no interim look. In this tutorial we illustrate how to use the R package 'rpact' and the associated Shiny app to design studies that control the Type I error rate when repeatedly analyzing data, even when neither the number of looks at the data, nor the exact timing of looks at the data, is specified. Specifically for *t*-tests, we illustrate how to perform an a-priori power analysis for group sequential designs, and explain how to stop the data collection for futility by rejecting the presence of an effect of interest based on a beta-spending function. Finally, we discuss how to report adjusted effect size estimates and confidence intervals. The recent availability of accessible software such as 'rpact' makes it possible for psychologists to benefit from the efficiency gains provided by group sequential designs.


2018 ◽  
Vol 8 (2) ◽  
pp. 58-71
Author(s):  
Richard L. Gorsuch ◽  
Curtis Lehmann

Approximations for Chi-square and F distributions can both be computed to provide a p-value, or probability of Type I error, to evaluate statistical significance. Although Chi-square has been used traditionally for tests of count data and nominal or categorical criterion variables (such as contingency tables) and F ratios for tests of non-nominal or continuous criterion variables (such as regression and analysis of variance), we demonstrate that either statistic can be applied in both situations. We used data simulation studies to examine when one statistic may be more accurate than the other for estimating Type I error rates across different types of analysis (count data/contingencies, dichotomous, and non-nominal) and across sample sizes (Ns) ranging from 20 to 160 (using 25,000 replications for simulating p-value derived from either Chi-squares or F-ratios). Our results showed that those derived from F ratios were generally closer to nominal Type I error rates than those derived from Chi-squares. The p-values derived from F ratios were more consistent for contingency table count data than those derived from Chi-squares. The smaller than 100 the N was, the more discrepant p-values derived from Chi-squares were from the nominal p-value. Only when the N was greater than 80 did the p-values from Chi-square tests become as accurate as those derived from F ratios in reproducing the nominal p-values. Thus, there was no evidence of any need for special treatment of dichotomous dependent variables. The most accurate and/or consistent p's were derived from F ratios. We conclude that Chi-square should be replaced generally with the F ratio as the statistic of choice and that the Chi-square test should only be taught as history.


Sign in / Sign up

Export Citation Format

Share Document