Exact Type 1 Error Rates for Robustness of Student's t Test with Unequal Variances

1980 ◽  
Vol 5 (4) ◽  
pp. 337-349 ◽  
Author(s):  
Philip H. Ramsey

It is noted that disagreements have arisen in the literature about the robustness of the t test in normal populations with unequal variances. Hsu's procedure is applied to determine exact Type I error rates for t. Employing fairly liberal but objective standards for assessing robustness, it is shown that the t test is not always robust to the assumption of equal population variances even when sample sizes are equal. Several guidelines are suggested including the point that to apply t at α = .05 without regard for unequal variances would require equal sample sizes of at least 15 by one of the standards considered. In many cases, especially those with unequal N's, an alternative such as Welch's procedure is recommended.

1994 ◽  
Vol 19 (3) ◽  
pp. 275-291 ◽  
Author(s):  
James Algina ◽  
T. C. Oshima ◽  
Wen-Ying Lin

Type I error rates were estimated for three tests that compare means by using data from two independent samples: the independent samples t test, Welch’s approximate degrees of freedom test, and James’s second-order test. Type I error rates were estimated for skewed distributions, equal and unequal variances, equal and unequal sample sizes, and a range of total sample sizes. Welch’s test and James’s test have very similar Type I error rates and tend to control the Type I error rate as well or better than the independent samples t test does. The results provide guidance about the total sample sizes required for controlling Type I error rates.


1977 ◽  
Vol 14 (4) ◽  
pp. 493-498 ◽  
Author(s):  
Joanne C. Rogan ◽  
H. J. Keselman

Numerous investigations have examined the effects of variance heterogeneity on the empirical probability of a Type I error for the analysis of variance (ANOVA) F-test and the prevailing conclusion has been that when sample sizes are equal, the ANOVA is robust to variance heterogeneity. However, Box (1954) reported a Type I error rate of .12, for a 5% nominal level, when unequal variances were paired with equal sample sizes. The present paper explored this finding, examining varying degrees and patterns of variance heterogeneity for varying sample sizes and number of treatment groups. The data indicate that the rate of Type 1 error varies as a function of the degree of variance heterogeneity and, consequently, it should not be assumed that the ANOVA F-test is always robust to variance heterogeneity when sample sizes are equal.


1995 ◽  
Vol 77 (1) ◽  
pp. 155-159 ◽  
Author(s):  
John E. Overall ◽  
Robert S. Atlas ◽  
Janet M. Gibson

Welch (1947) proposed an adjusted t test that can be used to correct the serious bias in Type I error protection that is otherwise present when both sample sizes and variances are unequal. The implications of the Welch adjustment for power of tests for the difference between two treatments across k levels of a concomitant factor are evaluated in this article for k × 2 designs with unequal sample sizes and unequal variances. Analyses confirm that, although Type I error is uniformly controlled, power of the Welch test of significance for the main effect of treatments remains rather seriously dependent on direction of the correlation between unequal variances and unequal sample sizes. Nevertheless, considering the fact that analysis of variance is not an acceptable option in such cases, the Welch t test appears to have an important role to play in the analysis of experimental data.


2021 ◽  
Author(s):  
Josue E. Rodriguez ◽  
Donald Ray Williams ◽  
Paul - Christian Bürkner

Categorical moderators are often included in mixed-effects meta-analysis to explain heterogeneity in effect sizes. An assumption in tests of moderator effects is that of a constant between-study variance across all levels of the moderator. Although it rarely receives serious thought, there can be drastic ramifications to upholding this assumption. We propose that researchers should instead assume unequal between-study variances by default. To achieve this, we suggest using a mixed-effects location-scale model (MELSM) to allow group-specific estimates for the between-study variances. In two extensive simulation studies, we show that in terms of Type I error and statistical power, nearly nothing is lost by using the MELSM for moderator tests, but there can be serious costs when a mixed-effects model with equal variances is used. Most notably, in scenarios with balanced sample sizes or equal between-study variance, the Type I error and power rates are nearly identical between the mixed-effects model and the MELSM. On the other hand, with imbalanced sample sizes and unequal variances, the Type I error rate under the mixed-effects model can be grossly inflated or overly conservative, whereas the MELSM excellently controlled the Type I error across all scenarios. With respect to power, the MELSM had comparable or higher power than the mixed-effects model in all conditions where the latter produced valid (i.e., not inflated) Type 1 error rates. Altogether, our results strongly support that assuming unequal between-study variances is preferred as a default strategy when testing categorical moderators


2019 ◽  
Vol 47 (6) ◽  
pp. 3031-3045
Author(s):  
Michael Siebert ◽  
David Ellenberger

Abstract Automatic passenger counting (APC) in public transport has been introduced in the 1970s and has been rapidly emerging in recent years. Still, real-world applications continue to face events that are difficult to classify. The induced imprecision needs to be handled as statistical noise and thus methods have been defined to ensure that measurement errors do not exceed certain bounds. Various recommendations for such an APC validation have been made to establish criteria that limit the bias and the variability of the measurement errors. In those works, the misinterpretation of non-significance in statistical hypothesis tests for the detection of differences (e.g. Student’s t-test) proves to be prevalent, although existing methods which were developed under the term equivalence testing in biostatistics (i.e. bioequivalence trials, Schuirmann in J Pharmacokinet Pharmacodyn 15(6):657–680, 1987) would be appropriate instead. This heavily affects the calibration and validation process of APC systems and has been the reason for unexpected results when the sample sizes were not suitably chosen: Large sample sizes were assumed to improve the assessment of systematic measurement errors of the devices from a user’s perspective as well as from a manufacturers perspective, but the regular t-test fails to achieve that. We introduce a variant of the t-test, the revised t-test, which addresses both type I and type II errors appropriately and allows a comprehensible transition from the long-established t-test in a widely used industrial recommendation. This test is appealing, but still it is susceptible to numerical instability. Finally, we analytically reformulate it as a numerically stable equivalence test, which is thus easier to use. Our results therefore allow to induce an equivalence test from a t-test and increase the comparability of both tests, especially for decision makers.


2015 ◽  
Vol 46 (3) ◽  
pp. 586-603 ◽  
Author(s):  
Ma Dolores Hidalgo ◽  
Isabel Benítez ◽  
Jose-Luis Padilla ◽  
Juana Gómez-Benito

The growing use of scales in survey questionnaires warrants the need to address how does polytomous differential item functioning (DIF) affect observed scale score comparisons. The aim of this study is to investigate the impact of DIF on the type I error and effect size of the independent samples t-test on the observed total scale scores. A simulation study was conducted, focusing on potential variables related to DIF in polytomous items, such as DIF pattern, sample size, magnitude, and percentage of DIF items. The results showed that DIF patterns and the number of DIF items affected the type I error rates and effect size of t-test values. The results highlighted the need to analyze DIF before making comparative group interpretations.


Methodology ◽  
2009 ◽  
Vol 5 (2) ◽  
pp. 60-70 ◽  
Author(s):  
W. Holmes Finch ◽  
Teresa Davenport

Permutation testing has been suggested as an alternative to the standard F approximate tests used in multivariate analysis of variance (MANOVA). These approximate tests, such as Wilks’ Lambda and Pillai’s Trace, have been shown to perform poorly when assumptions of normally distributed dependent variables and homogeneity of group covariance matrices were violated. Because Monte Carlo permutation tests do not rely on distributional assumptions, they may be expected to work better than their approximate cousins when the data do not conform to the assumptions described above. The current simulation study compared the performance of four standard MANOVA test statistics with their Monte Carlo permutation-based counterparts under a variety of conditions with small samples, including conditions when the assumptions were met and when they were not. Results suggest that for sample sizes of 50 subjects, power is very low for all the statistics. In addition, Type I error rates for both the approximate F and Monte Carlo tests were inflated under the condition of nonnormal data and unequal covariance matrices. In general, the performance of the Monte Carlo permutation tests was slightly better in terms of Type I error rates and power when both assumptions of normality and homogeneous covariance matrices were not met. It should be noted that these simulations were based upon the case with three groups only, and as such results presented in this study can only be generalized to similar situations.


2019 ◽  
Vol 3 (Supplement_1) ◽  
Author(s):  
Keisuke Ejima ◽  
Andrew Brown ◽  
Daniel Smith ◽  
Ufuk Beyaztas ◽  
David Allison

Abstract Objectives Rigor, reproducibility and transparency (RRT) awareness has expanded over the last decade. Although RRT can be improved from various aspects, we focused on type I error rates and power of commonly used statistical analyses testing mean differences of two groups, using small (n ≤ 5) to moderate sample sizes. Methods We compared data from five distinct, homozygous, monogenic, murine models of obesity with non-mutant controls of both sexes. Baseline weight (7–11 weeks old) was the outcome. To examine whether type I error rate could be affected by choice of statistical tests, we adjusted the empirical distributions of weights to ensure the null hypothesis (i.e., no mean difference) in two ways: Case 1) center both weight distributions on the same mean weight; Case 2) combine data from control and mutant groups into one distribution. From these cases, 3 to 20 mice were resampled to create a ‘plasmode’ dataset. We performed five common tests (Student's t-test, Welch's t-test, Wilcoxon test, permutation test and bootstrap test) on the plasmodes and computed type I error rates. Power was assessed using plasmodes, where the distribution of the control group was shifted by adding a constant value as in Case 1, but to realize nominal effect sizes. Results Type I error rates were unreasonably higher than the nominal significance level (type I error rate inflation) for Student's t-test, Welch's t-test and permutation especially when sample size was small for Case 1, whereas inflation was observed only for permutation for Case 2. Deflation was noted for bootstrap with small sample. Increasing sample size mitigated inflation and deflation, except for Wilcoxon in Case 1 because heterogeneity of weight distributions between groups violated assumptions for the purposes of testing mean differences. For power, a departure from the reference value was observed with small samples. Compared with the other tests, bootstrap was underpowered with small samples as a tradeoff for maintaining type I error rates. Conclusions With small samples (n ≤ 5), bootstrap avoided type I error rate inflation, but often at the cost of lower power. To avoid type I error rate inflation for other tests, sample size should be increased. Wilcoxon should be avoided because of heterogeneity of weight distributions between mutant and control mice. Funding Sources This study was supported in part by NIH and Japan Society for Promotion of Science (JSPS) KAKENHI grant.


2017 ◽  
Author(s):  
Marie Delacre ◽  
Daniel Lakens ◽  
Christophe Leys

When comparing two independent groups, researchers in Psychology commonly use Student’s t-test. Assumptions of normality and of homogeneity of variance underlie this test. More often than not, when these conditions are not met, Student’s t-test can be severely biased, and leads to invalid statistical inferences. Moreover, we argue that the assumption of equal variances will seldom hold in psychological research and that choosing between Student’s t-test or Welch’s t-test based on the outcomes of a test of the equality of variances often fails to provide an appropriate answer. We show that the Welch’s t-test provides a better control of Type 1 error rates when the assumption of homogeneity of variance is not met, and loses little robustness compared to Student’s t-test when the assumptions are met. We argue that Welch’s t-test should be used as a default strategy.


Sign in / Sign up

Export Citation Format

Share Document