Type I error rates and test power for some variance components estimation methods: one-way random effect model

Abstract. This study was conducted to compare Type I error and test power of ANOVA, REML and ML methods by Monte Carlo simulation technique under different experimental conditions. Simulation results indicated that the variance ratios, sample size and number of groups were important factors in determining appropriate methods which were used to estimate variance components. The ML method was found slightly superior when compared to ANOVA and REML methods. On the other hand, ANOVA and REML methods generated similar results in general. As a results, regardless of distribution shapes and number of groups and if n<15; ML, REML methods might be preferred to the ANOVA. However, when either number of groups or sample size was increased (n≥15) ANOVA method may also be used along with ML and REML.

Download Full-text

Another Warning about Median Reaction Time --- Version of 11 Feb 2020

10.31234/osf.io/3q5np ◽

2020 ◽

Author(s):

Jeff Miller

Keyword(s):

Type I Error ◽

Reaction Times ◽

Error Rates ◽

Type I ◽

Experimental Conditions ◽

Type I Error Rates ◽

Correction Technique ◽

Inflated Type ◽

The Mean ◽

Rt Distributions

Contrary to the warning of Miller (1988), Rousselet and Wilcox (2020) argued that it is better to summarize each participant’s single-trial reaction times (RTs) in a given condition with the median than with the mean when comparing the central tendencies of RT distributions across experimental conditions. They acknowledged that median RTs can produce inflated Type I error rates when conditions differ in the number of trials tested, consistent with Miller’s warning, but they showed that the bias responsible for this error rate inflation could be eliminated with a bootstrap bias correction technique. The present simulations extend their analysis by examining the power of bias-corrected medians to detect true experimental effects and by comparing this power with the power of analyses using means and regular medians. Unfortunately, although bias-corrected medians solve the problem of inflated Type I error rates, their power is lower than that of means or regular medians in many realistic situations. In addition, even when conditions do not differ in the number of trials tested, the power of tests (e.g., t-tests) is generally lower using medians rather than means as the summary measures. Thus, the present simulations demonstrate that summary means will often provide the most powerful test for differences between conditions, and they show what aspects of the RT distributions determine the size of the power advantage for means.

Download Full-text

Using Bayesian Nonparametric Item Response Function Estimation to Check Parametric Model Fit

Applied Psychological Measurement ◽

10.1177/0146621620909906 ◽

2020 ◽

Vol 44 (5) ◽

pp. 331-345

Author(s):

Wenhao Wang ◽

Neal Kingston

Keyword(s):

Item Response ◽

Type I Error ◽

Error Rates ◽

Estimation Methods ◽

Type I ◽

Nonparametric Method ◽

Bayesian Nonparametric ◽

Lower Type ◽

Type I Error Rates ◽

Nonparametric Item Response

Previous studies indicated that the assumption of logistic form of parametric item response functions (IRFs) is violated often enough to be worth checking. Using nonparametric item response theory (IRT) estimation methods with the posterior predictive model checking method can obtain significance probabilities of fit statistics in a Bayesian framework by accounting for the uncertainty of the parameter estimation and can indicate the location and magnitude of misfit for an item. The purpose of this study is to check the performance of the Bayesian nonparametric method to assess the IRF fit of parametric IRT models for mixed-format tests and compare it with the existing bootstrapping nonparametric method under various conditions. The simulation study results show that the Bayesian nonparametric method can detect misfit items with higher power and lower type I error rates when the sample size is large and with lower type I error rates compared with the bootstrapping method for the conditions with nonmonotonic items. In the real-data study, several dichotomous and polytomous misfit items were identified and the location and magnitude of misfit were indicated.

Download Full-text

Comparison of a Two-Stage and Three-Stage Interim-Analysis Procedure

Psychological Reports ◽

10.2466/pr0.1992.71.1.3 ◽

1992 ◽

Vol 71 (1) ◽

pp. 3-14 ◽

Cited By ~ 1

Author(s):

John E. Overall ◽

Robert S. Atlas

Keyword(s):

Sample Size ◽

Interim Analysis ◽

Type I Error ◽

Substantial Reduction ◽

Error Rates ◽

Sampling Plan ◽

Type I ◽

Two Stage ◽

Expected Sample Size ◽

Analysis Strategy

A statistical model for combining p values from multiple tests of significance is used to define rejection and acceptance regions for two-stage and three-stage sampling plans. Type I error rates, power, frequencies of early termination decisions, and expected sample sizes are compared. Both the two-stage and three-stage procedures provide appropriate protection against Type I errors. The two-stage sampling plan with its single interim analysis entails minimal loss in power and provides substantial reduction in expected sample size as compared with a conventional single end-of-study test of significance for which power is in the adequate range. The three-stage sampling plan with its two interim analyses introduces somewhat greater reduction in power, but it compensates with greater reduction in expected sample size. Either interim-analysis strategy is more efficient than a single end-of-study analysis in terms of power per unit of sample size.

Download Full-text

Use of interval estimations in design and evaluation of multiregional clinical trials with continuous outcomes

Statistical Methods in Medical Research ◽

10.1177/0962280217751277 ◽

2018 ◽

Vol 28 (7) ◽

pp. 2179-2195 ◽

Cited By ~ 1

Author(s):

Chieh Chiang ◽

Chin-Fu Hsiao

Keyword(s):

Clinical Trials ◽

Sample Size ◽

Type I Error ◽

Interval Estimation ◽

Error Rates ◽

New Drugs ◽

Sample Size Determination ◽

Type I ◽

Size Determination ◽

Interval Estimators

Multiregional clinical trials have been accepted in recent years as a useful means of accelerating the development of new drugs and abridging their approval time. The statistical properties of multiregional clinical trials are being widely discussed. In practice, variance of a continuous response may be different from region to region, but it leads to the assessment of the efficacy response falling into a Behrens–Fisher problem—there is no exact testing or interval estimator for mean difference with unequal variances. As a solution, this study applies interval estimations of the efficacy response based on Howe’s, Cochran–Cox’s, and Satterthwaite’s approximations, which have been shown to have well-controlled type I error rates. However, the traditional sample size determination cannot be applied to the interval estimators. The sample size determination to achieve a desired power based on these interval estimators is then presented. Moreover, the consistency criteria suggested by the Japanese Ministry of Health, Labour and Welfare guidance to decide whether the overall results from the multiregional clinical trial obtained via the proposed interval estimation were also applied. A real example is used to illustrate the proposed method. The results of simulation studies indicate that the proposed method can correctly determine the required sample size and evaluate the assurance probability of the consistency criteria.

Download Full-text

The effect of number of clusters and cluster size on statistical power and Type I error rates when testing random effects variance components in multilevel linear and logistic regression models

Journal of Statistical Computation and Simulation ◽

10.1080/00949655.2018.1504945 ◽

2018 ◽

Vol 88 (16) ◽

pp. 3151-3163 ◽

Cited By ~ 8

Author(s):

Peter C. Austin ◽

George Leckie

Keyword(s):

Variance Components ◽

Cluster Size ◽

Regression Models ◽

Statistical Power ◽

Type I Error ◽

Error Rates ◽

Type I ◽

Number Of Clusters ◽

Logistic Regression Models ◽

Type I Error Rates

Download Full-text

Assessment of Type I Error Rates and Power of Common Analysis Methods in Murine Obesity-Related Study: ‘Plasmode-Based’ Simulation (P13-011-19)

Current Developments in Nutrition ◽

10.1093/cdn/nzz036.p13-011-19 ◽

2019 ◽

Vol 3 (Supplement_1) ◽

Author(s):

Keisuke Ejima ◽

Andrew Brown ◽

Daniel Smith ◽

Ufuk Beyaztas ◽

David Allison

Keyword(s):

Sample Size ◽

Error Rate ◽

Type I Error ◽

Error Rates ◽

T Test ◽

Small Samples ◽

Type I ◽

Type I Error Rates ◽

Type I Error Rate ◽

Weight Distributions

Abstract Objectives Rigor, reproducibility and transparency (RRT) awareness has expanded over the last decade. Although RRT can be improved from various aspects, we focused on type I error rates and power of commonly used statistical analyses testing mean differences of two groups, using small (n ≤ 5) to moderate sample sizes. Methods We compared data from five distinct, homozygous, monogenic, murine models of obesity with non-mutant controls of both sexes. Baseline weight (7–11 weeks old) was the outcome. To examine whether type I error rate could be affected by choice of statistical tests, we adjusted the empirical distributions of weights to ensure the null hypothesis (i.e., no mean difference) in two ways: Case 1) center both weight distributions on the same mean weight; Case 2) combine data from control and mutant groups into one distribution. From these cases, 3 to 20 mice were resampled to create a ‘plasmode’ dataset. We performed five common tests (Student's t-test, Welch's t-test, Wilcoxon test, permutation test and bootstrap test) on the plasmodes and computed type I error rates. Power was assessed using plasmodes, where the distribution of the control group was shifted by adding a constant value as in Case 1, but to realize nominal effect sizes. Results Type I error rates were unreasonably higher than the nominal significance level (type I error rate inflation) for Student's t-test, Welch's t-test and permutation especially when sample size was small for Case 1, whereas inflation was observed only for permutation for Case 2. Deflation was noted for bootstrap with small sample. Increasing sample size mitigated inflation and deflation, except for Wilcoxon in Case 1 because heterogeneity of weight distributions between groups violated assumptions for the purposes of testing mean differences. For power, a departure from the reference value was observed with small samples. Compared with the other tests, bootstrap was underpowered with small samples as a tradeoff for maintaining type I error rates. Conclusions With small samples (n ≤ 5), bootstrap avoided type I error rate inflation, but often at the cost of lower power. To avoid type I error rate inflation for other tests, sample size should be increased. Wilcoxon should be avoided because of heterogeneity of weight distributions between mutant and control mice. Funding Sources This study was supported in part by NIH and Japan Society for Promotion of Science (JSPS) KAKENHI grant.

Download Full-text

The Modification and Evaluation of the Alexander-Govern Test in Terms of Power

Modern Applied Science ◽

10.5539/mas.v9n13p1 ◽

2015 ◽

Vol 9 (13) ◽

pp. 1

Author(s):

Tobi Kingsley Ochuko ◽

Suhaida Abdullah ◽

Zakiyah Binti Zain ◽

Sharipah Soaad Syed Yahaya

Keyword(s):

Sample Size ◽

High Power ◽

Type I Error ◽

Error Rates ◽

Robust Estimator ◽

Type I ◽

Central Tendency ◽

Unequal Variance ◽

Central Tendency Measure ◽

Type I Error Rates

This study centres on the comparison of independent group tests in terms of power, by using parametric method, such as the Alexander-Govern test. The Alexander-Govern (AG) test uses mean as its central tendency measure. It is a better alternative compared to the Welch test, the James test and the ANOVA, because it produces high power and gives good control of Type I error rates for a normal data under variance heterogeneity. But this test is not robust for a non-normal data. When trimmed mean was applied on the test as its central tendency measure under non-normality, the test was only robust for two group condition, but as the number of groups increased more than two groups, the test was no more robust. As a result, a highly robust estimator known as the MOM estimator was applied on the test, as its central tendency measure. This test is not affected by the number of groups, but could not control Type I error rates under skewed heavy tailed distribution. In this study, the Winsorized MOM estimator was applied in the AG test, as its central tendency measure. A simulation of 5,000 data sets were generated and analysed on the test, using the SAS package. The result of the analysis, shows that with the pairing of unbalanced sample size of (15:15:20:30) with equal variance of (1:1:1:1) and the pairing of unbalanced sample size of (15:15:20:30) with unequal variance of (1:1:1:36) with effect size index (f = 0.8), the AGWMOM test only produced a high power value of 0.9562 and 0.8336 compared to the AG test, the AGMOM test and the ANOVA respectively and the test is considered to be sufficient.

Download Full-text

Consequences of power transforms as a statistical solution in linear mixed-effects models of chronometric data

10.31234/osf.io/f73mh ◽

2018 ◽

Cited By ~ 1

Author(s):

Van Rynald T Liceralde ◽

Peter C. Gordon

Keyword(s):

Type I Error ◽

Response Times ◽

Random Effect ◽

Mixed Effects ◽

Error Rates ◽

Mixed Effects Models ◽

Type I ◽

Statistical Solution ◽

Linear Mixed Effects Models ◽

Linear Mixed Effects

Power transforms have been increasingly used in linear mixed-effects models (LMMs) of chronometric data (e.g., response times [RTs]) as a statistical solution to preempt violating the assumption of residual normality. However, differences in results between LMMs fit to raw RTs and transformed RTs have reignited discussions on issues concerning the transformation of RTs. Here, we analyzed three word-recognition megastudies and performed Monte Carlo simulations to better understand the consequences of transforming RTs in LMMs. Within each megastudy, transforming RTs produced different fixed- and random-effect patterns; across the megastudies, RTs were optimally normalized by different power transforms, and results were more consistent among LMMs fit to raw RTs. Moreover, the simulations showed that LMMs fit to optimally normalized RTs had greater power for main effects in smaller samples, but that LMMs fit to raw RTs had greater power for interaction effects as sample sizes increased, with negligible differences in Type I error rates between the two models. Based on these results, LMMs should be fit to raw RTs when there is no compelling reason beyond nonnormality to transform RTs and when the interpretive framework mapping the predictors and RTs treats RT as an interval scale.

Download Full-text

A Machine Learning Approach to Assess Differential Item Functioning in Psychometric Questionnaires Using the Elastic Net Regularized Ordinal Logistic Regression in Small Sample Size Groups

BioMed Research International ◽

10.1155/2021/6854477 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Vahid Ebrahimi ◽

Zahra Bagheri ◽

Zahra Shayan ◽

Peyman Jafari

Keyword(s):

Sample Size ◽

Type I Error ◽

Small Sample Size ◽

Error Rates ◽

Small Sample ◽

Elastic Net ◽

Small Samples ◽

Type I ◽

Item Functioning ◽

Size Groups

Assessing differential item functioning (DIF) using the ordinal logistic regression (OLR) model highly depends on the asymptotic sampling distribution of the maximum likelihood (ML) estimators. The ML estimation method, which is often used to estimate the parameters of the OLR model for DIF detection, may be substantially biased with small samples. This study is aimed at proposing a new application of the elastic net regularized OLR model, as a special type of machine learning method, for assessing DIF between two groups with small samples. Accordingly, a simulation study was conducted to compare the powers and type I error rates of the regularized and nonregularized OLR models in detecting DIF under various conditions including moderate and severe magnitudes of DIF ( DIF = 0.4 and 0.8 ), sample size ( N ), sample size ratio ( R ), scale length ( I ), and weighting parameter ( w ). The simulation results revealed that for I = 5 and regardless of R , the elastic net regularized OLR model with w = 0.1 , as compared with the nonregularized OLR model, increased the power of detecting moderate uniform DIF ( DIF = 0.4 ) approximately 35% and 21% for N = 100 and 150 , respectively. Moreover, for I = 10 and severe uniform DIF ( DIF = 0.8 ), the average power of the elastic net regularized OLR model with 0.03 ≤ w ≤ 0.06 , as compared with the nonregularized OLR model, increased approximately 29.3% and 11.2% for N = 100 and 150 , respectively. In these cases, the type I error rates of the regularized and nonregularized OLR models were below or close to the nominal level of 0.05. In general, this simulation study showed that the elastic net regularized OLR model outperformed the nonregularized OLR model especially in extremely small sample size groups. Furthermore, the present research provided a guideline and some recommendations for researchers who conduct DIF studies with small sample sizes.

Download Full-text

Type I Error Rates for Yao’s and James Tests of Equality of Mean Vectors Under VarianceCovariance Heteroscedasticity

Journal of Educational Statistics ◽

10.3102/10769986013003281 ◽

1988 ◽

Vol 13 (3) ◽

pp. 281-290 ◽

Cited By ~ 4

Author(s):

James Algina ◽

Kezhen L. Tang

Keyword(s):

Sample Size ◽

Type I Error ◽

Error Rates ◽

Covariance Matrices ◽

Type I ◽

Type I Error Rates ◽

Mean Vectors ◽

Equality Of Mean Vectors

For Yao’s and James’ tests, Type I error rates were estimated for various combinations of the number of variables (p), samplesize ratio (n1: n2), sample-size-to-variables ratio, and degree of heteroscedasticity. These tests are alternatives to Hotelling’s T2 and are intended for use when the variance-covariance matrices are not equal in a study using two independent samples. The performance of Yao’s test was superior to that of James’. Yao’s test had appropriate Type I error rates when p ≥ 10, (n1 + n2)/p ≥ 10, and 1:2 ≤ n1:n2 ≤ 2:1. When (n1 + n2)/p = 20, Yao’s test was robust when n1: n2 was 5:1, 3:1, and 4:1 and p was 2, 6, and 10, respectively.

Download Full-text