Taking Parametric Assumptions Seriously Arguments for the Use of Welch’s F-test instead of the Classical F-test in One-way ANOVA (in press for the International Review of Social Psychology)

Student's t-test and classical F-test ANOVA rely on the assumptions that two or more samples are independent, and that independent and identically distributed residuals are normal and have equal variances between groups. We focus on the assumptions of normality and equality of variances, and argue that these assumptions are often unrealistic in the field of psychology. We underline the current lack of attention to these assumptions through an analysis of researchers' practices. Through Monte Carlo simulations we illustrate the consequences of performing the classic parametric F-test for ANOVA when the test assumptions are not met on the Type I error rate and statistical power. Under realistic deviations from the assumption of equal variances the classic F-test can yield severely biased results and lead to invalid statistical inferences. We examine two common alternatives to the F-test, namely the Welch's ANOVA (W-test) and the Brown-Forsythe test (F*-test). Our simulations show that under a range of realistic scenariosthe W-test is a better alternative and we therefore recommend using the W-test by default when comparing means. We provide a detailed example explaining how to perform the W-test in SPSS and R. We summarize our conclusions in practical recommendations that researchers can use to improve their statistical practices.

Download Full-text

Evaluating the Efficacy of Conditional Analysis of Variance under Heterogeneity and Non-Normality

Journal of Modern Applied Statistical Methods ◽

10.22237/jmasm/1555340224 ◽

2019 ◽

Vol 17 (2) ◽

Author(s):

Yan Wang ◽

Thanh Pham ◽

Diep Nguyen ◽

Eun Sook Kim ◽

Yi-Hsin Chen ◽

...

Keyword(s):

Analysis Of Variance ◽

Simulation Study ◽

Statistical Power ◽

Error Control ◽

Type I Error ◽

Conditional Analysis ◽

Type I ◽

F Test ◽

Homogeneity Of Variance ◽

Robust Anova

A simulation study was conducted to examine the efficacy of conditional analysis of variance (ANOVA) methods where the initial homogeneity of variance screening leads to the choice between the ANOVA F test and robust ANOVA methods. Type I error control and statistical power were investigated under various conditions.

Download Full-text

Additional file to "Why Psychologists Should by Default Use Welch's t-test Instead of Student's t-test." (in press for the International Review of Social Psychology)

10.31219/osf.io/dqck7 ◽

2017 ◽

Cited By ~ 1

Author(s):

Marie Delacre ◽

Daniel Lakens ◽

Christophe Leys

Keyword(s):

Psychological Research ◽

Error Rates ◽

T Test ◽

International Review ◽

Type 1 Error ◽

Statistical Inferences ◽

Homogeneity Of Variance ◽

Student’S T ◽

Student’S T Test

When comparing two independent groups, researchers in Psychology commonly use Student’s t-test. Assumptions of normality and of homogeneity of variance underlie this test. More often than not, when these conditions are not met, Student’s t-test can be severely biased, and leads to invalid statistical inferences. Moreover, we argue that the assumption of equal variances will seldom hold in psychological research and that choosing between Student’s t-test or Welch’s t-test based on the outcomes of a test of the equality of variances often fails to provide an appropriate answer. We show that the Welch’s t-test provides a better control of Type 1 error rates when the assumption of homogeneity of variance is not met, and loses little robustness compared to Student’s t-test when the assumptions are met. We argue that Welch’s t-test should be used as a default strategy.

Download Full-text

Why Psychologists Should by Default Use Welch's t-test Instead of Student's t-test (in press for the International Review of Social Psychology).

10.31219/osf.io/sbp6k ◽

2017 ◽

Cited By ~ 1

Author(s):

Marie Delacre ◽

Daniel Lakens ◽

Christophe Leys

Keyword(s):

Psychological Research ◽

Error Rates ◽

T Test ◽

International Review ◽

Type 1 Error ◽

Statistical Inferences ◽

Homogeneity Of Variance ◽

Student’S T ◽

Student’S T Test

Download Full-text

Detecting Unit of Analysis Problems in Nested Designs: Statistical Power and Type I Error Rates of the F Test for Groups-within-Treatments Effects

Educational and Psychological Measurement ◽

10.1177/0013164496056002003 ◽

1996 ◽

Vol 56 (2) ◽

pp. 215-231 ◽

Cited By ~ 9

Author(s):

Jeffrey D. Kromrey ◽

Wendy B. Dickinson

Keyword(s):

Statistical Power ◽

Type I Error ◽

Error Rates ◽

Type I ◽

F Test ◽

Unit Of Analysis ◽

Type I Error Rates ◽

Nested Designs

Download Full-text

Application of the Hierarchical Bootstrap to Multi-Level Data in Neuroscience

10.1101/819334 ◽

2019 ◽

Cited By ~ 14

Author(s):

Varun Saravanan ◽

Gordon J. Berman ◽

Samuel J. Sober

Keyword(s):

Error Rate ◽

Statistical Power ◽

Type I Error ◽

Statistical Tests ◽

False Positive Rate ◽

Type I ◽

Type I Error Rate ◽

The Hierarchical Structure ◽

Positive Rate ◽

Student’S T

AbstractA common feature in many neuroscience datasets is the presence of hierarchical data structures, most commonly recording the activity of multiple neurons in multiple animals across multiple trials. Accordingly, the measurements constituting the dataset are not independent, even though the traditional statistical analyses often applied in such cases (e.g. Student’s t-test) treat them as such. The hierarchical bootstrap has been shown to be an effective tool to accurately analyze such data and while it has been used extensively in the statistical literature, its use is not widespread in neuroscience - despite the ubiquity of hierarchical datasets. In this paper, we illustrate the intuitiveness and utility of this approach to analyze hierarchically nested datasets. We use simulated neural data to show that traditional statistical tests can result in a false positive rate of over 45%, even if the Type-I error rate is set at 5%. While summarizing data across non-independent points (or lower levels) can potentially fix this problem, this approach greatly reduces the statistical power of the analysis. The hierarchical bootstrap, when applied sequentially over the levels of the hierarchical structure, keeps the Type-I error rate within the intended bound and retains more statistical power than summarizing methods. We conclude by demonstrating the effectiveness of the method in two real-world examples, first analyzing singing data in male Bengalese finches (Lonchura striata var. domestica) and second quantifying changes in behavior under optogenetic control in flies (Drosophila melanogaster).

Download Full-text

How to Detect Publication Bias in Psychological Research

Zeitschrift für Psychologie ◽

10.1027/2151-2604/a000386 ◽

2019 ◽

Vol 227 (4) ◽

pp. 261-279 ◽

Cited By ~ 2

Author(s):

Frank Renkewitz ◽

Melanie Keiner

Keyword(s):

Publication Bias ◽

Effect Size ◽

Statistical Power ◽

Type I Error ◽

Psychological Research ◽

Type I ◽

True Effect Size ◽

Questionable Research Practices ◽

True Effect ◽

Meta Analyses

Abstract. Publication biases and questionable research practices are assumed to be two of the main causes of low replication rates. Both of these problems lead to severely inflated effect size estimates in meta-analyses. Methodologists have proposed a number of statistical tools to detect such bias in meta-analytic results. We present an evaluation of the performance of six of these tools. To assess the Type I error rate and the statistical power of these methods, we simulated a large variety of literatures that differed with regard to true effect size, heterogeneity, number of available primary studies, and sample sizes of these primary studies; furthermore, simulated studies were subjected to different degrees of publication bias. Our results show that across all simulated conditions, no method consistently outperformed the others. Additionally, all methods performed poorly when true effect sizes were heterogeneous or primary studies had a small chance of being published, irrespective of their results. This suggests that in many actual meta-analyses in psychology, bias will remain undiscovered no matter which detection method is used.

Download Full-text

Comparison of Means: An F Test Followed by a Modified Multiple Range Procedure

Journal of Educational Statistics ◽

10.3102/10769986004001014 ◽

1979 ◽

Vol 4 (1) ◽

pp. 14-23 ◽

Cited By ~ 9

Author(s):

Juliet Popper Shaffer

Keyword(s):

Error Control ◽

Type I Error ◽

Critical Value ◽

The Other ◽

Type I ◽

F Test ◽

Range Test

If used only when a preliminary F test yields significance, the usual multiple range procedures can be modified to increase the probability of detecting differences without changing the control of Type I error. The modification consists of a reduction in the critical value when comparing the largest and smallest means. Equivalence of modified and unmodified procedures in error control is demonstrated. The modified procedure is also compared with the alternative of using the unmodified range test without a preliminary F test, and it is shown that each has advantages over the other under some circumstances.

Download Full-text

A Multi-faceted Mess: A Review of Statistical Power Analysis in Psychology Journal Articles

10.31234/osf.io/3bdfu ◽

2019 ◽

Cited By ~ 2

Author(s):

Rob Cribbie ◽

Nataly Beribisky ◽

Udi Alter

Keyword(s):

Sample Size ◽

Effect Size ◽

Power Analysis ◽

Statistical Power ◽

Type I Error ◽

A Priori ◽

Type I ◽

Specific Level ◽

Maximum Sample Size ◽

Power Analyses

Many bodies recommend that a sample planning procedure, such as traditional NHST a priori power analysis, is conducted during the planning stages of a study. Power analysis allows the researcher to estimate how many participants are required in order to detect a minimally meaningful effect size at a specific level of power and Type I error rate. However, there are several drawbacks to the procedure that render it “a mess.” Specifically, the identification of the minimally meaningful effect size is often difficult but unavoidable for conducting the procedure properly, the procedure is not precision oriented, and does not guide the researcher to collect as many participants as feasibly possible. In this study, we explore how these three theoretical issues are reflected in applied psychological research in order to better understand whether these issues are concerns in practice. To investigate how power analysis is currently used, this study reviewed the reporting of 443 power analyses in high impact psychology journals in 2016 and 2017. It was found that researchers rarely use the minimally meaningful effect size as a rationale for the chosen effect in a power analysis. Further, precision-based approaches and collecting the maximum sample size feasible are almost never used in tandem with power analyses. In light of these findings, we offer that researchers should focus on tools beyond traditional power analysis when sample planning, such as collecting the maximum sample size feasible.

Download Full-text

Cognitive tests used in chronic adult human randomised controlled trial micronutrient and phytochemical intervention studies

Nutrition Research Reviews ◽

10.1017/s0954422410000119 ◽

2010 ◽

Vol 23 (2) ◽

pp. 200-229 ◽

Cited By ~ 25

Author(s):

Anna L. Macready ◽

Laurie T. Butler ◽

Orla B. Kennedy ◽

Judi A. Ellis ◽

Claire M. Williams ◽

...

Keyword(s):

Randomised Controlled Trial ◽

Statistical Power ◽

Type I Error ◽

Spatial Working Memory ◽

Controlled Trial ◽

Type I ◽

Cognitive Tests ◽

Cognitive Domains ◽

Positive Effects ◽

Randomised Controlled

In recent years there has been a rapid growth of interest in exploring the relationship between nutritional therapies and the maintenance of cognitive function in adulthood. Emerging evidence reveals an increasingly complex picture with respect to the benefits of various food constituents on learning, memory and psychomotor function in adults. However, to date, there has been little consensus in human studies on the range of cognitive domains to be tested or the particular tests to be employed. To illustrate the potential difficulties that this poses, we conducted a systematic review of existing human adult randomised controlled trial (RCT) studies that have investigated the effects of 24 d to 36 months of supplementation with flavonoids and micronutrients on cognitive performance. There were thirty-nine studies employing a total of 121 different cognitive tasks that met the criteria for inclusion. Results showed that less than half of these studies reported positive effects of treatment, with some important cognitive domains either under-represented or not explored at all. Although there was some evidence of sensitivity to nutritional supplementation in a number of domains (for example, executive function, spatial working memory), interpretation is currently difficult given the prevailing ‘scattergun approach’ for selecting cognitive tests. Specifically, the practice means that it is often difficult to distinguish between a boundary condition for a particular nutrient and a lack of task sensitivity. We argue that for significant future progress to be made, researchers need to pay much closer attention to existing human RCT and animal data, as well as to more basic issues surrounding task sensitivity, statistical power and type I error.

Download Full-text

Required sample size for comparing two independent means

Marine Medicine ◽

10.22328/2413-5747-2020-6-2-106-113 ◽

2020 ◽

Vol 6 (2) ◽

pp. 106-113

Author(s):

A. M. Grjibovski ◽

M. A. Gorbatova ◽

A. N. Narkevich ◽

K. A. Vinogradov

Keyword(s):

Sample Size ◽

Statistical Power ◽

Type I Error ◽

Sample Size Calculation ◽

Biomedical Literature ◽

Type I ◽

Research Practice ◽

False Null Hypothesis ◽

Different Levels ◽

Russian Research

Sample size calculation in a planning phase is still uncommon in Russian research practice. This situation threatens validity of the conclusions and may introduce Type I error when the false null hypothesis is accepted due to lack of statistical power to detect the existing difference between the means. Comparing two means using unpaired Students’ ttests is the most common statistical procedure in the Russian biomedical literature. However, calculations of the minimal required sample size or retrospective calculation of the statistical power were observed only in very few publications. In this paper we demonstrate how to calculate required sample size for comparing means in unpaired samples using WinPepi and Stata software. In addition, we produced tables for minimal required sample size for studies when two means have to be compared and body mass index and blood pressure are the variables of interest. The tables were constructed for unpaired samples for different levels of statistical power and standard deviations obtained from the literature.

Download Full-text