Evaluating Research in Personality and Social Psychology: Considerations of Statistical Power and Concerns About False Findings

2021 ◽  
pp. 014616722110308
Author(s):  
Duane T. Wegener ◽  
Leandre R. Fabrigar ◽  
Jolynn Pek ◽  
Kathryn Hoisington-Shaw

Traditionally, statistical power was viewed as relevant to research planning but not evaluation of completed research. However, following discussions of high false finding rates (FFRs) associated with low statistical power, the assumed level of statistical power has become a key criterion for research acceptability. Yet, the links between power and false findings are not as straightforward as described. Assumptions underlying FFR calculations do not reflect research realities in personality and social psychology. Even granting the assumptions, the FFR calculations identify important limitations to any general influences of statistical power. Limits for statistical power in inflating false findings can also be illustrated through the use of FFR calculations to (a) update beliefs about the null or alternative hypothesis and (b) assess the relative support for the null versus alternative hypothesis when evaluating a set of studies. Taken together, statistical power should be de-emphasized in comparison to current uses in research evaluation.

Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (American Statistical Association, Wasserstein & Lazar 2016). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a realistic statistical power of 40%, given that there is a true effect, only one in six studies will significantly replicate the significant result of another study. Even at a good power of 80%, results from two studies will be conflicting, in terms of significance, in one third of the cases if there is a true effect. This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to publication bias against nonsignificant findings. Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, such as 'we need more stringent decision rules', 'sample sizes will decrease' or 'we need to get rid of p-values'.


2021 ◽  
Author(s):  
Maximilian Maier ◽  
Daniel Lakens

The default use of an alpha level of 0.05 is suboptimal for two reasons. First, decisions based on data can be made more efficiently by choosing an alpha level that minimizes the combined Type 1 and Type 2 error rate. Second, it is possible that in studies with very high statistical power p-values lower than the alpha level can be more likely when the null hypothesis is true, than when the alternative hypothesis is true (i.e., Lindley's paradox). This manuscript explains two approaches that can be used to justify a better choice of an alpha level than relying on the default threshold of 0.05. The first approach is based on the idea to either minimize or balance Type 1 and Type 2 error rates. The second approach lowers the alpha level as a function of the sample size to prevent Lindley's paradox. An R package and Shiny app are provided to perform the required calculations. Both approaches have their limitations (e.g., the challenge of specifying relative costs and priors), but can offer an improvement to current practices, especially when sample sizes are large. The use of alpha levels that have a better justification should improve statistical inferences and can increase the efficiency and informativeness of scientific research.


2021 ◽  
pp. 194855062110240
Author(s):  
Rotem Kahalon ◽  
Verena Klein ◽  
Inna Ksenofontov ◽  
Johannes Ullrich ◽  
Stephen C. Wright

Psychology research from Western, educated, industrialized, rich, and democratic (WEIRD) countries, especially from the United States, receives more scientific attention than research from non-WEIRD countries. We investigate one structural way that this inequality might be enacted: mentioning the sample's country in the article title. Analyzing the current publication practice of four leading social psychology journals (Study 1) and conducting two experiments with U.S. American and German students (Study 2), we show that the country is more often mentioned in articles with samples from non-WEIRD countries than those with samples from WEIRD countries (especially the United States) and that this practice is associated with less scientific attention. We propose that this phenomenon represents a (perhaps unintentional) form of structural discrimination, which can lead to underrepresentation and reduced impact of social psychological research done with non-WEIRD samples. We outline possible changes in the publication process that could challenge this phenomenon.


1990 ◽  
Vol 132 (supp1) ◽  
pp. 156-166 ◽  
Author(s):  
DANIEL WARTENBERG ◽  
MICHAEL GREENBERG

Abstract A variety of methods and models have been proposed for the statistical analysis of disease excesses, yet rarely are these methods compared with respect to their ability to detect possible clusters. Evaluation of statistical power is one approach for comparing different methods. In this paper, the authors study the probability that a test will reject the null hypothesis, given that the null hypothesis is indeed false. They present a discussion of some considerations involved in power studies of cluster methods and review two methods for detecting space-time clusters of disease, one based on cell occupancy models and the other based on interevent distance comparisons. The authors compare these approaches with respect to: 1) the sensitivity to detect disease excesses (false negatives); 2) the likelihood of detecting clusters that do not exist (false positives); and 3) the structure of a cluster in a given investigation (the alternative hypothesis). The methods chosen, which are two of the most commonly used, are specific to different hypotheses. They both show low power for the small number of cases which are typical of citizen reports to health departments.


2021 ◽  
Author(s):  
Rotem Kahalon ◽  
Verena Klein ◽  
Inna Ksenofontov ◽  
Johannes Ullrich ◽  
Stephen C Wright

Psychology research from Western, educated, industrialized, rich, and democratic (WEIRD) countries, especially from the United States, receives more scientific attention than research from non-WEIRD countries. We investigate one structural way that this inequality might be enacted: mentioning the sample's country in the article title. Analyzing the current publication practice of four leading social psychology journals (Study 1) and conducting two experiments with U.S. American and German students (Study 2), we show that the country is more often mentioned in articles with samples from non-WEIRD countries than those with samples from WEIRD countries (especially the United States) and that this practice is associated with less scientific attention. We propose that this phenomenon represents a (perhaps unintentional) form of structural discrimination, which can lead to underrepresentation and reduced impact of social psychological research done with non-WEIRD samples. We outline possible changes in the publication process that could challenge this phenomenon.


2017 ◽  
Author(s):  
Daniel Lakens ◽  
Alexander Etz

Psychology journals rarely publish non-significant results. At the same time, it is often very unlikely (or ‘too good to be true’) that a set of studies yields exclusively significant results. Here, we use likelihood ratios to explain when sets of studies that contain a mix of significant and non-significant results are likely to be true, or ‘too true to be bad’. As we show, mixed results are not only likely to be observed in lines of research, but when observed, mixed results often provide evidence for the alternative hypothesis, given reasonable levels of statistical power and an adequately controlled low Type 1 error rate. Researchers should feel comfortable submitting such lines of research with an internal meta-analysis for publication. A better understanding of probabilities, accompanied by more realistic expectations of what real lines of studies look like, might be an important step in mitigating publication bias in the scientific literature.


Author(s):  
Daniel Berner ◽  
Valentin Amrhein

A paradigm shift away from null hypothesis significance testing seems in progress. Based on simulations, we illustrate some of the underlying motivations. First, P-values vary strongly from study to study, hence dichotomous inference using significance thresholds is usually unjustified. Second, statistically significant results have overestimated effect sizes, a bias declining with increasing statistical power. Third, statistically non-significant results have underestimated effect sizes, and this bias gets stronger with higher statistical power. Fourth, the tested statistical hypotheses generally lack biological justification and are often uninformative. Despite these problems, a screen of 48 papers from the 2020 volume of the Journal of Evolutionary Biology exemplifies that significance testing is still used almost universally in evolutionary biology. All screened studies tested the default null hypothesis of zero effect with the default significance threshold of p = 0.05, none presented a pre-planned alternative hypothesis, and none calculated statistical power and the probability of ‘false negatives’ (beta error). The papers reported 49 significance tests on average. Of 41 papers that contained verbal descriptions of a ‘statistically non-significant’ result, 26 (63%) falsely claimed the absence of an effect. We conclude that our studies in ecology and evolutionary biology are mostly exploratory and descriptive. We should thus shift from claiming to “test” specific hypotheses statistically to describing and discussing many hypotheses (effect sizes) that are most compatible with our data, given our statistical model. We already have the means for doing so, because we routinely present compatibility (“confidence”) intervals covering these hypotheses.


2017 ◽  
Vol 6 (6) ◽  
pp. 158
Author(s):  
Louis Mutter ◽  
Steven B. Kim

There are numerous statistical hypothesis tests for categorical data including Pearson's Chi-Square goodness-of-fit test and other discrete versions of goodness-of-fit tests. For these hypothesis tests, the null hypothesis is simple, and the alternative hypothesis is composite which negates the simple null hypothesis. For power calculation, a researcher specifies a significance level, a sample size, a simple null hypothesis, and a simple alternative hypothesis. In practice, there are cases when an experienced researcher has deep and broad scientific knowledge, but the researcher may suffer from a lack of statistical power due to a small sample size being available. In such a case, we may formulate hypothesis testing based on a simple alternative hypothesis instead of the composite alternative hypothesis. In this article, we investigate how much statistical power can be gained via a correctly specified simple alternative hypothesis and how much statistical power can be lost under a misspecified alternative hypothesis, particularly when an available sample size is small.


Sign in / Sign up

Export Citation Format

Share Document