scholarly journals Problems in using text-mining and p-curve analysis to detect rate of p-hacking

Author(s):  
Dorothy V Bishop ◽  
Paul A Thompson

Background: The p-curve is a plot of the distribution of p-values below .05 reported in a set of scientific studies. It has been used to estimate the frequency of bias in the selection of variables and analyses for publication, p-hacking. A recent study by Head et al. (2015) combined this approach with automated text-mining of p-values from over 100 000 published papers and concluded that although there was evidence of p-hacking, it was not common enough to cause serious distortions in the literature. Methods: P-hacking can take various forms. For the current paper, we developed R code to simulate the use of ghost variables, where an experimenter gathers data on numerous variables but reports only those with statistically significant effects. In addition, we examined the dataset used by Head et al. to assess its suitability for investigating p-hacking. This consisted of a set of open access papers that reported at least one p-value below .05; where more than one p-value was less than .05, one was randomly sampled per paper. Results: For uncorrelated variables, simulated p-hacked data do not give the signature left-skewed p-curve that Head et al. took as evidence of p-hacking. A right-skewed p-curve is obtained, as expected, when there is a true difference between groups, but it was also obtained in p-hacked datasets containing a high proportion of cases with a true null effect. The automated text mining used by Head et al. detected any p-value mentioned in the Results or Abstract of a paper, including those reported in the course of validation of materials or methods, or confirmation of well-established facts, as opposed to hypothesis-testing. There was no information on the statistical power of studies, nor on the statistical test conducted. In addition, Head et al. excluded p-values in tables, p-values reported as 'less than' rather than 'equal to' a given value, and those reported using scientific notation or in ranges. Conclusions: Use of ghost variables, a form of p-hacking where the experimenter tests many variables and reports only those with the largest effect sizes, does not give the kind of p-curve with left-skewing around .05 that Head et al. focused on. Furthermore, to interpret a p-curve we need to know whether the p-values were testing a specific hypothesis, and to be confident that if any p-values are excluded, the effect on the p-curve is random rather than systematic. It is inevitable that with automated text-mining there will be some inaccuracies in data: the key question is whether the advantages of having very large amounts of extracted data compensates for these inaccuracies. The analysis presented here suggests that the potential of systematic bias is mined data is substantial and invalidates conclusions about p-hacking based on p-values obtained by text-mining.

2015 ◽  
Author(s):  
Dorothy V Bishop ◽  
Paul A Thompson

Background: The p-curve is a plot of the distribution of p-values below .05 reported in a set of scientific studies. It has been used to identify bias in the selection of variables and analyses for publication, p-hacking. A recent study by Head et al. (2015) combined this approach with automated text-mining of p-values from a large corpus of published papers and concluded that although there was evidence of p-hacking, its effect was weak in relation to real effect sizes, and not likely to cause serious distortions in the literature. We argue that the methods used by these authors do not support this inference. Methods: P-hacking can take various forms. For the current paper, we developed R code to simulate the use of ghost variables, where an experimenter gathers data on numerous variables but reports only those with statistically significant effects. We also examined the text-mined dataset used by Head et al. to assess its suitability for investigating p-hacking. Results: For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" that is evidence of p-hacking. The p-curve develops a positive slope when simulated variables are highly intercorrelated, but does not show the excess of p-values in the p-curve just below .05 that has been regarded as indicative of extreme p-hacking. A right-skewed p-curve is obtained, as expected, when there is a true difference between groups, but it was also obtained in p-hacked datasets containing a high proportion of cases with a true null effect. The results of Head et al are further compromised because their automated text mining detected any p-value mentioned in the Results or Abstract of a paper, including those reported in the course of validation of materials or methods, or confirmation of well-established facts, as opposed to hypothesis-testing. There was no information on the statistical power of studies, nor on the statistical test conducted. Conclusions: We find two problems with the analysis by Head et al. First, though a significant bump in the p-curve just below .05 is good evidence of p-hacking, lack of a bump is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value. This is particularly the case when there is no control over the type of p-values entered into the analysis. The analysis presented here suggests that the potential for systematic bias is substantial. We conclude that the study by Head et al. provides evidence of p-hacking in the scientific literature, but it cannot be used to estimate the extent and consequences of p-hacking. Analysis of meta-analysed datasets avoids some of these problems, but will still miss an important type of p-hacking.


2015 ◽  
Author(s):  
Dorothy V Bishop ◽  
Paul A Thompson

Background: The p-curve is a plot of the distribution of p-values below .05 reported in a set of scientific studies. Comparisons between ranges of p-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication, p-hacking. We argue that binomial tests on the p-curve are not robust enough to be used for this purpose. Methods: P-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigating p-hacking. Results: We first show that a p-curve suggestive of p-hacking can be obtained if researchers misapply parametric tests to data that depart from normality, even when no p-hacking occurs. We go on to show that when there is ghost p-hacking, the shape of the p-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" just below .05 that is regarded as evidence of p-hacking, though there is a negative skew when simulated variables are inter-correlated. The way p-curves vary according to features of underlying data poses problems when automated text mining is used to detect p-values in heterogeneous sets of published papers. Conclusions: A significant bump in the p-curve just below .05 is not necessarily evidence of p-hacking, and lack of a bump is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will usually generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value, unless we have a model specific to the type of p-values entered into the analysis. We conclude that it is not feasible to use the p-curve to estimate the extent of p-hacking and evidential value unless there is considerable control over the type of data entered into the analysis.


2021 ◽  
pp. 39-55
Author(s):  
R. Barker Bausell

This chapter explores three empirical concepts (the p-value, the effect size, and statistical power) integral to the avoidance of false positive scientific. Their relationship to reproducibility is explained in a nontechnical manner without formulas or statistical jargon, with p-values and statistical power presented in terms of probabilities from zero to 1.0 with the values of most interest to scientists being 0.05 (synonymous with a positive, hence, publishable result) and 0.80 (the most commonly recommended probability that a positive result will be obtained if the hypothesis that generated it was correct and the study will be properly designed and conducted). Unfortunately many scientists circumvent both by artifactually inflating the 0.05 criterion, overstating the available statistical power, and engaging in a number of other questionable research practices. These issues are discussed via statistical models from the genetic and psychological fields and then extended to a number of different p-values, statistical power levels, effect sizes, and the prevalence of “true,” effects expected to exist in the research literature. Among the basic conclusions of these modeling efforts are that employing more stringent p-values and larger sample sizes constitute the most effective statistical approaches for increasing the reproducibility of published results in all empirically based scientific literatures. This chapter thus lays the necessary foundation for understanding and appreciating the effects of appropriate p-values, sufficient statistical power, reaslistic effect sizes, and the avoidance of questionable research practices upon the production of reproducible results.


2016 ◽  
Author(s):  
Dorothy V Bishop ◽  
Paul A Thompson

Background: The p-curve is a plot of the distribution of p-values reported in a set of scientific studies. Comparisons between ranges of p-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication, p-hacking. Methods: P-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigating p-hacking. Results: We first show that when there is ghost p-hacking, the shape of the p-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" just below .05 that is regarded as evidence of p-hacking, though there is a negative skew when simulated variables are inter-correlated. The way p-curves vary according to features of underlying data poses problems when automated text mining is used to detect p-values in heterogeneous sets of published papers. Conclusions: The absence of a bump in the p-curve is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will usually generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value, unless we have a model specific to the type of p-values entered into the analysis. We conclude that it is not feasible to use the p-curve to estimate the extent of p-hacking and evidential value unless there is considerable control over the type of data entered into the analysis. In particular, p-hacking with ghost variables is likely to be missed.


2016 ◽  
Author(s):  
Dorothy V Bishop ◽  
Paul A Thompson

Background: The p-curve is a plot of the distribution of p-values reported in a set of scientific studies. Comparisons between ranges of p-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication, p-hacking. Methods: P-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigating p-hacking. Results: We first show that when there is ghost p-hacking, the shape of the p-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" just below .05 that is regarded as evidence of p-hacking, though there is a negative skew when simulated variables are inter-correlated. The way p-curves vary according to features of underlying data poses problems when automated text mining is used to detect p-values in heterogeneous sets of published papers. Conclusions: The absence of a bump in the p-curve is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will usually generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value, unless we have a model specific to the type of p-values entered into the analysis. We conclude that it is not feasible to use the p-curve to estimate the extent of p-hacking and evidential value unless there is considerable control over the type of data entered into the analysis. In particular, p-hacking with ghost variables is likely to be missed.


Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (American Statistical Association, Wasserstein & Lazar 2016). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a realistic statistical power of 40%, given that there is a true effect, only one in six studies will significantly replicate the significant result of another study. Even at a good power of 80%, results from two studies will be conflicting, in terms of significance, in one third of the cases if there is a true effect. This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to publication bias against nonsignificant findings. Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, such as 'we need more stringent decision rules', 'sample sizes will decrease' or 'we need to get rid of p-values'.


2016 ◽  
Vol 156 (6) ◽  
pp. 978-980 ◽  
Author(s):  
Peter M. Vila ◽  
Melanie Elizabeth Townsend ◽  
Neel K. Bhatt ◽  
W. Katherine Kao ◽  
Parul Sinha ◽  
...  

There is a lack of reporting effect sizes and confidence intervals in the current biomedical literature. The objective of this article is to present a discussion of the recent paradigm shift encouraging the use of reporting effect sizes and confidence intervals. Although P values help to inform us about whether an effect exists due to chance, effect sizes inform us about the magnitude of the effect (clinical significance), and confidence intervals inform us about the range of plausible estimates for the general population mean (precision). Reporting effect sizes and confidence intervals is a necessary addition to the biomedical literature, and these concepts are reviewed in this article.


Author(s):  
Michael D. Jennions ◽  
Christopher J. Lortie ◽  
Julia Koricheva

This chapter begins with a brief review of why effect sizes and their variances are more informative than P-values. It then discusses how meta-analysis promotes “effective thinking” that can change approaches to several commonplace problems. Specifically, it addresses the issues of (1) exemplar studies versus average trends, (2) resolving “conflict” between specific studies, (3) presenting results, (4) deciding on the level at which to replicate studies, (5) understanding the constraints imposed by low statistical power, and (6) asking broad-scale questions that cannot be resolved in a single study. The chapter focuses on estimating effect sizes as a key outcome of meta-analysis, but acknowledges that other outcomes might be of more interest in other situations.


2017 ◽  
Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.


PeerJ ◽  
2016 ◽  
Vol 4 ◽  
pp. e1715 ◽  
Author(s):  
Dorothy V.M. Bishop ◽  
Paul A. Thompson

Background.Thep-curve is a plot of the distribution ofp-values reported in a set of scientific studies. Comparisons between ranges ofp-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication,p-hacking.Methods.p-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigatingp-hacking.Results.We show that when there is ghostp-hacking, the shape of thep-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulatedp-hacked data do not give the “p-hacking bump” just below .05 that is regarded as evidence ofp-hacking, though there is a negative skew when simulated variables are inter-correlated. The wayp-curves vary according to features of underlying data poses problems when automated text mining is used to detectp-values in heterogeneous sets of published papers.Conclusions.The absence of a bump in thep-curve is not indicative of lack ofp-hacking. Furthermore, while studies with evidential value will usually generate a right-skewedp-curve, we cannot treat a right-skewedp-curve as an indicator of the extent of evidential value, unless we have a model specific to the type ofp-values entered into the analysis. We conclude that it is not feasible to use thep-curve to estimate the extent ofp-hacking and evidential value unless there is considerable control over the type of data entered into the analysis. In particular,p-hacking with ghost variables is likely to be missed.


Sign in / Sign up

Export Citation Format

Share Document