Problems in using text-mining and p-curve analysis to detect rate of p-hacking
Background: The p-curve is a plot of the distribution of p-values below .05 reported in a set of scientific studies. It has been used to estimate the frequency of bias in the selection of variables and analyses for publication, p-hacking. A recent study by Head et al. (2015) combined this approach with automated text-mining of p-values from over 100 000 published papers and concluded that although there was evidence of p-hacking, it was not common enough to cause serious distortions in the literature. Methods: P-hacking can take various forms. For the current paper, we developed R code to simulate the use of ghost variables, where an experimenter gathers data on numerous variables but reports only those with statistically significant effects. In addition, we examined the dataset used by Head et al. to assess its suitability for investigating p-hacking. This consisted of a set of open access papers that reported at least one p-value below .05; where more than one p-value was less than .05, one was randomly sampled per paper. Results: For uncorrelated variables, simulated p-hacked data do not give the signature left-skewed p-curve that Head et al. took as evidence of p-hacking. A right-skewed p-curve is obtained, as expected, when there is a true difference between groups, but it was also obtained in p-hacked datasets containing a high proportion of cases with a true null effect. The automated text mining used by Head et al. detected any p-value mentioned in the Results or Abstract of a paper, including those reported in the course of validation of materials or methods, or confirmation of well-established facts, as opposed to hypothesis-testing. There was no information on the statistical power of studies, nor on the statistical test conducted. In addition, Head et al. excluded p-values in tables, p-values reported as 'less than' rather than 'equal to' a given value, and those reported using scientific notation or in ranges. Conclusions: Use of ghost variables, a form of p-hacking where the experimenter tests many variables and reports only those with the largest effect sizes, does not give the kind of p-curve with left-skewing around .05 that Head et al. focused on. Furthermore, to interpret a p-curve we need to know whether the p-values were testing a specific hypothesis, and to be confident that if any p-values are excluded, the effect on the p-curve is random rather than systematic. It is inevitable that with automated text-mining there will be some inaccuracies in data: the key question is whether the advantages of having very large amounts of extracted data compensates for these inaccuracies. The analysis presented here suggests that the potential of systematic bias is mined data is substantial and invalidates conclusions about p-hacking based on p-values obtained by text-mining.