scholarly journals Problems in using p-curve analysis and text-mining to detect rate of p-hacking

Author(s):  
Dorothy V Bishop ◽  
Paul A Thompson

Background: The p-curve is a plot of the distribution of p-values below .05 reported in a set of scientific studies. Comparisons between ranges of p-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication, p-hacking. We argue that binomial tests on the p-curve are not robust enough to be used for this purpose. Methods: P-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigating p-hacking. Results: We first show that a p-curve suggestive of p-hacking can be obtained if researchers misapply parametric tests to data that depart from normality, even when no p-hacking occurs. We go on to show that when there is ghost p-hacking, the shape of the p-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" just below .05 that is regarded as evidence of p-hacking, though there is a negative skew when simulated variables are inter-correlated. The way p-curves vary according to features of underlying data poses problems when automated text mining is used to detect p-values in heterogeneous sets of published papers. Conclusions: A significant bump in the p-curve just below .05 is not necessarily evidence of p-hacking, and lack of a bump is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will usually generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value, unless we have a model specific to the type of p-values entered into the analysis. We conclude that it is not feasible to use the p-curve to estimate the extent of p-hacking and evidential value unless there is considerable control over the type of data entered into the analysis.

2016 ◽  
Author(s):  
Dorothy V Bishop ◽  
Paul A Thompson

Background: The p-curve is a plot of the distribution of p-values reported in a set of scientific studies. Comparisons between ranges of p-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication, p-hacking. Methods: P-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigating p-hacking. Results: We first show that when there is ghost p-hacking, the shape of the p-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" just below .05 that is regarded as evidence of p-hacking, though there is a negative skew when simulated variables are inter-correlated. The way p-curves vary according to features of underlying data poses problems when automated text mining is used to detect p-values in heterogeneous sets of published papers. Conclusions: The absence of a bump in the p-curve is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will usually generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value, unless we have a model specific to the type of p-values entered into the analysis. We conclude that it is not feasible to use the p-curve to estimate the extent of p-hacking and evidential value unless there is considerable control over the type of data entered into the analysis. In particular, p-hacking with ghost variables is likely to be missed.


2016 ◽  
Author(s):  
Dorothy V Bishop ◽  
Paul A Thompson

Background: The p-curve is a plot of the distribution of p-values reported in a set of scientific studies. Comparisons between ranges of p-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication, p-hacking. Methods: P-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigating p-hacking. Results: We first show that when there is ghost p-hacking, the shape of the p-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" just below .05 that is regarded as evidence of p-hacking, though there is a negative skew when simulated variables are inter-correlated. The way p-curves vary according to features of underlying data poses problems when automated text mining is used to detect p-values in heterogeneous sets of published papers. Conclusions: The absence of a bump in the p-curve is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will usually generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value, unless we have a model specific to the type of p-values entered into the analysis. We conclude that it is not feasible to use the p-curve to estimate the extent of p-hacking and evidential value unless there is considerable control over the type of data entered into the analysis. In particular, p-hacking with ghost variables is likely to be missed.


PeerJ ◽  
2016 ◽  
Vol 4 ◽  
pp. e1715 ◽  
Author(s):  
Dorothy V.M. Bishop ◽  
Paul A. Thompson

Background.Thep-curve is a plot of the distribution ofp-values reported in a set of scientific studies. Comparisons between ranges ofp-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication,p-hacking.Methods.p-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigatingp-hacking.Results.We show that when there is ghostp-hacking, the shape of thep-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulatedp-hacked data do not give the “p-hacking bump” just below .05 that is regarded as evidence ofp-hacking, though there is a negative skew when simulated variables are inter-correlated. The wayp-curves vary according to features of underlying data poses problems when automated text mining is used to detectp-values in heterogeneous sets of published papers.Conclusions.The absence of a bump in thep-curve is not indicative of lack ofp-hacking. Furthermore, while studies with evidential value will usually generate a right-skewedp-curve, we cannot treat a right-skewedp-curve as an indicator of the extent of evidential value, unless we have a model specific to the type ofp-values entered into the analysis. We conclude that it is not feasible to use thep-curve to estimate the extent ofp-hacking and evidential value unless there is considerable control over the type of data entered into the analysis. In particular,p-hacking with ghost variables is likely to be missed.


2015 ◽  
Author(s):  
Dorothy V Bishop ◽  
Paul A Thompson

Background: The p-curve is a plot of the distribution of p-values below .05 reported in a set of scientific studies. It has been used to identify bias in the selection of variables and analyses for publication, p-hacking. A recent study by Head et al. (2015) combined this approach with automated text-mining of p-values from a large corpus of published papers and concluded that although there was evidence of p-hacking, its effect was weak in relation to real effect sizes, and not likely to cause serious distortions in the literature. We argue that the methods used by these authors do not support this inference. Methods: P-hacking can take various forms. For the current paper, we developed R code to simulate the use of ghost variables, where an experimenter gathers data on numerous variables but reports only those with statistically significant effects. We also examined the text-mined dataset used by Head et al. to assess its suitability for investigating p-hacking. Results: For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" that is evidence of p-hacking. The p-curve develops a positive slope when simulated variables are highly intercorrelated, but does not show the excess of p-values in the p-curve just below .05 that has been regarded as indicative of extreme p-hacking. A right-skewed p-curve is obtained, as expected, when there is a true difference between groups, but it was also obtained in p-hacked datasets containing a high proportion of cases with a true null effect. The results of Head et al are further compromised because their automated text mining detected any p-value mentioned in the Results or Abstract of a paper, including those reported in the course of validation of materials or methods, or confirmation of well-established facts, as opposed to hypothesis-testing. There was no information on the statistical power of studies, nor on the statistical test conducted. Conclusions: We find two problems with the analysis by Head et al. First, though a significant bump in the p-curve just below .05 is good evidence of p-hacking, lack of a bump is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value. This is particularly the case when there is no control over the type of p-values entered into the analysis. The analysis presented here suggests that the potential for systematic bias is substantial. We conclude that the study by Head et al. provides evidence of p-hacking in the scientific literature, but it cannot be used to estimate the extent and consequences of p-hacking. Analysis of meta-analysed datasets avoids some of these problems, but will still miss an important type of p-hacking.


2015 ◽  
Author(s):  
Dorothy V Bishop ◽  
Paul A Thompson

Background: The p-curve is a plot of the distribution of p-values below .05 reported in a set of scientific studies. It has been used to estimate the frequency of bias in the selection of variables and analyses for publication, p-hacking. A recent study by Head et al. (2015) combined this approach with automated text-mining of p-values from over 100 000 published papers and concluded that although there was evidence of p-hacking, it was not common enough to cause serious distortions in the literature. Methods: P-hacking can take various forms. For the current paper, we developed R code to simulate the use of ghost variables, where an experimenter gathers data on numerous variables but reports only those with statistically significant effects. In addition, we examined the dataset used by Head et al. to assess its suitability for investigating p-hacking. This consisted of a set of open access papers that reported at least one p-value below .05; where more than one p-value was less than .05, one was randomly sampled per paper. Results: For uncorrelated variables, simulated p-hacked data do not give the signature left-skewed p-curve that Head et al. took as evidence of p-hacking. A right-skewed p-curve is obtained, as expected, when there is a true difference between groups, but it was also obtained in p-hacked datasets containing a high proportion of cases with a true null effect. The automated text mining used by Head et al. detected any p-value mentioned in the Results or Abstract of a paper, including those reported in the course of validation of materials or methods, or confirmation of well-established facts, as opposed to hypothesis-testing. There was no information on the statistical power of studies, nor on the statistical test conducted. In addition, Head et al. excluded p-values in tables, p-values reported as 'less than' rather than 'equal to' a given value, and those reported using scientific notation or in ranges. Conclusions: Use of ghost variables, a form of p-hacking where the experimenter tests many variables and reports only those with the largest effect sizes, does not give the kind of p-curve with left-skewing around .05 that Head et al. focused on. Furthermore, to interpret a p-curve we need to know whether the p-values were testing a specific hypothesis, and to be confident that if any p-values are excluded, the effect on the p-curve is random rather than systematic. It is inevitable that with automated text-mining there will be some inaccuracies in data: the key question is whether the advantages of having very large amounts of extracted data compensates for these inaccuracies. The analysis presented here suggests that the potential of systematic bias is mined data is substantial and invalidates conclusions about p-hacking based on p-values obtained by text-mining.


2017 ◽  
Author(s):  
Rachael Smyth ◽  
Daniel Ansari

Research demonstrating that infants discriminate between small and large numerosities is central to theories concerning the origins of human numerical abilities. To date, there has been no quantitative meta-analysis of the infant numerical competency data. Here, we quantitatively synthesize the evidential value of the available literature on infant numerosity discrimination using a meta-analytic tool called p-curve, in which the distribution of available p-values is analyzed to determine whether the published literature examining particular hypotheses contains evidential value. P-curves demonstrated evidential value for the hypotheses that infants can discriminate between both small and large numerosities. However, the analyses also revealed that the data on infants’ ability to discriminate between large numerosities is less robust and statistically powered than the data on their ability to discriminate small numerosities. We argue there is a need for adequately powered replication studies to enable stronger inferences when grounding theories concerning the ontogenesis of numerical cognition.


Sign in / Sign up

Export Citation Format

Share Document