scholarly journals Problems in using text-mining and p-curve analysis to detect rate of p-hacking

Author(s):  
Dorothy V Bishop ◽  
Paul A Thompson

Background: The p-curve is a plot of the distribution of p-values below .05 reported in a set of scientific studies. It has been used to identify bias in the selection of variables and analyses for publication, p-hacking. A recent study by Head et al. (2015) combined this approach with automated text-mining of p-values from a large corpus of published papers and concluded that although there was evidence of p-hacking, its effect was weak in relation to real effect sizes, and not likely to cause serious distortions in the literature. We argue that the methods used by these authors do not support this inference. Methods: P-hacking can take various forms. For the current paper, we developed R code to simulate the use of ghost variables, where an experimenter gathers data on numerous variables but reports only those with statistically significant effects. We also examined the text-mined dataset used by Head et al. to assess its suitability for investigating p-hacking. Results: For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" that is evidence of p-hacking. The p-curve develops a positive slope when simulated variables are highly intercorrelated, but does not show the excess of p-values in the p-curve just below .05 that has been regarded as indicative of extreme p-hacking. A right-skewed p-curve is obtained, as expected, when there is a true difference between groups, but it was also obtained in p-hacked datasets containing a high proportion of cases with a true null effect. The results of Head et al are further compromised because their automated text mining detected any p-value mentioned in the Results or Abstract of a paper, including those reported in the course of validation of materials or methods, or confirmation of well-established facts, as opposed to hypothesis-testing. There was no information on the statistical power of studies, nor on the statistical test conducted. Conclusions: We find two problems with the analysis by Head et al. First, though a significant bump in the p-curve just below .05 is good evidence of p-hacking, lack of a bump is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value. This is particularly the case when there is no control over the type of p-values entered into the analysis. The analysis presented here suggests that the potential for systematic bias is substantial. We conclude that the study by Head et al. provides evidence of p-hacking in the scientific literature, but it cannot be used to estimate the extent and consequences of p-hacking. Analysis of meta-analysed datasets avoids some of these problems, but will still miss an important type of p-hacking.

2015 ◽  
Author(s):  
Dorothy V Bishop ◽  
Paul A Thompson

Background: The p-curve is a plot of the distribution of p-values below .05 reported in a set of scientific studies. It has been used to estimate the frequency of bias in the selection of variables and analyses for publication, p-hacking. A recent study by Head et al. (2015) combined this approach with automated text-mining of p-values from over 100 000 published papers and concluded that although there was evidence of p-hacking, it was not common enough to cause serious distortions in the literature. Methods: P-hacking can take various forms. For the current paper, we developed R code to simulate the use of ghost variables, where an experimenter gathers data on numerous variables but reports only those with statistically significant effects. In addition, we examined the dataset used by Head et al. to assess its suitability for investigating p-hacking. This consisted of a set of open access papers that reported at least one p-value below .05; where more than one p-value was less than .05, one was randomly sampled per paper. Results: For uncorrelated variables, simulated p-hacked data do not give the signature left-skewed p-curve that Head et al. took as evidence of p-hacking. A right-skewed p-curve is obtained, as expected, when there is a true difference between groups, but it was also obtained in p-hacked datasets containing a high proportion of cases with a true null effect. The automated text mining used by Head et al. detected any p-value mentioned in the Results or Abstract of a paper, including those reported in the course of validation of materials or methods, or confirmation of well-established facts, as opposed to hypothesis-testing. There was no information on the statistical power of studies, nor on the statistical test conducted. In addition, Head et al. excluded p-values in tables, p-values reported as 'less than' rather than 'equal to' a given value, and those reported using scientific notation or in ranges. Conclusions: Use of ghost variables, a form of p-hacking where the experimenter tests many variables and reports only those with the largest effect sizes, does not give the kind of p-curve with left-skewing around .05 that Head et al. focused on. Furthermore, to interpret a p-curve we need to know whether the p-values were testing a specific hypothesis, and to be confident that if any p-values are excluded, the effect on the p-curve is random rather than systematic. It is inevitable that with automated text-mining there will be some inaccuracies in data: the key question is whether the advantages of having very large amounts of extracted data compensates for these inaccuracies. The analysis presented here suggests that the potential of systematic bias is mined data is substantial and invalidates conclusions about p-hacking based on p-values obtained by text-mining.


2015 ◽  
Author(s):  
Dorothy V Bishop ◽  
Paul A Thompson

Background: The p-curve is a plot of the distribution of p-values below .05 reported in a set of scientific studies. Comparisons between ranges of p-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication, p-hacking. We argue that binomial tests on the p-curve are not robust enough to be used for this purpose. Methods: P-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigating p-hacking. Results: We first show that a p-curve suggestive of p-hacking can be obtained if researchers misapply parametric tests to data that depart from normality, even when no p-hacking occurs. We go on to show that when there is ghost p-hacking, the shape of the p-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" just below .05 that is regarded as evidence of p-hacking, though there is a negative skew when simulated variables are inter-correlated. The way p-curves vary according to features of underlying data poses problems when automated text mining is used to detect p-values in heterogeneous sets of published papers. Conclusions: A significant bump in the p-curve just below .05 is not necessarily evidence of p-hacking, and lack of a bump is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will usually generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value, unless we have a model specific to the type of p-values entered into the analysis. We conclude that it is not feasible to use the p-curve to estimate the extent of p-hacking and evidential value unless there is considerable control over the type of data entered into the analysis.


2016 ◽  
Author(s):  
Dorothy V Bishop ◽  
Paul A Thompson

Background: The p-curve is a plot of the distribution of p-values reported in a set of scientific studies. Comparisons between ranges of p-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication, p-hacking. Methods: P-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigating p-hacking. Results: We first show that when there is ghost p-hacking, the shape of the p-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" just below .05 that is regarded as evidence of p-hacking, though there is a negative skew when simulated variables are inter-correlated. The way p-curves vary according to features of underlying data poses problems when automated text mining is used to detect p-values in heterogeneous sets of published papers. Conclusions: The absence of a bump in the p-curve is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will usually generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value, unless we have a model specific to the type of p-values entered into the analysis. We conclude that it is not feasible to use the p-curve to estimate the extent of p-hacking and evidential value unless there is considerable control over the type of data entered into the analysis. In particular, p-hacking with ghost variables is likely to be missed.


2016 ◽  
Author(s):  
Dorothy V Bishop ◽  
Paul A Thompson

Background: The p-curve is a plot of the distribution of p-values reported in a set of scientific studies. Comparisons between ranges of p-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication, p-hacking. Methods: P-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigating p-hacking. Results: We first show that when there is ghost p-hacking, the shape of the p-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" just below .05 that is regarded as evidence of p-hacking, though there is a negative skew when simulated variables are inter-correlated. The way p-curves vary according to features of underlying data poses problems when automated text mining is used to detect p-values in heterogeneous sets of published papers. Conclusions: The absence of a bump in the p-curve is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will usually generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value, unless we have a model specific to the type of p-values entered into the analysis. We conclude that it is not feasible to use the p-curve to estimate the extent of p-hacking and evidential value unless there is considerable control over the type of data entered into the analysis. In particular, p-hacking with ghost variables is likely to be missed.


PeerJ ◽  
2016 ◽  
Vol 4 ◽  
pp. e1715 ◽  
Author(s):  
Dorothy V.M. Bishop ◽  
Paul A. Thompson

Background.Thep-curve is a plot of the distribution ofp-values reported in a set of scientific studies. Comparisons between ranges ofp-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication,p-hacking.Methods.p-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigatingp-hacking.Results.We show that when there is ghostp-hacking, the shape of thep-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulatedp-hacked data do not give the “p-hacking bump” just below .05 that is regarded as evidence ofp-hacking, though there is a negative skew when simulated variables are inter-correlated. The wayp-curves vary according to features of underlying data poses problems when automated text mining is used to detectp-values in heterogeneous sets of published papers.Conclusions.The absence of a bump in thep-curve is not indicative of lack ofp-hacking. Furthermore, while studies with evidential value will usually generate a right-skewedp-curve, we cannot treat a right-skewedp-curve as an indicator of the extent of evidential value, unless we have a model specific to the type ofp-values entered into the analysis. We conclude that it is not feasible to use thep-curve to estimate the extent ofp-hacking and evidential value unless there is considerable control over the type of data entered into the analysis. In particular,p-hacking with ghost variables is likely to be missed.


2015 ◽  
Author(s):  
Inti Inal Pedroso ◽  
Michael R Barnes ◽  
Anbarasu Lourdusamy ◽  
Ammar Al-Chalabi ◽  
Gerome Breen

Genome-wide association studies (GWAS) have proven a valuable tool to explore the genetic basis of many traits. However, many GWAS lack statistical power and the commonly used single-point analysis method needs to be complemented to enhance power and interpretation. Multivariate region or gene-wide association are an alternative, allowing for identification of disease genes in a manner more robust to allelic heterogeneity. Gene-based association also facilitates systems biology analyses by generating a single p-value per gene. We have designed and implemented FORGE, a software suite which implements a range of methods for the combination of p-values for the individual genetic variants within a gene or genomic region. The software can be used with summary statistics (marker ids and p-values) and accepts as input the result file formats of commonly used genetic association software. When applied to a study of Crohn's disease susceptibility, it identified all genes found by single SNP analysis and additional genes identified by large independent meta-analysis. FORGE p-values on gene-set analyses highlighted association with the Jak-STAT and cytokine signalling pathways, both previously associated with CD. We highlight the software's main features, its future development directions and provide a comparison with alternative available software tools. FORGE can be freely accessed at https://github.com/inti/FORGE.


2021 ◽  
pp. 39-55
Author(s):  
R. Barker Bausell

This chapter explores three empirical concepts (the p-value, the effect size, and statistical power) integral to the avoidance of false positive scientific. Their relationship to reproducibility is explained in a nontechnical manner without formulas or statistical jargon, with p-values and statistical power presented in terms of probabilities from zero to 1.0 with the values of most interest to scientists being 0.05 (synonymous with a positive, hence, publishable result) and 0.80 (the most commonly recommended probability that a positive result will be obtained if the hypothesis that generated it was correct and the study will be properly designed and conducted). Unfortunately many scientists circumvent both by artifactually inflating the 0.05 criterion, overstating the available statistical power, and engaging in a number of other questionable research practices. These issues are discussed via statistical models from the genetic and psychological fields and then extended to a number of different p-values, statistical power levels, effect sizes, and the prevalence of “true,” effects expected to exist in the research literature. Among the basic conclusions of these modeling efforts are that employing more stringent p-values and larger sample sizes constitute the most effective statistical approaches for increasing the reproducibility of published results in all empirically based scientific literatures. This chapter thus lays the necessary foundation for understanding and appreciating the effects of appropriate p-values, sufficient statistical power, reaslistic effect sizes, and the avoidance of questionable research practices upon the production of reproducible results.


Author(s):  
Zafar Iqbal ◽  
Lubna Waheed ◽  
Waheed Muhammad ◽  
Rajab Muhammad

Purpose: Quality Function Deployment, (QFD) is a methodology which helps to satisfy customer requirements through the selection of appropriate Technical Attributes (TAs). The rationale of this article is to provide a method lending statistical support to the selection of TAs.  The purpose is to determine the statistical significance of TAs through the derivation of associated significance (P) values.   Design/Methodology/Approach: We demonstrate our methodology with reference to an original QFD case study aimed at improving the educational system in high schools in Pakistan; and then with five further published case studies obtained from literature. Mean weights of TAs are determined. Considering each TA mean weight to be a Test Statistic, a weighted matrix is generated from the VOCs’ importance ratings, and ratings in the relationship matrix. Finally using R, P-values for the means of original TAs are determined from the hypothetical population of means of TAs.  Findings: Each TA’s P-value evaluates its significance/insignificance in terms of distance from the grand mean. P-values indirectly set the prioritization of TAs. Implications/Originality/Value: The novel aspect of this study is extension of mean weights of TAs, to also provide P-values for TAs. TAs with significant importance can be resolved on priority basis, while other can be fixed with appropriateness.


2020 ◽  
Vol 228 (1) ◽  
pp. 43-49 ◽  
Author(s):  
Michael Kossmeier ◽  
Ulrich S. Tran ◽  
Martin Voracek

Abstract. Currently, dedicated graphical displays to depict study-level statistical power in the context of meta-analysis are unavailable. Here, we introduce the sunset (power-enhanced) funnel plot to visualize this relevant information for assessing the credibility, or evidential value, of a set of studies. The sunset funnel plot highlights the statistical power of primary studies to detect an underlying true effect of interest in the well-known funnel display with color-coded power regions and a second power axis. This graphical display allows meta-analysts to incorporate power considerations into classic funnel plot assessments of small-study effects. Nominally significant, but low-powered, studies might be seen as less credible and as more likely being affected by selective reporting. We exemplify the application of the sunset funnel plot with two published meta-analyses from medicine and psychology. Software to create this variation of the funnel plot is provided via a tailored R function. In conclusion, the sunset (power-enhanced) funnel plot is a novel and useful graphical display to critically examine and to present study-level power in the context of meta-analysis.


2020 ◽  
Vol 132 (2) ◽  
pp. 662-670
Author(s):  
Minh-Son To ◽  
Alistair Jukes

OBJECTIVEThe objective of this study was to evaluate the trends in reporting of p values in the neurosurgical literature from 1990 through 2017.METHODSAll abstracts from the Journal of Neurology, Neurosurgery, and Psychiatry (JNNP), Journal of Neurosurgery (JNS) collection (including Journal of Neurosurgery: Spine and Journal of Neurosurgery: Pediatrics), Neurosurgery (NS), and Journal of Neurotrauma (JNT) available on PubMed from 1990 through 2017 were retrieved. Automated text mining was performed to extract p values from relevant abstracts. Extracted p values were analyzed for temporal trends and characteristics.RESULTSThe search yielded 47,889 relevant abstracts. A total of 34,324 p values were detected in 11,171 abstracts. Since 1990 there has been a steady, proportionate increase in the number of abstracts containing p values. There were average absolute year-on-year increases of 1.2% (95% CI 1.1%–1.3%; p < 0.001), 0.93% (95% CI 0.75%–1.1%; p < 0.001), 0.70% (95% CI 0.57%–0.83%; p < 0.001), and 0.35% (95% CI 0.095%–0.60%; p = 0.0091) of abstracts reporting p values in JNNP, JNS, NS, and JNT, respectively. There have also been average year-on-year increases of 0.045 (95% CI 0.031–0.059; p < 0.001), 0.052 (95% CI 0.037–0.066; p < 0.001), 0.042 (95% CI 0.030–0.054; p < 0.001), and 0.041 (95% CI 0.026–0.056; p < 0.001) p values reported per abstract for these respective journals. The distribution of p values showed a positive skew and strong clustering of values at rounded decimals (i.e., 0.01, 0.02, etc.). Between 83.2% and 89.8% of all reported p values were at or below the “significance” threshold of 0.05 (i.e., p ≤ 0.05).CONCLUSIONSTrends in reporting of p values and the distribution of p values suggest publication bias remains in the neurosurgical literature.


Sign in / Sign up

Export Citation Format

Share Document