Reporting trends of p values in the neurosurgical literature

2020 ◽  
Vol 132 (2) ◽  
pp. 662-670
Author(s):  
Minh-Son To ◽  
Alistair Jukes

OBJECTIVEThe objective of this study was to evaluate the trends in reporting of p values in the neurosurgical literature from 1990 through 2017.METHODSAll abstracts from the Journal of Neurology, Neurosurgery, and Psychiatry (JNNP), Journal of Neurosurgery (JNS) collection (including Journal of Neurosurgery: Spine and Journal of Neurosurgery: Pediatrics), Neurosurgery (NS), and Journal of Neurotrauma (JNT) available on PubMed from 1990 through 2017 were retrieved. Automated text mining was performed to extract p values from relevant abstracts. Extracted p values were analyzed for temporal trends and characteristics.RESULTSThe search yielded 47,889 relevant abstracts. A total of 34,324 p values were detected in 11,171 abstracts. Since 1990 there has been a steady, proportionate increase in the number of abstracts containing p values. There were average absolute year-on-year increases of 1.2% (95% CI 1.1%–1.3%; p < 0.001), 0.93% (95% CI 0.75%–1.1%; p < 0.001), 0.70% (95% CI 0.57%–0.83%; p < 0.001), and 0.35% (95% CI 0.095%–0.60%; p = 0.0091) of abstracts reporting p values in JNNP, JNS, NS, and JNT, respectively. There have also been average year-on-year increases of 0.045 (95% CI 0.031–0.059; p < 0.001), 0.052 (95% CI 0.037–0.066; p < 0.001), 0.042 (95% CI 0.030–0.054; p < 0.001), and 0.041 (95% CI 0.026–0.056; p < 0.001) p values reported per abstract for these respective journals. The distribution of p values showed a positive skew and strong clustering of values at rounded decimals (i.e., 0.01, 0.02, etc.). Between 83.2% and 89.8% of all reported p values were at or below the “significance” threshold of 0.05 (i.e., p ≤ 0.05).CONCLUSIONSTrends in reporting of p values and the distribution of p values suggest publication bias remains in the neurosurgical literature.

Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (American Statistical Association, Wasserstein & Lazar 2016). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a realistic statistical power of 40%, given that there is a true effect, only one in six studies will significantly replicate the significant result of another study. Even at a good power of 80%, results from two studies will be conflicting, in terms of significance, in one third of the cases if there is a true effect. This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to publication bias against nonsignificant findings. Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, such as 'we need more stringent decision rules', 'sample sizes will decrease' or 'we need to get rid of p-values'.


2015 ◽  
Author(s):  
Dorothy V Bishop ◽  
Paul A Thompson

Background: The p-curve is a plot of the distribution of p-values below .05 reported in a set of scientific studies. Comparisons between ranges of p-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication, p-hacking. We argue that binomial tests on the p-curve are not robust enough to be used for this purpose. Methods: P-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigating p-hacking. Results: We first show that a p-curve suggestive of p-hacking can be obtained if researchers misapply parametric tests to data that depart from normality, even when no p-hacking occurs. We go on to show that when there is ghost p-hacking, the shape of the p-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" just below .05 that is regarded as evidence of p-hacking, though there is a negative skew when simulated variables are inter-correlated. The way p-curves vary according to features of underlying data poses problems when automated text mining is used to detect p-values in heterogeneous sets of published papers. Conclusions: A significant bump in the p-curve just below .05 is not necessarily evidence of p-hacking, and lack of a bump is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will usually generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value, unless we have a model specific to the type of p-values entered into the analysis. We conclude that it is not feasible to use the p-curve to estimate the extent of p-hacking and evidential value unless there is considerable control over the type of data entered into the analysis.


1996 ◽  
Vol 21 (4) ◽  
pp. 299-332 ◽  
Author(s):  
Larry V. Hedges ◽  
Jack L. Vevea

When there is publication bias, studies yielding large p values, and hence small effect estimates, are less likely to be published, which leads to biased estimates of effects in meta-analysis. We investigate a selection model based on one-tailed p values in the context of a random effects model. The procedure both models the selection process and corrects for the consequences of selection on estimates of the mean and variance of effect parameters. A test of the statistical significance of selection is also provided. The small sample properties of the method are evaluated by means of simulations, and the asymptotic theory is found to be reasonably accurate under correct model specification and plausible conditions. The method substantially reduces bias due to selection when model specification is correct, but the variance of estimates is increased; thus mean squared error is reduced only when selection produces substantial bias. The robustness of the method to violations of assumptions about the form of the distribution of the random effects is also investigated via simulation, and the model-corrected estimates of the mean effect are generally found to be much less biased than the uncorrected estimates. The significance test for selection bias, however, is found to be highly nonrobust, rejecting at up to 10 times the nominal rate when there is no selection but the distribution of the effects is incorrectly specified.


2012 ◽  
Vol 65 (11) ◽  
pp. 2271-2279 ◽  
Author(s):  
E.J. Masicampo ◽  
Daniel R. Lalande

In null hypothesis significance testing (NHST), p values are judged relative to an arbitrary threshold for significance (.05). The present work examined whether that standard influences the distribution of p values reported in the psychology literature. We examined a large subset of papers from three highly regarded journals. Distributions of p were found to be similar across the different journals. Moreover, p values were much more common immediately below .05 than would be expected based on the number of p values occurring in other ranges. This prevalence of p values just below the arbitrary criterion for significance was observed in all three journals. We discuss potential sources of this pattern, including publication bias and researcher degrees of freedom.


2017 ◽  
Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.


2015 ◽  
Author(s):  
Dorothy V Bishop ◽  
Paul A Thompson

Background: The p-curve is a plot of the distribution of p-values below .05 reported in a set of scientific studies. It has been used to identify bias in the selection of variables and analyses for publication, p-hacking. A recent study by Head et al. (2015) combined this approach with automated text-mining of p-values from a large corpus of published papers and concluded that although there was evidence of p-hacking, its effect was weak in relation to real effect sizes, and not likely to cause serious distortions in the literature. We argue that the methods used by these authors do not support this inference. Methods: P-hacking can take various forms. For the current paper, we developed R code to simulate the use of ghost variables, where an experimenter gathers data on numerous variables but reports only those with statistically significant effects. We also examined the text-mined dataset used by Head et al. to assess its suitability for investigating p-hacking. Results: For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" that is evidence of p-hacking. The p-curve develops a positive slope when simulated variables are highly intercorrelated, but does not show the excess of p-values in the p-curve just below .05 that has been regarded as indicative of extreme p-hacking. A right-skewed p-curve is obtained, as expected, when there is a true difference between groups, but it was also obtained in p-hacked datasets containing a high proportion of cases with a true null effect. The results of Head et al are further compromised because their automated text mining detected any p-value mentioned in the Results or Abstract of a paper, including those reported in the course of validation of materials or methods, or confirmation of well-established facts, as opposed to hypothesis-testing. There was no information on the statistical power of studies, nor on the statistical test conducted. Conclusions: We find two problems with the analysis by Head et al. First, though a significant bump in the p-curve just below .05 is good evidence of p-hacking, lack of a bump is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value. This is particularly the case when there is no control over the type of p-values entered into the analysis. The analysis presented here suggests that the potential for systematic bias is substantial. We conclude that the study by Head et al. provides evidence of p-hacking in the scientific literature, but it cannot be used to estimate the extent and consequences of p-hacking. Analysis of meta-analysed datasets avoids some of these problems, but will still miss an important type of p-hacking.


2018 ◽  
Author(s):  
Christopher Brydges

Objectives: Research has found evidence of publication bias, questionable research practices (QRPs), and low statistical power in published psychological journal articles. Isaacowitz’s (2018) editorial in the Journals of Gerontology Series B, Psychological Sciences called for investigation of these issues in gerontological research. The current study presents meta-research findings based on published research to explore if there is evidence of these practices in gerontological research. Method: 14,481 test statistics and p values were extracted from articles published in eight top gerontological psychology journals since 2000. Frequentist and Bayesian caliper tests were used to test for publication bias and QRPs (specifically, p-hacking and incorrect rounding of p values). A z-curve analysis was used to estimate average statistical power across studies.Results: Strong evidence of publication bias was observed, and average statistical power was approximately .70 – below the recommended .80 level. Evidence of p-hacking was mixed. Evidence of incorrect rounding of p values was inconclusive.Discussion: Gerontological research is not immune to publication bias, QRPs, and low statistical power. Researchers, journals, institutions, and funding bodies are encouraged to adopt open and transparent research practices, and using Registered Reports as an alternative article type to minimize publication bias and QRPs, and increase statistical power.


2021 ◽  
Vol 13 (12) ◽  
pp. 6589
Author(s):  
Amir Karami ◽  
Melek Yildiz Spinel ◽  
C. Nicole White ◽  
Kayla Ford ◽  
Suzanne Swan

Sexual harassment has been the topic of thousands of research articles in the 20th and 21st centuries. Several review papers have been developed to synthesize the literature about sexual harassment. While traditional literature review studies provide valuable insights, these studies have some limitations including analyzing a limited number of papers, being time-consuming and labor-intensive, focusing on a few topics, and lacking temporal trend analysis. To address these limitations, this paper employs both computational and qualitative approaches to identify major research topics, explore temporal trends of sexual harassment topics over the past few decades, and point to future possible directions in sexual harassment studies. We collected 5320 research papers published between 1977 and 2020, identified and analyzed sexual harassment topics, and explored the temporal trend of topics. Our findings indicate that sexual harassment in the workplace was the most popular research theme, and sexual harassment was investigated in a wide range of spaces ranging from school to military settings. Our analysis shows that 62.5% of the topics having a significant trend had an increasing (hot) temporal trend that is expected to be studied more in the coming years. This study offers a bird’s eye view to better understand sexual harassment literature with text mining, qualitative, and temporal trend analysis methods. This research could be beneficial to researchers, educators, publishers, and policymakers by providing a broad overview of the sexual harassment field.


Sign in / Sign up

Export Citation Format

Share Document