How the Maximal Evidence of P-Values Against Point Null Hypotheses Depends on Sample Size

I uncover previously underappreciated systematic sources of false and irreproducible results in natural, biomedical and social sciences that are rooted in statistical methodology. They include the inevitably occurring deviations from basic assumptions behind statistical analyses and the use of various approximations. I show through a number of examples that (a) arbitrarily small deviations from distributional homogeneity can lead to arbitrarily large deviations in the outcomes of statistical analyses; (b) samples of random size may violate the Law of Large Numbers and thus are generally unsuitable for conventional statistical inference; (c) the same is true, in particular, when random sample size and observations are stochastically dependent; and (d) the use of the Gaussian approximation based on the Central Limit Theorem has dramatic implications for p-values and statistical significance essentially making pursuit of small significance levels and p-values for a fixed sample size meaningless. The latter is proven rigorously in the case of one-sided Z test. This article could serve as a cautionary guidance to scientists and practitioners employing statistical methods in their work.

Download Full-text

Biostatistical Analysis: A Primer for Clinical Exercise Physiology, Part 1

Journal of Clinical Exercise Physiology ◽

10.31189/2165-6193-7.3.63 ◽

2018 ◽

Vol 7 (3) ◽

pp. 63-69

Author(s):

Suzanne L. Havstad ◽

George W. Divine

Keyword(s):

Sample Size ◽

Gold Standard ◽

Exercise Physiology ◽

Analytical Techniques ◽

Multiple Comparisons ◽

Inference Process ◽

P Values ◽

Advantages And Disadvantages ◽

Point Estimates ◽

Biostatistical Analysis

ABSTRACT In this first of a two-part series on introductory biostatistics, we briefly describe common designs. The advantages and disadvantages of six design types are highlighted. The randomized clinical trial is the gold standard to which other designs are compared. We present the benefits of randomization and discuss the importance of power and sample size. Sample size and power calculations for any design need to be based on meaningful effects of interest. We give examples of how the effect of interest and the sample size interrelate. We also define concepts helpful to the statistical inference process. When drawing conclusions from a completed study, P values, point estimates, and confidence intervals will all assist the researcher. Finally, the issue of multiple comparisons is briefly explored. The second paper in this series will describe basic analytical techniques and discuss some common mistakes in the interpretation of data.

Download Full-text

Power, Effect Size, P ‐Values, and Estimating Required Sample Size Using Python

Applied Univariate, Bivariate, and Multivariate Statistics Using Python ◽

10.1002/9781119578208.ch5 ◽

2021 ◽

pp. 96-112

Keyword(s):

Sample Size ◽

Effect Size ◽

P Values ◽

Power Effect

Download Full-text

Normalization of the Kolmogorov–Smirnov and Shapiro–Wilk tests of normality

Biometrical Letters ◽

10.1515/bile-2015-0008 ◽

2015 ◽

Vol 52 (2) ◽

pp. 85-93 ◽

Cited By ~ 13

Author(s):

Zofia Hanusz ◽

Joanna Tarasińska

Keyword(s):

Sample Size ◽

P Values ◽

Normalizing Constants ◽

Kolmogorov Smirnov ◽

Tests For Normality

Abstract Two very well-known tests for normality, the Kolmogorov-Smirnov and the Shapiro- Wilk tests, are considered. Both of them may be normalized using Johnson’s (1949) SB distribution. In this paper, functions for normalizing constants, dependent on the sample size, are given. These functions eliminate the need to use non-standard statistical tables with normalizing constants, and make it easy to obtain p-values for testing normality.

Download Full-text

Bayesian interpretation of p values in clinical trials

BMJ evidence-based medicine ◽

10.1136/bmjebm-2020-111603 ◽

2021 ◽

pp. bmjebm-2020-111603

Author(s):

John Ferguson

Keyword(s):

Clinical Trial ◽

Clinical Trials ◽

Sample Size ◽

Confidence Intervals ◽

Statistical Significance ◽

Large Sample Size ◽

P Values ◽

Clinical Trial Results ◽

Sound Treatment ◽

Counterintuitive Result

Commonly accepted statistical advice dictates that large-sample size and highly powered clinical trials generate more reliable evidence than trials with smaller sample sizes. This advice is generally sound: treatment effect estimates from larger trials tend to be more accurate, as witnessed by tighter confidence intervals in addition to reduced publication biases. Consider then two clinical trials testing the same treatment which result in the same p values, the trials being identical apart from differences in sample size. Assuming statistical significance, one might at first suspect that the larger trial offers stronger evidence that the treatment in question is truly effective. Yet, often precisely the opposite will be true. Here, we illustrate and explain this somewhat counterintuitive result and suggest some ramifications regarding interpretation and analysis of clinical trial results.

Download Full-text

Assessing Model Fit in Latent Class Analysis When Asymptotics Do Not Hold

Methodology ◽

10.1027/1614-2241/a000093 ◽

2015 ◽

Vol 11 (2) ◽

pp. 65-79 ◽

Cited By ~ 8

Author(s):

Geert H. van Kollenburg ◽

Joris Mulder ◽

Jeroen K. Vermunt

Keyword(s):

Sample Size ◽

Goodness Of Fit ◽

Latent Class ◽

Parametric Bootstrap ◽

Error Rates ◽

Asymptotic Distributions ◽

Type I ◽

Test Statistics ◽

P Values ◽

Global And Local

The application of latent class (LC) analysis involves evaluating the LC model using goodness-of-fit statistics. To assess the misfit of a specified model, say with the Pearson chi-squared statistic, a p-value can be obtained using an asymptotic reference distribution. However, asymptotic p-values are not valid when the sample size is not large and/or the analyzed contingency table is sparse. Another problem is that for various other conceivable global and local fit measures, asymptotic distributions are not readily available. An alternative way to obtain the p-value for the statistic of interest is by constructing its empirical reference distribution using resampling techniques such as the parametric bootstrap or the posterior predictive check (PPC). In the current paper, we show how to apply the parametric bootstrap and two versions of the PPC to obtain empirical p-values for a number of commonly used global and local fit statistics within the context of LC analysis. The main difference between the PPC using test statistics and the parametric bootstrap is that the former takes into account parameter uncertainty. The PPC using discrepancies has the advantage that it is computationally much less intensive than the other two resampling methods. In a Monte Carlo study we evaluated Type I error rates and power of these resampling methods when used for global and local goodness-of-fit testing in LC analysis. Results show that both the bootstrap and the PPC using test statistics are generally good alternatives to asymptotic p-values and can also be used when (asymptotic) distributions are not known. Nominal Type I error rates were not met when sample size was small and the contingency table has many cells. Overall the PPC using test statistics was somewhat more conservative than the parametric bootstrap. We have also replicated previous research suggesting that the Pearson χ2 statistic should in many cases be preferred over the likelihood-ratio G2 statistic. Power to reject a model for which the number of LCs was one less than in the population was very high, unless sample size was small. When the contingency tables are very sparse, the total bivariate residual (TBVR) statistic, which is based on bivariate relationships, still had very high power, signifying its usefulness in assessing model fit.

Download Full-text

A Review of Statistical Reporting in Dietetics Research (2010–2019): How is a Canadian Journal Doing?

Canadian Journal of Dietetic Practice and Research ◽

10.3148/cjdpr-2021-005 ◽

2021 ◽

pp. 1-9

Author(s):

Holly Schaafsma ◽

Holly Laasanen ◽

Jasna Twynstra ◽

Jamie A. Seabrook

Keyword(s):

Sample Size ◽

Canadian Journal ◽

Quantitative Research ◽

Research Team ◽

Statistical Tests ◽

Sample Size Calculation ◽

Future Research ◽

Statistical Techniques ◽

P Values

Despite the widespread use of statistical techniques in quantitative research, methodological flaws and inadequate statistical reporting persist. The objective of this study is to evaluate the quality of statistical reporting and procedures in all original, quantitative articles published in the Canadian Journal of Dietetic Practice and Research (CJDPR) from 2010 to 2019 using a checklist created by our research team. In total, 107 articles were independently evaluated by 2 raters. The hypothesis or objective(s) was clearly stated in 97.2% of the studies. Over half (51.4%) of the articles reported the study design and 57.9% adequately described the statistical techniques used. Only 21.2% of the studies that required a prestudy sample size calculation reported one. Of the 281 statistical tests conducted, 88.3% of them were correct. P values >0.05–0.10 were reported as “statistically significant” and/or a “trend” in 11.4% of studies. While this evaluation reveals both strengths and areas for improvement in the quality of statistical reporting in CJDPR, we encourage dietitians to pursue additional statistical training and/or seek the assistance of a statistician. Future research should consider validating this new checklist and using it to evaluate the statistical quality of studies published in other nutrition journals and disciplines.

Download Full-text

Radiomics-guided deep neural networks stratify lung adenocarcinoma prognosis from CT scans

Communications Biology ◽

10.1038/s42003-021-02814-7 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Hwan-ho Cho ◽

Ho Yun Lee ◽

Eunjin Kim ◽

Geewon Lee ◽

Jonghoon Kim ◽

...

Keyword(s):

Sample Size ◽

Deep Neural Networks ◽

External Validation ◽

Texture Features ◽

Risk Groups ◽

Natural Images ◽

Risk Scores ◽

P Values ◽

Significant Difference ◽

Lung Adenocarcinomas

AbstractDeep learning (DL) is a breakthrough technology for medical imaging with high sample size requirements and interpretability issues. Using a pretrained DL model through a radiomics-guided approach, we propose a methodology for stratifying the prognosis of lung adenocarcinomas based on pretreatment CT. Our approach allows us to apply DL with smaller sample size requirements and enhanced interpretability. Baseline radiomics and DL models for the prognosis of lung adenocarcinomas were developed and tested using local (n = 617) cohort. The DL models were further tested in an external validation (n = 70) cohort. The local cohort was divided into training and test cohorts. A radiomics risk score (RRS) was developed using Cox-LASSO. Three pretrained DL networks derived from natural images were used to extract the DL features. The features were further guided using radiomics by retaining those DL features whose correlations with the radiomics features were high and Bonferroni-corrected p-values were low. The retained DL features were subject to a Cox-LASSO when constructing DL risk scores (DRS). The risk groups stratified by the RRS and DRS showed a significant difference in training, testing, and validation cohorts. The DL features were interpreted using existing radiomics features, and the texture features explained the DL features well.

Download Full-text

Insights into Criteria for Statistical Significance from Signal Detection Analysis

Meta-Psychology ◽

10.15626/mp.2018.871 ◽

2019 ◽

Vol 3 ◽

Cited By ~ 2

Author(s):

Jessica K. Witt

Keyword(s):

Signal Detection ◽

Sample Size ◽

Effect Size ◽

Statistical Significance ◽

Bayes Factors ◽

False Alarms ◽

P Values ◽

Questionable Research Practices ◽

Detection Analysis ◽

Signal Detection Analysis

What is best criterion for determining statistical significance? In psychology, the criterion has been p < .05. This criterion has been criticized since its inception, and the criticisms have been rejuvenated with recent failures to replicate studies published in top psychology journals. Several replacement criteria have been suggested including reducing the alpha level to .005 or switching to other types of criteria such as Bayes factors or effect sizes. Here, various decision criteria for statistical significance were evaluated using signal detection analysis on the outcomes of simulated data. The signal detection measure of area under the curve (AUC) is a measure of discriminability with a value of 1 indicating perfect discriminability and 0.5 indicating chance performance. Applied to criteria for statistical significance, it provides an estimate of the decision criterion’s performance in discriminating real effects from null effects. AUCs were high (M = .96, median = .97) for p values, suggesting merit in using p values to discriminate significant effects. AUCs can be used to assess methodological questions such as how much improvement will be gained with increased sample size, how much discriminability will be lost with questionable research practices, and whether it is better to run a single high-powered study or a study plus a replication at lower powers. AUCs were also used to compare performance across p values, Bayes factors, and effect size (Cohen’s d). AUCs were equivalent for p values and Bayes factors and were slightly higher for effect size. Signal detection analysis provides separate measures of discriminability and bias. With respect to bias, the specific thresholds that produced maximally-optimal utility depended on sample size, although this dependency was particularly notable for p values and less so for Bayes factors. The application of signal detection theory to the issue of statistical significance highlights the need to focus on both false alarms and misses, rather than false alarms alone.

Download Full-text

What big size you have! Using effect sizes to determine the impact of public health nursing interventions

Applied Clinical Informatics ◽

10.4338/aci-2013-07-ra-0044 ◽

2013 ◽

Vol 04 (03) ◽

pp. 434-444 ◽

Cited By ~ 10

Author(s):

B.J. McMorris ◽

L.A. Raynor ◽

K. A. Monsen ◽

K. E. Johnson

Keyword(s):

Public Health ◽

Sample Size ◽

Effect Sizes ◽

P Values ◽

Cohen’S D ◽

Omaha System ◽

System Data ◽

Electronic Health ◽

Cohen's D ◽

The Impact

Summary Background: The Omaha System is a standardized interface terminology that is used extensively by public health nurses in community settings to document interventions and client outcomes. Researchers using Omaha System data to analyze the effectiveness of interventions have typically calculated p-values to determine whether significant client changes occurred between admission and discharge. However, p-values are highly dependent on sample size, making it difficult to distinguish statistically significant changes from clinically meaningful changes. Effect sizes can help identify practical differences but have not yet been applied to Omaha System data. Methods: We compared p-values and effect sizes (Cohen’s d) for mean differences between admission and discharge for 13 client problems documented in the electronic health records of 1,016 young low-income parents. Client problems were documented anywhere from 6 (Health Care Supervision) to 906 (Caretaking/parenting) times. Results: On a scale from 1 to 5, the mean change needed to yield a large effect size (Cohen’s d 0.80) was approximately 0.60 (range = 0.50 – 1.03) regardless of p-value or sample size (i.e., the number of times a client problem was documented in the electronic health record). Conclusions: Researchers using the Omaha System should report effect sizes to help readers determine which differences are practical and meaningful. Such disclosures will allow for increased recognition of effective interventions.

Download Full-text