Interpreting Statistical Significance and Meaningfulness in Adapted Physical Activity Research

1998 ◽  
Vol 15 (2) ◽  
pp. 103-118 ◽  
Author(s):  
Vinson H. Sutlive ◽  
Dale A. Ulrich

The unqualified use of statistical significance tests for interpreting the results of empirical research has been called into question by researchers in a number of behavioral disciplines. This paper reviews what statistical significance tells us and what it does not, with particular attention paid to criticisms of using the results of these tests as the sole basis for evaluating the overall significance of research findings. In addition, implications for adapted physical activity research are discussed. Based on the recent literature of other disciplines, several recommendations for evaluating and reporting research findings are made. They include calculating and reporting effect sizes, selecting an alpha level larger than the conventional .05 level, placing greater emphasis on replication of results, evaluating results in a sample size context, and employing simple research designs. Adapted physical activity researchers are encouraged to use specific modifiers when describing findings as significant.

Circulation ◽  
2021 ◽  
Vol 143 (Suppl_1) ◽  
Author(s):  
Kimberly L Savin ◽  
Linda C Gallo ◽  
Britta A Larsen

Introduction: Pregnant women with diabetes often show low levels of physical activity (PA) and high sedentary behavior (SED). Longitudinal studies with objective measures are needed to understand the relationships of daily PA with daily and next-day blood glucose (BG). Hypothesis: Increased steps or moderate to vigorous PA (MVPA) and decreased SED are linked with lower post-meal BG and next day fasting BG in pregnant women. Methods: Participants were 10 pregnant women with diabetes [mean age= 29.3 (SD= 3.6); mean gestational age= 21.9 (SD= 3.9); 90% (9 of 10) Latina] enrolled in a 12-week pilot PA intervention. Participants self-reported demographic and BG data (morning fasting BG, up to 3 daily post-meal BGs). Steps, MVPA (mins/day), and SED (mins/day) were measured using a Fitbit Alta HR. Participants had on average 49 (range: 21 to 77) days with valid PA and BG data, for a total of 469 observations. Multi-level models (MLMs) were fit to examine mean and day-level effects of steps, MVPA, and SED on post-meal and next-day fasting BG after adjusting for age, gestational age, education, and participant mean PA or SED. Due to the small sample size, effect sizes are emphasized in results instead of statistical significance. Results: The mean post-meal BG was 122.5 mg/dL and mean fasting BG was 92.81 mg/dL. After adjustment, an increase of mean steps by 1000 was linked to a lower mean post-meal BG by 11.79 mg/dL (p=0.22) and fasting BG by 7.26 mg/dL (p=0.54), though neither between effect was statistically significant. The within-individual effects of daily steps on post-meal and fasting BG were very small and non-significant (b=-1.78; p=0.59; b=0.72; p=0.30, respectively). A 1-minute increase in mean MVPA was associated with a slight increase in mean post-meal BG by 1.53 mg/dL (p=0.07). The within-individual effect of daily MVPA on daily post-meal BG was negligible and non-significant (b=-0.39, p=0.51). Between-individual effects showed SED had small, positive, non-significant associations with post-meal BG. Specifically, per 60-minute mean SED increase, mean post-meal BG increased by 1.02 mg/dL (p=0.44). Within-individual daily SED increases of 60 minutes were associated with increases of 1.87 mg/dL (p=0.63) in daily post-meal BG. MVPA and SED were not associated with fasting BG. Conclusions: Greater mean steps were linked to lower post-meal and fasting BG while greater SED and MVPA were linked to greater post-meal BG. However, within individual daily increases in MVPA and decreases in SED, were protective for post-meal BG, while controlling for individual mean MVPA and SED. Most effect sizes were small and results were not statistically significant in part due to the small sample size. Participants generally had well-controlled post-meal and fasting BGs, so results may not be generalizable to larger populations.


Author(s):  
Scott B. Morris ◽  
Arash Shokri

To understand and communicate research findings, it is important for researchers to consider two types of information provided by research results: the magnitude of the effect and the degree of uncertainty in the outcome. Statistical significance tests have long served as the mainstream method for statistical inferences. However, the widespread misinterpretation and misuse of significance tests has led critics to question their usefulness in evaluating research findings and to raise concerns about the far-reaching effects of this practice on scientific progress. An alternative approach involves reporting and interpreting measures of effect size along with confidence intervals. An effect size is an indicator of magnitude and direction of a statistical observation. Effect size statistics have been developed to represent a wide range of research questions, including indicators of the mean difference between groups, the relative odds of an event, or the degree of correlation among variables. Effect sizes play a key role in evaluating practical significance, conducting power analysis, and conducting meta-analysis. While effect sizes summarize the magnitude of an effect, the confidence intervals represent the degree of uncertainty in the result. By presenting a range of plausible alternate values that might have occurred due to sampling error, confidence intervals provide an intuitive indicator of how strongly researchers should rely on the results from a single study.


2021 ◽  
pp. 1-6
Author(s):  
Jeffrey Martin ◽  
Drew Martin

In the current study, a 20-year span of 80 issues of articles (N = 196) in Adapted Physical Activity Quarterly (APAQ) were examined. The authors sought to determine whether quantitative research published in APAQ, based on sample size, was underpowered, leading to the potential for false-positive results and findings that may not be reproducible. The median sample size, also known as the N-Pact Factor (NF), for all quantitative research published in APAQ was coded for correlational-type, quasi-experimental, and experimental research. The overall median sample size over the 20-year period examined was as follows: correlational type, NF = 112; quasi-experimental, NF = 40; and experimental, NF = 48. Four 5-year blocks were also analyzed to show historical trends. As the authors show, these results suggest that much of the quantitative research published in APAQ over the last 20 years was underpowered to detect small to moderate population effect sizes.


Author(s):  
H. S. Styn ◽  
S. M. Ellis

The determination of significance of differences in means and of relationships between variables is of importance in many empirical studies. Usually only statistical significance is reported, which does not necessarily indicate an important (practically significant) difference or relationship. With studies based on probability samples, effect size indices should be reported in addition to statistical significance tests in order to comment on practical significance. Where complete populations or convenience samples are worked with, the determination of statistical significance is strictly speaking no longer relevant, while the effect size indices can be used as a basis to judge significance. In this article attention is paid to the use of effect size indices in order to establish practical significance. It is also shown how these indices are utilized in a few fields of statistical application and how it receives attention in statistical literature and computer packages. The use of effect sizes is illustrated by a few examples from the research literature.


2020 ◽  
Author(s):  
Michael W. Beets ◽  
R. Glenn Weaver ◽  
John P.A. Ioannidis ◽  
Alexis Jones ◽  
Lauren von Klinggraeff ◽  
...  

Abstract Background: Pilot/feasibility or studies with small sample sizes may be associated with inflated effects. This study explores the vibration of effect sizes (VoE) in meta-analyses when considering different inclusion criteria based upon sample size or pilot/feasibility status. Methods: Searches were conducted for meta-analyses of behavioral interventions on topics related to the prevention/treatment of childhood obesity from 01-2016 to 10-2019. The computed summary effect sizes (ES) were extracted from each meta-analysis. Individual studies included in the meta-analyses were classified into one of the following four categories: self-identified pilot/feasibility studies or based upon sample size (N≤100, N>100, and N>370 the upper 75th of sample size). The VoE was defined as the absolute difference (ABS) between the re-estimations of summary ES restricted to study classifications compared to the originally reported summary ES. Concordance (kappa) of statistical significance between summary ES was assessed. Fixed and random effects models and meta-regressions were estimated. Three case studies are presented to illustrate the impact of including pilot/feasibility and N≤100 studies on the estimated summary ES.Results: A total of 1,602 effect sizes, representing 145 reported summary ES, were extracted from 48 meta-analyses containing 603 unique studies (avg. 22 avg. meta-analysis, range 2-108) and included 227,217 participants. Pilot/feasibility and N≤100 studies comprised 22% (0-58%) and 21% (0-83%) of studies. Meta-regression indicated the ABS between the re-estimated and original summary ES where summary ES were comprised of ≥40% of N≤100 studies was 0.29. The ABS ES was 0.46 when summary ES comprised of >80% of both pilot/feasibility and N≤100 studies. Where ≤40% of the studies comprising a summary ES had N>370, the ABS ES ranged from 0.20-0.30. Concordance was low when removing both pilot/feasibility and N≤100 studies (kappa=0.53) and restricting analyses only to the largest studies (N>370, kappa=0.35), with 20% and 26% of the originally reported statistically significant ES rendered non-significant. Reanalysis of the three case study meta-analyses resulted in the re-estimated ES rendered either non-significant or half of the originally reported ES. Conclusions: When meta-analyses of behavioral interventions include a substantial proportion of both pilot/feasibility and N≤100 studies, summary ES can be affected markedly and should be interpreted with caution.


2013 ◽  
Vol 112 (3) ◽  
pp. 835-844 ◽  
Author(s):  
M. T. Bradley ◽  
A. Brand

Tables of alpha values as a function of sample size, effect size, and desired power were presented. The tables indicated expected alphas for small, medium, and large effect sizes given a variety of sample sizes. It was evident that sample sizes for most psychological studies are adequate for large effect sizes defined at .8. The typical alpha level of .05 and desired power of 90% can be achieved with 70 participants in two groups. It was perhaps doubtful if these ideal levels of alpha and power have generally been achieved for medium effect sizes in actual research, since 170 participants would be required. Small effect sizes have rarely been tested with an adequate number of participants or power. Implications were discussed.


1993 ◽  
Vol 10 (4) ◽  
pp. 371-391 ◽  
Author(s):  
Marcel Bouffard

This paper is a criticism of typical group research designs in which the data are analyzed by using standard analysis of variance structural models. A distinction is made between lawful relationships about averages and lawful relationships about people. It is argued that propositions about people cannot necessarily be derived from propositions about the mean of people because the patterns found by aggregating data across people do not necessarily apply to individuals. To find lawful universal relationships about people, data analysis strategies should recognize the person as a basic unit of analysis. Implications of this view for research conducted in adapted physical activity are outlined.


2021 ◽  
Author(s):  
Kleber Neves ◽  
Pedro Batista Tan ◽  
Olavo Bohrer Amaral

Diagnostic screening models for the interpretation of null hypothesis significance test (NHST) results have been influential in highlighting the effect of selective publication on the reproducibility of the published literature, leading to John Ioannidis’ much-cited claim that most published research findings are false. These models, however, are typically based on the assumption that hypotheses are dichotomously true or false, without considering that effect sizes for different hypotheses are not the same. To address this limitation, we develop a simulation model that overcomes this by modeling effect sizes explicitly using different continuous distributions, while retaining other aspects of previous models such as publication bias and the pursuit of statistical significance. Our results show that the combination of selective publication, bias, low statistical power and unlikely hypotheses consistently leads to high proportions of false positives, irrespective of the effect size distribution assumed. Using continuous effect sizes also allows us to evaluate the degree of effect size overestimation and prevalence of estimates with the wrong signal in the literature, showing that the same factors that drive false-positive results also lead to errors in estimating effect size direction and magnitude. Nevertheless, the relative influence of these factors on different metrics varies depending on the distribution assumed for effect sizes. The model is made available as an R ShinyApp interface, allowing one to explore features of the literature in various scenarios.


2012 ◽  
Vol 36 (2) ◽  
pp. 104-121 ◽  
Author(s):  
Christine E. DeMars

A testlet is a cluster of items that share a common passage, scenario, or other context. These items might measure something in common beyond the trait measured by the test as a whole; if so, the model for the item responses should allow for this testlet trait. But modeling testlet effects that are negligible makes the model unnecessarily complicated and risks capitalization on chance, increasing the error in parameter estimates. Checking each testlet to see if the items within the testlet share something beyond the primary trait could therefore be useful. This study included (a) comparison between a model with no testlets and a model with testlet g, (b) comparison between a model with all suspected testlets and a model with all suspected testlets except testlet g, and (c) a test of essential unidimensionality. Overall, Comparison b was most useful for detecting testlet effects. Model comparisons based on information criteria, specifically the sample-size adjusted Bayesian Information Criteria (SSA-BIC) and BIC, resulted in fewer false alarms than statistical significance tests. The test of essential unidimensionality had true hit rates and false alarm rates similar to the SSA-BIC when the testlet effect was zero for all testlets except the studied testlet. But the presence of additional testlet effects in the partitioning test led to higher false alarm rates for the test of essential unidimensionality.


Sign in / Sign up

Export Citation Format

Share Document