scholarly journals The case for default point-H1-hypotheses: a theory-construction perspective

2021 ◽  
Author(s):  
Frank Zenker ◽  
Erich H. Witte

The development of an empirically adequate theoretical construct for a given phenomenon of interest requires an estimate of the population effect size, aka the true effect. Arriving at this estimate in evidence-based ways presupposes access to robust experimental or observational findings, defined as statistically significant test-results with high statistical power. In the behavioral sciences, however, even the best journals typically publish statistically significant test-results with insufficient statistical power, entailing that such findings have insufficient replication probability. Whereas a robust finding formally requires that an empirical study engage with point-specific H0- and H1-hypotheses, behavioral scientists today typically point-specify only the H0, and instead engage a composite (directional) H1. This mismatch renders the prospects for theory-construction poor, because the population effect size—the very parameter that is to be modelled—regularly remains unknown. This can only keep from developing empirically adequate theoretical constructs. Based on the research program strategy (RPS), a sophisticated integration of Frequentist and Bayesian statistical inference elements, here we claim that theoretical progress requires engaging with point-H1-hypotheses by default.

Author(s):  
Neal M. Krause

The literature on the relationship between religion and health is vast, but it is in a state of disarray. One empirical study has been piled upon another, while little effort has been made to integrate them into a more tightly knit theoretical whole. This book was designed to address this problem. It is the product of 40 years of empirical research, hundreds of peer-reviewed publications, and countless hours of deep reflection. This volume contributes to the literature in three ways: (1) a unique approach to theory construction and model development is presented that is designed to produce a conceptual scheme that is evidence based and empirically verifiable; (2) a new construct—communities of faith—that has largely been overlooked in empirical studies on religion is introduced; and (3) the need is highlighted for a no-holds-barred discussion of how to practice one’s research craft.


2020 ◽  
Author(s):  
John Protzko ◽  
Jon Krosnick ◽  
Leif D. Nelson ◽  
Brian A. Nosek ◽  
Jordan Axt ◽  
...  

Failures to replicate evidence of new discoveries have forced scientists to ask whether this unreliability is due to suboptimal implementation of optimal methods or whether presumptively optimal methods are not, in fact, optimal. This paper reports an investigation by four coordinated laboratories of the prospective replicability of 16 novel experimental findings using current optimal practices: high statistical power, preregistration, and complete methodological transparency. In contrast to past systematic replication efforts that reported replication rates averaging 50%, replication attempts here produced the expected effects with significance testing (p<.05) in 86% of attempts, slightly exceeding maximum expected replicability based on observed effect size and sample size. When one lab attempted to replicate an effect discovered by another lab, the effect size in the replications was 97% that of the original study. This high replication rate justifies confidence in rigor enhancing methods and suggests that past failures to replicate may be attributable to departures from optimal procedures.


2017 ◽  
Author(s):  
Herm J. Lamberink ◽  
Willem M. Otte ◽  
Michel R.T. Sinke ◽  
Daniël Lakens ◽  
Paul P. Glasziou ◽  
...  

AbstractBackgroundBiomedical studies with low statistical power are a major concern in the scientific community and are one of the underlying reasons for the reproducibility crisis in science. If randomized clinical trials, which are considered the backbone of evidence-based medicine, also suffer from low power, this could affect medical practice.MethodsWe analysed the statistical power in 137 032 clinical trials between 1975 and 2017 extracted from meta-analyses from the Cochrane database of systematic reviews. We determined study power to detect standardized effect sizes according to Cohen, and in meta-analysis with p-value below 0.05 we based power on the meta-analysed effect size. Average power, effect size and temporal patterns were examined.ResultsThe number of trials with power ≥80% was low but increased over time: from 9% in 1975–1979 to 15% in 2010–2014. This increase was mainly due to increasing sample sizes, whilst effect sizes remained stable with a median Cohen’s h of 0.21 (IQR 0.12-0.36) and a median Cohen’s d of 0.31 (0.19-0.51). The proportion of trials with power of at least 80% to detect a standardized effect size of 0.2 (small), 0.5 (moderate) and 0.8 (large) was 7%, 48% and 81%, respectively.ConclusionsThis study demonstrates that sufficient power in clinical trials is still problematic, although the situation is slowly improving. Our data encourages further efforts to increase statistical power in clinical trials to guarantee rigorous and reproducible evidence-based medicine.


2019 ◽  
Vol 227 (4) ◽  
pp. 261-279 ◽  
Author(s):  
Frank Renkewitz ◽  
Melanie Keiner

Abstract. Publication biases and questionable research practices are assumed to be two of the main causes of low replication rates. Both of these problems lead to severely inflated effect size estimates in meta-analyses. Methodologists have proposed a number of statistical tools to detect such bias in meta-analytic results. We present an evaluation of the performance of six of these tools. To assess the Type I error rate and the statistical power of these methods, we simulated a large variety of literatures that differed with regard to true effect size, heterogeneity, number of available primary studies, and sample sizes of these primary studies; furthermore, simulated studies were subjected to different degrees of publication bias. Our results show that across all simulated conditions, no method consistently outperformed the others. Additionally, all methods performed poorly when true effect sizes were heterogeneous or primary studies had a small chance of being published, irrespective of their results. This suggests that in many actual meta-analyses in psychology, bias will remain undiscovered no matter which detection method is used.


2019 ◽  
Author(s):  
Curtis David Von Gunten ◽  
Bruce D Bartholow

A primary psychometric concern with laboratory-based inhibition tasks has been their reliability. However, a reliable measure may not be necessary or sufficient for reliably detecting effects (statistical power). The current study used a bootstrap sampling approach to systematically examine how the number of participants, the number of trials, the magnitude of an effect, and study design (between- vs. within-subject) jointly contribute to power in five commonly used inhibition tasks. The results demonstrate the shortcomings of relying solely on measurement reliability when determining the number of trials to use in an inhibition task: high internal reliability can be accompanied with low power and low reliability can be accompanied with high power. For instance, adding additional trials once sufficient reliability has been reached can result in large gains in power. The dissociation between reliability and power was particularly apparent in between-subject designs where the number of participants contributed greatly to power but little to reliability, and where the number of trials contributed greatly to reliability but only modestly (depending on the task) to power. For between-subject designs, the probability of detecting small-to-medium-sized effects with 150 participants (total) was generally less than 55%. However, effect size was positively associated with number of trials. Thus, researchers have some control over effect size and this needs to be considered when conducting power analyses using analytic methods that take such effect sizes as an argument. Results are discussed in the context of recent claims regarding the role of inhibition tasks in experimental and individual difference designs.


2017 ◽  
Vol 10 (2) ◽  
pp. 94
Author(s):  
Ji Meng

This research investigated a comparison between the effect of cooperative learning and lecture teaching on Comprehensive English classes in a Chinese Independent College. An empirical study for two semesters was carried out in the forms of pretest, posttest, questionnaire and interviews. While control class was taught in the conventional way, experiment class was instructed based on cooperative base groups with positive interdependence structured on purpose. Compared with traditional instructions, cooperative learning as pedagogy can improve students’ performance on course exams, but not necessarily their language competence as shown in national English competency tests taken before and after the experiement. Test results also indicate students from experiment class who excelled in competency test outnumbered those from control class, revealing that cooperative learning has positive impacts especially on students at a relatively higher academic level. Questionaire results show that students are most inclined to agree they have more chances to practice the language in a cooperative environment.


2008 ◽  
Vol 54 (11) ◽  
pp. 1872-1882 ◽  
Author(s):  
Eva Nagy ◽  
Joseph Watine ◽  
Peter S Bunting ◽  
Rita Onody ◽  
Wytze P Oosterhuis ◽  
...  

Abstract Background: Although the methodological quality of therapeutic guidelines (GLs) has been criticized, little is known regarding the quality of GLs that make diagnostic recommendations. Therefore, we assessed the methodological quality of GLs providing diagnostic recommendations for managing diabetes mellitus (DM) and explored several reasons for differences in quality across these GLs. Methods: After systematic searches of published and electronic resources dated between 1999 and 2007, 26 DM GLs, published in English, were selected and scored for methodological quality using the AGREE Instrument. Subgroup analyses were performed based on the source, scope, length, origin, and date and type of publication of GLs. Using a checklist, we collected laboratory-specific items within GLs thought to be important for interpretation of test results. Results: The 26 diagnostic GLs had significant shortcomings in methodological quality according to the AGREE criteria. GLs from agencies that had clear procedures for GL development, were longer than 50 pages, or were published in electronic databases were of higher quality. Diagnostic GLs contained more preanalytical or analytical information than combined (i.e., diagnostic and therapeutic) recommendations, but the overall quality was not significantly different. The quality of GLs did not show much improvement over the time period investigated. Conclusions: The methodological shortcomings of diagnostic GLs in DM raise questions regarding the validity of recommendations in these documents that may affect their implementation in practice. Our results suggest the need for standardization of GL terminology and for higher-quality, systematically developed recommendations based on explicit guideline development and reporting standards in laboratory medicine.


Sign in / Sign up

Export Citation Format

Share Document