Assessing Evidence for Replication: A Likelihood-Based Approach

2020 ◽  
Author(s):  
Peter Dixon ◽  
Scott Glover

How to evaluate replications is a fundamental issue in experimental methodology. We develop a likelihood-based approach to assessing evidence for replication. In this approach, the design of the original study is used to derive an estimate of a theoretically interesting effect size.A likelihood ratio is then calculated to contrast the match of two models to the data from the replication attempt: 1) A model based on the derived theoretically interesting effect size; and 2) a null model. This approach provides new insights not available with existing methods of assessingreplication. When applied to data from the Replication Project (Open Science Collaboration, 2015), the procedure indicates that a large portion of the replications failed to find evidence for a theoretically interesting effect.

2017 ◽  
Author(s):  
Robbie Cornelis Maria van Aert ◽  
Marcel A. L. M. van Assen

The unrealistic high rate of positive results within psychology increased the attention for replication research. Researchers who conduct a replication and want to statistically combine the results of their replication with a statistically significant original study encounter problems when using traditional meta-analysis techniques. The original study’s effect size is most probably overestimated because of it being statistically significant and this bias is not taken into consideration in traditional meta-analysis. We developed a hybrid method that does take statistical significance of the original study into account and enables (a) accurate effect size estimation, (b) estimation of a confidence interval, and (c) testing of the null hypothesis of no effect. We analytically approximate the performance of the hybrid method and describe its good statistical properties. Applying the hybrid method to the data of the Reproducibility Project Psychology (Open Science Collaboration, 2015) demonstrated that the conclusions based on the hybrid method are often in line with those of the replication, suggesting that many published psychological studies have smaller effect sizes than reported in the original study and that some effects may be even absent. We offer hands-on guidelines for how to statistically combine an original study and replication, and developed a web-based application (https://rvanaert.shinyapps.io/hybrid) for applying the hybrid method.


2020 ◽  
Vol 3 (3) ◽  
pp. 309-331 ◽  
Author(s):  
Charles R. Ebersole ◽  
Maya B. Mathur ◽  
Erica Baranski ◽  
Diane-Jo Bart-Plange ◽  
Nicholas R. Buttrick ◽  
...  

Replication studies in psychological science sometimes fail to reproduce prior findings. If these studies use methods that are unfaithful to the original study or ineffective in eliciting the phenomenon of interest, then a failure to replicate may be a failure of the protocol rather than a challenge to the original finding. Formal pre-data-collection peer review by experts may address shortcomings and increase replicability rates. We selected 10 replication studies from the Reproducibility Project: Psychology (RP:P; Open Science Collaboration, 2015) for which the original authors had expressed concerns about the replication designs before data collection; only one of these studies had yielded a statistically significant effect ( p < .05). Commenters suggested that lack of adherence to expert review and low-powered tests were the reasons that most of these RP:P studies failed to replicate the original effects. We revised the replication protocols and received formal peer review prior to conducting new replication studies. We administered the RP:P and revised protocols in multiple laboratories (median number of laboratories per original study = 6.5, range = 3–9; median total sample = 1,279.5, range = 276–3,512) for high-powered tests of each original finding with both protocols. Overall, following the preregistered analysis plan, we found that the revised protocols produced effect sizes similar to those of the RP:P protocols (Δ r = .002 or .014, depending on analytic approach). The median effect size for the revised protocols ( r = .05) was similar to that of the RP:P protocols ( r = .04) and the original RP:P replications ( r = .11), and smaller than that of the original studies ( r = .37). Analysis of the cumulative evidence across the original studies and the corresponding three replication attempts provided very precise estimates of the 10 tested effects and indicated that their effect sizes (median r = .07, range = .00–.15) were 78% smaller, on average, than the original effect sizes (median r = .37, range = .19–.50).


2020 ◽  
Author(s):  
Hidde Jelmer Leplaa ◽  
Charlotte Rietbergen ◽  
Herbert Hoijtink

In this paper a method is proposed to determine whether the result from an original study is corroborated in a replication study. The paper is illustrated using data from the reproducibility project psychology by the Open Science Collaboration. This method emphasizes the need to determine what one wants to replicate: the hypotheses as formulated in the introduction of the original paper, or hypotheses derived from the research results presented in the original paper. The Bayes factor will be used to determine whether the hypotheses evaluated in/resulting from the original study are corroborated by the replication study. Our method to assess the successfulness of replication will better fit the needs and desires of researchers in fields that use replication studies.


2019 ◽  
Author(s):  
Charles R. Ebersole ◽  
Maya B Mathur ◽  
Erica Baranski ◽  
Diane-Jo Bart-Plange ◽  
Nick Buttrick ◽  
...  

Replications in psychological science sometimes fail to reproduce prior findings. If replications use methods that are unfaithful to the original study or ineffective in eliciting the phenomenon of interest, then a failure to replicate may be a failure of the protocol rather than a challenge to the original finding. Formal pre-data collection peer review by experts may address shortcomings and increase replicability rates. We selected 10 replications from the Reproducibility Project: Psychology (RP:P; Open Science Collaboration, 2015) in which the original authors had expressed concerns about the replication designs before data collection and only one of which was “statistically significant” (p &lt; .05). Commenters suggested that lack of adherence to expert review and low-powered tests were the reasons that most of these RP:P studies failed to replicate (Gilbert et al., 2016). We revised the replication protocols and received formal peer review prior to conducting new replications. We administered the RP:P and Revised protocols in multiple laboratories (Median number of laboratories per original study = 6.5; Range 3 to 9; Median total sample = 1279.5; Range 276 to 3512) for high-powered tests of each original finding with both protocols. Overall, Revised protocols produced similar effect sizes as RP:P protocols following the preregistered analysis plan (Δr = .002 or .014, depending on analytic approach). The median effect size for Revised protocols (r = .05) was similar to RP:P protocols (r = .04) and the original RP:P replications (r = .11), and smaller than the original studies (r = .37). The cumulative evidence of original study and three replication attempts suggests that effect sizes for all 10 (median r = .07; range .00 to .15) are 78% smaller on average than original findings (median r = .37; range .19 to .50), with very precisely estimated effects.


2020 ◽  
Author(s):  
Charles R. Ebersole ◽  
Brian A. Nosek ◽  
Mallory Kidwell ◽  
Nick Buttrick ◽  
Erica Baranski ◽  
...  

Replications in psychological science sometimes fail to reproduce prior findings. If replications use methods that are unfaithful to the original study or ineffective in eliciting the phenomenon of interest, then a failure to replicate may be a failure of the protocol rather than a challenge to the original finding. Formal pre-data collection peer review by experts may address shortcomings and increase replicability rates. We selected 10 replications from the Reproducibility Project: Psychology (RP:P; Open Science Collaboration, 2015) in which the original authors had expressed concerns about the replication designs before data collection and only one of which was “statistically significant” (p &lt; .05). Commenters suggested that lack of adherence to expert review and low-powered tests were the reasons that most of these RP:P studies failed to replicate (Gilbert et al., 2016). We revised the replication protocols and received formal peer review prior to conducting new replications. We administered the RP:P and Revised protocols in multiple laboratories (Median number of laboratories per original study = 6.5; Range 3 to 9; Median total sample = 1279.5; Range 276 to 3512) for high-powered tests of each original finding with both protocols. Overall, Revised protocols produced similar effect sizes as RP:P protocols following the preregistered analysis plan (Δr = .002 or .014, depending on analytic approach). The median effect size for Revised protocols (r = .05) was similar to RP:P protocols (r = .04) and the original RP:P replications (r = .11), and smaller than the original studies (r = .37). The cumulative evidence of original study and three replication attempts suggests that effect sizes for all 10 (median r = .07; range .00 to .15) are 78% smaller on average than original findings (median r = .37; range .19 to .50), with very precisely estimated effects.


Diagnosis ◽  
2018 ◽  
Vol 5 (4) ◽  
pp. 205-214 ◽  
Author(s):  
Matthew L. Rubinstein ◽  
Colleen S. Kraft ◽  
J. Scott Parrott

AbstractBackgroundDiagnostic test accuracy (DTA) systematic reviews (SRs) characterize a test’s potential for diagnostic quality and safety. However, interpreting DTA measures in the context of SRs is challenging. Further, some evidence grading methods (e.g. Centers for Disease Control and Prevention, Division of Laboratory Systems Laboratory Medicine Best Practices method) require determination of qualitative effect size ratings as a contributor to practice recommendations. This paper describes a recently developed effect size rating approach for assessing a DTA evidence base.MethodsA likelihood ratio scatter matrix will plot positive and negative likelihood ratio pairings for DTA studies. Pairings are graphed as single point estimates with confidence intervals, positioned in one of four quadrants derived from established thresholds for test clinical validity. These quadrants support defensible judgments on “substantial”, “moderate”, or “minimal” effect size ratings for each plotted study. The approach is flexible in relation to a priori determinations of the relative clinical importance of false positive and false negative test results.Results and conclusionsThis qualitative effect size rating approach was operationalized in a recent SR that assessed effectiveness of test practices for the diagnosis ofClostridium difficile. Relevance of this approach to other methods of grading evidence, and efforts to measure diagnostic quality and safety are described. Limitations of the approach arise from understanding that a diagnostic test is not an isolated element in the diagnostic process, but provides information in clinical context towards diagnostic quality and safety.


2016 ◽  
Author(s):  
Frank Bosco ◽  
Joshua Carp ◽  
James G. Field ◽  
Hans IJzerman ◽  
Melissa Lewis ◽  
...  

Open Science Collaboration (in press). Maximizing the reproducibility of your research. In S. O. Lilienfeld & I. D. Waldman (Eds.), Psychological Science Under Scrutiny: Recent Challenges and Proposed Solutions. New York, NY: Wiley.


2021 ◽  
pp. 152-172
Author(s):  
R. Barker Bausell

The “mass” replications of multiple studies, some employing dozens of investigators distributed among myriad sites, is unique to the reproducibility movement. The most impressive of these initiatives was employed by the Open Science Collaboration directed by Brian Nosek, who recruited 270 investigators to participate in the replication of 100 psychological experiments via a very carefully structured, prespecified protocol that avoided questionable research practices. Just before this Herculean effort, two huge biotech firms (Amegen and Bayer Health Care) respectively conducted 53 and 67 preclinical replications of promising published studies to ascertain which results were worth pursuing for commercial applications. Amazingly, in less than a 10-year period, a number of other diverse multistudy replications were also conducted involving hundreds of effects. Among these were the three “many lab” multistudy replications based on the Open Science Model (but also designed to ascertain if potential confounders of the approach itself existed, such as differences in participant types, settings, and timing), replications of social science studies published in Science and Nature, experimental economics studies, and even self-reported replications ascertained from a survey. Somewhat surprisingly, the overall successful replication percentage for this diverse collection of 811 studies was 46%, mirroring the modeling results discussed in Chapter 3 and supportive of John Ioannidis’s pejorative and often quoted conclusion that most scientific results are incorrect.


2020 ◽  
Vol 7 (4) ◽  
pp. 191795 ◽  
Author(s):  
Laura Schlingloff ◽  
Gergely Csibra ◽  
Denis Tatone

Hamlin et al . found in 2007 that preverbal infants displayed a preference for helpers over hinderers. The robustness of this finding and the conditions under which infant sociomoral evaluation can be elicited has since been debated. Here, we conducted a replication of the original study, in which we tested 14- to 16-month-olds using a familiarization procedure with three-dimensional animated video stimuli. Unlike previous replication attempts, ours uniquely benefited from detailed procedural advice by Hamlin. In contrast with the original results, only 16 out of 32 infants (50%) in our study reached for the helper; thus, we were not able to replicate the findings. A possible reason for this failure is that infants' preference for prosocial agents may not be reliably elicited with the procedure and stimuli adopted. Alternatively, the effect size of infants’ preference may be smaller than originally estimated. The study addresses ongoing methodological debates on the replicability of influential findings in infant cognition.


Sign in / Sign up

Export Citation Format

Share Document