scholarly journals Predicting replication outcomes in the Many Labs 2 study

2018 ◽  
Author(s):  
Eskil Forsell ◽  
Domenico Viganola ◽  
Thomas Pfeiffer ◽  
Johan Almenberg ◽  
Brad Wilson ◽  
...  

Understanding and improving reproducibility is crucial for scientific progress. Prediction markets and related methods of eliciting peer beliefs are promising tools to predict replication outcomes. We invited researchers in the field of psychology to judge the replicability of 24 studies replicated in the large scale Many Labs 2 project. We elicited peer beliefs in prediction markets and surveys about two replication success metrics: the probability that the replication yields a statistically significant effect in the original direction (p<0.001), and the relative effect size of the replication. The prediction markets correctly predicted 75% of the replication outcomes, and were highly correlated with the replication outcomes. Survey beliefs were also significantly correlated with replication outcomes, but had higher prediction errors. The prediction markets for relative effect sizes attracted little trading and thus did not work well. The survey beliefs about relative effect sizes performed better and were significantly correlated with observed relative effect sizes. These results suggest that replication outcomes can be predicted and that the elicitation of peer beliefs can increase our knowledge about scientific reproducibility and the dynamics of hypothesis testing.

2018 ◽  
Author(s):  
Colin Camerer ◽  
Anna Dreber ◽  
Felix Holzmeister ◽  
Teck Hua Ho ◽  
Juergen Huber ◽  
...  

Being able to replicate scientific findings is crucial for scientific progress. We replicate 21 systematically selected experimental studies in the social sciences published in Nature and Science between 2010 and 2015. The replications follow analysis plans reviewed by the original authors and pre-registered prior to the replications. The replications are high powered, with sample sizes on average about five times higher than in the original studies. We find a significant effect in the same direction as the original study for 13 (62%) studies, and the effect size of the replications is on average about 50% of the original effect size. Replicability varies between 12 (57%) and 14 (67%) studies for complementary replicability indicators. Consistent with these results, the estimated true positive rate is 67% in a Bayesian analysis. The relative effect size of true positives is estimated to be 71%, suggesting that both false positives and inflated effect sizes of true positives contribute to imperfect reproducibility. Furthermore, we find that peer beliefs of replicability are strongly related to replicability, suggesting that the research community could predict which results would replicate and that failures to replicate were not the result of chance alone.


2019 ◽  
Author(s):  
Adam Altmejd ◽  
Anna Dreber ◽  
Eskil Forsell ◽  
Teck Hua Ho ◽  
Juergen Huber ◽  
...  

We measure how accurately replication of experimental results can be predicted by a black-box statistical model. With data from four large- scale replication projects in experimental psychology and economics, and techniques from machine learning, we train a predictive model and study which variables drive predictable replication.The model predicts binary replication with a cross validated accuracy rate of 70% (AUC of 0.79) and relative effect size with a Spearman ρ of 0.38. The accuracy level is similar to the market-aggregated beliefs of peer scientists (Camerer et al., 2016; Dreber et al., 2015). The predictive power is validated in a pre-registered out of sample test of the outcome of Camerer et al. (2018b), where 71% (AUC of 0.73) of replications are predicted correctly and effect size correlations amount to ρ = 0.25.Basic features such as the sample and effect sizes in original papers, and whether reported effects are single-variable main effects or two- variable interactions, are predictive of successful replication. The models presented in this paper are simple tools to produce cheap, prognostic replicability metrics. These models could be useful in institutionalizing the process of evaluation of new findings and guiding resources to those direct replications that are likely to be most informative.


2020 ◽  
Author(s):  
Alexander Bowring ◽  
Fabian Telschow ◽  
Armin Schwartzman ◽  
Thomas E. Nichols

AbstractCurrent statistical inference methods for task-fMRI suffer from two fundamental limitations. First, the focus is solely on detection of non-zero signal or signal change, a problem that is exasperated for large scale studies (e.g. UK Biobank, N = 40, 000+) where the ‘null hypothesis fallacy’ causes even trivial effects to be determined as significant. Second, for any sample size, widely used cluster inference methods only indicate regions where a null hypothesis can be rejected, without providing any notion of spatial uncertainty about the activation. In this work, we address these issues by developing spatial Confidence Sets (CSs) on clusters found in thresholded Cohen’s d effect size images. We produce an upper and lower CS to make confidence statements about brain regions where Cohen’s d effect sizes have exceeded and fallen short of a non-zero threshold, respectively. The CSs convey information about the magnitude and reliability of effect sizes that is usually given separately in a t-statistic and effect estimate map. We expand the theory developed in our previous work on CSs for %BOLD change effect maps (Bowring et al., 2019) using recent results from the bootstrapping literature. By assessing the empirical coverage with 2D and 3D Monte Carlo simulations resembling fMRI data, we find our method is accurate in sample sizes as low as N = 60. We compute Cohen’s d CSs for the Human Connectome Project working memory taskfMRI data, illustrating the brain regions with a reliable Cohen’s d response for a given threshold. By comparing the CSs with results obtained from a traditional statistical voxelwise inference, we highlight the improvement in activation localization that can be gained with the Confidence Sets.


2019 ◽  
Author(s):  
Han Bossier ◽  
Thomas E. Nichols ◽  
Beatrijs Moerkerke

AbstractScientific progress is based on the ability to compare opposing theories and thereby develop consensus among existing hypotheses or create new ones. We argue that data aggregation (i.e. combine data across studies or research groups) for neuroscience is an important tool in this process. An important prerequisite is the ability to directly compare fMRI results over studies. In this paper, we discuss how an observed effect size in an fMRI data-analysis can be transformed into a standardized effect size. We demonstrate how these enable direct comparison and data aggregation over studies. Furthermore, we also discuss the influence of key parameters in the design of an fMRI experiment (such as number of scans and the sample size) on (statistical) properties of standardized effect sizes. In the second part of the paper, we give an overview of two approaches to aggregate fMRI results over studies. The first corresponds to extending the two-level general linear model approach as is typically used in individual fMRI studies with a third level. This requires the parameter estimates corresponding to the group models from each study together with estimated variances and meta-data. Unfortunately, there is a risk of running into unit mismatches when the primary studies use different scales to measure the BOLD response. To circumvent, it is possible to aggregate (unitless) standardized effect sizes which can be derived from summary statistics. We discuss a general model to aggregate these and different approaches to deal with between-study heterogeneity. Furthermore, we hope to further promote the usage of standardized effect sizes in fMRI research.


Author(s):  
Scott B. Morris ◽  
Arash Shokri

To understand and communicate research findings, it is important for researchers to consider two types of information provided by research results: the magnitude of the effect and the degree of uncertainty in the outcome. Statistical significance tests have long served as the mainstream method for statistical inferences. However, the widespread misinterpretation and misuse of significance tests has led critics to question their usefulness in evaluating research findings and to raise concerns about the far-reaching effects of this practice on scientific progress. An alternative approach involves reporting and interpreting measures of effect size along with confidence intervals. An effect size is an indicator of magnitude and direction of a statistical observation. Effect size statistics have been developed to represent a wide range of research questions, including indicators of the mean difference between groups, the relative odds of an event, or the degree of correlation among variables. Effect sizes play a key role in evaluating practical significance, conducting power analysis, and conducting meta-analysis. While effect sizes summarize the magnitude of an effect, the confidence intervals represent the degree of uncertainty in the result. By presenting a range of plausible alternate values that might have occurred due to sampling error, confidence intervals provide an intuitive indicator of how strongly researchers should rely on the results from a single study.


2016 ◽  
Vol 12 (2) ◽  
pp. 210-220 ◽  
Author(s):  
Kenes Beketayev ◽  
Mark A. Runco

Divergent thinking (DT) tests are useful for the assessment of creative potentials. This article reports the semantics-based algorithmic (SBA) method for assessing DT. This algorithm is fully automated: Examinees receive DT questions on a computer or mobile device and their ideas are immediately compared with norms and semantic networks. This investigation compared the scores generated by the SBA method with the traditional methods of scoring DT (i.e., fluency, originality, and flexibility). Data were collected from 250 examinees using the “Many Uses Test” of DT. The most important finding involved the flexibility scores from both scoring methods. This was critical because semantic networks are based on conceptual structures, and thus a high SBA score should be highly correlated with the traditional flexibility score from DT tests. Results confirmed this correlation (r = .74). This supports the use of algorithmic scoring of DT. The nearly-immediate computation time required by SBA method may make it the method of choice, especially when it comes to moderate- and large-scale DT assessment investigations. Correlations between SBA scores and GPA were insignificant, providing evidence of the discriminant and construct validity of SBA scores. Limitations of the present study and directions for future research are offered.


2020 ◽  
Author(s):  
Malte Friese ◽  
Julius Frankenbach

Science depends on trustworthy evidence. Thus, a biased scientific record is of questionable value because it impedes scientific progress, and the public receives advice on the basis of unreliable evidence that has the potential to have far-reaching detrimental consequences. Meta-analysis is a valid and reliable technique that can be used to summarize research evidence. However, meta-analytic effect size estimates may themselves be biased, threatening the validity and usefulness of meta-analyses to promote scientific progress. Here, we offer a large-scale simulation study to elucidate how p-hacking and publication bias distort meta-analytic effect size estimates under a broad array of circumstances that reflect the reality that exists across a variety of research areas. The results revealed that, first, very high levels of publication bias can severely distort the cumulative evidence. Second, p-hacking and publication bias interact: At relatively high and low levels of publication bias, p-hacking does comparatively little harm, but at medium levels of publication bias, p-hacking can considerably contribute to bias, especially when the true effects are very small or are approaching zero. Third, p-hacking can severely increase the rate of false positives. A key implication is that, in addition to preventing p-hacking, policies in research institutions, funding agencies, and scientific journals need to make the prevention of publication bias a top priority to ensure a trustworthy base of evidence.


Methodology ◽  
2019 ◽  
Vol 15 (3) ◽  
pp. 97-105
Author(s):  
Rodrigo Ferrer ◽  
Antonio Pardo

Abstract. In a recent paper, Ferrer and Pardo (2014) tested several distribution-based methods designed to assess when test scores obtained before and after an intervention reflect a statistically reliable change. However, we still do not know how these methods perform from the point of view of false negatives. For this purpose, we have simulated change scenarios (different effect sizes in a pre-post-test design) with distributions of different shapes and with different sample sizes. For each simulated scenario, we generated 1,000 samples. In each sample, we recorded the false-negative rate of the five distribution-based methods with the best performance from the point of view of the false positives. Our results have revealed unacceptable rates of false negatives even with effects of very large size, starting from 31.8% in an optimistic scenario (effect size of 2.0 and a normal distribution) to 99.9% in the worst scenario (effect size of 0.2 and a highly skewed distribution). Therefore, our results suggest that the widely used distribution-based methods must be applied with caution in a clinical context, because they need huge effect sizes to detect a true change. However, we made some considerations regarding the effect size and the cut-off points commonly used which allow us to be more precise in our estimates.


2019 ◽  
Author(s):  
Amanda Kvarven ◽  
Eirik Strømland ◽  
Magnus Johannesson

Andrews & Kasy (2019) propose an approach for adjusting effect sizes in meta-analysis for publication bias. We use the Andrews-Kasy estimator to adjust the result of 15 meta-analyses and compare the adjusted results to 15 large-scale multiple labs replication studies estimating the same effects. The pre-registered replications provide precisely estimated effect sizes, which do not suffer from publication bias. The Andrews-Kasy approach leads to a moderate reduction of the inflated effect sizes in the meta-analyses. However, the approach still overestimates effect sizes by a factor of about two or more and has an estimated false positive rate of between 57% and 100%.


1984 ◽  
Vol 16 (1-2) ◽  
pp. 281-295 ◽  
Author(s):  
Donald C Gordon

Large-scale tidal power development in the Bay of Fundy has been given serious consideration for over 60 years. There has been a long history of productive interaction between environmental scientists and engineers durinn the many feasibility studies undertaken. Up until recently, tidal power proposals were dropped on economic grounds. However, large-scale development in the upper reaches of the Bay of Fundy now appears to be economically viable and a pre-commitment design program is highly likely in the near future. A large number of basic scientific research studies have been and are being conducted by government and university scientists. Likely environmental impacts have been examined by scientists and engineers together in a preliminary fashion on several occasions. A full environmental assessment will be conducted before a final decision is made and the results will definately influence the outcome.


Sign in / Sign up

Export Citation Format

Share Document