Evaluating and implementing block jackknife resampling Mendelian randomization to mitigate bias induced by overlapping samples
AbstractParticipant overlap has been thought to induce overfitting bias into Mendelian randomization (MR) and polygenic risk score (PRS) studies. This hinders the potential research into many unique traits and disease outcomes from large-scale biobanks. Here, we evaluated a block jackknife resampling framework for genome-wide association studies (GWAS) and PRS construction to mitigate the influence of overfitting bias on MR analyses compared to alternative approaches and implemented this study design in causal inference setting using data from the UK Biobank.We simulated PRS and MR under three scenarios: (1) using weighted SNP estimates from an external GWAS, (2) using weighted SNP estimates from an overlapping GWAS sample and (3) using a block jackknife resampling framework. Based on a conventional P-value threshold to derive genetic instruments for MR studies (P<5×10−8), our block-jackknifing PRS did not suffer from overfitting bias (mean R2=0.034) compared to the externally weighted PRS (mean R2=0.040). In contrast, genetic instruments derived from overlapping samples explained a higher proportion of variance (mean R2=0.048) compared to the externally derived score. The detrimental impact of overfitting bias became considerably larger when using a more liberal P-value threshold to construct PRS (e.g., P<0.05, mean R2=0.103), whereas estimates using jackknife score remained robust to overfitting (mean R2=0.084).In an applied setting, we examined (A) the effects of body mass index on circulating biomarkers and (B) the effect of childhood body size on levels of testosterone in adulthood using methods described above. In the first applied analysis, overlapping sample PRS and block jackknife resampled PRS led to comparable effect sizes, whereas narrower confidence intervals were identified when using the overlapping sample instrument. In the second example, through sex-stratified multivariable and bi-directional MR, we demonstrate that childhood body size indirectly leads to lower testosterone levels in adulthood in males, an effect mediated through adult body size.Author summaryUsing genetic variants as instrumental variables for risk factors, Mendelian randomization (MR) provides an approach to explore the genetically predicted effects of modifiable risk factors on disease which is robust to confounding and reverse causation. Genetic instrumental variables are conventionally selected from results of genome-wide association studies on an independent dataset whose sample does not overlap with the dataset being analysed using MR analysis, as this can lead to overfitting bias. This can often be challenging to entirely avoid however, as such association studies are increasingly being performed by meta-analysing several biobanks to achieve the maximum power to detect variants with smaller effect sizes. Moreover, when investigating exposures and outcomes which only a single biobank has measured in sufficiently large samples, avoiding participant overlap requires splitting the study population into subgroups which can limit statistical power. Block jackknife resampling MR provides a solution to conduct causal inference under these circumstances with the maximum statistical power while avoiding bias due to overlapping participants. In this study, we evaluated this study design with simulated dataset in comparison to MR using genetic variants discovered from an external dataset or one with overlapping samples. We applied this approach using UK Biobank to investigate the role of body mass index on circulating biomarkers, as well as the causal relationship between childhood adiposity and testosterone levels in adulthood.