LDpred-funct: incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets

2018 ◽  
Author(s):  
Carla Márquez-Luna ◽  
Steven Gazal ◽  
Po-Ru Loh ◽  
Samuel S. Kim ◽  
Nicholas Furlotte ◽  
...  

AbstractGenetic variants in functional regions of the genome are enriched for complex trait heritability. Here, we introduce a new method for polygenic prediction, LDpred-funct, that leverages trait-specific functional priors to increase prediction accuracy. We fit priors using the recently developed baseline-LD model, which includes coding, conserved, regulatory and LD-related annotations. We analytically estimate posterior mean causal effect sizes and then use cross-validation to regularize these estimates, improving prediction accuracy for sparse architectures. LDpred-funct attained higher prediction accuracy than other polygenic prediction methods in simulations using real genotypes. We applied LDpred-funct to predict 21 highly heritable traits in the UK Biobank. We used association statistics from British-ancestry samples as training data (avg N=373K) and samples of other European ancestries as validation data (avg N=22K), to minimize confounding. LDpred-funct attained a +4.6% relative improvement in average prediction accuracy (avg prediction R2=0.144; highest R2=0.413 for height) compared to SBayesR (the best method that does not incorporate functional information). For height, meta-analyzing training data from UK Biobank and 23andMe cohorts (total N=1107K; higher heritability in UK Biobank cohort) increased prediction R2 to 0.431. Our results show that incorporating functional priors improves polygenic prediction accuracy, consistent with the functional architecture of complex traits.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Carla Márquez-Luna ◽  
Steven Gazal ◽  
Po-Ru Loh ◽  
Samuel S. Kim ◽  
Nicholas Furlotte ◽  
...  

AbstractPolygenic risk prediction is a widely investigated topic because of its promising clinical applications. Genetic variants in functional regions of the genome are enriched for complex trait heritability. Here, we introduce a method for polygenic prediction, LDpred-funct, that leverages trait-specific functional priors to increase prediction accuracy. We fit priors using the recently developed baseline-LD model, including coding, conserved, regulatory, and LD-related annotations. We analytically estimate posterior mean causal effect sizes and then use cross-validation to regularize these estimates, improving prediction accuracy for sparse architectures. We applied LDpred-funct to predict 21 highly heritable traits in the UK Biobank (avg N = 373 K as training data). LDpred-funct attained a +4.6% relative improvement in average prediction accuracy (avg prediction R2 = 0.144; highest R2 = 0.413 for height) compared to SBayesR (the best method that does not incorporate functional information). For height, meta-analyzing training data from UK Biobank and 23andMe cohorts (N = 1107 K) increased prediction R2 to 0.431. Our results show that incorporating functional priors improves polygenic prediction accuracy, consistent with the functional architecture of complex traits.



2021 ◽  
Author(s):  
Omer Weissbrod ◽  
Masahiro Kanai ◽  
Huwenbo Shi ◽  
Steven Gazal ◽  
Wouter Peyrot ◽  
...  

AbstractPolygenic risk scores (PRS) based on European training data suffer reduced accuracy in non-European target populations, exacerbating health disparities. This loss of accuracy predominantly stems from LD differences, MAF differences (including population-specific SNPs), and/or causal effect size differences. Here, we propose PolyPred, a method that improves trans-ethnic polygenic prediction by combining two complementary predictors: a new predictor that leverages functionally informed fine-mapping to estimate causal effects (instead of tagging effects), addressing LD differences; and BOLT-LMM, a published predictor. In the special case where a large training sample is available in the non-European target population (or a closely related population), we propose PolyPred+, which further incorporates the non-European training data, addressing MAF differences and causal effect size differences. We applied PolyPred to 49 diseases and complex traits in 4 UK Biobank populations using UK Biobank British training data (average N=325K), and observed statistically significant average relative improvements in prediction accuracy vs. BOLT-LMM ranging from +7% in South Asians to +32% in Africans (and vs. LD-pruning + P-value thresholding (P+T) ranging from +77% to +164%), consistent with simulations. We applied PolyPred+ to 23 diseases and complex traits in UK Biobank East Asians using both UK Biobank British (average N=325K) and Biobank Japan (average N=124K) training data, and observed statistically significant average relative improvements in prediction accuracy of +24% vs. BOLT-LMM and +12% vs. PolyPred. In conclusion, PolyPred and PolyPred+ improve trans-ethnic polygenic prediction accuracy, ameliorating health disparities.



2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Xuan Zhou ◽  
S. Hong Lee

AbstractComplementary to the genome, the concept of exposome has been proposed to capture the totality of human environmental exposures. While there has been some recent progress on the construction of the exposome, few tools exist that can integrate the genome and exposome for complex trait analyses. Here we propose a linear mixed model approach to bridge this gap, which jointly models the random effects of the two omics layers on phenotypes of complex traits. We illustrate our approach using traits from the UK Biobank (e.g., BMI and height for N ~ 35,000) with a small fraction of the exposome that comprises 28 lifestyle factors. The joint model of the genome and exposome explains substantially more phenotypic variance and significantly improves phenotypic prediction accuracy, compared to the model based on the genome alone. The additional phenotypic variance captured by the exposome includes its additive effects as well as non-additive effects such as genome–exposome (gxe) and exposome–exposome (exe) interactions. For example, 19% of variation in BMI is explained by additive effects of the genome, while additional 7.2% by additive effects of the exposome, 1.9% by exe interactions and 4.5% by gxe interactions. Correspondingly, the prediction accuracy for BMI, computed using Pearson’s correlation between the observed and predicted phenotypes, improves from 0.15 (based on the genome alone) to 0.35 (based on the genome and exposome). We also show, using established theories, that integrating genomic and exposomic data can be an effective way of attaining a clinically meaningful level of prediction accuracy for disease traits. In conclusion, the genomic and exposomic effects can contribute to phenotypic variation via their latent relationships, i.e. genome-exposome correlation, and gxe and exe interactions, and modelling these effects has a potential to improve phenotypic prediction accuracy and thus holds a great promise for future clinical practice.



Author(s):  
Armin P. Schoech ◽  
Omer Weissbrod ◽  
Luke J. O’Connor ◽  
Nick Patterson ◽  
Huwenbo Shi ◽  
...  

AbstractMost models of complex trait genetic architecture assume that signed causal effect sizes of each SNP (defined with respect to the minor allele) are uncorrelated with those of nearby SNPs, but it is currently unknown whether this is the case. We develop a new method, autocorrelation LD regression (ACLR), for estimating the genome-wide autocorrelation of causal minor allele effect sizes as a function of genomic distance. Our method estimates these autocorrelations by regressing the products of summary statistics on distance-dependent LD scores. We determined that ACLR robustly assesses the presence or absence of nonzero autocorrelation, producing unbiased estimates with well-calibrated standard errors in null simulations regardless of genetic architecture; if true autocorrelation is nonzero, ACLR correctly detects its sign, although estimates of the autocorrelation magnitude are susceptible to bias in cases of certain genetic architectures. We applied ACLR to 31 diseases and complex traits from the UK Biobank (average N=331K), meta-analyzing results across traits. We determined that autocorrelations were significantly negative at distances of 1-50bp (P = 8 × 10−6, point estimate −0.35 ±0.08) and 50-100bp (P = 2 × 10−3, point estimate −0.33 ± 0.11). We show that the autocorrelation is primarily driven by pairs of SNPs in positive LD, which is consistent with the expectation that linked SNPs with opposite effects are less impacted by natural selection. Our findings suggest that this mechanism broadly affects complex trait genetic architectures, and we discuss implications for association mapping, heritability estimation, and genetic risk prediction.



2021 ◽  
Author(s):  
Ildar I Sadreev ◽  
Benjamin L Elsworth ◽  
Ruth E Mitchell ◽  
Lavinia Paternoster ◽  
Eleanor Sanderson ◽  
...  

We performed GWAS on 2514 complex traits from the UK Biobank using a linear mixed model, identifying 40,620 independent significant associations (p<5x10-8). We estimate that winner's curse incurs substantial overestimation of effect sizes in a mean of 35% of discovered associations per trait. We use these results to estimate that the polygenicity of most complex traits is below 10000 common causal variants. We evaluated the impact of winner's curse on causal effect estimation and hypothesis testing in Mendelian randomization analyses. We show that winner's curse substantially amplifies the magnitude of weak instrument bias, though any inflation of false discovery rates tends to be low or modest. We designed a process of pseudo-replication within the UK Biobank data to generate GWAS estimates that minimise bias in MR studies using these data. Our resource is integrated into the OpenGWAS platform and enables a convenient framework for researchers to minimise bias or maximise precision of causal effect estimates.



Author(s):  
Xuan Zhou ◽  
S. Hong Lee

AbstractComplementary to the genome, the concept of exposome has been proposed to capture the totality of human environmental exposures. While there has been some recent progress on the construction of the exposome, few tools exist that can integrate the genome and exposome for complex trait analyses. Here we propose a linear mixed model approach to bridge this gap, which jointly models the random effects of the two omics layers on phenotypes of complex traits. We illustrate our approach using traits from the UK Biobank (e.g., BMI & height for N ~ 40,000) with a small fraction of the exposome that comprises 28 lifestyle factors. The joint model of the genome and exposome explains substantially more phenotypic variance and significantly improves phenotypic prediction accuracy, compared to the model based on the genome alone. The additional phenotypic variance captured by the exposome includes its additive effects as well as non-additive effects such as genome-exposome (gxe) and exposome-exposome (exe) interactions. For example, 19% of variation in BMI is explained by additive effects of the genome, while additional 7.2% by additive effects of the exposome, 1.9% by exe interactions and 4.5% by gxe interactions. Correspondingly, the prediction accuracy for BMI, computed using Pearson’s correlation between the observed and predicted phenotypes, improves from 0.15 (based on the genome alone) to 0.35 (based on the genome & exposome). We also show, using established theories, integrating genomic and exposomic data is essential to attaining a clinically meaningful level of prediction accuracy for disease traits. In conclusion, the genomic and exposomic effects can contribute to phenotypic variation via their latent relationships, i.e. genome-exposome correlation, and gxe and exe interactions, and modelling these effects has a great potential to improve phenotypic prediction accuracy and thus holds a great promise for future clinical practice.



2019 ◽  
Author(s):  
Arunabha Majumdar ◽  
Claudia Giambartolomei ◽  
Na Cai ◽  
Tanushree Haldar ◽  
Tommer Schwarz ◽  
...  

AbstractGenetic predisposition for complex traits often acts through multiple tissues at different time points during development. As a simple example, the genetic predisposition for obesity could be manifested either through inherited variants that control metabolism through regulation of genes expressed in the brain, or that control fat storage through dysregulation of genes expressed in adipose tissue, or both. Here we describe a statistical approach that leverages tissue-specific expression quantitative trait loci (eQTLs) corresponding to tissue-specific genes to prioritize a relevant tissue underlying the genetic predisposition of a given individual for a complex trait. Unlike existing approaches that prioritize relevant tissues for the trait in the population, our approach probabilistically quantifies the tissue-wise genetic contribution to the trait for a given individual. We hypothesize that for a subgroup of individuals the genetic contribution to the trait can be mediated primarily through a specific tissue. Through simulations using the UK Biobank, we show that our approach can predict the relevant tissue accurately and can cluster individuals according to their tissue-specific genetic architecture. We analyze body mass index (BMI) and waist to hip ratio adjusted for BMI (WHRadjBMI) in the UK Biobank to identify subgroups of individuals whose genetic predisposition act primarily through brain versus adipose tissue, and adipose versus muscle tissue, respectively. Notably, we find that these individuals have specific phenotypic features beyond BMI and WHRadjBMI that distinguish them from random individuals in the data, suggesting biological effects of tissue-specific genetic contribution for these traits.



2020 ◽  
Vol 10 (12) ◽  
pp. 4599-4613
Author(s):  
Fabio Morgante ◽  
Wen Huang ◽  
Peter Sørensen ◽  
Christian Maltecca ◽  
Trudy F. C. Mackay

The ability to accurately predict complex trait phenotypes from genetic and genomic data are critical for the implementation of personalized medicine and precision agriculture; however, prediction accuracy for most complex traits is currently low. Here, we used data on whole genome sequences, deep RNA sequencing, and high quality phenotypes for three quantitative traits in the ∼200 inbred lines of the Drosophila melanogaster Genetic Reference Panel (DGRP) to compare the prediction accuracies of gene expression and genotypes for three complex traits. We found that expression levels (r = 0.28 and 0.38, for females and males, respectively) provided higher prediction accuracy than genotypes (r = 0.07 and 0.15, for females and males, respectively) for starvation resistance, similar prediction accuracy for chill coma recovery (null for both models and sexes), and lower prediction accuracy for startle response (r = 0.15 and 0.14 for female and male genotypes, respectively; and r = 0.12 and 0.11, for females and male transcripts, respectively). Models including both genotype and expression levels did not outperform the best single component model. However, accuracy increased considerably for all the three traits when we included gene ontology (GO) category as an additional layer of information for both genomic variants and transcripts. We found strongly predictive GO terms for each of the three traits, some of which had a clear plausible biological interpretation. For example, for starvation resistance in females, GO:0033500 (r = 0.39 for transcripts) and GO:0032870 (r = 0.40 for transcripts), have been implicated in carbohydrate homeostasis and cellular response to hormone stimulus (including the insulin receptor signaling pathway), respectively. In summary, this study shows that integrating different sources of information improved prediction accuracy and helped elucidate the genetic architecture of three Drosophila complex phenotypes.



2021 ◽  
Author(s):  
Duncan S Palmer ◽  
Wei Zhou ◽  
Liam Abbott ◽  
Nik Baya ◽  
Claire Churchhouse ◽  
...  

In classical statistical genetic theory, a dominance effect is defined as the deviation from a purely additive genetic effect for a biallelic variant. Dominance effects are well documented in model organisms. However, evidence in humans is limited to a handful of traits, particularly those with strong single locus effects such as hair color. We carried out the largest systematic evaluation of dominance effects on phenotypic variance in the UK Biobank. We curated and tested over 1,000 phenotypes for dominance effects through GWAS scans, identifying 175 loci at genome-wide significance correcting for multiple testing (P < 4.7 × 10-11). Power to detect non-additive loci is much lower than power to detect additive effects for complex traits: based on the relative effect sizes at genome-wide significant additive loci, we estimate a factor of 20-30 increase in sample size will be necessary to capture clear evidence of dominance similar to those currently observed for additive effects. However, these localised dominance hits do not extend to a significant aggregate contribution to phenotypic variance genome-wide. By deriving a version of LD-score regression to detect dominance effects tagged by common variation genome-wide (minor allele frequency > 0.05), we found no strong evidence of a contribution to phenotypic variance when accounting for multiple testing. Across the 267 continuous and 793 binary traits the median contribution was 5.73 × 10-4, with unbiased point estimates ranging from -0.261 to 0.131. Finally, we introduce dominance fine-mapping to explore whether the more rapid decay of dominance LD can be leveraged to find causal variants. These results provide the most comprehensive assessment of dominance trait variation in humans to date.



2020 ◽  
Author(s):  
David Curtis

Rare genetic variants in LDLR, APOB and PCSK9 are known causes of familial hypercholesterolaemia and it is expected that rare variants in other genes will also have effects on hyperlipidaemia risk although such genes remain to be identified. The UK Biobank consists of a sample of 500,000 volunteers and exome sequence data is available for 50,000 of them. 11,490 of these were classified as hyperlipidaemia cases on the basis of having a relevant diagnosis recorded and/or taking lipid-lowering medication while the remaining 38,463 were treated as controls. Variants in each gene were assigned weights according to rarity and predicted impact and overall weighted burden scores were compared between cases and controls, including population principal components as covariates. One biologically plausible gene, HUWE1, produced statistically significant evidence for association after correction for testing 22,028 genes with a signed log10 p value (SLP) of -6.15, suggesting a protective effect of variants in this gene. Other genes with uncorrected p<0.001 are arguably also of interest, including LDLR (SLP=3.67), RBP2 (SLP=3.14), NPFFR1 (SLP=3.02) and ACOT9 (SLP=-3.19). Gene set analysis indicated that rare variants in genes involved in metabolism and energy can influence hyperlipidaemia risk. Overall, the results provide some leads which might be followed up with functional studies and which could be tested in additional data sets as these become available. This research has been conducted using the UK Biobank Resource.



Sign in / Sign up

Export Citation Format

Share Document