Polygenic scores for UK Biobank scale data

AbstractPolygenic scores (PGS) are estimated scores representing the genetic tendency of an individual for a disease or trait and have become an indispensible tool in a variety of analyses. Typically they are linear combination of the genotypes of a large number of SNPs, with the weights calculated from an external source, such as summary statistics from large meta-analyses. Recently cohorts with genetic data have become very large, such that it would be a waste if the raw data were not made use of in constructing PGS. Making use of raw data in calculating PGS, however, presents us with problems of overfitting. Here we discuss the essence of overfitting as applied in PGS calculations and highlight the difference between overfitting due to the overlap between the target and the discovery data (OTD), and overfitting due to the overlap between the target the the validation data (OTV). We propose two methods — cross prediction and split validation — to overcome OTD and OTV respectively. Using these two methods, PGS can be calculated using raw data without overfitting. We show that PGSs thus calculated have better predictive power than those using summary statistics alone for six phenotypes in the UK Biobank data.

Download Full-text

Contrasting Broad- and Clinically- defined Polygenic Indicators of Depression and Depression-related Phenotypes in Adults and Children

10.31234/osf.io/pn9vb ◽

2020 ◽

Author(s):

John E. McGeary ◽

Chelsie Benca-Bachman ◽

Victoria Risner ◽

Christopher G Beevers ◽

Brandon Gibb ◽

...

Keyword(s):

Suicidal Ideation ◽

Cognitive Reappraisal ◽

Twin Studies ◽

European Ancestry ◽

Summary Statistics ◽

Depression Severity ◽

Uk Biobank ◽

Polygenic Scores ◽

Adults And Children ◽

The Uk

Twin studies indicate that 30-40% of the disease liability for depression can be attributed to genetic differences. Here, we assess the explanatory ability of polygenic scores (PGS) based on broad- (PGSBD) and clinical- (PGSMDD) depression summary statistics from the UK Biobank using independent cohorts of adults (N=210; 100% European Ancestry) and children (N=728; 70% European Ancestry) who have been extensively phenotyped for depression and related neurocognitive phenotypes. PGS associations with depression severity and diagnosis were generally modest, and larger in adults than children. Polygenic prediction of depression-related phenotypes was mixed and varied by PGS. Higher PGSBD, in adults, was associated with a higher likelihood of having suicidal ideation, increased brooding and anhedonia, and lower levels of cognitive reappraisal; PGSMDD was positively associated with brooding and negatively related to cognitive reappraisal. Overall, PGS based on both broad and clinical depression phenotypes have modest utility in adult and child samples of depression.

Download Full-text

How robust are cross-population signatures of polygenic adaptation in humans?

10.1101/2020.07.13.200030 ◽

2020 ◽

Author(s):

Alba Refoyo-Martínez ◽

Siyang Liu ◽

Anja Moltke Jørgensen ◽

Xin Jin ◽

Anders Albrechtsen ◽

...

Keyword(s):

Effect Size ◽

Association Studies ◽

Gwas Data ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Uk Biobank ◽

Polygenic Adaptation ◽

The Uk ◽

Meta Analyses ◽

Size Estimates

AbstractOver the past decade, summary statistics from genome-wide association studies (GWAS) have been used to detect and quantify polygenic adaptation in humans. Several studies have reported signatures of natural selection at sets of SNPs associated with complex traits, like height and body mass index. However, more recent studies suggest that some of these signals may be caused by biases from uncorrected population stratification in the GWAS data with which these tests are performed. Moreover, past studies have predominantly relied on SNP effect size estimates obtained from GWAS panels of European ancestries, which are known to be poor predictors of phenotypes in non-European populations. Here, we collated GWAS data from multiple anthropometric and metabolic traits that have been measured in more than one cohort around the world, including the UK Biobank, FINRISK, Chinese NIPT, Biobank Japan, APCDR and PAGE. We then evaluated how robust signals of polygenic adaptation are to the choice of GWAS cohort used to identify associated variants and their effect size estimates, while using the same panel to obtain population allele frequencies (The 1000 Genomes Project). We observe many discrepancies across tests performed on the same phenotype and find that association studies performed using multiple different cohorts, like meta-analyses, tend to produce scores with strong overdispersion across populations. This results in apparent signatures of polygenic adaptation which are not observed when using effect size estimates from biobank-based GWAS of homogeneous ancestries. Indeed, we were able to artificially create score overdispersion when taking the UK Biobank cohort and simulating a meta-analysis on multiple subsets of the cohort. This suggests that extreme caution should be taken in the execution and interpretation of future tests of polygenic adaptation based on population differentiation, especially when using summary statistics from GWAS meta-analyses.

Download Full-text

Polygenic scores via penalized regression on summary statistics

10.1101/058214 ◽

2016 ◽

Author(s):

Timothy Shin Heng Mak ◽

Robert Milan Porsch ◽

Shing Wan Choi ◽

Xueya Zhou ◽

Pak Chung Sham

Keyword(s):

Prediction Accuracy ◽

Penalized Regression ◽

Tuning Parameter ◽

Summary Statistics ◽

Validation Data ◽

Polygenic Scores ◽

Pertinent Question ◽

Almost All ◽

Risk Categories ◽

General Method

AbstractPolygenic scores (PGS) summarize the genetic contribution of a person’s genotype to a disease or phenotype. They can be used to group participants into different risk categories for diseases, and are also used as covariates in epidemiological analyses. A number of possible ways of calculating polygenic scores have been proposed, and recently there is much interest in methods that incorporate information available in published summary statistics. As there is no inherent information on linkage disequilibrium (LD) in summary statistics, a pertinent question is how we can make use of LD information available elsewhere to supplement such analyses. To answer this question we propose a method for constructing PGS using summary statistics and a reference panel in a penalized regression framework, which we call lassosum. We also propose a general method for choosing the value of the tuning parameter in the absence of validation data. In our simulations, we showed that pseudovalidation often resulted in prediction accuracy that is comparable to using a dataset with validation phenotype and was clearly superior to the conservative option of setting the tuning parameter of lassosum to its lowest value. We also showed that lassosum achieved better prediction accuracy than simple clumping and p-value thresholding in almost all scenarios. It was also substantially faster and more accurate than the recently proposed LDpred.

Download Full-text

LDpred-funct: incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets

10.1101/375337 ◽

2018 ◽

Cited By ~ 21

Author(s):

Carla Márquez-Luna ◽

Steven Gazal ◽

Po-Ru Loh ◽

Samuel S. Kim ◽

Nicholas Furlotte ◽

...

Keyword(s):

Complex Traits ◽

Prediction Accuracy ◽

Causal Effect ◽

Complex Trait ◽

Training Data ◽

Data Sets ◽

Uk Biobank ◽

Validation Data ◽

Functional Regions ◽

The Uk

AbstractGenetic variants in functional regions of the genome are enriched for complex trait heritability. Here, we introduce a new method for polygenic prediction, LDpred-funct, that leverages trait-specific functional priors to increase prediction accuracy. We fit priors using the recently developed baseline-LD model, which includes coding, conserved, regulatory and LD-related annotations. We analytically estimate posterior mean causal effect sizes and then use cross-validation to regularize these estimates, improving prediction accuracy for sparse architectures. LDpred-funct attained higher prediction accuracy than other polygenic prediction methods in simulations using real genotypes. We applied LDpred-funct to predict 21 highly heritable traits in the UK Biobank. We used association statistics from British-ancestry samples as training data (avg N=373K) and samples of other European ancestries as validation data (avg N=22K), to minimize confounding. LDpred-funct attained a +4.6% relative improvement in average prediction accuracy (avg prediction R2=0.144; highest R2=0.413 for height) compared to SBayesR (the best method that does not incorporate functional information). For height, meta-analyzing training data from UK Biobank and 23andMe cohorts (total N=1107K; higher heritability in UK Biobank cohort) increased prediction R2 to 0.431. Our results show that incorporating functional priors improves polygenic prediction accuracy, consistent with the functional architecture of complex traits.

Download Full-text

Using the UK Biobank as a global reference of worldwide populations: application to measuring ancestry diversity from GWAS summary statistics

10.1101/2021.10.27.466078 ◽

2021 ◽

Author(s):

Florian Privé

Keyword(s):

Genetic Diversity ◽

Cohort Study ◽

Summary Statistics ◽

Uk Biobank ◽

Reference Groups ◽

Phenotypic Data ◽

The United Kingdom ◽

Four Corners ◽

The World ◽

The Uk

The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on almost 500,000 individuals from across the United Kingdom. Within this dataset, we carefully define 17 distinct ancestry groups from all four corners of the world. Using allele frequencies derived from these global reference groups, we are now able to effectively measure diversity from summary statistics of any genetic dataset. Measuring genetic diversity is an important problem because increasing genetic diversity is key to making new genetic discoveries, while also being a major source of confounding to be aware of in genetics studies.

Download Full-text

Heterogeneous effects of genetic risk for Alzheimer’s disease on the phenome

Translational Psychiatry ◽

10.1038/s41398-021-01518-0 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Hei Man Wu ◽

Alison M. Goate ◽

Paul F. O’Reilly

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Genetic Risk ◽

Familial Risk ◽

Psychosocial Health ◽

Uk Biobank ◽

Multiple Forms ◽

Heterogeneous Effects ◽

The Difference ◽

The Uk

AbstractHere we report how four major forms of Alzheimer’s disease (AD) genetic risk—APOE-ε4, APOE-ε2, polygenic risk and familial risk—are associated with 273 traits in ~500,000 individuals in the UK Biobank. The traits cover blood biochemistry and cell traits, metabolic and general health, psychosocial health, and cognitive function. The difference in the profile of traits associated with the different forms of AD risk is striking and may contribute to heterogenous presentation of the disease. However, we also identify traits significantly associated with multiple forms of AD genetic risk, as well as traits showing significant changes across ages in those at high risk of AD, which may point to their potential roles in AD etiology. Finally, we highlight how survivor effects, in particular those relating to shared risks of cardiovascular disease and AD, can generate associations that may mislead interpretation in epidemiological AD studies. The UK Biobank provides a unique opportunity to powerfully compare the effects of different forms of AD genetic risk on the phenome in the same cohort.

Download Full-text

Variable prediction accuracy of polygenic scores within an ancestry group

10.1101/629949 ◽

2019 ◽

Cited By ~ 14

Author(s):

Hakhamanesh Mostafavi ◽

Arbel Harpak ◽

Dalton Conley ◽

Jonathan K Pritchard ◽

Molly Przeworski

Keyword(s):

Prediction Accuracy ◽

Human Genetics ◽

Association Studies ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Gwas Study ◽

Ancestry Group ◽

Genome Wide ◽

Polygenic Scores ◽

The Uk

AbstractFields as diverse as human genetics and sociology are increasingly using polygenic scores based on genome-wide association studies (GWAS) for phenotypic prediction. However, recent work has shown that polygenic scores have limited portability across groups of different genetic ancestries, restricting the contexts in which they can be used reliably and potentially creating serious inequities in future clinical applications. Using the UK Biobank data, we demonstrate that even within a single ancestry group, the prediction accuracy of polygenic scores depends on characteristics such as the age or sex composition of the individuals in which the GWAS and the prediction were conducted, and on the GWAS study design. Our findings highlight both the complexities of interpreting polygenic scores and underappreciated obstacles to their broad use.

Download Full-text

Comparison of adopted and non-adopted individuals reveals gene-environment interplay for education in the UK Biobank

10.1101/707695 ◽

2019 ◽

Cited By ~ 12

Author(s):

Rosa Cheesman ◽

Avina Hunjan ◽

Jonathan R. I. Coleman ◽

Yasmin Ahmadzadeh ◽

Robert Plomin ◽

...

Keyword(s):

Home Environment ◽

Adoptive Parents ◽

Genetic Influences ◽

Uk Biobank ◽

Rearing Environment ◽

Individual Level ◽

Gene Environment ◽

Polygenic Scores ◽

The Uk ◽

Difference Test

AbstractIndividual-level polygenic scores can now explain ∼10% of the variation in number of years of completed education. However, associations between polygenic scores and education capture not only genetic propensity but information about the environment that individuals are exposed to. This is because individuals passively inherit effects of parental genotypes, since their parents typically also provide the rearing environment. In other words, the strong correlation between offspring and parent genotypes results in an association between the offspring genotypes and the rearing environment. This is termed passive gene-environment correlation. We present an approach to test for the extent of passive gene-environment correlation for education without requiring intergenerational data. Specifically, we use information from 6311 individuals in the UK Biobank who were adopted in childhood to compare genetic influence on education between adoptees and non-adopted individuals. Adoptees’ rearing environments are less correlated with their genotypes, because they do not share genes with their adoptive parents. We find that polygenic scores are twice as predictive of years of education in non-adopted individuals compared to adoptees (R2= 0.074 vs 0.037, difference test p= 8.23 × 10−24). We provide another kind of evidence for the influence of parental behaviour on offspring education: individuals in the lowest decile of education polygenic score attain significantly more education if they are adopted, possibly due to educationally supportive adoptive environments. Overall, these results suggest that genetic influences on education are mediated via the home environment. As such, polygenic prediction of educational attainment represents gene-environment correlations just as much as it represents direct genetic effects.

Download Full-text

Polygenic Adaptation has Impacted Multiple Anthropometric Traits

10.1101/167551 ◽

2017 ◽

Cited By ~ 30

Author(s):

Jeremy J. Berg ◽

Xinjun Zhang ◽

Graham Coop

Keyword(s):

Complex Traits ◽

Association Studies ◽

Gwas Data ◽

Human Populations ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Link Type ◽

Polygenic Scores ◽

Polygenic Adaptation ◽

The Uk

AbstractOur understanding of the genetic basis of human adaptation is biased toward loci of large pheno-typic effect. Genome wide association studies (GWAS) now enable the study of genetic adaptation in polygenic phenotypes. We test for polygenic adaptation among 187 world-wide human populations using polygenic scores constructed from GWAS of 34 complex traits. We identify signals of polygenic adaptation for anthropometric traits including height, infant head circumference (IHC), hip circumference and waist-to-hip ratio (WHR). Analysis of ancient DNA samples indicates that a north-south cline of height within Europe and and a west-east cline across Eurasia can be traced to selection for increased height in two late Pleistocene hunter gatherer populations living in western and west-central Eurasia. Our observation that IHC and WHR follow a latitudinal cline in Western Eurasia support the role of natural selection driving Bergmann’s Rule in humans, consistent with thermoregulatory adaptation in response to latitudinal temperature variation.Author’s Note on Failure to ReplicateAfter this preprint was posted, the UK Biobank dataset was released, providing a new and open GWAS resource. When attempting to replicate the height selection results from this preprint using GWAS data from the UK Biobank, we discovered that we could not. In subsequent analyses, we determined that both the GIANT consortium height GWAS data, as well as another dataset that was used for replication, were impacted by stratification issues that created or at a minimum substantially inflated the height selection signals reported here. The results of this second investigation, written together with additional coauthors, have now been published (https://elifesciences.org/articles/39725 along with another paper by a separate group of authors, showing similar issues https://elifesciences.org/articles/39702). A preliminary investigation shows that the other non-height based results may suffer from similar issues. We stand by the theory and statistical methods reported in this paper, and the paper can be cited for these results. However, we have shown that the data on which the major empirical results were based are not sound, and so should be treated with caution until replicated.

Download Full-text

A tool for translating polygenic scores onto the absolute scale using summary statistics

European Journal of Human Genetics ◽

10.1038/s41431-021-01028-z ◽

2022 ◽

Author(s):

Oliver Pain ◽

Alexandra C. Gillett ◽

Jehannine C. Austin ◽

Lasse Folkersen ◽

Cathryn M. Lewis

Keyword(s):

Genome Wide Association Study ◽

Absolute Risk ◽

Summary Statistics ◽

Clinical Implementation ◽

Polygenic Score ◽

Absolute Scale ◽

Polygenic Scores ◽

The Absolute ◽

The Uk ◽

Normally Distributed

AbstractThere is growing interest in the clinical application of polygenic scores as their predictive utility increases for a range of health-related phenotypes. However, providing polygenic score predictions on the absolute scale is an important step for their safe interpretation. We have developed a method to convert polygenic scores to the absolute scale for binary and normally distributed phenotypes. This method uses summary statistics, requiring only the area-under-the-ROC curve (AUC) or variance explained (R2) by the polygenic score, and the prevalence of binary phenotypes, or mean and standard deviation of normally distributed phenotypes. Polygenic scores are converted using normal distribution theory. We also evaluate methods for estimating polygenic score AUC/R2 from genome-wide association study (GWAS) summary statistics alone. We validate the absolute risk conversion and AUC/R2 estimation using data for eight binary and three continuous phenotypes in the UK Biobank sample. When the AUC/R2 of the polygenic score is known, the observed and estimated absolute values were highly concordant. Estimates of AUC/R2 from the lassosum pseudovalidation method were most similar to the observed AUC/R2 values, though estimated values deviated substantially from the observed for autoimmune disorders. This study enables accurate interpretation of polygenic scores using only summary statistics, providing a useful tool for educational and clinical purposes. Furthermore, we have created interactive webtools implementing the conversion to the absolute (https://opain.github.io/GenoPred/PRS_to_Abs_tool.html). Several further barriers must be addressed before clinical implementation of polygenic scores, such as ensuring target individuals are well represented by the GWAS sample.

Download Full-text