EraSOR: Erase Sample Overlap in polygenic score analyses

Background: Polygenic risk score (PRS) analyses are now routinely applied in biomedical research, with great hope that they will aid in our understanding of disease aetiology and contribute to personalized medicine. The continued growth of multi-cohort genome-wide association studies (GWASs) and large-scale biobank projects has provided researchers with a wealth of GWAS summary statistics and individual-level data suitable for performing PRS analyses. However, as the size of these studies increase, the risk of inter-cohort sample overlap and close relatedness increases. Ideally sample overlap would be identified and removed directly, but this is typically not possible due to privacy laws or consent agreements. This sample overlap, whether known or not, is a major problem in PRS analyses because it can lead to inflation of type 1 error and, thus, erroneous conclusions in published work. Results: Here, for the first time, we report the scale of the sample overlap problem for PRS analyses by generating known sample overlap across sub-samples of the UK Biobank data, which we then use to produce GWAS and target data to mimic the effects of inter-cohort sample overlap. We demonstrate that inter-cohort overlap results in a significant and often substantial inflation in the observed PRS-trait association, coefficient of determination (R2) and false-positive rate. This inflation can be high even when the absolute number of overlapping individuals is small if this makes up a notable fraction of the target sample. We develop and introduce EraSOR (Erase Sample Overlap and Relatedness), a software for adjusting inflation in PRS prediction and association statistics in the presence of sample overlap or close relatedness between the GWAS and target samples. A key component of the EraSOR approach is inference of the degree of sample overlap from the intercept of a bivariate LD score regression applied to the GWAS and target data, making it powered in settings where both have sample sizes over 1,000 individuals. Through extensive benchmarking using UK Biobank and HapGen2 simulated genotype-phenotype data, we demonstrate that PRSs calculated using EraSOR-adjusted GWAS summary statistics are robust to inter-cohort overlap in a wide range of realistic scenarios and are even robust to high levels of residual genetic and environmental stratification. Conclusion: The results of all PRS analyses for which sample overlap cannot be definitively ruled out should be considered with caution given high type 1 error observed in the presence of even low overlap between base and target cohorts. Given the strong performance of EraSOR in eliminating inflation caused by sample overlap in PRS studies with large (>5k) target samples, we recommend that EraSOR be used in all future such PRS studies to mitigate the potential effects of inter-cohort overlap and close relatedness.

Download Full-text

Global Biobank Engine: enabling genotype-phenotype browsing for biobank summary statistics

Bioinformatics ◽

10.1093/bioinformatics/bty999 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2495-2497 ◽

Cited By ~ 27

Author(s):

Gregory McInnes ◽

Yosuke Tanigawa ◽

Chris DeBoever ◽

Adam Lavertu ◽

Julia Eve Olivieri ◽

...

Keyword(s):

Association Studies ◽

Genetic Association Studies ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Uk Biobank ◽

Patient Privacy ◽

Web Based ◽

Genome Wide ◽

Wide Range ◽

The Uk

Abstract Summary Large biobanks linking phenotype to genotype have led to an explosion of genetic association studies across a wide range of phenotypes. Sharing the knowledge generated by these resources with the scientific community remains a challenge due to patient privacy and the vast amount of data. Here, we present Global Biobank Engine (GBE), a web-based tool that enables exploration of the relationship between genotype and phenotype in biobank cohorts, such as the UK Biobank. GBE supports browsing for results from genome-wide association studies, phenome-wide association studies, gene-based tests and genetic correlation between phenotypes. We envision GBE as a platform that facilitates the dissemination of summary statistics from biobanks to the scientific and clinical communities. Availability and implementation GBE currently hosts data from the UK Biobank and can be found freely available at biobankengine.stanford.edu.

Download Full-text

Multi-trait Genome-Wide Analyses of the Brain Imaging Phenotypes in UK Biobank

Genetics ◽

10.1534/genetics.120.303242 ◽

2020 ◽

Vol 215 (4) ◽

pp. 947-958 ◽

Cited By ~ 1

Author(s):

Chong Wu

Keyword(s):

Association Studies ◽

Error Rates ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Association Analyses ◽

Trait Association ◽

Type 1 Error ◽

Genome Wide ◽

Inflation Factor

Many genetic variants identified in genome-wide association studies (GWAS) are associated with multiple, sometimes seemingly unrelated, traits. This motivates multi-trait association analyses, which have successfully identified novel associated loci for many complex diseases. While appealing, most existing methods focus on analyzing a relatively small number of traits, and may yield inflated Type 1 error rates when a large number of traits need to be analyzed jointly. As deep phenotyping data are becoming rapidly available, we develop a novel method, referred to as aMAT (adaptive multi-trait association test), for multi-trait analysis of any number of traits. We applied aMAT to GWAS summary statistics for a set of 58 volumetric imaging derived phenotypes from the UK Biobank. aMAT had a genomic inflation factor of 1.04, indicating the Type 1 error rate was well controlled. More important, aMAT identified 24 distinct risk loci, 13 of which were ignored by standard GWAS. In comparison, the competing methods either had a suspicious genomic inflation factor or identified much fewer risk loci. Finally, four additional sets of traits have been analyzed and provided similar conclusions.

Download Full-text

Global Biobank Engine: enabling genotype-phenotype browsing for biobank summary statistics

10.1101/304188 ◽

2018 ◽

Cited By ~ 4

Author(s):

Gregory McInnes ◽

Yosuke Tanigawa ◽

Chris DeBoever ◽

Adam Lavertu ◽

Julia Eve Olivieri ◽

...

Keyword(s):

Association Studies ◽

Genetic Association Studies ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Uk Biobank ◽

Patient Privacy ◽

Web Based ◽

Genome Wide ◽

Wide Range ◽

The Uk

Large biobanks linking phenotype to genotype have led to an explosion of genetic association studies across a wide range of phenotypes. Sharing the knowledge generated by these resources with the scientific community remains a challenge due to patient privacy and the vast amount of data. Here we present Global Biobank Engine (GBE), a web-based tool that enables the exploration of the relationship between genotype and phenotype in large biobank cohorts, such as the UK Biobank. GBE supports browsing for results from genome-wide association studies, phenome-wide association studies, gene-based tests, and genetic correlation between phenotypes. We envision GBE as a platform that facilitates the dissemination of summary statistics from biobanks to the scientific and clinical communities. GBE currently hosts data from the UK Biobank and can be found freely available at biobankengine.stanford.edu.

Download Full-text

Polygenic Risk Scores for Kidney Function and Their Associations with Circulating Proteome, and Incident Kidney Diseases

Journal of the American Society of Nephrology ◽

10.1681/asn.2020111599 ◽

2021 ◽

pp. ASN.2020111599

Author(s):

Zhi Yu ◽

Jin Jin ◽

Adrienne Tin ◽

Anna Köttgen ◽

Bing Yu ◽

...

Keyword(s):

Kidney Function ◽

Kidney Diseases ◽

Association Studies ◽

Polygenic Risk Score ◽

Genome Wide Association Studies ◽

Plasma Proteome ◽

Uk Biobank ◽

Polygenic Risk ◽

Genome Wide ◽

A Genome

Background: Genome-wide association studies (GWAS) have revealed numerous loci for kidney function (estimated glomerular filtration rate, eGFR). The relationship of polygenic predictors of eGFR, risk of incident adverse kidney outcomes, and the plasma proteome is not known. Methods: We developed a genome-wide polygenic risk score (PRS) for eGFR by applying the LDpred algorithm to summary statistics generated from a multiethnic meta-analysis of CKDGen Consortium GWAS (N=765,348) and UK Biobank GWAS (90% of the cohort; N=451,508), followed by best parameter selection using the remaining 10% of UK Biobank (N=45,158). We then tested the association of the PRS in the Atherosclerosis Risk in Communities (ARIC) study (N=8,866) with incident chronic kidney disease, kidney failure, and acute kidney injury. We also examined associations between the PRS and 4,877 plasma proteins measured at at middle age and older adulthood and evaluated mediation of PRS associations by eGFR. Results: The developed PRS showed significant associations with all outcomes with hazard ratios (95% CI) per 1 SD lower PRS ranged from 1.06 (1.01, 1.11) to 1.33 (1.28, 1.37). The PRS was significantly associated with 132 proteins at both time points. The strongest associations were with cystatin-C, collagen alpha-1(XV) chain, and desmocollin-2. Most proteins were higher at lower kidney function, except for 5 proteins including testican-2. Most correlations of the genetic PRS with proteins were mediated by eGFR. Conclusions: A PRS for eGFR is now sufficiently strong to capture risk for a spectrum of incident kidney diseases and broadly influences the plasma proteome, primarily mediated by eGFR.

Download Full-text

Controlling type 1 error rates in genome-wide association studies in plants

Heredity ◽

10.1038/hdy.2012.101 ◽

2012 ◽

Vol 111 (1) ◽

pp. 86-87 ◽

Cited By ~ 5

Author(s):

A W George

Keyword(s):

Association Studies ◽

Error Rates ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Type 1 Error ◽

Genome Wide

Download Full-text

Genome-wide genetic data on ~500,000 UK Biobank participants

10.1101/166298 ◽

2017 ◽

Cited By ~ 303

Author(s):

Clare Bycroft ◽

Colin Freeman ◽

Desislava Petkova ◽

Gavin Band ◽

Lloyd T. Elliott ◽

...

Keyword(s):

Quality Control ◽

Allelic Variation ◽

Association Studies ◽

Genetic Data ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genotype Data ◽

Uk Biobank ◽

Genome Wide ◽

Wide Range

AbstractThe UK Biobank project is a large prospective cohort study of ~500,000 individuals from across the United Kingdom, aged between 40-69 at recruitment. A rich variety of phenotypic and health-related information is available on each participant, making the resource unprecedented in its size and scope. Here we describe the genome-wide genotype data (~805,000 markers) collected on all individuals in the cohort and its quality control procedures. Genotype data on this scale offers novel opportunities for assessing quality issues, although the wide range of ancestries of the individuals in the cohort also creates particular challenges. We also conducted a set of analyses that reveal properties of the genetic data – such as population structure and relatedness – that can be important for downstream analyses. In addition, we phased and imputed genotypes into the dataset, using computationally efficient methods combined with the Haplotype Reference Consortium (HRC) and UK10K haplotype resource. This increases the number of testable variants by over 100-fold to ~96 million variants. We also imputed classical allelic variation at 11 human leukocyte antigen (HLA) genes, and as a quality control check of this imputation, we replicate signals of known associations between HLA alleles and many common diseases. We describe tools that allow efficient genome-wide association studies (GWAS) of multiple traits and fast phenome-wide association studies (PheWAS), which work together with a new compressed file format that has been used to distribute the dataset. As a further check of the genotyped and imputed datasets, we performed a test-case genome-wide association scan on a well-studied human trait, standing height.

Download Full-text

emeraLD: Rapid Linkage Disequilibrium Estimation with Massive Data Sets

10.1101/301366 ◽

2018 ◽

Cited By ~ 1

Author(s):

Corbin Quick ◽

Christian Fuchsberger ◽

Daniel Taliun ◽

Gonçalo Abecasis ◽

Michael Boehnke ◽

...

Keyword(s):

Linkage Disequilibrium ◽

Association Studies ◽

Random Access ◽

Supplementary Information ◽

Data Sets ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Genome Wide ◽

Wide Range ◽

Supplementary Material

AbstractSummaryEstimating linkage disequilibrium (LD) is essential for a wide range of summary statistics-based association methods for genome-wide association studies (GWAS). Large genetic data sets, e.g. the TOPMed WGS project and UK Biobank, enable more accurate and comprehensive LD estimates, but increase the computational burden of LD estimation. Here, we describe emeraLD (Efficient Methods for Estimation and Random Access of LD), a computational tool that leverages sparsity and haplotype structure to estimate LD orders of magnitude faster than existing tools.Availability and ImplementationemeraLD is implemented in C++, and is open source under GPLv3. Source code, documentation, an R interface, and utilities for analysis of summary statistics are freely available at http://github.com/statgen/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Cancer PRSweb – an Online Repository with Polygenic Risk Scores (PRS) for Major Cancer Traits and Their Phenome-wide Exploration in Two Independent Biobanks

10.1101/2020.01.22.915751 ◽

2020 ◽

Cited By ~ 1

Author(s):

Lars G. Fritsche ◽

Snehal Patil ◽

Lauren J. Beesley ◽

Peter VandeHaar ◽

Maxwell Salvatore ◽

...

Keyword(s):

Association Studies ◽

Predictive Performance ◽

Risk Scores ◽

P Value ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Uk Biobank ◽

Polygenic Risk ◽

Genome Wide ◽

Study Results

AbstractTo facilitate scientific collaboration on polygenic risk scores (PRS) research, we created an extensive PRS online repository for 49 common cancer traits integrating freely available genome-wide association studies (GWAS) summary statistics from three sources: published GWAS, the NHGRI-EBI GWAS Catalog, and UK Biobank-based GWAS. Our framework condenses these summary statistics into PRS using various approaches such as linkage disequilibrium pruning / p-value thresholding (fixed or data-adaptively optimized thresholds) and penalized, genome-wide effect size weighting. We evaluated the PRS in two biobanks: the Michigan Genomics Initiative (MGI), a longitudinal biorepository effort at Michigan Medicine, and the population-based UK Biobank (UKB). For each PRS construct, we provide measures on predictive performance, calibration, and discrimination. Besides PRS evaluation, the Cancer-PRSweb platform features construct downloads and phenome-wide PRS association study results (PRS-PheWAS) for predictive PRS. We expect this integrated platform to accelerate PRS-related cancer research.

Download Full-text

Bayesian meta-analysis across genome-wide association studies of diverse phenotypes

10.1101/477828 ◽

2018 ◽

Author(s):

Holly Trochet ◽

Matti Pirinen ◽

Gavin Band ◽

Luke Jostins ◽

Gilean McVean ◽

...

Keyword(s):

Genetic Basis ◽

Association Studies ◽

Meta Analysis ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Computationally Efficient ◽

Genome Wide ◽

Wide Range ◽

Study Designs

AbstractGenome-wide association studies (GWAS) are a powerful tool for understanding the genetic basis of diseases and traits, but most studies have been conducted in isolation, with a focus on either a single or a set of closely related phenotypes. We describe MetABF, a simple Bayesian framework for performing integrative meta-analysis across multiple GWAS using summary statistics. The approach is applicable across a wide range of study designs and can increase the power by 50% compared to standard frequentist tests when only a subset of studies have a true effect. We demonstrate its utility in a meta-analysis of 20 diverse GWAS which were part of the Wellcome Trust Case-Control Consortium 2. The novelty of the approach is its ability to explore, and assess the evidence for, a range of possible true patterns of association across studies in a computationally efficient framework.

Download Full-text

Systematic single-variant and gene-based association testing of 3,700 phenotypes in 281,850 UK Biobank exomes

10.1101/2021.06.19.21259117 ◽

2021 ◽

Author(s):

Konrad Karczewski ◽

Matthew Solomonson ◽

Katherine R Chao ◽

Julia K Goodrich ◽

Grace Tiao ◽

...

Keyword(s):

Sequence Data ◽

Association Studies ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Genetic Associations ◽

Allelic Series ◽

Association Analyses ◽

Wide Range ◽

The Uk ◽

The Impact

Genome-wide association studies have successfully discovered thousands of common variants associated with human diseases and traits, but the landscape of rare variation in human disease has not been explored at scale. Exome sequencing studies of population biobanks provide an opportunity to systematically evaluate the impact of rare coding variation across a wide range of phenotypes to discover genes and allelic series relevant to human health and disease. Here, we present results from systematic association analyses of 3,700 phenotypes using single-variant and gene tests of 281,850 individuals in the UK Biobank with exome sequence data. We find that the discovery of genetic associations is tightly linked to frequency as well as correlated with metrics of deleteriousness and natural selection. We highlight biological findings elucidated by these data and release the dataset as a public resource alongside a browser framework for rapidly exploring rare variant association results.

Download Full-text