Leveraging both individual-level genetic data and GWAS summary statistics increases polygenic prediction

AbstractThe accuracy of polygenic risk scores (PRSs) to predict complex diseases increases with the training sample size. PRSs are generally derived based on summary statistics from large meta-analyses of multiple genome-wide association studies (GWAS). However, it is now common for researchers to have access to large individual-level data as well, such as the UK biobank data. To the best of our knowledge, it has not yet been explored how to best combine both types of data (summary statistics and individual-level data) to optimize polygenic prediction. The most widely used approach to combine data is the meta-analysis of GWAS summary statistics (Meta-GWAS), but we show that it does not always provide the most accurate PRS. Through simulations and using twelve real case-control and quantitative traits from both iPSYCH and UK Biobank along with external GWAS summary statistics, we compare Meta-GWAS with two alternative data-combining approaches, stacked clumping and thresholding (SCT) and Meta-PRS. We find that, when large individual-level data is available, the linear combination of PRSs (Meta-PRS) is both a simple alternative to Meta-GWAS and often more accurate.

Download Full-text

Improved polygenic prediction by Bayesian multiple regression on summary statistics

Nature Communications ◽

10.1038/s41467-019-12653-0 ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 34

Author(s):

Luke R. Lloyd-Jones ◽

Jian Zeng ◽

Julia Sidorenko ◽

Loïc Yengo ◽

Gerhard Moser ◽

...

Keyword(s):

Multiple Regression ◽

Association Studies ◽

Meta Analysis ◽

Multiple Regression Model ◽

Data Sets ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Individual Level ◽

Level Data ◽

The Uk

Abstract Accurate prediction of an individual’s phenotype from their DNA sequence is one of the great promises of genomics and precision medicine. We extend a powerful individual-level data Bayesian multiple regression model (BayesR) to one that utilises summary statistics from genome-wide association studies (GWAS), SBayesR. In simulation and cross-validation using 12 real traits and 1.1 million variants on 350,000 individuals from the UK Biobank, SBayesR improves prediction accuracy relative to commonly used state-of-the-art summary statistics methods at a fraction of the computational resources. Furthermore, using summary statistics for variants from the largest GWAS meta-analysis (n ≈ 700, 000) on height and BMI, we show that on average across traits and two independent data sets that SBayesR improves prediction R2 by 5.2% relative to LDpred and by 26.5% relative to clumping and p value thresholding.

Download Full-text

Exploiting collider bias to apply two-sample summary data Mendelian randomization methods to one-sample individual level data

10.1101/2020.10.20.20216358 ◽

2020 ◽

Author(s):

Ciarrah Barry ◽

Junxi Liu ◽

Rebecca Richmond ◽

Martin K Rutter ◽

Deborah A Lawlor ◽

...

Keyword(s):

Mendelian Randomization ◽

Association Studies ◽

General Procedure ◽

Meta Analysis ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Individual Level ◽

Level Data ◽

Summary Data ◽

Collider Bias

AbstractOver the last decade the availability of SNP-trait associations from genome-wide association studies data has led to an array of methods for performing Mendelian randomization studies using only summary statistics. A common feature of these methods, besides their intuitive simplicity, is the ability to combine data from several sources, incorporate multiple variants and account for biases due to weak instruments and pleiotropy. With the advent of large and accessible fully-genotyped cohorts such as UK Biobank, there is now increasing interest in understanding how best to apply these well developed summary data methods to individual level data, and to explore the use of more sophisticated causal methods allowing for non-linearity and effect modification.In this paper we describe a general procedure for optimally applying any two sample summary data method using one sample data. Our procedure first performs a meta-analysis of summary data estimates that are intentionally contaminated by collider bias between the genetic instruments and unmeasured confounders, due to conditioning on the observed exposure. A weighted sum of these estimates is then used to correct the standard observational association between an exposure and outcome. Simulations are conducted to demonstrate the method’s performance against naive applications of two sample summary data MR. We apply the approach to the UK Biobank cohort to investigate the causal role of sleep disturbance on HbA1c levels, an important determinant of diabetes.Our approach is closely related to the work of Dudbridge et al. (Nat. Comm. 10: 1561), who developed a technique to adjust for index event bias when uncovering genetic predictors of disease progression based on case-only data. Our paper serves to clarify that in any one sample MR analysis, it can be advantageous to estimate causal relationships by artificially inducing and then correcting for collider bias.

Download Full-text

Exploiting collider bias to apply two-sample summary data Mendelian randomization methods to one-sample individual level data

PLoS Genetics ◽

10.1371/journal.pgen.1009703 ◽

2021 ◽

Vol 17 (8) ◽

pp. e1009703

Author(s):

Ciarrah Barry ◽

Junxi Liu ◽

Rebecca Richmond ◽

Martin K. Rutter ◽

Deborah A. Lawlor ◽

...

Keyword(s):

Mendelian Randomization ◽

Association Studies ◽

General Procedure ◽

Meta Analysis ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Individual Level ◽

Level Data ◽

Summary Data ◽

Collider Bias

Over the last decade the availability of SNP-trait associations from genome-wide association studies has led to an array of methods for performing Mendelian randomization studies using only summary statistics. A common feature of these methods, besides their intuitive simplicity, is the ability to combine data from several sources, incorporate multiple variants and account for biases due to weak instruments and pleiotropy. With the advent of large and accessible fully-genotyped cohorts such as UK Biobank, there is now increasing interest in understanding how best to apply these well developed summary data methods to individual level data, and to explore the use of more sophisticated causal methods allowing for non-linearity and effect modification. In this paper we describe a general procedure for optimally applying any two sample summary data method using one sample data. Our procedure first performs a meta-analysis of summary data estimates that are intentionally contaminated by collider bias between the genetic instruments and unmeasured confounders, due to conditioning on the observed exposure. These estimates are then used to correct the standard observational association between an exposure and outcome. Simulations are conducted to demonstrate the method’s performance against naive applications of two sample summary data MR. We apply the approach to the UK Biobank cohort to investigate the causal role of sleep disturbance on HbA1c levels, an important determinant of diabetes. Our approach can be viewed as a generalization of Dudbridge et al. (Nat. Comm. 10: 1561), who developed a technique to adjust for index event bias when uncovering genetic predictors of disease progression based on case-only data. Our work serves to clarify that in any one sample MR analysis, it can be advantageous to estimate causal relationships by artificially inducing and then correcting for collider bias.

Download Full-text

Estimating genetic correlation jointly using individual-level and summary-level GWAS data

10.1101/2021.08.18.456908 ◽

2021 ◽

Author(s):

Yiliang Zhang ◽

Youshu Cheng ◽

Yixuan Ye ◽

Wei Jiang ◽

Qiongshi Lu ◽

...

Keyword(s):

Genetic Correlation ◽

Association Studies ◽

Real Data ◽

Efficient Estimation ◽

Risk Scores ◽

Genome Wide Association Studies ◽

Individual Level ◽

Correlation Estimation ◽

Level Data ◽

Summary Data

AbstractWith the increasing accessibility of individual-level data from genome wide association studies, it is now common for researchers to have individual-level data of some traits in one specific population. For some traits, we can only access public released summary-level data due to privacy and safety concerns. The current methods to estimate genetic correlation can only be applied when the input data type of the two traits of interest is either both individual-level or both summary-level. When researchers have access to individual-level data for one trait and summary-level data for the other, they have to transform the individual-level data to summary-level data first and then apply summary data-based methods to estimate the genetic correlation. This procedure is computationally and statistically inefficient and introduces information loss. We introduce GENJI (Genetic correlation EstimatioN Jointly using Individual-level and summary data), a method that can estimate within-population or transethnic genetic correlation based on individual-level data for one trait and summary-level data for another trait. Through extensive simulations and analyses of real data on within-population and transethnic genetic correlation estimation, we show that GENJI produces more reliable and efficient estimation than summary data-based methods. Besides, when individual-level data are available for both traits, GENJI can achieve comparable performance than individual-level data-based methods. Downstream applications of genetic correlation can benefit from more accurate estimates. In particular, we show that more accurate genetic correlation estimation facilitates the predictability of cross-population polygenic risk scores.

Download Full-text

Cancer PRSweb – an Online Repository with Polygenic Risk Scores (PRS) for Major Cancer Traits and Their Phenome-wide Exploration in Two Independent Biobanks

10.1101/2020.01.22.915751 ◽

2020 ◽

Cited By ~ 1

Author(s):

Lars G. Fritsche ◽

Snehal Patil ◽

Lauren J. Beesley ◽

Peter VandeHaar ◽

Maxwell Salvatore ◽

...

Keyword(s):

Association Studies ◽

Predictive Performance ◽

Risk Scores ◽

P Value ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Uk Biobank ◽

Polygenic Risk ◽

Genome Wide ◽

Study Results

AbstractTo facilitate scientific collaboration on polygenic risk scores (PRS) research, we created an extensive PRS online repository for 49 common cancer traits integrating freely available genome-wide association studies (GWAS) summary statistics from three sources: published GWAS, the NHGRI-EBI GWAS Catalog, and UK Biobank-based GWAS. Our framework condenses these summary statistics into PRS using various approaches such as linkage disequilibrium pruning / p-value thresholding (fixed or data-adaptively optimized thresholds) and penalized, genome-wide effect size weighting. We evaluated the PRS in two biobanks: the Michigan Genomics Initiative (MGI), a longitudinal biorepository effort at Michigan Medicine, and the population-based UK Biobank (UKB). For each PRS construct, we provide measures on predictive performance, calibration, and discrimination. Besides PRS evaluation, the Cancer-PRSweb platform features construct downloads and phenome-wide PRS association study results (PRS-PheWAS) for predictive PRS. We expect this integrated platform to accelerate PRS-related cancer research.

Download Full-text

metaCCA: Summary statistics-based multivariate meta-analysis of genome-wide association studies using canonical correlation analysis

10.1101/022665 ◽

2015 ◽

Cited By ~ 1

Author(s):

Anna Cichonska ◽

Juho Rousu ◽

Pekka Marttinen ◽

Antti J Kangas ◽

Pasi Soininen ◽

...

Keyword(s):

Correlation Analysis ◽

Canonical Correlation Analysis ◽

Canonical Correlation ◽

Statistical Power ◽

Association Studies ◽

Meta Analysis ◽

Original Data ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Individual Level

A dominant approach to genetic association studies is to perform univariate tests between genotype-phenotype pairs. However, analysing related traits together increases statistical power, and certain complex associations become detectable only when several variants are tested jointly. Currently, modest sample sizes of individual cohorts and restricted availability of individual-level genotype-phenotype data across the cohorts limit conducting multivariate tests. We introduce metaCCA, a computational framework for summary statistics-based analysis of a single or multiple studies that allows multivariate representation of both genotype and phenotype. It extends the statistical technique of canonical correlation analysis to the setting where original individual-level records are not available, and employs a covariance shrinkage algorithm to achieve robustness. Multivariate meta-analysis of two Finnish studies of nuclear magnetic resonance metabolomics by metaCCA, using standard univariate output from the program SNPTEST, shows an excellent agreement with the pooled individual-level analysis of original data. Motivated by strong multivariate signals in the lipid genes tested, we envision that multivariate association testing using metaCCA has a great potential to provide novel insights from already published summary statistics from high-throughput phenotyping technologies.

Download Full-text

Bayesian large-scale multiple regression with summary statistics from genome-wide association studies

10.1101/042457 ◽

2016 ◽

Cited By ~ 5

Author(s):

Xiang Zhu ◽

Matthew Stephens

Keyword(s):

Multiple Regression ◽

Large Scale ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Individual Level ◽

Genome Wide ◽

Level Data ◽

Wide Range

Bayesian methods for large-scale multiple regression provide attractive approaches to the analysis of genome-wide association studies (GWAS). For example, they can estimate heritability of complex traits, allowing for both polygenic and sparse models; and by incorporating external genomic data into the priors they can increase power and yield new biological insights. However, these methods require access to individual genotypes and phenotypes, which are often not easily available. Here we provide a framework for performing these analyses without individual-level data. Specifically, we introduce a “Regression with Summary Statistics” (RSS) likelihood, which relates the multiple regression coefficients to univariate regression results that are often easily available. The RSS likelihood requires estimates of correlations among covariates (SNPs), which also can be obtained from public databases. We perform Bayesian multiple regression analysis by combining the RSS likelihood with previously-proposed prior distributions, sampling posteriors by Markov chain Monte Carlo. In a wide range of simulations RSS performs similarly to analyses using the individual data, both for estimating heritability and detecting associations. We apply RSS to a GWAS of human height that contains 253,288 individuals typed at 1.06 million SNPs, for which analyses of individual-level data are practically impossible. Estimates of heritability (52%) are consistent with, but more precise, than previous results using subsets of these data. We also identify many previously-unreported loci that show evidence for association with height in our analyses. Software is available at https://github.com/stephenslab/rss.

Download Full-text

PUMAS: fine-tuning polygenic risk scores with GWAS summary statistics

Genome Biology ◽

10.1186/s13059-021-02479-9 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Zijie Zhao ◽

Yanyao Yi ◽

Jie Song ◽

Yuchang Wu ◽

Xiaoyuan Zhong ◽

...

Keyword(s):

Statistical Power ◽

Human Genetics ◽

Association Studies ◽

Fine Tuning ◽

Risk Scores ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Polygenic Risk ◽

Individual Level ◽

Genetics Research

AbstractPolygenic risk scores (PRSs) have wide applications in human genetics research, but often include tuning parameters which are difficult to optimize in practice due to limited access to individual-level data. Here, we introduce PUMAS, a novel method to fine-tune PRS models using summary statistics from genome-wide association studies (GWASs). Through extensive simulations, external validations, and analysis of 65 traits, we demonstrate that PUMAS can perform various model-tuning procedures using GWAS summary statistics and effectively benchmark and optimize PRS models under diverse genetic architecture. Furthermore, we show that fine-tuned PRSs will significantly improve statistical power in downstream association analysis.

Download Full-text

S173. GENOME-WIDE ASSOCIATION STUDIES OF SCHIZOPHRENIA AND BIPOLAR DISORDER IN A DIVERSE COHORT OF US VETERANS

Schizophrenia Bulletin ◽

10.1093/schbul/sbaa031.239 ◽

2020 ◽

Vol 46 (Supplement_1) ◽

pp. S103-S103

Author(s):

Tim Bigdeli ◽

Ayman Fanous ◽

Nallakkandi Rajeevan ◽

Frederick Sayward ◽

Yuli Li ◽

...

Keyword(s):

Bipolar Disorder ◽

Association Studies ◽

Meta Analysis ◽

Genome Wide Association ◽

Risk Scores ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Polygenic Risk ◽

Genome Wide ◽

Us Veterans

Abstract Background Schizophrenia and bipolar disorder are debilitating neuropsychiatric illnesses collectively affecting 2% of the world’s population, and which cause tremendous human suffering that impacts patients, their families and their communities. Recognizing the major impact of these disorders on the psychosocial function of more than 200,000 US Veterans, the Department of Veterans Affairs (VA) recently genotyping of nearly 9,000 veterans with schizophrenia or bipolar I disorder in Cooperative Studies Program (CSP) #572: “Genetics of Functional Disability in Schizophrenia and Bipolar Illness”, all of whom were extensively assessed for neurocognitive function and disability, and genotyped using a custom Affymetrix Axiom Biobank array. Methods Primary genome-wide association studies (GWAS) of schizophrenia and bipolar disorder were performed across and within ancestry goups, with attempted replication in matched subjects from the PGC and Genomic Psychiatry Cohort (GPC). We combined results for CSP#572 with available summary statistics from the PGC, Indonesia Schizophrenia Consortium and Genetic REsearch on schizophreniA neTwork-China and Netherland (GREAT-CN) study, and multi-ethnic GPC cohorts, achieving among the largest and most diverse studies of these disorders to date. Results Polygenic risk scores based on published PGC summary statistics for schizophrenia or bipolar disorder were significantly associated with case status among EA (P<10–30) and AA (P<0.0005) participants in CSP#572. Our primary analyses of schizophrenia yielded a single genome-wide significant association with variants in CHD7 at 8q12.2 for European-American (EA) participants, which remained significant in a joint analysis of EA and African-American (AA) subjects (P=4.62e-08). While no genome-wide significant associations were detected by our within-ancestry analyses of bipolar disorder, a cross-ancestry meta-analysis of CSP#572 participants yielded a significant finding at 10q25 with variants in SORCS3 (P=2.62e-08). Among loci attaining P<0.0001 in our within-ancestry analyses, 4 and 8 subsequently achieved genome-wide significance, respectively, when jointly analyzed with matched subjects from the PGC and GPC. Combining our results with published summary statistics, we performed a cross-ancestry GWAS meta-analysis of 69,280 schizophrenia cases and 138,379 controls, identifying 200 genome-wide significant loci of which 76 are newly reported here. Cross-ancestry analysis of 28,326 bipolar cases and 90,570 controls identified 24 genome-wide significant loci, including novel associations with common variants in PAX5, DOCK2, MACROD2, BRE, KCNG1, and LINC01378. Discussion We newly describe genome-wide analyses in a diverse cohort of US Veterans with schizophrenia or bipolar disorder, benchmarking the predictive value of polygenic risk scores based on published GWAS findings. Leveraging available summary statistics from studies of global populations, we add to burgeoning lists of genomic loci implicated in the etiologies of these disorders.

Download Full-text

Estimating genetic correlation jointly using individual-level and summary-level GWAS data

10.21203/rs.3.rs-830770/v1 ◽

2021 ◽

Author(s):

Hongyu Zhao ◽

Yiliang Zhang ◽

Youshu Cheng ◽

Yixuan Ye ◽

Wei Jiang ◽

...

Keyword(s):

Genetic Correlation ◽

Association Studies ◽

Real Data ◽

Efficient Estimation ◽

Risk Scores ◽

Genome Wide Association Studies ◽

Individual Level ◽

Correlation Estimation ◽

Level Data ◽

Summary Data

Abstract With the increasing accessibility of individual-level data from genome wide association studies, it is now common for researchers to have individual-level data of some traits in one specific population. For some traits, we can only access public released summary-level data due to privacy and safety concerns. The current methods to estimate genetic correlation can only be applied when the input data type of the two traits of interest is either both individual-level or both summary-level. When researchers have access to individual-level data for one trait and summary-level data for the other, they have to transform the individual-level data to summary-level data first and then apply summary data-based methods to estimate the genetic correlation. This procedure is computationally and statistically inefficient and introduces information loss. We introduce GENJI (Genetic correlation EstimatioN Jointly using Individual-level and summary data), a method that can estimate within-population or transethnic genetic correlation based on individual-level data for one trait and summary-level data for another trait. Through extensive simulations and analyses of real data on within-population and transethnic genetic correlation estimation, we show that GENJI produces more reliable and efficient estimation than summary data-based methods. Besides, when individual-level data are available for both traits, GENJI can achieve comparable performance than individual-level data-based methods. Downstream applications of genetic correlation can benefit from more accurate estimates. In particular, we show that more accurate genetic correlation estimation facilitates the predictability of cross-population polygenic risk scores.

Download Full-text