Estimating genetic correlation jointly using individual-level and summary-level GWAS data

AbstractWith the increasing accessibility of individual-level data from genome wide association studies, it is now common for researchers to have individual-level data of some traits in one specific population. For some traits, we can only access public released summary-level data due to privacy and safety concerns. The current methods to estimate genetic correlation can only be applied when the input data type of the two traits of interest is either both individual-level or both summary-level. When researchers have access to individual-level data for one trait and summary-level data for the other, they have to transform the individual-level data to summary-level data first and then apply summary data-based methods to estimate the genetic correlation. This procedure is computationally and statistically inefficient and introduces information loss. We introduce GENJI (Genetic correlation EstimatioN Jointly using Individual-level and summary data), a method that can estimate within-population or transethnic genetic correlation based on individual-level data for one trait and summary-level data for another trait. Through extensive simulations and analyses of real data on within-population and transethnic genetic correlation estimation, we show that GENJI produces more reliable and efficient estimation than summary data-based methods. Besides, when individual-level data are available for both traits, GENJI can achieve comparable performance than individual-level data-based methods. Downstream applications of genetic correlation can benefit from more accurate estimates. In particular, we show that more accurate genetic correlation estimation facilitates the predictability of cross-population polygenic risk scores.

Download Full-text

Estimating genetic correlation jointly using individual-level and summary-level GWAS data

10.21203/rs.3.rs-830770/v1 ◽

2021 ◽

Author(s):

Hongyu Zhao ◽

Yiliang Zhang ◽

Youshu Cheng ◽

Yixuan Ye ◽

Wei Jiang ◽

...

Keyword(s):

Genetic Correlation ◽

Association Studies ◽

Real Data ◽

Efficient Estimation ◽

Risk Scores ◽

Genome Wide Association Studies ◽

Individual Level ◽

Correlation Estimation ◽

Level Data ◽

Summary Data

Abstract With the increasing accessibility of individual-level data from genome wide association studies, it is now common for researchers to have individual-level data of some traits in one specific population. For some traits, we can only access public released summary-level data due to privacy and safety concerns. The current methods to estimate genetic correlation can only be applied when the input data type of the two traits of interest is either both individual-level or both summary-level. When researchers have access to individual-level data for one trait and summary-level data for the other, they have to transform the individual-level data to summary-level data first and then apply summary data-based methods to estimate the genetic correlation. This procedure is computationally and statistically inefficient and introduces information loss. We introduce GENJI (Genetic correlation EstimatioN Jointly using Individual-level and summary data), a method that can estimate within-population or transethnic genetic correlation based on individual-level data for one trait and summary-level data for another trait. Through extensive simulations and analyses of real data on within-population and transethnic genetic correlation estimation, we show that GENJI produces more reliable and efficient estimation than summary data-based methods. Besides, when individual-level data are available for both traits, GENJI can achieve comparable performance than individual-level data-based methods. Downstream applications of genetic correlation can benefit from more accurate estimates. In particular, we show that more accurate genetic correlation estimation facilitates the predictability of cross-population polygenic risk scores.

Download Full-text

Comparison of methods for estimating genetic correlation between complex traits using GWAS summary statistics

10.1101/2020.10.12.336867 ◽

2020 ◽

Cited By ~ 1

Author(s):

Yiliang Zhang ◽

Youshu Cheng ◽

Wei Jiang ◽

Yixuan Ye ◽

Qiongshi Lu ◽

...

Keyword(s):

Genetic Correlation ◽

Complex Traits ◽

Association Studies ◽

Genetic Correlations ◽

Real Data ◽

Estimation Methods ◽

Easy Access ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Correlation Estimation

AbstractGenetic correlation is the correlation of additive genetic effects on two phenotypes. It is an informative metric to quantify the overall genetic similarity between complex traits, which provides insights into their polygenic genetic architecture. Several methods have been proposed to estimate genetic correlations based on data collected from genome-wide association studies (GWAS). Due to the easy access of GWAS summary statistics and computational efficiency, methods only requiring GWAS summary statistics as input have become more popular than methods utilizing individual-level genotype data. Here, we present a benchmark study for different summary-statistics-based genetic correlation estimation methods through simulation and real data applications. We focus on two major technical challenges in estimating genetic correlation: marker dependency caused by linkage disequilibrium (LD) and sample overlap between different studies. To assess the performance of different methods in the presence of these two challenges, we first conducted comprehensive simulations with diverse LD patterns and sample overlaps. Then we applied these methods to real GWAS summary statistics for a wide spectrum of complex traits. Based on these experiments, we conclude that methods relying on accurate LD estimation are less robust in real data applications compared to other methods due to the imprecision of LD obtained from reference panels. Our findings offer a guidance on how to appropriately choose the method for genetic correlation estimation in post-GWAS analysis in interpretation.

Download Full-text

Exploiting collider bias to apply two-sample summary data Mendelian randomization methods to one-sample individual level data

10.1101/2020.10.20.20216358 ◽

2020 ◽

Author(s):

Ciarrah Barry ◽

Junxi Liu ◽

Rebecca Richmond ◽

Martin K Rutter ◽

Deborah A Lawlor ◽

...

Keyword(s):

Mendelian Randomization ◽

Association Studies ◽

General Procedure ◽

Meta Analysis ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Individual Level ◽

Level Data ◽

Summary Data ◽

Collider Bias

AbstractOver the last decade the availability of SNP-trait associations from genome-wide association studies data has led to an array of methods for performing Mendelian randomization studies using only summary statistics. A common feature of these methods, besides their intuitive simplicity, is the ability to combine data from several sources, incorporate multiple variants and account for biases due to weak instruments and pleiotropy. With the advent of large and accessible fully-genotyped cohorts such as UK Biobank, there is now increasing interest in understanding how best to apply these well developed summary data methods to individual level data, and to explore the use of more sophisticated causal methods allowing for non-linearity and effect modification.In this paper we describe a general procedure for optimally applying any two sample summary data method using one sample data. Our procedure first performs a meta-analysis of summary data estimates that are intentionally contaminated by collider bias between the genetic instruments and unmeasured confounders, due to conditioning on the observed exposure. A weighted sum of these estimates is then used to correct the standard observational association between an exposure and outcome. Simulations are conducted to demonstrate the method’s performance against naive applications of two sample summary data MR. We apply the approach to the UK Biobank cohort to investigate the causal role of sleep disturbance on HbA1c levels, an important determinant of diabetes.Our approach is closely related to the work of Dudbridge et al. (Nat. Comm. 10: 1561), who developed a technique to adjust for index event bias when uncovering genetic predictors of disease progression based on case-only data. Our paper serves to clarify that in any one sample MR analysis, it can be advantageous to estimate causal relationships by artificially inducing and then correcting for collider bias.

Download Full-text

Exploiting collider bias to apply two-sample summary data Mendelian randomization methods to one-sample individual level data

PLoS Genetics ◽

10.1371/journal.pgen.1009703 ◽

2021 ◽

Vol 17 (8) ◽

pp. e1009703

Author(s):

Ciarrah Barry ◽

Junxi Liu ◽

Rebecca Richmond ◽

Martin K. Rutter ◽

Deborah A. Lawlor ◽

...

Keyword(s):

Mendelian Randomization ◽

Association Studies ◽

General Procedure ◽

Meta Analysis ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Individual Level ◽

Level Data ◽

Summary Data ◽

Collider Bias

Over the last decade the availability of SNP-trait associations from genome-wide association studies has led to an array of methods for performing Mendelian randomization studies using only summary statistics. A common feature of these methods, besides their intuitive simplicity, is the ability to combine data from several sources, incorporate multiple variants and account for biases due to weak instruments and pleiotropy. With the advent of large and accessible fully-genotyped cohorts such as UK Biobank, there is now increasing interest in understanding how best to apply these well developed summary data methods to individual level data, and to explore the use of more sophisticated causal methods allowing for non-linearity and effect modification. In this paper we describe a general procedure for optimally applying any two sample summary data method using one sample data. Our procedure first performs a meta-analysis of summary data estimates that are intentionally contaminated by collider bias between the genetic instruments and unmeasured confounders, due to conditioning on the observed exposure. These estimates are then used to correct the standard observational association between an exposure and outcome. Simulations are conducted to demonstrate the method’s performance against naive applications of two sample summary data MR. We apply the approach to the UK Biobank cohort to investigate the causal role of sleep disturbance on HbA1c levels, an important determinant of diabetes. Our approach can be viewed as a generalization of Dudbridge et al. (Nat. Comm. 10: 1561), who developed a technique to adjust for index event bias when uncovering genetic predictors of disease progression based on case-only data. Our work serves to clarify that in any one sample MR analysis, it can be advantageous to estimate causal relationships by artificially inducing and then correcting for collider bias.

Download Full-text

Leveraging both individual-level genetic data and GWAS summary statistics increases polygenic prediction

10.1101/2020.11.27.401141 ◽

2020 ◽

Author(s):

Clara Albiñana ◽

Jakob Grove ◽

John J. McGrath ◽

Esben Agerbo ◽

Naomi R. Wray ◽

...

Keyword(s):

Association Studies ◽

Meta Analysis ◽

Training Sample ◽

Risk Scores ◽

Large Individual ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Uk Biobank ◽

Individual Level ◽

Level Data

AbstractThe accuracy of polygenic risk scores (PRSs) to predict complex diseases increases with the training sample size. PRSs are generally derived based on summary statistics from large meta-analyses of multiple genome-wide association studies (GWAS). However, it is now common for researchers to have access to large individual-level data as well, such as the UK biobank data. To the best of our knowledge, it has not yet been explored how to best combine both types of data (summary statistics and individual-level data) to optimize polygenic prediction. The most widely used approach to combine data is the meta-analysis of GWAS summary statistics (Meta-GWAS), but we show that it does not always provide the most accurate PRS. Through simulations and using twelve real case-control and quantitative traits from both iPSYCH and UK Biobank along with external GWAS summary statistics, we compare Meta-GWAS with two alternative data-combining approaches, stacked clumping and thresholding (SCT) and Meta-PRS. We find that, when large individual-level data is available, the linear combination of PRSs (Meta-PRS) is both a simple alternative to Meta-GWAS and often more accurate.

Download Full-text

Cigarette smoking and personality: interrogating causality using Mendelian randomisation

Psychological Medicine ◽

10.1017/s0033291718003069 ◽

2018 ◽

Vol 49 (13) ◽

pp. 2197-2205 ◽

Cited By ~ 1

Author(s):

Hannah M. Sallis ◽

George Davey Smith ◽

Marcus R. Munafò

Keyword(s):

Personality Traits ◽

Association Studies ◽

Smoking Initiation ◽

Mendelian Randomisation ◽

Genome Wide Association Studies ◽

Individual Level ◽

Causal Pathways ◽

Genome Wide ◽

Level Data ◽

Causal Nature

AbstractBackgroundDespite the well-documented association between smoking and personality traits such as neuroticism and extraversion, little is known about the potential causal nature of these findings. If it were possible to unpick the association between personality and smoking, it may be possible to develop tailored smoking interventions that could lead to both improved uptake and efficacy.MethodsRecent genome-wide association studies (GWAS) have identified variants robustly associated with both smoking phenotypes and personality traits. Here we use publicly available GWAS summary statistics in addition to individual-level data from UK Biobank to investigate the link between smoking and personality. We first estimate genetic overlap between traits using LD score regression and then use bidirectional Mendelian randomisation methods to unpick the nature of this relationship.ResultsWe found clear evidence of a modest genetic correlation between smoking behaviours and both neuroticism and extraversion. We found some evidence that personality traits are causally linked to certain smoking phenotypes: among current smokers each additional neuroticism risk allele was associated with smoking an additional 0.07 cigarettes per day (95% CI 0.02–0.12, p = 0.009), and each additional extraversion effect allele was associated with an elevated odds of smoking initiation (OR 1.015, 95% CI 1.01–1.02, p = 9.6 × 10−7).ConclusionWe found some evidence for specific causal pathways from personality to smoking phenotypes, and weaker evidence of an association from smoking initiation to personality. These findings could be used to inform future smoking interventions or to tailor existing schemes.

Download Full-text

Estimating Heritability and Genetic Correlation in Case Control Studies Directly and with Summary Statistics

10.1101/256388 ◽

2018 ◽

Author(s):

Omer Weissbrod ◽

Jonathan Flint ◽

Saharon Rosset

Keyword(s):

Genetic Correlation ◽

Association Studies ◽

Genetic Correlations ◽

Large Data ◽

Case Control ◽

Data Sets ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Case Control Studies ◽

Individual Level

AbstractMethods that estimate heritability and genetic correlations from genome-wide association studies have proven to be powerful tools for investigating the genetic architecture of common diseases and exposing unexpected relationships between disorders. Many relevant studies employ a case-control design, yet most methods are primarily geared towards analyzing quantitative traits. Here we investigate the validity of three common methods for estimating genetic heritability and genetic correlation. We find that the Phenotype-Correlation-Genotype-Correlation (PCGC) approach is the only method that can estimate both quantities accurately in the presence of important non-genetic risk factors, such as age and sex. We extend PCGC to work with summary statistics that take the case-control sampling into account, and demonstrate that our new method, PCGC-s, accurately estimates both heritability and genetic correlations and can be applied to large data sets without requiring individual-level genotypic or phenotypic information. Finally, we use PCGC-S to estimate the genetic correlation between schizophrenia and bipolar disorder, and demonstrate that previous estimates are biased due to incorrect handling of sex as a strong risk factor. PCGC-s is available at https://github.com/omerwe/PCGCs.

Download Full-text

A unifying framework for summary statistic imputation

10.1101/292664 ◽

2018 ◽

Author(s):

Yue Wu ◽

Eleazar Eskin ◽

Sriram Sankararaman

Keyword(s):

Association Studies ◽

Causal Variant ◽

Genome Wide Association Studies ◽

Multivariate Normal ◽

Summary Statistic ◽

Computationally Efficient ◽

Imputation Methods ◽

Individual Level ◽

Genome Wide ◽

Level Data

AbstractImputation has been widely utilized to aid and interpret the results of Genome-Wide Association Studies(GWAS). Imputation can increase the power to identify associations when the causal variant was not directly observed or typed in the GWAS. There are two broad classes of methods for imputation. The first class imputes the genotypes at the untyped variants given the genotypes at the typed variants and then performs a statistical test of association at the imputed variants. The second class of methods, summary statistic imputation, directly imputes the association statics at the untyped variants given the association statistics observed at the typed variants. This second class of methods is appealing as it tends to be computationally efficient while only requiring the summary statistics from a study while the former class requires access to individual-level data that can be difficult to obtain. The statistical properties of these two classes of imputation methods have not been fully understood. In this paper, we show that the two classes of imputation methods are equivalent, i.e., have identical asymptotic multivariate normal distributions with zero mean and minor variations in the covariance matrix, under some reasonable assumptions. Using this equivalence, we can understand the effect of imputation methods on power. We show that a commonly employed modification of summary statistic imputation that we term summary statistic imputation with variance re-weighting generally leads to a loss in power. On the other hand, our proposed method, summary statistic imputation without performing variance re-weighting, fully accounts for imputation uncertainty while achieving better power.

Download Full-text

Improved polygenic prediction by Bayesian multiple regression on summary statistics

Nature Communications ◽

10.1038/s41467-019-12653-0 ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 34

Author(s):

Luke R. Lloyd-Jones ◽

Jian Zeng ◽

Julia Sidorenko ◽

Loïc Yengo ◽

Gerhard Moser ◽

...

Keyword(s):

Multiple Regression ◽

Association Studies ◽

Meta Analysis ◽

Multiple Regression Model ◽

Data Sets ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Individual Level ◽

Level Data ◽

The Uk

Abstract Accurate prediction of an individual’s phenotype from their DNA sequence is one of the great promises of genomics and precision medicine. We extend a powerful individual-level data Bayesian multiple regression model (BayesR) to one that utilises summary statistics from genome-wide association studies (GWAS), SBayesR. In simulation and cross-validation using 12 real traits and 1.1 million variants on 350,000 individuals from the UK Biobank, SBayesR improves prediction accuracy relative to commonly used state-of-the-art summary statistics methods at a fraction of the computational resources. Furthermore, using summary statistics for variants from the largest GWAS meta-analysis (n ≈ 700, 000) on height and BMI, we show that on average across traits and two independent data sets that SBayesR improves prediction R2 by 5.2% relative to LDpred and by 26.5% relative to clumping and p value thresholding.

Download Full-text

Large uncertainty in individual PRS estimation impacts PRS-based risk stratification

10.1101/2020.11.30.403188 ◽

2020 ◽

Author(s):

Yi Ding ◽

Kangcheng Hou ◽

Kathryn S. Burch ◽

Sandra Lapinska ◽

Florian Privé ◽

...

Keyword(s):

Large Scale ◽

Association Studies ◽

Probabilistic Approach ◽

Risk Scores ◽

Genome Wide Association Studies ◽

Single Individual ◽

Individual Level ◽

Credible Intervals ◽

The Uk ◽

Genetic Value

AbstractLarge-scale genome-wide association studies have enabled polygenic risk scores (PRS), which estimate the genetic value of an individual for a given trait. Since PRS accuracy is typically assessed using cohort-level metrics (e.g., R2), uncertainty in PRS estimates at individual level remains underexplored. Here we show that Bayesian PRS methods can estimate the variance of individual PRS and can yield well-calibrated credible intervals for the genetic value of a single individual. For real traits in the UK Biobank (N=291,273 unrelated “white British”), we observe large variance in individual PRS estimates which impacts interpretation of PRS-based stratification; for example, averaging across 11 traits, only 1.8% (s.d. 2.4%) of individuals with PRS point estimates in the top decile have their entire 95% credible intervals fully contained in the top decile. To account for this uncertainty, we propose a probabilistic approach to PRS-based stratification that estimates the probability of an individual’s genetic value to be above a prespecified threshold. Our results showcase the importance of incorporating uncertainty in individual PRS estimates into subsequent analyses.

Download Full-text