Genomic prediction using individual-level data and summary statistics from multiple populations

ABSTRACTThis study presents a method for genomic prediction that uses individual-level data and summary statistics from multiple populations. Genome-wide markers are nowadays widely used to predict complex traits, and genomic prediction using multi-population data is an appealing approach to achieve higher prediction accuracies. However, sharing of individual-level data across populations is not always possible. We present a method that enables integration of summary statistics from separate analyses with the available individual-level data. The data can either consist of individuals with single or multiple (weighted) phenotype records per individual. We developed a method based on a hypothetical joint analysis model and absorption of population specific information. We show that population specific information is fully captured by estimated allele substitution effects and the accuracy of those estimates, i.e. the summary statistics. The method gives identical result as the joint analysis of all individual-level data when complete summary statistics are available. We provide a series of easy-to-use approximations that can be used when complete summary statistics are not available or impractical to share. Simulations show that approximations enables integration of different sources of information across a wide range of settings yielding accurate predictions. The method can be readily extended to multiple-traits. In summary, the developed method enables integration of genome-wide data in the individual-level or summary statistics form from multiple populations to obtain more accurate estimates of allele substitution effects and genomic predictions.

Download Full-text

Genomic Prediction Using Individual-Level Data and Summary Statistics from Multiple Populations

Genetics ◽

10.1534/genetics.118.301109 ◽

2018 ◽

Vol 210 (1) ◽

pp. 53-69 ◽

Cited By ~ 7

Author(s):

Jeremie Vandenplas ◽

Mario P. L. Calus ◽

Gregor Gorjanc

Keyword(s):

Genomic Prediction ◽

Summary Statistics ◽

Multiple Populations ◽

Individual Level ◽

Level Data

Download Full-text

Approximate conditional phenotype analysis based on genome wide association summary statistics

Scientific Reports ◽

10.1038/s41598-021-82000-1 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Peitao Wu ◽

Biqi Wang ◽

Steven A. Lubitz ◽

Emelia J. Benjamin ◽

James B. Meigs ◽

...

Keyword(s):

Large Scale ◽

Genome Wide Association Study ◽

Genome Wide Association ◽

Summary Statistics ◽

Phenotypic Data ◽

Individual Level ◽

Genome Wide ◽

Level Data ◽

A Genome ◽

Phenotype Analysis

AbstractBecause single genetic variants may have pleiotropic effects, one trait can be a confounder in a genome-wide association study (GWAS) that aims to identify loci associated with another trait. A typical approach to address this issue is to perform an additional analysis adjusting for the confounder. However, obtaining conditional results can be time-consuming. We propose an approximate conditional phenotype analysis based on GWAS summary statistics, the covariance between outcome and confounder, and the variant minor allele frequency (MAF). GWAS summary statistics and MAF are taken from GWAS meta-analysis results while the traits covariance may be estimated by two strategies: (i) estimates from a subset of the phenotypic data; or (ii) estimates from published studies. We compare our two strategies with estimates using individual level data from the full GWAS sample (gold standard). A simulation study for both binary and continuous traits demonstrates that our approximate approach is accurate. We apply our method to the Framingham Heart Study (FHS) GWAS and to large-scale cardiometabolic GWAS results. We observed a high consistency of genetic effect size estimates between our method and individual level data analysis. Our approach leads to an efficient way to perform approximate conditional analysis using large-scale GWAS summary statistics.

Download Full-text

Bayesian large-scale multiple regression with summary statistics from genome-wide association studies

10.1101/042457 ◽

2016 ◽

Cited By ~ 5

Author(s):

Xiang Zhu ◽

Matthew Stephens

Keyword(s):

Multiple Regression ◽

Large Scale ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Individual Level ◽

Genome Wide ◽

Level Data ◽

Wide Range

Bayesian methods for large-scale multiple regression provide attractive approaches to the analysis of genome-wide association studies (GWAS). For example, they can estimate heritability of complex traits, allowing for both polygenic and sparse models; and by incorporating external genomic data into the priors they can increase power and yield new biological insights. However, these methods require access to individual genotypes and phenotypes, which are often not easily available. Here we provide a framework for performing these analyses without individual-level data. Specifically, we introduce a “Regression with Summary Statistics” (RSS) likelihood, which relates the multiple regression coefficients to univariate regression results that are often easily available. The RSS likelihood requires estimates of correlations among covariates (SNPs), which also can be obtained from public databases. We perform Bayesian multiple regression analysis by combining the RSS likelihood with previously-proposed prior distributions, sampling posteriors by Markov chain Monte Carlo. In a wide range of simulations RSS performs similarly to analyses using the individual data, both for estimating heritability and detecting associations. We apply RSS to a GWAS of human height that contains 253,288 individuals typed at 1.06 million SNPs, for which analyses of individual-level data are practically impossible. Estimates of heritability (52%) are consistent with, but more precise, than previous results using subsets of these data. We also identify many previously-unreported loci that show evidence for association with height in our analyses. Software is available at https://github.com/stephenslab/rss.

Download Full-text

Bayesian Multi-SNP Genetic Association Analysis: Control of FDR and Use of Summary Statistics

10.1101/316471 ◽

2018 ◽

Cited By ~ 14

Author(s):

Yeji Lee ◽

Francesca Luca ◽

Roger Pique-Regi ◽

Xiaoquan Wen

Keyword(s):

Genetic Association ◽

Association Analysis ◽

Quantitative Trait ◽

Snp Analysis ◽

Summary Statistics ◽

Genetic Association Analysis ◽

Computational Approaches ◽

Individual Level ◽

Genome Wide ◽

Level Data

AbstractMulti-SNP genetic association analysis has become increasingly important in analyzing data from genome-wide association studies (GWASs) and molecular quantitative trait loci (QTL) mapping studies. In this paper, we propose novel computational approaches to address two outstanding issues in Bayesian multi-SNP genetic association analysis: namely, the control of false positive discoveries of identified association signals and the maximization of the efficiency of statistical inference by utilizing summary statistics. Quantifying the strength and uncertainty of genetic association signals has been a long-standing theme in statistical genetics. However, there is a lack of formal statistical procedures that can rigorously control type I errors in multi-SNP analysis. We propose an intuitive hierarchical representation of genetic association signals based on Bayesian posterior probabilities, which subsequently enables rigorous control of false discovery rate (FDR) and construction of Bayesian credible sets. From the perspective of statistical data reduction, we examine the computational approaches of multi-SNP analysis using z-statistics from single-SNP association testing and conclude that they likely yield conservative results comparing to using individual-level data. Built on this result, we propose a set of sufficient summary statistics that can lead to identical results as individual-level data without sacrificing power. Our novel computational approaches are implemented in the software package, DAP-G (https://github.com/xqwen/dap), which applies to both GWASs and genome-wide molecular QTL mapping studies. It is highly computationally efficient and approximately 20 times faster than the state-of-the-art implementation of Bayesian multi-SNP analysis software. We demonstrate the proposed computational approaches using carefully constructed simulation studies and illustrate a complete workflow for multi-SNP analysis of cis expression quantitative trait loci using the whole blood data from the GTEx project.

Download Full-text

Risk Projection for Time-to-event Outcome Leveraging Summary Statistics With Source Individual-level Data

Journal of the American Statistical Association ◽

10.1080/01621459.2021.1895810 ◽

2021 ◽

pp. 1-34

Author(s):

Jiayin Zheng ◽

Yingye Zheng ◽

Li Hsu

Keyword(s):

Summary Statistics ◽

Time To Event ◽

Individual Level ◽

Level Data ◽

Risk Projection

Download Full-text

Cigarette smoking and personality: interrogating causality using Mendelian randomisation

Psychological Medicine ◽

10.1017/s0033291718003069 ◽

2018 ◽

Vol 49 (13) ◽

pp. 2197-2205 ◽

Cited By ~ 1

Author(s):

Hannah M. Sallis ◽

George Davey Smith ◽

Marcus R. Munafò

Keyword(s):

Personality Traits ◽

Association Studies ◽

Smoking Initiation ◽

Mendelian Randomisation ◽

Genome Wide Association Studies ◽

Individual Level ◽

Causal Pathways ◽

Genome Wide ◽

Level Data ◽

Causal Nature

AbstractBackgroundDespite the well-documented association between smoking and personality traits such as neuroticism and extraversion, little is known about the potential causal nature of these findings. If it were possible to unpick the association between personality and smoking, it may be possible to develop tailored smoking interventions that could lead to both improved uptake and efficacy.MethodsRecent genome-wide association studies (GWAS) have identified variants robustly associated with both smoking phenotypes and personality traits. Here we use publicly available GWAS summary statistics in addition to individual-level data from UK Biobank to investigate the link between smoking and personality. We first estimate genetic overlap between traits using LD score regression and then use bidirectional Mendelian randomisation methods to unpick the nature of this relationship.ResultsWe found clear evidence of a modest genetic correlation between smoking behaviours and both neuroticism and extraversion. We found some evidence that personality traits are causally linked to certain smoking phenotypes: among current smokers each additional neuroticism risk allele was associated with smoking an additional 0.07 cigarettes per day (95% CI 0.02–0.12, p = 0.009), and each additional extraversion effect allele was associated with an elevated odds of smoking initiation (OR 1.015, 95% CI 1.01–1.02, p = 9.6 × 10−7).ConclusionWe found some evidence for specific causal pathways from personality to smoking phenotypes, and weaker evidence of an association from smoking initiation to personality. These findings could be used to inform future smoking interventions or to tailor existing schemes.

Download Full-text

Equivalence of LD-Score Regression and Individual-Level-Data Methods

10.1101/211821 ◽

2017 ◽

Cited By ~ 8

Author(s):

Ronald de Vlaming ◽

Magnus Johannesson ◽

Patrik K.E. Magnusson ◽

M. Arfan Ikram ◽

Peter M. Visscher

Keyword(s):

Maximum Likelihood ◽

Recent Work ◽

Principal Components ◽

Population Stratification ◽

Summary Statistics ◽

Test Statistics ◽

Ceteris Paribus ◽

Individual Level ◽

Level Data ◽

Genomic Relatedness

AbstractLD-score (LDSC) regression disentangles the contribution of polygenic signal, in terms of SNP-based heritability, and population stratification, in terms of a so-called intercept, to GWAS test statistics. Whereas LDSC regression uses summary statistics, methods like Haseman-Elston (HE) regression and genomic-relatedness-matrix (GRM) restricted maximum likelihood infer parameters such as SNP-based heritability from individual-level data directly. Therefore, these two types of methods are typically considered to be profoundly different. Nevertheless, recent work has revealed that LDSC and HE regression yield near-identical SNP-based heritability estimates when confounding stratification is absent. We now extend the equivalence; under the stratification assumed by LDSC regression, we show that the intercept can be estimated from individual-level data by transforming the coefficients of a regression of the phenotype on the leading principal components from the GRM. Using simulations, considering various degrees and forms of population stratification, we find that intercept estimates obtained from individual-level data are nearly equivalent to estimates from LDSC regression (R2> 99%). An empirical application corroborates these findings. Hence, LDSC regression is not profoundly different from methods using individual-level data; parameters that are identified by LDSC regression are also identified by methods using individual-level data. In addition, our results indicate that, under strong stratification, there is misattribution of stratification to the slope of LDSC regression, inflating estimates of SNP-based heritability from LDSC regression ceteris paribus. Hence, the intercept is not a panacea for population stratification. Consequently, LDSC-regression estimates should be interpreted with caution, especially when the intercept estimate is significantly greater than one.

Download Full-text

Genetic Predictors of Response to Serotonergic and Noradrenergic Antidepressants in Major Depressive Disorder: A Genome-Wide Analysis of Individual-Level Data and a Meta-Analysis

PLoS Medicine ◽

10.1371/journal.pmed.1001326 ◽

2012 ◽

Vol 9 (10) ◽

pp. e1001326 ◽

Cited By ~ 88

Author(s):

Katherine E. Tansey ◽

Michel Guipponi ◽

Nader Perroud ◽

Guido Bondolfi ◽

Enrico Domenici ◽

...

Keyword(s):

Major Depressive Disorder ◽

Meta Analysis ◽

Major Depressive ◽

Individual Level ◽

Genome Wide Analysis ◽

Predictors Of Response ◽

Genetic Predictors ◽

Genome Wide ◽

Level Data ◽

A Genome

Download Full-text

A data harmonization pipeline to leverage external controls and boost power in GWAS

10.1101/2020.11.30.405415 ◽

2020 ◽

Author(s):

Danfeng Chen ◽

Katherine Tashman ◽

Duncan S. Palmer ◽

Benjamin Neale ◽

Kathryn Roeder ◽

...

Keyword(s):

Genome Wide Association Study ◽

Control Sample ◽

Summary Statistics ◽

Batch Effects ◽

Multiple Sources ◽

Individual Level ◽

Data Harmonization ◽

Genome Wide ◽

Before And After ◽

Spurious Results

AbstractThe use of external controls in genome-wide association study (GWAS) can significantly increase the size and diversity of the control sample, enabling high-resolution ancestry matching and enhancing the power to detect association signals. However, the aggregation of controls from multiple sources is challenging due to batch effects, difficulty in identifying genotyping errors, and the use of different genotyping platforms. These obstacles have impeded the use of external controls in GWAS and can lead to spurious results if not carefully addressed. We propose a unified data harmonization pipeline that includes an iterative approach to quality control (QC) and imputation, implemented before and after merging cohorts and arrays. We apply this harmonization pipeline to aggregate 27,517 European control samples from 16 collections within dbGaP. We leverage these harmonized controls to conduct a GWAS of Crohn’s disease. We demonstrate a boost in power over using the cohort samples alone, and that our procedure results in summary statistics free of any significant batch effects. This harmonization pipeline for aggregating genotype data from multiple sources can also serve other applications where individual level genotypes, rather than summary statistics, are required.

Download Full-text

MTAG: Multi-Trait Analysis of GWAS

10.1101/118810 ◽

2017 ◽

Cited By ~ 19

Author(s):

Patrick Turley ◽

Raymond K. Walters ◽

Omeed Maghzian ◽

Aysu Okbay ◽

James J. Lee ◽

...

Keyword(s):

Depressive Symptoms ◽

Well Being ◽

Joint Analysis ◽

Summary Statistics ◽

Subjective Well Being ◽

Bioinformatics Analyses ◽

Trait Analysis ◽

Genome Wide ◽

Polygenic Scores ◽

Variance Explained

ABSTRACTWe introduce Multi-Trait Analysis of GWAS (MTAG), a method for joint analysis of summary statistics from GWASs of different traits, possibly from overlapping samples. We apply MTAG to summary statistics for depressive symptoms (Neff = 354,862), neuroticism (N = 168,105), and subjective well-being (N = 388,538). Compared to 32, 9, and 13 genome-wide significant loci in the single-trait GWASs (most of which are themselves novel), MTAG increases the number of loci to 64, 37, and 49, respectively. Moreover, association statistics from MTAG yield more informative bioinformatics analyses and increase variance explained by polygenic scores by approximately 25%, matching theoretical expectations.

Download Full-text