scholarly journals A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics

PLoS Genetics ◽  
2021 ◽  
Vol 17 (7) ◽  
pp. e1009697
Author(s):  
Geyu Zhou ◽  
Hongyu Zhao

Genetic prediction of complex traits has great promise for disease prevention, monitoring, and treatment. The development of accurate risk prediction models is hindered by the wide diversity of genetic architecture across different traits, limited access to individual level data for training and parameter tuning, and the demand for computational resources. To overcome the limitations of the most existing methods that make explicit assumptions on the underlying genetic architecture and need a separate validation data set for parameter tuning, we develop a summary statistics-based nonparametric method that does not rely on validation datasets to tune parameters. In our implementation, we refine the commonly used likelihood assumption to deal with the discrepancy between summary statistics and external reference panel. We also leverage the block structure of the reference linkage disequilibrium matrix for implementation of a parallel algorithm. Through simulations and applications to twelve traits, we show that our method is adaptive to different genetic architectures, statistically robust, and computationally efficient. Our method is available at https://github.com/eldronzhou/SDPR.

2020 ◽  
Author(s):  
Geyu Zhou ◽  
Hongyu Zhao

AbstractGenetic prediction of complex traits has great promise for disease prevention, monitoring, and treatment. The development of accurate risk prediction models is hindered by the wide diversity of genetic architecture across different traits, limited access to individual level data for training and parameter tuning, and the demand for computational resources. To overcome the limitations of the most existing methods that make explicit assumptions on the underlying genetic architecture and need a separate validation data set for parameter tuning, we develop a summary statistics-based nonparametric method that does not rely on specific parametric assumptions about genetic architecture and validation datasets to tune parameters. In our implementation, we refine the commonly used likelihood assumption to deal with the discrepancy between summary statistics and external reference panel. We also leverage the block structure of the reference linkage disequilibrium matrix for implementation of a parallel algorithm. Through simulations and applications to twelve traits, we show that our method is adaptive to different genetic architectures, statistically robust, and computationally efficient. Our method is available at https://github.com/eldronzhou/SDPR.


2020 ◽  
Author(s):  
Qianqian Zhang ◽  
Florian Privé ◽  
Bjarni Vilhjálmsson ◽  
Doug Speed

At present, most tools for constructing genetic prediction models begin with the assumption that all genetic variants contribute equally towards the phenotype. However, this represents a sub-optimal model for how heritability is distributed across the genome. Here we construct prediction models for 14 phenotypes from the UK Biobank (200,000 individuals per phenotype) using four of the most popular prediction tools: lasso, ridge regression, Bolt-LMM and BayesR. When we improve the assumed heritability model, prediction accuracy always improves (i.e., for all four tools and for all 14 phenotypes). When we construct prediction models using individual-level data, the best-performing tool is Bolt-LMM; if we replace its default heritability model with the most realistic model currently available, the average proportion of phenotypic variance explained increases by 19% (s.d. 2), equivalent to increasing the sample size by about a quarter. When we construct prediction models using summary statistics, the best tool depends on the phenotype. Therefore, we develop MegaPRS, a summary statistic prediction tool for constructing lasso, ridge regression, Bolt-LMM and BayesR prediction models, that allows the user to specify the heritability model.


2018 ◽  
Author(s):  
Doug Speed ◽  
David J Balding

LD Score Regression (LDSC) has been widely applied to the results of genome-wide association studies. However, its estimates of SNP heritability are derived from an unrealistic model in which each SNP is expected to contribute equal heritability. As a consequence, LDSC tends to over-estimate confounding bias, under-estimate the total phenotypic variation explained by SNPs, and provide misleading estimates of the heritability enrichment of SNP categories. Therefore, we present SumHer, software for estimating SNP heritability from summary statistics using more realistic heritability models. After demonstrating its superiority over LDSC, we apply SumHer to the results of 24 large-scale association studies (average sample size 121 000). First we show that these studies have tended to substantially over-correct for confounding, and as a result the number of genome-wide significant loci has under-reported by about 20%. Next we estimate enrichment for 24 categories of SNPs defined by functional annotations. A previous study using LDSC reported that conserved regions were 13-fold enriched, and found a further twelve categories with above 2-fold enrichment. By contrast, our analysis using SumHer finds that conserved regions are only 1.6-fold (SD 0.06) enriched, and that no category has enrichment above 1.7-fold. SumHer provides an improved understanding of the genetic architecture of complex traits, which enables more efficient analysis of future genetic data.


2016 ◽  
Author(s):  
Han Zhang ◽  
William Wheeler ◽  
Paula L Hyland ◽  
Yifan Yang ◽  
Jianxin Shi ◽  
...  

AbstractMeta-analysis of multiple genome-wide association studies (GWAS) has become an effective approach for detecting single nucleotide polymorphism (SNP) associations with complex traits. However, it is difficult to integrate the readily accessible SNP-level summary statistics from a meta-analysis into more powerful multi-marker testing procedures, which generally require individual-level genetic data. We developed a general procedure called Summary based Adaptive Rank Truncated Product (sARTP) for conducting gene and pathway meta-analysis that uses only SNP-level summary statistics in combination with genotype correlation estimated from a panel of individual-level genetic data. We demonstrated the validity and power advantage of sARTP through empirical and simulated data. We conducted a comprehensive pathway-based meta-analysis with sARTP on type 2 diabetes (T2D) by integrating SNP-level summary statistics from two large studies consisting of 19,809 T2D cases and 111,181 controls with European ancestry. Among 4,713 candidate pathways from which genes in neighborhoods of 170 GWAS established T2D loci were excluded, we detected 43 T2D globally significant pathways (with Bonferroni corrected p-values < 0.05), which included the insulin signaling pathway and T2D pathway defined by KEGG, as well as the pathways defined according to specific gene expression patterns on pancreatic adenocarcinoma, hepatocellular carcinoma, and bladder carcinoma. Using summary data from 8 eastern Asian T2D GWAS with 6,952 cases and 11,865 controls, we showed 7 out of the 43 pathways identified in European populations remained to be significant in eastern Asians at the false discovery rate of 0.1. We created an R package and a web-based tool for sARTP with the capability to analyze pathways with thousands of genes and tens of thousands of SNPs.Author SummaryAs GWAS continue to grow in sample size, it is evident that these studies need to be utilized more effectively for detecting individual susceptibility variants, and more importantly to provide insight into global genetic architecture of complex traits. Towards this goal, identifying association with respect to a collection of variants in biological pathways can be particularly insightful for understanding how networks of genes might be affecting pathophysiology of diseases. Here we present a new pathway analysis procedure that can be conducted using summary-level association statistics, which have become the main vehicle for performing meta-analysis of individual genetic variants across studies in large consortia. Through simulation studies we showed the proposed method was more powerful than the existing state-of-art method. We carried out a comprehensive pathway analysis of 4,713 candidate pathways on their association with T2D using two large studies with European ancestry and identified 43 T2D-associated pathways. Further examinations of those 43 pathways in 8 Asian studies showed that some pathways were trans-ethnically associated with T2D. This analysis clearly highlights novel T2D-associated pathways beyond what has been known from single-variant association analysis reported from largest GWAS to date.


2019 ◽  
Author(s):  
Yi Yang ◽  
Xingjie Shi ◽  
Yuling Jiao ◽  
Jian Huang ◽  
Min Chen ◽  
...  

AbstractMotivationAlthough genome-wide association studies (GWAS) have deepened our understanding of the genetic architecture of complex traits, the mechanistic links that underlie how genetic variants cause complex traits remains elusive. To advance our understanding of the underlying mechanistic links, various consortia have collected a vast volume of genomic data that enable us to investigate the role that genetic variants play in gene expression regulation. Recently, a collaborative mixed model (CoMM) [42] was proposed to jointly interrogate genome on complex traits by integrating both the GWAS dataset and the expression quantitative trait loci (eQTL) dataset. Although CoMM is a powerful approach that leverages regulatory information while accounting for the uncertainty in using an eQTL dataset, it requires individual-level GWAS data and cannot fully make use of widely available GWAS summary statistics. Therefore, statistically efficient methods that leverages transcriptome information using only summary statistics information from GWAS data are required.ResultsIn this study, we propose a novel probabilistic model, CoMM-S2, to examine the mechanistic role that genetic variants play, by using only GWAS summary statistics instead of individual-level GWAS data. Similar to CoMM which uses individual-level GWAS data, CoMM-S2 combines two models: the first model examines the relationship between gene expression and genotype, while the second model examines the relationship between the phenotype and the predicted gene expression from the first model. Distinct from CoMM, CoMM-S2 requires only GWAS summary statistics. Using both simulation studies and real data analysis, we demonstrate that even though CoMM-S2 utilizes GWAS summary statistics, it has comparable performance as CoMM, which uses individual-level GWAS [email protected] and implementationThe implement of CoMM-S2 is included in the CoMM package that can be downloaded from https://github.com/gordonliu810822/CoMM.Supplementary informationSupplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Jie Zheng ◽  
Tom G. Richardson ◽  
Louise A. C. Millard ◽  
Gibran Hemani ◽  
Christopher Raistrick ◽  
...  

AbstractBackgroundIdentifying phenotypic correlations between complex traits and diseases can provide useful etiological insights. Restricted access to individual-level phenotype data makes it difficult to estimate large-scale phenotypic correlation across the human phenome. State-of-the-art methods, metaCCA and LD score regression, provide an alternative approach to estimate phenotypic correlation using genome-wide association study (GWAS) summary statistics.ResultsHere, we present an integrated R toolkit, PhenoSpD, to 1) apply metaCCA (or LD score regression) to estimate phenotypic correlations using GWAS summary statistics; and 2) to utilize the estimated phenotypic correlations to inform correction of multiple testing for complex human traits using the spectral decomposition of matrices (SpD). The simulations suggest it is possible to estimate phenotypic correlation using samples with only a partial overlap, but as overlap decreases correlations will attenuate towards zero and multiple testing correction will be more stringent than in perfectly overlapping samples. In a case study, PhenoSpD using GWAS results suggested 324.4 independent tests among 452 metabolites, which is close to the 296 independent tests estimated using true phenotypic correlation. We further applied PhenoSpD to estimated 7,503 pair-wise phenotypic correlations among 123 metabolites using GWAS summary statistics from Kettunen et al. and PhenoSpD suggested 44.9 number of independent tests for theses metabolites.ConclusionPhenoSpD integrates existing methods and provides a simple and conservative way to reduce dimensionality for complex human traits using GWAS summary statistics, which is particularly valuable for post-GWAS analysis of complex molecular traits.AvailabilityR code and documentation for PhenoSpD V1.0.0 is available online (https://github.com/MRCIEU/PhenoSpD).


Author(s):  
Alicia R. Martin ◽  
Solomon Teferra ◽  
Marlo Möller ◽  
Eileen G. Hoal ◽  
Mark J. Daly

Human genetic studies have long been vastly Eurocentric, raising a key question about the generalizability of these study findings to other populations. Because humans originated in Africa, these populations retain more genetic diversity, and yet individuals of African descent have been tremendously underrepresented in genetic studies. The diversity in Africa affords ample opportunities to improve fine-mapping resolution for associated loci, discover novel genetic associations with phenotypes, build more generalizable genetic risk prediction models, and better understand the genetic architecture of complex traits and diseases subject to varying environmental pressures. Thus, it is both ethically and scientifically imperative that geneticists globally surmount challenges that have limited progress in African genetic studies to date while meaningfully including African investigators, as greater inclusivity and enhanced research capacity affords enormous opportunities to accelerate genomic discoveries that translate more effectively to all populations. We review the advantages and challenges of studying the genetic architecture of complex traits and diseases in Africa. For example, with greater genetic diversity comes greater ancestral heterogeneity; this higher level of understudied diversity can yield novel genetic findings, but some methods that assume homogeneous population structure and work well in European populations may work less well in the presence of greater diversity and heterogeneity in African populations. Consequently, we advocate for methodological development that will accelerate studies important for all populations, especially those currently underrepresented in genetics.


2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Adriaan van der Graaf ◽  
◽  
Annique Claringbould ◽  
Antoine Rimbert ◽  
Harm-Jan Westra ◽  
...  

Abstract Inference of causality between gene expression and complex traits using Mendelian randomization (MR) is confounded by pleiotropy and linkage disequilibrium (LD) of gene-expression quantitative trait loci (eQTL). Here, we propose an MR method, MR-link, that accounts for unobserved pleiotropy and LD by leveraging information from individual-level data, even when only one eQTL variant is present. In simulations, MR-link shows false-positive rates close to expectation (median 0.05) and high power (up to 0.89), outperforming all other tested MR methods and coloc. Application of MR-link to low-density lipoprotein cholesterol (LDL-C) measurements in 12,449 individuals with expression and protein QTL summary statistics from blood and liver identifies 25 genes causally linked to LDL-C. These include the known SORT1 and ApoE genes as well as PVRL2, located in the APOE locus, for which a causal role in liver was not known. Our results showcase the strength of MR-link for transcriptome-wide causal inferences.


2018 ◽  
Author(s):  
Andrew D. Grotzinger ◽  
Mijke Rhemtulla ◽  
Ronald de Vlaming ◽  
Stuart J. Ritchie ◽  
Travis T. Mallard ◽  
...  

AbstractMethods for using GWAS to estimate genetic correlations between pairwise combinations of traits have produced “atlases” of genetic architecture. Genetic atlases reveal pervasive pleiotropy, and genome-wide significant loci are often shared across different phenotypes. We introduce genomic structural equation modeling (Genomic SEM), a multivariate method for analyzing the joint genetic architectures of complex traits. Using formal methods for modeling covariance structure, Genomic SEM synthesizes genetic correlations and SNP-heritabilities inferred from GWAS summary statistics of individual traits from samples with varying and unknown degrees of overlap. Genomic SEM can be used to identify variants with effects on general dimensions of cross-trait liability, boost power for discovery, and calculate more predictive polygenic scores. Finally, Genomic SEM can be used to identify loci that cause divergence between traits, aiding the search for what uniquely differentiates highly correlated phenotypes. We demonstrate several applications of Genomic SEM, including a joint analysis of GWAS summary statistics from five genetically correlated psychiatric traits. We identify 27 independent SNPs not previously identified in the univariate GWASs, 5 of which have been reported in other published GWASs of the included traits. Polygenic scores derived from Genomic SEM consistently outperform polygenic scores derived from GWASs of the individual traits. Genomic SEM is flexible, open ended, and allows for continuous innovations in how multivariate genetic architecture is modeled.


2021 ◽  
Vol 118 (25) ◽  
pp. e2023184118
Author(s):  
Yuchang Wu ◽  
Xiaoyuan Zhong ◽  
Yunong Lin ◽  
Zijie Zhao ◽  
Jiawen Chen ◽  
...  

Marginal effect estimates in genome-wide association studies (GWAS) are mixtures of direct and indirect genetic effects. Existing methods to dissect these effects require family-based, individual-level genetic, and phenotypic data with large samples, which is difficult to obtain in practice. Here, we propose a statistical framework to estimate direct and indirect genetic effects using summary statistics from GWAS conducted on own and offspring phenotypes. Applied to birth weight, our method showed nearly identical results with those obtained using individual-level data. We also decomposed direct and indirect genetic effects of educational attainment (EA), which showed distinct patterns of genetic correlations with 45 complex traits. The known genetic correlations between EA and higher height, lower body mass index, less-active smoking behavior, and better health outcomes were mostly explained by the indirect genetic component of EA. In contrast, the consistently identified genetic correlation of autism spectrum disorder (ASD) with higher EA resides in the direct genetic component. A polygenic transmission disequilibrium test showed a significant overtransmission of the direct component of EA from healthy parents to ASD probands. Taken together, we demonstrate that traditional GWAS approaches, in conjunction with offspring phenotypic data collection in existing cohorts, could greatly benefit studies on genetic nurture and shed important light on the interpretation of genetic associations for human complex traits.


Sign in / Sign up

Export Citation Format

Share Document