DL-PRS: a novel deep learning approach to polygenic risk scores

Author(s):  
Sijia Huang ◽  
Xiao Ji ◽  
Michael Cho ◽  
Jaehyun Joo ◽  
Jason Moore

Abstract Background COPD is a complex heterogeneous disease influenced by both environmental and genetic risk factors. Traditional genome wide association studies (GWAS) have been successful in identifying many reproducible risk variants of moderate to small effect. Polygenic risk scores (PRS) were developed as way to aggregate risk alleles weighted by their effect size to produce a score which could be used in clinical practice to identify individuals at high risk of disease. A limitation of both GWAS and PRS is that they make the important assumption that the effect of each allele is independent and not modified by other genetic or environmental factors. Machine learning methods such as deep learning (DL) neural networks complement the GWAS and PRS paradigm by making fewer assumptions about the nature of the genetic effects being modeled. For example, the hidden layers of a DL model have the potential to model gene-gene interactions with non-additive effects on disease risk. The goal of the present study was to develop a DL neural network approach to GWAS and PRS and to compare it to the prevailing paradigm based on modeling independent effects. We applied our DL-PRS method to genetic association data from several GWAS studies of chronic obstructive pulmonary disease (COPD).Results We developed a DL learning algorithm for modeling the relationship between genetic variation from GWAS and risk of COPD in several population-based studies. We then developed a DL-PRS based on nodes and associated weights from the first and second layer of the DL neural network. Our DL-PRS framework has overall satisfactory performance in the prediction of COPD and provides significant contribution to prediction in addition to the current PRS methods. Moreover, regarding the clinical relevance of COPD, our DL-PRS has a consistent and closer relationship regarding individual deciles and lung functions such as FEV1/FVC and predicted FEV1%. Conclusions Not only does DL-PRS show favorable predictive performance with current benchmark PRS methods, but it also extends the ranges of PRS deciles in predicting different stages of COPD. Moreover, our DL-PRS results were replicated in an independent cohort. This study opens the door to the use of machine learning for developing risk scores from models developed using fewer assumptions about the nature of the genetic effects.

2021 ◽  
Author(s):  
Ying Wang ◽  
Shinichi Namba ◽  
Esteban Lopera ◽  
Sini Kerminen ◽  
Kristin Tsuo ◽  
...  

SummaryWith the increasing availability of biobank-scale datasets that incorporate both genomic data and electronic health records, many associations between genetic variants and phenotypes of interest have been discovered. Polygenic risk scores (PRS), which are being widely explored in precision medicine, use the results of association studies to predict the genetic component of disease risk by accumulating risk alleles weighted by their effect sizes. However, limited studies have thoroughly investigated best practices for PRS in global populations across different diseases. In this study, we utilize data from the Global-Biobank Meta-analysis Initiative (GBMI), which consists of individuals from diverse ancestries and across continents, to explore methodological considerations and PRS prediction performance in 9 different biobanks for 14 disease endpoints. Specifically, we constructed PRS using heuristic (pruning and thresholding, P+T) and Bayesian (PRS-CS) methods. We found that the genetic architecture, such as SNP-based heritability and polygenicity, varied greatly among endpoints. For both PRS construction methods, using a European ancestry LD reference panel resulted in comparable or higher prediction accuracy compared to several other non-European based panels; this is largely attributable to European descent populations still comprising the majority of GBMI participants. PRS-CS overall outperformed the classic P+T method, especially for endpoints with higher SNP-based heritability. For example, substantial improvements are observed in East-Asian ancestry (EAS) using PRS-CS compared to P+T for heart failure (HF) and chronic obstructive pulmonary disease (COPD). Notably, prediction accuracy is heterogeneous across endpoints, biobanks, and ancestries, especially for asthma which has known variation in disease prevalence across global populations. Overall, we provide lessons for PRS construction, evaluation, and interpretation using the GBMI and highlight the importance of best practices for PRS in the biobank-scale genomics era.


PLoS Genetics ◽  
2021 ◽  
Vol 17 (9) ◽  
pp. e1009670
Author(s):  
Lars G. Fritsche ◽  
Ying Ma ◽  
Daiwei Zhang ◽  
Maxwell Salvatore ◽  
Seunggeun Lee ◽  
...  

Polygenic risk scores (PRS) can provide useful information for personalized risk stratification and disease risk assessment, especially when combined with non-genetic risk factors. However, their construction depends on the availability of summary statistics from genome-wide association studies (GWAS) independent from the target sample. For best compatibility, it was reported that GWAS and the target sample should match in terms of ancestries. Yet, GWAS, especially in the field of cancer, often lack diversity and are predominated by European ancestry. This bias is a limiting factor in PRS research. By using electronic health records and genetic data from the UK Biobank, we contrast the utility of breast and prostate cancer PRS derived from external European-ancestry-based GWAS across African, East Asian, European, and South Asian ancestry groups. We highlight differences in the PRS distributions of these groups that are amplified when PRS methods condense hundreds of thousands of variants into a single score. While European-GWAS-derived PRS were not directly transferrable across ancestries on an absolute scale, we establish their predictive potential when considering them separately within each group. For example, the top 10% of the breast cancer PRS distributions within each ancestry group each revealed significant enrichments of breast cancer cases compared to the bottom 90% (odds ratio of 2.81 [95%CI: 2.69,2.93] in European, 2.88 [1.85, 4.48] in African, 2.60 [1.25, 5.40] in East Asian, and 2.33 [1.55, 3.51] in South Asian individuals). Our findings highlight a compromise solution for PRS research to compensate for the lack of diversity in well-powered European GWAS efforts while recruitment of diverse participants in the field catches up.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Kit K. Elam ◽  
Thao Ha ◽  
Zoe Neale ◽  
Fazil Aliev ◽  
Danielle Dick ◽  
...  

AbstractGenetic effects on alcohol use can vary over time but are often examined using longitudinal models that predict a distal outcome at a single time point. The vast majority of these studies predominately examine effects using White, European American (EA) samples or examine the etiology of genetic variants identified from EA samples in other racial/ethnic populations, leading to inconclusive findings about genetic effects on alcohol use. The current study examined how genetic influences on alcohol use varied by age across a 15 year period within a diverse ethnic/racial sample of adolescents. Using a multi-ethnic approach, polygenic risk scores were created for African American (AA, n = 192) and EA samples (n = 271) based on racially/ethnically aligned genome wide association studies. Age-varying associations between polygenic scores and alcohol use were examined from age 16 to 30 using time-varying effect models separately for AA and EA samples. Polygenic risk for alcohol use was found to be associated with alcohol use from age 22–27 in the AA sample and from age 24.50 to 29 in the EA sample. Results are discussed relative to the intersection of alcohol use and developmental genetic effects in diverse populations.


2021 ◽  
Author(s):  
Sophia Gunn ◽  
Michael Wainberg ◽  
Zeyuan Song ◽  
Stacy Andersen ◽  
Robert Boudreau ◽  
...  

Background: A surprising and well-replicated result in genetic studies of human longevity is that centenarians appear to carry disease-associated variants in numbers similar to the general population. With the proliferation of large genome-wide association studies (GWAS) in recent years, investigators have turned to polygenic scores to leverage GWAS results into a measure of genetic risk that can better predict risk of disease than individual significant variants alone. Methods: We selected 54 polygenic risk scores (PRSs) developed for a variety of outcomes and we calculated their values in individuals from the New England Centenarian Study (NECS, N = 4886) and the Long Life Family Study (LLFS, N = 4577). We compared the distribution of these PRSs among exceptionally long-lived individuals (ELLI), their offspring and controls and we also examined their predictive values, using t-tests and regression models adjusting for sex and principal components reflecting ancestral background of the individuals (PCs). In our analyses we controlled for multiple testing using a Bonferroni-adjusted threshold for 54 traits. Results: We found that only 4 of the 54 PRSs differed between ELLIs and controls in both cohorts. ELLIs had significantly lower mean PRSs for Alzheimer's disease (AD), coronary artery disease (CAD) and systemic lupus than controls, suggesting genetic predisposition to extreme longevity may be mediated by reduced susceptibility to these traits. ELLIs also had significantly higher mean PRSs for improved cognitive function. In addition, the PRS for AD was associated with higher risk of dementia among controls but not ELLIs (p = 0.0004, 0.3 in NECS, p = 0.03, 0.93 in LLFS respectively). Interestingly, ELLIs did not have a larger number of homozygous risk genotypes for AD (TNECS = -1.72, TLLFs = 0.83) and CAD (TNECS = -5.08, TLLFs = -0.31) in both cohorts, but did have significantly larger number of homozygous protective genotypes than controls for the two traits (AD: TNECS =3.10, TLLFs = 2.2, CAD: TNECS = 6.57, TLLFs =2.36, respectively). Conclusions: ELLIs have a similar burden of genetic disease risk as the general population for most traits, but have significantly lower genetic risk of AD, CAD, and lupus. The lack of association between AD PRS and dementia among ELLIs suggests that their genetic risk for AD is somehow buffered by protective genetic or environmental factors.


2022 ◽  
Author(s):  
Tianyuan Lu ◽  
Vincenzo Forgetta ◽  
J. Brent Richards ◽  
Celia Greenwood

Abstract Genomic risk prediction is on the emerging path towards personalized medicine. However, the accuracy of polygenic prediction varies strongly in different individuals. In this study, based on up to 352,277 White British participants in the UK Biobank, we constructed polygenic risk scores for 15 physiological and biochemical quantitative traits after performing genome-wide association studies (GWASs). We identified 185 polygenic prediction variability quantitative trait loci (pvQTLs) for 11 traits by Levene’s test among 254,376 unrelated individuals. We validated the effects of pvQTLs using an independent test set of 58,927 individuals. A score aggregating 51 pvQTL SNPs for triglycerides had the strongest Spearman correlation of 0.185 (p-value < 1.0x10−300) with the squared prediction errors. We found a strong enrichment of complex genetic effects conferred by pvQTLs compared to risk loci identified in GWASs, including 89 pvQTLs exhibiting dominance effects. Incorporation of dominance effects into polygenic risk scores significantly improved polygenic prediction for triglycerides, low-density lipoprotein cholesterol, vitamin D, and platelet. After including 87 dominance effects for triglycerides, the adjusted R2 for the polygenic risk score had an 8.1% increase on the test set. In addition, 108 pvQTLs had significant interaction effects with measured environmental or lifestyle exposures. In conclusion, we have discovered and validated genetic determinants of polygenic prediction variability for 11 quantitative biomarkers, and partially profiled the underlying complex genetic effects. These findings may assist interpretation of genomic risk prediction in various contexts, and encourage novel approaches for constructing polygenic risk scores with complex genetic effects.


2021 ◽  
Author(s):  
Lars G. Fritsche ◽  
Ying Ma ◽  
Daiwei Zhang ◽  
Maxwell Salvatore ◽  
Seunggeun Lee ◽  
...  

AbstractPolygenic risk scores (PRS) can provide useful information for personalized risk stratification and disease risk assessment, especially when combined with non-genetic risk factors. However, their construction depends on the availability of summary statistics from genome-wide association studies (GWAS) independent from the target sample. For best compatibility, it was reported that GWAS and the target sample should match in terms of ancestries. Yet, GWAS, especially in the field of cancer, often lack diversity and are predominated by European ancestry. This bias is a limiting factor in PRS research. By using electronic health records and genetic data from the UK Biobank, we contrast the utility of breast and prostate cancer PRS derived from external European-ancestry-based GWAS across African, East Asian, European, and South Asian ancestry groups. We highlight differences in the PRS distributions of these groups that are amplified when PRS methods condense hundreds of thousands of variants into a single score. While European-GWAS-derived PRS were not directly transferrable across ancestries on an absolute scale, we establish their predictive potential when considering them separately within each group. For example, the top 10% of the breast cancer PRS distributions within each ancestry group each revealed significant enrichments of breast cancer cases compared to the bottom 90% (odds ratio of 2.81 [95%CI: 2.69,2.93] in European, 2.88 [1.85, 4.48] in African, 2.60 [1.25, 5.40] in East Asian, and 2.33 [1.55, 3.51] in South Asian individuals). Our findings highlight a compromise solution for PRS research to compensate for the lack of diversity in well-powered European GWAS efforts while recruitment of diverse participants in the field catches up.


Author(s):  
Niccolo’ Tesi ◽  
Sven J van der Lee ◽  
Marc Hulsman ◽  
Iris E Jansen ◽  
Najada Stringa ◽  
...  

Abstract Studying the genome of centenarians may give insights into the molecular mechanisms underlying extreme human longevity and the escape of age-related diseases. Here, we set out to construct polygenic risk scores (PRSs) for longevity and to investigate the functions of longevity-associated variants. Using a cohort of centenarians with maintained cognitive health (N = 343), a population-matched cohort of older adults from 5 cohorts (N = 2905), and summary statistics data from genome-wide association studies on parental longevity, we constructed a PRS including 330 variants that significantly discriminated between centenarians and older adults. This PRS was also associated with longer survival in an independent sample of younger individuals (p = .02), leading up to a 4-year difference in survival based on common genetic factors only. We show that this PRS was, in part, able to compensate for the deleterious effect of the APOE-ε4 allele. Using an integrative framework, we annotated the 330 variants included in this PRS by the genes they associate with. We find that they are enriched with genes associated with cellular differentiation, developmental processes, and cellular response to stress. Together, our results indicate that an extended human life span is, in part, the result of a constellation of variants each exerting small advantageous effects on aging-related biological mechanisms that maintain overall health and decrease the risk of age-related diseases.


2021 ◽  
Author(s):  
Jielin Xu ◽  
Yuan Hou ◽  
Yadi Zhou ◽  
Ming Hu ◽  
Feixiong Cheng

Human genome sequencing studies have identified numerous loci associated with complex diseases, including Alzheimer's disease (AD). Translating human genetic findings (i.e., genome-wide association studies [GWAS]) to pathobiology and therapeutic discovery, however, remains a major challenge. To address this critical problem, we present a network topology-based deep learning framework to identify disease-associated genes (NETTAG). NETTAG is capable of integrating multi-genomics data along with the protein-protein interactome to infer putative risk genes and drug targets impacted by GWAS loci. Specifically, we leverage non-coding GWAS loci effects on expression quantitative trait loci (eQTLs), histone-QTLs, and transcription factor binding-QTLs, enhancers and CpG islands, promoter regions, open chromatin, and promoter flanking regions. The key premises of NETTAG are that the disease risk genes exhibit distinct functional characteristics compared to non-risk genes and therefore can be distinguished by their aggregated genomic features under the human protein interactome. Applying NETTAG to the latest AD GWAS data, we identified 156 putative AD-risk genes (i.e., APOE, BIN1, GSK3B, MARK4, and PICALM). We showed that predicted risk genes are: 1) significantly enriched in AD-related pathobiological pathways, 2) more likely to be differentially expressed regarding transcriptome and proteome of AD brains, and 3) enriched in druggable targets with approved medicines (i.e., choline and ibudilast). In summary, our findings suggest that understanding of human pathobiology and therapeutic development could benefit from a network-based deep learning methodology that utilizes GWAS findings under the multimodal genomic analyses.


2018 ◽  
Author(s):  
Tom G. Richardson ◽  
Sean Harrison ◽  
Gibran Hemani ◽  
George Davey Smith

AbstractThe age of large-scale genome-wide association studies (GWAS) has provided us with an unprecedented opportunity to evaluate the genetic liability of complex disease using polygenic risk scores (PRS). In this study, we have analysed 162 PRS (P<5×l0 05) derived from GWAS and 551 heritable traits from the UK Biobank study (N=334,398). Findings can be investigated using a web application (http://mrcieu.mrsoftware.org/PRS_atlas/), which we envisage will help uncover both known and novel mechanisms which contribute towards disease susceptibility.To demonstrate this, we have investigated the results from a phenome-wide evaluation of schizophrenia genetic liability. Amongst findings were inverse associations with measures of cognitive function which extensive follow-up analyses using Mendelian randomization (MR) provided evidence of a causal relationship. We have also investigated the effect of multiple risk factors on disease using mediation and multivariable MR frameworks. Our atlas provides a resource for future endeavours seeking to unravel the causal determinants of complex disease.


2019 ◽  
Author(s):  
Zijie Zhao ◽  
Yanyao Yi ◽  
Yuchang Wu ◽  
Xiaoyuan Zhong ◽  
Yupei Lin ◽  
...  

AbstractPolygenic risk scores (PRSs) have wide applications in human genetics research. Notably, most PRS models include tuning parameters which improve predictive performance when properly selected. However, existing model-tuning methods require individual-level genetic data as the training dataset or as a validation dataset independent from both training and testing samples. These data rarely exist in practice, creating a significant gap between PRS methodology and applications. Here, we introduce PUMAS (Parameter-tuning Using Marginal Association Statistics), a novel method to fine-tune PRS models using summary statistics from genome-wide association studies (GWASs). Through extensive simulations, external validations, and analysis of 65 traits, we demonstrate that PUMAS can perform a variety of model-tuning procedures (e.g. cross-validation) using GWAS summary statistics and can effectively benchmark and optimize PRS models under diverse genetic architecture. On average, PUMAS improves the predictive R2 by 205.6% and 62.5% compared to PRSs with arbitrary p-value cutoffs of 0.01 and 1, respectively. Applied to 211 neuroimaging traits and Alzheimer’s disease, we show that fine-tuned PRSs will significantly improve statistical power in downstream association analysis. We believe our method resolves a fundamental problem without a current solution and will greatly benefit genetic prediction applications.


Sign in / Sign up

Export Citation Format

Share Document