scholarly journals 29.3 IMPROVING MACHINE LEARNING PREDICTION OF ADHD USING GENE SET POLYGENIC RISK SCORES AND RISK SCORES FROM GENETICALLY CORRELATED DISORDERS

Author(s):  
Stephen V. Faraone
2022 ◽  
Author(s):  
Eric J Barnett ◽  
Yanli Zhang-James ◽  
Stephen V Faraone

Background: Polygenic risk scores (PRSs), which sum the effects of SNPs throughout the genome to measure risk afforded by common genetic variants, have improved our ability to estimate disorder risk for Attention-Deficit/Hyperactivity Disorder (ADHD) but the accuracy of risk prediction is rarely investigated. Methods: With the goal of improving risk prediction, we performed gene set analysis of GWAS data to select gene sets associated with ADHD within a training subset. For each selected gene set, we generated gene set polygenic risk scores (gsPRSs), which sum the effects of SNPs for each selected gene set. We created gsPRS for ADHD and for phenotypes having a high genetic correlation with ADHD. These gsPRS were added to the standard PRS as input to machine learning models predicting ADHD. We used feature importance scores to select gsPRS for a final model and to generate a ranking of the most consistently predictive gsPRS. Results: For a test subset that had not been used for training or validation, a random forest (RF) model using PRSs from ADHD and genetically correlated phenotypes and an optimized group of 20 gsPRS had an area under the receiving operating characteristic curve (AUC) of 0.72 (95% CI: 0.70 to 0.74). This AUC was a statistically significant improvement over logistic regression models and RF models using only PRS from ADHD and genetically correlated phenotypes. Conclusions: Summing risk at the gene set level and incorporating genetic risk from disorders with high genetic correlations with ADHD improved the accuracy of predicting ADHD. Learning curves suggest that additional improvements would be expected with larger study sizes. Our study suggests that better accounting of genetic risk and the genetic context of allelic differences results in more predictive models.


2017 ◽  
Author(s):  
Guillaume Paré ◽  
Shihong Mao ◽  
Wei Q. Deng

AbstractMachine-learning techniques have helped solve a broad range of prediction problems, yet are not widely used to build polygenic risk scores for the prediction of complex traits. We propose a novel heuristic based on machine-learning techniques (GraBLD) to boost the predictive performance of polygenic risk scores. Gradient boosted regression trees were first used to optimize the weights of SNPs included in the score, followed by a novel regional adjustment for linkage disequilibrium. A calibration set with sample size of ~200 individuals was sufficient for optimal performance. GraBLD yielded prediction R2 of 0.239 and 0.082 using GIANT summary association statistics for height and BMI in the UK Biobank study (N=130K; 1.98M SNPs), explaining 46.9% and 32.7% of the overall polygenic variance, respectively. For diabetes status, the area under the receiver operating characteristic curve was 0.602 in the UK Biobank study using summary-level association statistics from the DIAGRAM consortium. GraBLD outperformed other polygenic score heuristics for the prediction of height (p<2.2x10−16) and BMI (p<1.57x10−4), and was equivalent to LDpred for diabetes. Results were independently validated in the Health and Retirement Study (N=8,292; 688,398 SNPs). Our report demonstrates the use of machine-learning techniques, coupled with summary-level data from large genome-wide meta-analyses to improve the prediction of polygenic traits.


2019 ◽  
Vol 29 ◽  
pp. S1126
Author(s):  
Paul O'Reilly ◽  
Yunfeng Ruan ◽  
Shing Wan Choi
Keyword(s):  

2019 ◽  
Vol 29 ◽  
pp. S229-S230
Author(s):  
Georgia Panagiotaropoulou ◽  
Stephan Ripke ◽  
Emily Baker ◽  
Valentina Escott-Price

2020 ◽  
Vol 44 (2) ◽  
pp. 125-138 ◽  
Author(s):  
Damian Gola ◽  
Jeannette Erdmann ◽  
Bertram Müller‐Myhsok ◽  
Heribert Schunkert ◽  
Inke R. König

2021 ◽  
Author(s):  
Sijia Huang ◽  
Xiao Ji ◽  
Michael Cho ◽  
Jaehyun Joo ◽  
Jason Moore

Abstract Background COPD is a complex heterogeneous disease influenced by both environmental and genetic risk factors. Traditional genome wide association studies (GWAS) have been successful in identifying many reproducible risk variants of moderate to small effect. Polygenic risk scores (PRS) were developed as way to aggregate risk alleles weighted by their effect size to produce a score which could be used in clinical practice to identify individuals at high risk of disease. A limitation of both GWAS and PRS is that they make the important assumption that the effect of each allele is independent and not modified by other genetic or environmental factors. Machine learning methods such as deep learning (DL) neural networks complement the GWAS and PRS paradigm by making fewer assumptions about the nature of the genetic effects being modeled. For example, the hidden layers of a DL model have the potential to model gene-gene interactions with non-additive effects on disease risk. The goal of the present study was to develop a DL neural network approach to GWAS and PRS and to compare it to the prevailing paradigm based on modeling independent effects. We applied our DL-PRS method to genetic association data from several GWAS studies of chronic obstructive pulmonary disease (COPD).Results We developed a DL learning algorithm for modeling the relationship between genetic variation from GWAS and risk of COPD in several population-based studies. We then developed a DL-PRS based on nodes and associated weights from the first and second layer of the DL neural network. Our DL-PRS framework has overall satisfactory performance in the prediction of COPD and provides significant contribution to prediction in addition to the current PRS methods. Moreover, regarding the clinical relevance of COPD, our DL-PRS has a consistent and closer relationship regarding individual deciles and lung functions such as FEV1/FVC and predicted FEV1%. Conclusions Not only does DL-PRS show favorable predictive performance with current benchmark PRS methods, but it also extends the ranges of PRS deciles in predicting different stages of COPD. Moreover, our DL-PRS results were replicated in an independent cohort. This study opens the door to the use of machine learning for developing risk scores from models developed using fewer assumptions about the nature of the genetic effects.


2020 ◽  
Author(s):  
◽  
Joseph D. Deak

[ACCESS RESTRICTED TO THE UNIVERSITY OF MISSOURI-COLUMBIA AT REQUEST OF AUTHOR.] The current study aimed to extend previous genetic studies of level of response (LR) to alcohol by conducting the largest genome-wide association study (GWAS) of LR to date through the meta-analysis of multiple samples with extant SRE (Self-rating of the Effects of Alcohol) and GWAS data. A second aim was to use summary data from the described GWAS of LR to create polygenic risk scores (PRS) in an independent sample in order to determine whether, and to what extent, the genetic influences underlying LR to alcohol serve as a risk factor for alcohol use disorder (AUD). Towards these aims, datasets were processed according to standard quality control (QC) procedures allowing for genotype imputation and GWA analysis using methods appropriate for the individual study designs. Following individual study-level GWAS analysis, results were meta-analyzed utilizing an inverse-variance weighted fixed-effects model in METAL resulting in a final sample size of N=10,635. GWAS summary statistics from the SRE meta-analysis were then used to conduct gene-based and gene-set analyses, as well as compute polygenic risk scores (PRS) in an independent target sample to examine the predictive ability of the LR to alcohol PRS for DSM-IV AD symptom counts. No individual variants, genes, or gene-sets achieved study-level significance, although multiple genetic loci of interest achieved suggestive significance. The top single variant association was in an intergenic region on chromosome 2 located near the FUNDC2P2 gene (rs12463481; p=6.35x10[superscript -8]), the top gene-based association was with the PRR16 gene on chromosome 5 (p=6.72x10 [superscript -6]), and the top gene-set was with a set of genes associated with NFE2L2 targets (p=1.21 x10 [superscript -5]). No results from the PRS analysis approached significance. These findings suggest that, similar to other alcohol use outcomes, larger sample sizes will be required for the robust detection of genetic influences contributing to level of response to alcohol.


2021 ◽  
Author(s):  
Baoshan Ma ◽  
Jianqiao Pan ◽  
Xiaoyu Hou ◽  
Chongyang Li ◽  
Tong Xiong ◽  
...  

Abstract Background: Breast cancer accounts for a large proportion of cancer-related deaths in women. Polygenic risk score (PRS) derived from single nucleotide polymorphisms (SNP) data can evaluate the individual-level genetic risk of breast cancer and has been widely applied for risk stratification. However, standalone SNP data used for PRS may not provide satisfactory prediction accuracy. Additionally, current PRS models based on linear regression have insufficient power to leverage non-linear effects from thousands of associated SNPs.Methods: In this study, the multiple omics data (DNA methylation data, miRNA data, mRNA data and lncRNA data) and clinical data of breast invasive carcinoma (BRCA) were collected from The Cancer Genome Atlas (TCGA). First, we developed a novel PRS model utilizing single omic data and a machine learning algorithm (LightGBM). Subsequently, we built a combination model of PRS derived from each omic data to explore whether multiple omics data can further improve the prediction accuracy of PRS. Finally, we performed association analysis and prognosis prediction of breast cancer to evaluate the utility of the PRS generated by our method.Results: Our PRS model based on single omic data and LightGBM algorithm achieved better predictive performance than the linear models and other machine learning models. Moreover, the combination of the PRS derived from each omic data can efficiently strengthen prediction accuracy. The analysis of prevalence and the associations of the PRS with phenotypes including case-control and cancer stage status indicated that the risk of breast cancer increases with the increases of PRS. The survival analysis also suggested that PRS for the cancer stage is an effective prognostic metric of breast cancer patients.Conclusion: Our proposed model expanded the current definition of PRS from standalone SNP data to multiple omics data and outperformed the state-of-the-art PRS models, which may provide a powerful tool for diagnostic and prognostic prediction of breast cancer.


Sign in / Sign up

Export Citation Format

Share Document