Deep integrative models for large-scale human genomics
Polygenic risk scores (PRSs) are expected to play a critical role in achieving precision medicine. PRS predictors are generally based on linear models using summary statistics, and more recently individual- level data. However, these predictors generally only capture additive relationships and are limited when it comes to what type of data they use. Here, we develop a deep learning framework (EIR) for PRS prediction which includes a model, genome-local-net (GLN), we specifically designed for large scale genomics data. The framework supports multi-task (MT) learning, automatic integration of clinical and biochemical data and model explainability. GLN outperforms LASSO for a wide range of diseases, particularly autoimmune disease which have been researched for interaction effects. We showcase the flexibility of the framework by training one MT model to predict 338 diseases simultaneously. Furthermore, we find that incorporating measurement data for PRSs improves performance for virtually all (93%) diseases considered (ROC-AUC improvement up to 0.36) and that including genotype data provides better model calibration compared to measurements alone. We use the framework to analyse what our models learn and find that they learn both relevant disease variants and clinical measurements. EIR is open source and available at https://github.com/arnor-sigurdsson/EIR.