Polygenic scores via penalized regression on summary statistics

AbstractPolygenic scores (PGS) summarize the genetic contribution of a person’s genotype to a disease or phenotype. They can be used to group participants into different risk categories for diseases, and are also used as covariates in epidemiological analyses. A number of possible ways of calculating polygenic scores have been proposed, and recently there is much interest in methods that incorporate information available in published summary statistics. As there is no inherent information on linkage disequilibrium (LD) in summary statistics, a pertinent question is how we can make use of LD information available elsewhere to supplement such analyses. To answer this question we propose a method for constructing PGS using summary statistics and a reference panel in a penalized regression framework, which we call lassosum. We also propose a general method for choosing the value of the tuning parameter in the absence of validation data. In our simulations, we showed that pseudovalidation often resulted in prediction accuracy that is comparable to using a dataset with validation phenotype and was clearly superior to the conservative option of setting the tuning parameter of lassosum to its lowest value. We also showed that lassosum achieved better prediction accuracy than simple clumping and p-value thresholding in almost all scenarios. It was also substantially faster and more accurate than the recently proposed LDpred.

Download Full-text

Polygenic scores for UK Biobank scale data

10.1101/252270 ◽

2018 ◽

Cited By ~ 5

Author(s):

Timothy Shin Heng Mak ◽

Robert Milan Porsch ◽

Shing Wan Choi ◽

Pak Chung Sham

Keyword(s):

External Source ◽

Summary Statistics ◽

Uk Biobank ◽

Validation Data ◽

Raw Data ◽

Cross Prediction ◽

Polygenic Scores ◽

The Difference ◽

The Uk ◽

Meta Analyses

AbstractPolygenic scores (PGS) are estimated scores representing the genetic tendency of an individual for a disease or trait and have become an indispensible tool in a variety of analyses. Typically they are linear combination of the genotypes of a large number of SNPs, with the weights calculated from an external source, such as summary statistics from large meta-analyses. Recently cohorts with genetic data have become very large, such that it would be a waste if the raw data were not made use of in constructing PGS. Making use of raw data in calculating PGS, however, presents us with problems of overfitting. Here we discuss the essence of overfitting as applied in PGS calculations and highlight the difference between overfitting due to the overlap between the target and the discovery data (OTD), and overfitting due to the overlap between the target the the validation data (OTV). We propose two methods — cross prediction and split validation — to overcome OTD and OTV respectively. Using these two methods, PGS can be calculated using raw data without overfitting. We show that PGSs thus calculated have better predictive power than those using summary statistics alone for six phenotypes in the UK Biobank data.

Download Full-text

Polygenic scores via penalized regression on summary statistics

Genetic Epidemiology ◽

10.1002/gepi.22050 ◽

2017 ◽

Vol 41 (6) ◽

pp. 469-480 ◽

Cited By ~ 58

Author(s):

Timothy Shin Heng Mak ◽

Robert Milan Porsch ◽

Shing Wan Choi ◽

Xueya Zhou ◽

Pak Chung Sham

Keyword(s):

Penalized Regression ◽

Summary Statistics ◽

Polygenic Scores

Download Full-text

Faculty Opinions recommendation of Polygenic scores via penalized regression on summary statistics.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.727593680.793563010 ◽

2019 ◽

Author(s):

John Nurnberger

Keyword(s):

Penalized Regression ◽

Summary Statistics ◽

Polygenic Scores

Download Full-text

Penalized regression and model selection methods for polygenic scores on summary statistics

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008271 ◽

2020 ◽

Vol 16 (10) ◽

pp. e1008271

Author(s):

Jack Pattee ◽

Wei Pan

Keyword(s):

Model Selection ◽

Penalized Regression ◽

Summary Statistics ◽

Selection Methods ◽

Polygenic Scores

Download Full-text

Contrasting Broad- and Clinically- defined Polygenic Indicators of Depression and Depression-related Phenotypes in Adults and Children

10.31234/osf.io/pn9vb ◽

2020 ◽

Author(s):

John E. McGeary ◽

Chelsie Benca-Bachman ◽

Victoria Risner ◽

Christopher G Beevers ◽

Brandon Gibb ◽

...

Keyword(s):

Suicidal Ideation ◽

Cognitive Reappraisal ◽

Twin Studies ◽

European Ancestry ◽

Summary Statistics ◽

Depression Severity ◽

Uk Biobank ◽

Polygenic Scores ◽

Adults And Children ◽

The Uk

Twin studies indicate that 30-40% of the disease liability for depression can be attributed to genetic differences. Here, we assess the explanatory ability of polygenic scores (PGS) based on broad- (PGSBD) and clinical- (PGSMDD) depression summary statistics from the UK Biobank using independent cohorts of adults (N=210; 100% European Ancestry) and children (N=728; 70% European Ancestry) who have been extensively phenotyped for depression and related neurocognitive phenotypes. PGS associations with depression severity and diagnosis were generally modest, and larger in adults than children. Polygenic prediction of depression-related phenotypes was mixed and varied by PGS. Higher PGSBD, in adults, was associated with a higher likelihood of having suicidal ideation, increased brooding and anhedonia, and lower levels of cognitive reappraisal; PGSMDD was positively associated with brooding and negatively related to cognitive reappraisal. Overall, PGS based on both broad and clinical depression phenotypes have modest utility in adult and child samples of depression.

Download Full-text

LDpred-funct: incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets

10.1101/375337 ◽

2018 ◽

Cited By ~ 21

Author(s):

Carla Márquez-Luna ◽

Steven Gazal ◽

Po-Ru Loh ◽

Samuel S. Kim ◽

Nicholas Furlotte ◽

...

Keyword(s):

Complex Traits ◽

Prediction Accuracy ◽

Causal Effect ◽

Complex Trait ◽

Training Data ◽

Data Sets ◽

Uk Biobank ◽

Validation Data ◽

Functional Regions ◽

The Uk

AbstractGenetic variants in functional regions of the genome are enriched for complex trait heritability. Here, we introduce a new method for polygenic prediction, LDpred-funct, that leverages trait-specific functional priors to increase prediction accuracy. We fit priors using the recently developed baseline-LD model, which includes coding, conserved, regulatory and LD-related annotations. We analytically estimate posterior mean causal effect sizes and then use cross-validation to regularize these estimates, improving prediction accuracy for sparse architectures. LDpred-funct attained higher prediction accuracy than other polygenic prediction methods in simulations using real genotypes. We applied LDpred-funct to predict 21 highly heritable traits in the UK Biobank. We used association statistics from British-ancestry samples as training data (avg N=373K) and samples of other European ancestries as validation data (avg N=22K), to minimize confounding. LDpred-funct attained a +4.6% relative improvement in average prediction accuracy (avg prediction R2=0.144; highest R2=0.413 for height) compared to SBayesR (the best method that does not incorporate functional information). For height, meta-analyzing training data from UK Biobank and 23andMe cohorts (total N=1107K; higher heritability in UK Biobank cohort) increased prediction R2 to 0.431. Our results show that incorporating functional priors improves polygenic prediction accuracy, consistent with the functional architecture of complex traits.

Download Full-text

Performance analysis on least absolute shrinkage selection operator, elastic net and correlation adjusted elastic net regression methods

International Journal of Advanced Statistics and Probability ◽

10.14419/ijasp.v3i1.4364 ◽

2015 ◽

Vol 3 (1) ◽

pp. 93

Author(s):

Pascalis Kadaro Matthew ◽

Abubakar Yahaya

Keyword(s):

Linear Regression ◽

Prediction Accuracy ◽

Penalized Regression ◽

Ordinary Least Squares ◽

Complex Model ◽

Elastic Net ◽

Data Set ◽

Regression Methods ◽

Regression Techniques ◽

Selection Operator

<p>Some few decades ago, penalized regression techniques for linear regression have been developed specifically to reduce the flaws inherent in the prediction accuracy of the classical ordinary least squares (OLS) regression technique. In this paper, we used a diabetes data set obtained from previous literature to compare three of these well-known techniques, namely: Least Absolute Shrinkage Selection Operator (LASSO), Elastic Net and Correlation Adjusted Elastic Net (CAEN). After thorough analysis, it was observed that CAEN generated a less complex model.</p>

Download Full-text

A penalized regression approach for DNA copy number study using the sequencing data

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2018-0001 ◽

2019 ◽

Vol 18 (4) ◽

Cited By ~ 1

Author(s):

Jaeeun Lee ◽

Jie Chen

Keyword(s):

Copy Number ◽

Information Criterion ◽

Penalized Regression ◽

Parameter Selection ◽

Information Criteria ◽

Tuning Parameter ◽

Dimensional Structure ◽

Sequencing Data ◽

Dna Copy Number ◽

Regression Approach

Abstract Modeling the high-throughput next generation sequencing (NGS) data, resulting from experiments with the goal of profiling tumor and control samples for the study of DNA copy number variants (CNVs), remains to be a challenge in various ways. In this application work, we provide an efficient method for detecting multiple CNVs using NGS reads ratio data. This method is based on a multiple statistical change-points model with the penalized regression approach, 1d fused LASSO, that is designed for ordered data in a one-dimensional structure. In addition, since the path algorithm traces the solution as a function of a tuning parameter, the number and locations of potential CNV region boundaries can be estimated simultaneously in an efficient way. For tuning parameter selection, we then propose a new modified Bayesian information criterion, called JMIC, and compare the proposed JMIC with three different Bayes information criteria used in the literature. Simulation results have shown the better performance of JMIC for tuning parameter selection, in comparison with the other three criterion. We applied our approach to the sequencing data of reads ratio between the breast tumor cell lines HCC1954 and its matched normal cell line BL 1954 and the results are in-line with those discovered in the literature.

Download Full-text

A Survey of Tuning Parameter Selection for High-Dimensional Regression

Annual Review of Statistics and Its Application ◽

10.1146/annurev-statistics-030718-105038 ◽

2020 ◽

Vol 7 (1) ◽

pp. 209-226 ◽

Cited By ~ 2

Author(s):

Yunan Wu ◽

Lan Wang

Keyword(s):

Standard Technique ◽

Penalized Regression ◽

Optimal Choice ◽

Parameter Selection ◽

Tuning Parameter ◽

High Dimensional ◽

Design Matrix ◽

High Dimensional Regression ◽

Selection For ◽

Sparsity Level

Penalized (or regularized) regression, as represented by lasso and its variants, has become a standard technique for analyzing high-dimensional data when the number of variables substantially exceeds the sample size. The performance of penalized regression relies crucially on the choice of the tuning parameter, which determines the amount of regularization and hence the sparsity level of the fitted model. The optimal choice of tuning parameter depends on both the structure of the design matrix and the unknown random error distribution (variance, tail behavior, etc.). This article reviews the current literature of tuning parameter selection for high-dimensional regression from both the theoretical and practical perspectives. We discuss various strategies that choose the tuning parameter to achieve prediction accuracy or support recovery. We also review several recently proposed methods for tuning-free high-dimensional regression.

Download Full-text

MTAG: Multi-Trait Analysis of GWAS

10.1101/118810 ◽

2017 ◽

Cited By ~ 19

Author(s):

Patrick Turley ◽

Raymond K. Walters ◽

Omeed Maghzian ◽

Aysu Okbay ◽

James J. Lee ◽

...

Keyword(s):

Depressive Symptoms ◽

Well Being ◽

Joint Analysis ◽

Summary Statistics ◽

Subjective Well Being ◽

Bioinformatics Analyses ◽

Trait Analysis ◽

Genome Wide ◽

Polygenic Scores ◽

Variance Explained

ABSTRACTWe introduce Multi-Trait Analysis of GWAS (MTAG), a method for joint analysis of summary statistics from GWASs of different traits, possibly from overlapping samples. We apply MTAG to summary statistics for depressive symptoms (Neff = 354,862), neuroticism (N = 168,105), and subjective well-being (N = 388,538). Compared to 32, 9, and 13 genome-wide significant loci in the single-trait GWASs (most of which are themselves novel), MTAG increases the number of loci to 64, 37, and 49, respectively. Moreover, association statistics from MTAG yield more informative bioinformatics analyses and increase variance explained by polygenic scores by approximately 25%, matching theoretical expectations.

Download Full-text