S23MACHINE LEARNING METHODS TO PREDICT UNMEASURED PHENOTYPES IN LARGE-SCALE BIOBANK STUDIES: PROOF OF PRINCIPLE USING AUDIT IN THE UK BIOBANK

Erectile dysfunction affects millions of men worldwide. Twin studies support the role of genetic risk factors underlying erectile dysfunction, but no specific genetic variants have been identified. We conducted a large-scale genome-wide association study of erectile dysfunction in 36,649 men in the multiethnic Kaiser Permanente Northern California Genetic Epidemiology Research in Adult Health and Aging cohort. We also undertook replication analyses in 222,358 men from the UK Biobank. In the discovery cohort, we identified a single locus (rs17185536-T) on chromosome 6 near the single-minded family basic helix-loop-helix transcription factor 1 (SIM1) gene that was significantly associated with the risk of erectile dysfunction (odds ratio = 1.26, P = 3.4 × 10−25). The association replicated in the UK Biobank sample (odds ratio = 1.25, P = 6.8 × 10−14), and the effect is independent of known erectile dysfunction risk factors, including body mass index (BMI). The risk locus resides on the same topologically associating domain as SIM1 and interacts with the SIM1 promoter, and the rs17185536-T risk allele showed differential enhancer activity. SIM1 is part of the leptin–melanocortin system, which has an established role in body weight homeostasis and sexual function. Because the variants associated with erectile dysfunction are not associated with differences in BMI, our findings suggest a mechanism that is specific to sexual function.

Download Full-text

Large-scale analysis of iliopsoas muscle volumes in the UK Biobank

Scientific Reports ◽

10.1038/s41598-020-77351-0 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Julie A. Fitzpatrick ◽

Nicolas Basty ◽

Madeleine Cule ◽

Yi Liu ◽

Jimmy D. Bell ◽

...

Keyword(s):

Large Scale ◽

Large Population ◽

Psoas Muscle ◽

Magnetic Resonance Images ◽

Muscle Volume ◽

Iliopsoas Muscle ◽

Uk Biobank ◽

Cross Sectional ◽

Automated Method ◽

The Uk

AbstractPsoas muscle measurements are frequently used as markers of sarcopenia and predictors of health. Manually measured cross-sectional areas are most commonly used, but there is a lack of consistency regarding the position of the measurement and manual annotations are not practical for large population studies. We have developed a fully automated method to measure iliopsoas muscle volume (comprised of the psoas and iliacus muscles) using a convolutional neural network. Magnetic resonance images were obtained from the UK Biobank for 5000 participants, balanced for age, gender and BMI. Ninety manual annotations were available for model training and validation. The model showed excellent performance against out-of-sample data (average dice score coefficient of 0.9046 ± 0.0058 for six-fold cross-validation). Iliopsoas muscle volumes were successfully measured in all 5000 participants. Iliopsoas volume was greater in male compared with female subjects. There was a small but significant asymmetry between left and right iliopsoas muscle volumes. We also found that iliopsoas volume was significantly related to height, BMI and age, and that there was an acceleration in muscle volume decrease in men with age. Our method provides a robust technique for measuring iliopsoas muscle volume that can be applied to large cohorts.

Download Full-text

Polynomial Mendelian Randomization reveals widespread non-linear causal effects in the UK Biobank

10.1101/2021.12.08.471751 ◽

2021 ◽

Author(s):

Jonathan Sulc ◽

Jenny Sjaarda ◽

Zoltan Kutalik

Keyword(s):

Causal Inference ◽

Large Scale ◽

Causal Effect ◽

Causal Effects ◽

Mendelian Randomisation ◽

Uk Biobank ◽

Individual Level ◽

Glucose Levels ◽

Causal Function ◽

The Uk

Causal inference is a critical step in improving our understanding of biological processes and Mendelian randomisation (MR) has emerged as one of the foremost methods to efficiently interrogate diverse hypotheses using large-scale, observational data from biobanks. Although many extensions have been developed to address the three core assumptions of MR-based causal inference (relevance, exclusion restriction, and exchangeability), most approaches implicitly assume that any putative causal effect is linear. Here we propose PolyMR, an MR-based method which provides a polynomial approximation of an (arbitrary) causal function between an exposure and an outcome. We show that this method provides accurate inference of the shape and magnitude of causal functions with greater accuracy than existing methods. We applied this method to data from the UK Biobank, testing for effects between anthropometric traits and continuous health-related phenotypes and found most of these (84%) to have causal effects which deviate significantly from linear. These deviations ranged from slight attenuation at the extremes of the exposure distribution, to large changes in the magnitude of the effect across the range of the exposure (e.g. a 1 kg/m2 change in BMI having stronger effects on glucose levels if the initial BMI was higher), to non-monotonic causal relationships (e.g. the effects of BMI on cholesterol forming an inverted U shape). Finally, we show that the linearity assumption of the causal effect may lead to the misinterpretation of health risks at the individual level or heterogeneous effect estimates when using cohorts with differing average exposure levels.

Download Full-text

A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank

PLoS Genetics ◽

10.1371/journal.pgen.1009141 ◽

2020 ◽

Vol 16 (10) ◽

pp. e1009141

Author(s):

Junyang Qian ◽

Yosuke Tanigawa ◽

Wenfei Du ◽

Matthew Aguirre ◽

Chris Chang ◽

...

Keyword(s):

Large Scale ◽

Uk Biobank ◽

Sparse Regression ◽

The Uk

Download Full-text

A Fast and Scalable Framework for Large-scale and Ultrahigh-dimensional Sparse Regression with Application to the UK Biobank

10.1101/630079 ◽

2019 ◽

Cited By ~ 5

Author(s):

Junyang Qian ◽

Yosuke Tanigawa ◽

Wenfei Du ◽

Matthew Aguirre ◽

Chris Chang ◽

...

Keyword(s):

Body Mass Index ◽

Variable Selection ◽

Multiple Regression ◽

Large Scale ◽

Prediction Performance ◽

Uk Biobank ◽

High Cholesterol ◽

Computational Framework ◽

Regression Methods ◽

The Uk

AbstractThe UK Biobank (Bycroft et al., 2018) is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with GWAS, have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso (Tibshirani, 1996), since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet (Friedman et al., 2010a) and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve superior predictive performance on quantitative and qualitative traits including height, body mass index, asthma and high cholesterol.Author SummaryWith the advent and evolution of large-scale and comprehensive biobanks, there come up unprecedented opportunities for researchers to further uncover the complex landscape of human genetics. One major direction that attracts long-standing interest is the investigation of the relationships between genotypes and phenotypes. This includes but doesn’t limit to the identification of genotypes that are significantly associated with the phenotypes, and the prediction of phenotypic values based on the genotypic information. Genome-wide association studies (GWAS) is a very powerful and widely used framework for the former task, having produced a number of very impactful discoveries. However, when it comes to the latter, its performance is fairly limited by the univariate nature. To address this, multiple regression methods have been suggested to fill in the gap. That said, challenges emerge as the dimension and the size of datasets both become large nowadays. In this paper, we present a novel computational framework that enables us to solve efficiently the entire lasso or elastic-net solution path on large-scale and ultrahigh-dimensional data, and therefore make simultaneous variable selection and prediction. Our approach can build on any existing lasso solver for small or moderate-sized problems, scale it up to a big-data solution, and incorporate other extensions easily. We provide a package snpnet that extends the glmnet package in R and optimizes for large phenotype-genotype data. On the UK Biobank, we observe improved prediction performance on height, body mass index (BMI), asthma and high cholesterol by the lasso over other univariate and multiple regression methods. That said, the scope of our approach goes beyond genetic studies. It can be applied to general sparse regression problems and build scalable solution for a variety of distribution families based on existing solvers.

Download Full-text

Phenome-wide Heritability Analysis of the UK Biobank

10.1101/070177 ◽

2016 ◽

Author(s):

Tian Ge ◽

Chia-Yen Chen ◽

Benjamin M. Neale ◽

Mert R. Sabuncu ◽

Jordan W. Smoller

Keyword(s):

Complex Traits ◽

Large Scale ◽

Prediction Models ◽

Population Characteristics ◽

Uk Biobank ◽

Multiple Traits ◽

Common Genetic Variants ◽

Heritability Estimation ◽

Heritability Analysis ◽

The Uk

Heritability estimation provides important information about the relative contribution of genetic and environmental factors to phenotypic variation, and provides an upper bound for the utility of genetic risk prediction models. Recent technological and statistical advances have enabled the estimation of additive heritability attributable to common genetic variants (SNP heritability) across a broad phenotypic spectrum. However, assessing the comparative heritability of multiple traits estimated in different cohorts may be misleading due to the population-specific nature of heritability. Here we report the SNP heritability for 551 complex traits derived from the large-scale, population-based UK Biobank, comprising both quantitative phenotypes and disease codes, and examine the moderating effect of three major demographic variables (age, sex and socioeconomic status) on the heritability estimates. Our study represents the first comprehensive phenome-wide heritability analysis in the UK Biobank, and underscores the importance of considering population characteristics in comparing and interpreting heritability.

Download Full-text

CNest: A Novel Copy Number Association Discovery Method Uncovers 862 New Associations from 200,629 Whole Exome Sequence Datasets in the UK Biobank

10.1101/2021.08.19.456963 ◽

2021 ◽

Author(s):

Tomas W Fitzgerald ◽

Ewan Birney

Keyword(s):

Copy Number ◽

Large Scale ◽

Association Studies ◽

Genomic Variation ◽

Next Generation Sequencing Data ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Genome Wide ◽

The Uk ◽

Ngs Data

Copy number variation (CNV) has long been known to influence human traits having a rich history of research into common and rare genetic disease and although CNV is accepted as an important class of genomic variation, progress on copy number (CN) phenotype associations from Next Generation Sequencing data (NGS) has been limited, in part, due to the relative difficulty in CNV detection and an enrichment for large numbers of false positives. To date most successful CN genome wide association studies (CN-GWAS) have focused on using predictive measures of dosage intolerance or gene burden tests to gain sufficient power for detecting CN effects. Here we present a novel method for large scale CN analysis from NGS data generating robust CN estimates and allowing CN-GWAS to be performed genome wide in discovery mode. We provide a detailed analysis in the large scale UK BioBank resource and a specifically designed software package for deriving CN estimates from NGS data that are robust enough to be used for CN-GWAS. We use these methods to perform genome wide CN-GWAS analysis across 78 human traits discovering 862 genetic associations that are likely to contribute strongly to trait distributions based solely on their CN or by acting in concert with other genetic variation. Finally, we undertake an analysis comparing CNV and SNP association signals across the same traits and samples, defining specific CNV association classes based on whether they could be detected using standard SNP-GWAS in the UK Biobank.

Download Full-text

Hippocampal volume across age: Nomograms derived from over 19,700 people in UK Biobank

10.1101/562678 ◽

2019 ◽

Cited By ~ 2

Author(s):

Lisa Nobis ◽

Sanjay G. Manohar ◽

Stephen M. Smith ◽

Fidel Alfaro-Almagro ◽

Mark Jenkinson ◽

...

Keyword(s):

Temporal Lobe ◽

Large Scale ◽

Objective Evaluation ◽

Hippocampal Volume ◽

Volume Status ◽

Normative Values ◽

Grey Matter Volume ◽

Uk Biobank ◽

Age Related ◽

The Uk

AbstractMeasurement of hippocampal volume has proven useful to diagnose and track progression in several brain disorders, most notably in Alzheimer’s disease (AD). For example, an objective evaluation of a patient’s hippocampal volume status may provide important information that can assist diagnosis or risk stratification of AD. However, clinicians and researchers require access to age-related normative percentiles to reliably categorise a patient’s hippocampal volume as being pathologically small. Here we analysed effects of age, sex, and hemisphere on the hippocampus and neighbouring temporal lobe volumes, in 19,793 generally healthy participants in the UK Biobank. A key finding of the current study is a significant acceleration in the rate of hippocampal volume loss in middle age, more pronounced in females than in males. In this report, we provide normative values for hippocampal and total grey matter volume as a function of age for reference in clinical and research settings. These normative values may be used in combination with our online, automated percentile estimation tool to provide a rapid, objective evaluation of an individual’s hippocampal volume status. The data provide a large-scale normative database to facilitate easy age-adjusted determination of where an individual hippocampal and temporal lobe volume lies within the normal distribution.

Download Full-text

Phenome-Wide Association Study of Actigraphy in the UK Biobank

10.1101/2021.12.09.21267558 ◽

2021 ◽

Author(s):

Thomas G. Brooks ◽

Nicholas F. Lahens ◽

Gregory R. Grant ◽

Yvette I. Sheline ◽

Garret A. FitzGerald ◽

...

Keyword(s):

Physical Activity ◽

Airway Obstruction ◽

Large Scale ◽

Diurnal Rhythms ◽

Uk Biobank ◽

Daily Lives ◽

Individualized Approach ◽

Using Data ◽

The Uk ◽

Chronic Airway

AbstractWrist-worn accelerometer actigraphy devices present the opportunity for large-scale data collection from people during their daily lives. Using data from approximately 100,000 participants in the UK Biobank, actigraphy-derived measures of physical activity, sleep, and diurnal rhythms were associated in exploration and validation cohorts with a full phenome-wide set of diagnoses, biomarkers and metadata. Rhythmicity was captured by two independent models based on accelerometer and skin temperature harnessing behavioral (diurnal) and molecular (circadian) components. We found that robust rhythms significantly with biomarkers, survival, and phenotypes including diabetes, hypertension, mood disorders, and chronic airway obstruction; these associations were comparable to those with physical activity and sleep. Surprisingly, associations were mostly consistent between the sexes, while modulation by age was significant. More importantly, rhythms were found to be powerful predictors of future diseases: a two standard deviation difference in wrist temperature rhythms corresponded to increases in rate of diagnosis of 61% in diabetes, 38% in chronic airway obstruction, 27% in anxiety disorders, and 22% in hypertension. Our PheWAS of actigraphy data in the UK Biobank establishes that rhythmicity is fundamental to modeling disease trajectories, as are physical activity and sleep. Integration of long-term remote biosensing into patient care could thus afford an individualized approach to risk management.

Download Full-text