Rapid detection of identity-by-descent tracts for mega-scale datasets

AbstractThe ability to identify segments of genomes identical-by-descent (IBD) is a part of standard workflows in both statistical and population genetics. However, traditional methods for finding local IBD across all pairs of individuals scale poorly leading to a lack of adoption in very large-scale datasets. Here, we present iLASH, an algorithm based on similarity detection techniques that shows equal or improved accuracy in simulations compared to current leading methods and speeds up analysis by several orders of magnitude on genomic datasets, making IBD estimation tractable for millions of individuals. We apply iLASH to the PAGE dataset of ~52,000 multi-ethnic participants, including several founder populations with elevated IBD sharing, identifying IBD segments in ~3 minutes per chromosome compared to over 6 days for a state-of-the-art algorithm. iLASH enables efficient analysis of very large-scale datasets, as we demonstrate by computing IBD across the UK Biobank (~500,000 individuals), detecting 12.9 billion pairwise connections.

Download Full-text

Rapid detection of identity-by-descent tracts for mega-scale datasets

10.1101/749507 ◽

2019 ◽

Cited By ~ 6

Author(s):

Ruhollah Shemirani ◽

Gillian M. Belbin ◽

Christy L. Avery ◽

Eimear E. Kenny ◽

Christopher R. Gignoux ◽

...

Keyword(s):

Population Genetics ◽

Large Scale ◽

State Of The Art ◽

Locality Sensitive Hashing ◽

Uk Biobank ◽

Identity By Descent ◽

Detection Techniques ◽

Identical By Descent ◽

Founder Populations ◽

Improved Accuracy

The ability to identify segments of genomes identical-by-descent (IBD) is a part of standard workflows in both statistical and population genetics. However, traditional methods for finding local IBD across all pairs of individuals scale poorly leading to a lack of adoption in very large-scale datasets. Here, we present iLASH, IBD by LocAlity-Sensitive Hashing, an algorithm based on similarity detection techniques that shows equal or improved accuracy in simulations compared to the current leading method and speeds up analysis by several orders of magnitude on genomic datasets, making IBD estimation tractable for hundreds of thousands to millions of individuals. We applied iLASH to the Population Architecture using Genomics and Epidemiology (PAGE) dataset of ∼52,000 multi-ethnic participants, including several founder populations with elevated IBD sharing, which identified IBD segments on a single machine in an hour (∼3 minutes per chromosome compared to over 6 days per chromosome for a state-of-the-art algorithm). iLASH is able to efficiently estimate IBD tracts in very large-scale datasets, as demonstrated via IBD estimation across the entire UK Biobank (∼500,000 individuals), detecting nearly 13 billion pairwise IBD tracts shared between ∼11% of participants. In summary, iLASH enables fast and accurate detection of IBD, an upstream step in applications of IBD for population genetics and trait mapping.

Download Full-text

Accuracy of Electronic Health Record Data for Identifying Stroke Cases in Large-Scale Epidemiological Studies: A Systematic Review from the UK Biobank Stroke Outcomes Group

PLoS ONE ◽

10.1371/journal.pone.0140533 ◽

2015 ◽

Vol 10 (10) ◽

pp. e0140533 ◽

Cited By ~ 39

Author(s):

Rebecca Woodfield ◽

Ian Grant ◽

Cathie L. M. Sudlow ◽

◽

Keyword(s):

Systematic Review ◽

Electronic Health Record ◽

Large Scale ◽

Epidemiological Studies ◽

Health Record ◽

Uk Biobank ◽

Electronic Health Record Data ◽

Stroke Outcomes ◽

Record Data ◽

The Uk

Download Full-text

S23MACHINE LEARNING METHODS TO PREDICT UNMEASURED PHENOTYPES IN LARGE-SCALE BIOBANK STUDIES: PROOF OF PRINCIPLE USING AUDIT IN THE UK BIOBANK

European Neuropsychopharmacology ◽

10.1016/j.euroneuro.2019.08.024 ◽

2019 ◽

Vol 29 ◽

pp. S125-S126

Author(s):

Amanda Gentry ◽

Roseann Peterson ◽

Alexis Edwards ◽

Brien Riley ◽

B. Todd Webb

Keyword(s):

Large Scale ◽

Uk Biobank ◽

Learning Methods ◽

The Uk ◽

Proof Of Principle

Download Full-text

Genetic variation in the SIM1 locus is associated with erectile dysfunction

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1809872115 ◽

2018 ◽

Vol 115 (43) ◽

pp. 11018-11023 ◽

Cited By ~ 14

Author(s):

Eric Jorgenson ◽

Navneet Matharu ◽

Melody R. Palmer ◽

Jie Yin ◽

Jun Shan ◽

...

Keyword(s):

Risk Factors ◽

Erectile Dysfunction ◽

Sexual Function ◽

Odds Ratio ◽

Large Scale ◽

Genome Wide Association Study ◽

Adult Health ◽

Uk Biobank ◽

Melanocortin System ◽

The Uk

Erectile dysfunction affects millions of men worldwide. Twin studies support the role of genetic risk factors underlying erectile dysfunction, but no specific genetic variants have been identified. We conducted a large-scale genome-wide association study of erectile dysfunction in 36,649 men in the multiethnic Kaiser Permanente Northern California Genetic Epidemiology Research in Adult Health and Aging cohort. We also undertook replication analyses in 222,358 men from the UK Biobank. In the discovery cohort, we identified a single locus (rs17185536-T) on chromosome 6 near the single-minded family basic helix-loop-helix transcription factor 1 (SIM1) gene that was significantly associated with the risk of erectile dysfunction (odds ratio = 1.26, P = 3.4 × 10−25). The association replicated in the UK Biobank sample (odds ratio = 1.25, P = 6.8 × 10−14), and the effect is independent of known erectile dysfunction risk factors, including body mass index (BMI). The risk locus resides on the same topologically associating domain as SIM1 and interacts with the SIM1 promoter, and the rs17185536-T risk allele showed differential enhancer activity. SIM1 is part of the leptin–melanocortin system, which has an established role in body weight homeostasis and sexual function. Because the variants associated with erectile dysfunction are not associated with differences in BMI, our findings suggest a mechanism that is specific to sexual function.

Download Full-text

Large-scale analysis of iliopsoas muscle volumes in the UK Biobank

Scientific Reports ◽

10.1038/s41598-020-77351-0 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Julie A. Fitzpatrick ◽

Nicolas Basty ◽

Madeleine Cule ◽

Yi Liu ◽

Jimmy D. Bell ◽

...

Keyword(s):

Large Scale ◽

Large Population ◽

Psoas Muscle ◽

Magnetic Resonance Images ◽

Muscle Volume ◽

Iliopsoas Muscle ◽

Uk Biobank ◽

Cross Sectional ◽

Automated Method ◽

The Uk

AbstractPsoas muscle measurements are frequently used as markers of sarcopenia and predictors of health. Manually measured cross-sectional areas are most commonly used, but there is a lack of consistency regarding the position of the measurement and manual annotations are not practical for large population studies. We have developed a fully automated method to measure iliopsoas muscle volume (comprised of the psoas and iliacus muscles) using a convolutional neural network. Magnetic resonance images were obtained from the UK Biobank for 5000 participants, balanced for age, gender and BMI. Ninety manual annotations were available for model training and validation. The model showed excellent performance against out-of-sample data (average dice score coefficient of 0.9046 ± 0.0058 for six-fold cross-validation). Iliopsoas muscle volumes were successfully measured in all 5000 participants. Iliopsoas volume was greater in male compared with female subjects. There was a small but significant asymmetry between left and right iliopsoas muscle volumes. We also found that iliopsoas volume was significantly related to height, BMI and age, and that there was an acceleration in muscle volume decrease in men with age. Our method provides a robust technique for measuring iliopsoas muscle volume that can be applied to large cohorts.

Download Full-text

Polynomial Mendelian Randomization reveals widespread non-linear causal effects in the UK Biobank

10.1101/2021.12.08.471751 ◽

2021 ◽

Author(s):

Jonathan Sulc ◽

Jenny Sjaarda ◽

Zoltan Kutalik

Keyword(s):

Causal Inference ◽

Large Scale ◽

Causal Effect ◽

Causal Effects ◽

Mendelian Randomisation ◽

Uk Biobank ◽

Individual Level ◽

Glucose Levels ◽

Causal Function ◽

The Uk

Causal inference is a critical step in improving our understanding of biological processes and Mendelian randomisation (MR) has emerged as one of the foremost methods to efficiently interrogate diverse hypotheses using large-scale, observational data from biobanks. Although many extensions have been developed to address the three core assumptions of MR-based causal inference (relevance, exclusion restriction, and exchangeability), most approaches implicitly assume that any putative causal effect is linear. Here we propose PolyMR, an MR-based method which provides a polynomial approximation of an (arbitrary) causal function between an exposure and an outcome. We show that this method provides accurate inference of the shape and magnitude of causal functions with greater accuracy than existing methods. We applied this method to data from the UK Biobank, testing for effects between anthropometric traits and continuous health-related phenotypes and found most of these (84%) to have causal effects which deviate significantly from linear. These deviations ranged from slight attenuation at the extremes of the exposure distribution, to large changes in the magnitude of the effect across the range of the exposure (e.g. a 1 kg/m2 change in BMI having stronger effects on glucose levels if the initial BMI was higher), to non-monotonic causal relationships (e.g. the effects of BMI on cholesterol forming an inverted U shape). Finally, we show that the linearity assumption of the causal effect may lead to the misinterpretation of health risks at the individual level or heterogeneous effect estimates when using cohorts with differing average exposure levels.

Download Full-text

A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank

PLoS Genetics ◽

10.1371/journal.pgen.1009141 ◽

2020 ◽

Vol 16 (10) ◽

pp. e1009141

Author(s):

Junyang Qian ◽

Yosuke Tanigawa ◽

Wenfei Du ◽

Matthew Aguirre ◽

Chris Chang ◽

...

Keyword(s):

Large Scale ◽

Uk Biobank ◽

Sparse Regression ◽

The Uk

Download Full-text

A Fast and Scalable Framework for Large-scale and Ultrahigh-dimensional Sparse Regression with Application to the UK Biobank

10.1101/630079 ◽

2019 ◽

Cited By ~ 5

Author(s):

Junyang Qian ◽

Yosuke Tanigawa ◽

Wenfei Du ◽

Matthew Aguirre ◽

Chris Chang ◽

...

Keyword(s):

Body Mass Index ◽

Variable Selection ◽

Multiple Regression ◽

Large Scale ◽

Prediction Performance ◽

Uk Biobank ◽

High Cholesterol ◽

Computational Framework ◽

Regression Methods ◽

The Uk

AbstractThe UK Biobank (Bycroft et al., 2018) is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with GWAS, have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso (Tibshirani, 1996), since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet (Friedman et al., 2010a) and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve superior predictive performance on quantitative and qualitative traits including height, body mass index, asthma and high cholesterol.Author SummaryWith the advent and evolution of large-scale and comprehensive biobanks, there come up unprecedented opportunities for researchers to further uncover the complex landscape of human genetics. One major direction that attracts long-standing interest is the investigation of the relationships between genotypes and phenotypes. This includes but doesn’t limit to the identification of genotypes that are significantly associated with the phenotypes, and the prediction of phenotypic values based on the genotypic information. Genome-wide association studies (GWAS) is a very powerful and widely used framework for the former task, having produced a number of very impactful discoveries. However, when it comes to the latter, its performance is fairly limited by the univariate nature. To address this, multiple regression methods have been suggested to fill in the gap. That said, challenges emerge as the dimension and the size of datasets both become large nowadays. In this paper, we present a novel computational framework that enables us to solve efficiently the entire lasso or elastic-net solution path on large-scale and ultrahigh-dimensional data, and therefore make simultaneous variable selection and prediction. Our approach can build on any existing lasso solver for small or moderate-sized problems, scale it up to a big-data solution, and incorporate other extensions easily. We provide a package snpnet that extends the glmnet package in R and optimizes for large phenotype-genotype data. On the UK Biobank, we observe improved prediction performance on height, body mass index (BMI), asthma and high cholesterol by the lasso over other univariate and multiple regression methods. That said, the scope of our approach goes beyond genetic studies. It can be applied to general sparse regression problems and build scalable solution for a variety of distribution families based on existing solvers.

Download Full-text

IBDkin: fast estimation of kinship coefficients from identity by descent segments

Bioinformatics ◽

10.1093/bioinformatics/btaa569 ◽

2020 ◽

Vol 36 (16) ◽

pp. 4519-4520

Author(s):

Ying Zhou ◽

Sharon R Browning ◽

Brian L Browning

Keyword(s):

Software Package ◽

Large Datasets ◽

Supplementary Information ◽

Supplementary Data ◽

Uk Biobank ◽

Identity By Descent ◽

Fast Estimation ◽

Kinship Coefficients ◽

Related Individuals ◽

The Uk

Abstract Motivation Estimation of pairwise kinship coefficients in large datasets is computationally challenging because the number of related individuals increases quadratically with sample size. Results We present IBDkin, a software package written in C for estimating kinship coefficients from identity by descent (IBD) segments. We use IBDkin to estimate kinship coefficients for 7.95 billion pairs of individuals in the UK Biobank who share at least one detected IBD segment with length ≥ 4 cM. Availability and implementation https://github.com/YingZhou001/IBDkin. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Novel Approach for Parallelizing Pairwise Comparison Problems as Applied to Detecting Segments Identical By Decent in Whole-Genome Data

Bioinformatics ◽

10.1093/bioinformatics/btab084 ◽

2021 ◽

Author(s):

Emmanuel Sapin ◽

Matthew C Keller

Keyword(s):

Pairwise Comparison ◽

Pairwise Comparisons ◽

Uk Biobank ◽

Large Problem ◽

Genome Data ◽

Full Dataset ◽

Identical By Descent ◽

Novel Approach ◽

The Uk ◽

Massive Parallelization

Abstract Motivation Pairwise comparison problems arise in many areas of science. In genomics, datasets are already large and getting larger, and so operations that require pairwise comparisons—either on pairs of SNPs or pairs of individuals—are extremely computationally challenging. We propose a generic algorithm for addressing pairwise comparison problems that breaks a large problem (of order n2 comparisons) into multiple smaller ones (each of order n comparisons), allowing for massive parallelization. Results We demonstrated that this approach is very efficient for calling identical by descent (IBD) segments between all pairs of individuals in the UK Biobank dataset, with a 250-fold savings in time and 750-fold savings in memory over the standard approach to detecting such segments across the full dataset. This efficiency should extend to other methods of IBD calling and, more generally, to other pairwise comparison tasks in genomics or other areas of science.

Download Full-text