Global Biobank Engine: enabling genotype-phenotype browsing for biobank summary statistics

Abstract Summary Large biobanks linking phenotype to genotype have led to an explosion of genetic association studies across a wide range of phenotypes. Sharing the knowledge generated by these resources with the scientific community remains a challenge due to patient privacy and the vast amount of data. Here, we present Global Biobank Engine (GBE), a web-based tool that enables exploration of the relationship between genotype and phenotype in biobank cohorts, such as the UK Biobank. GBE supports browsing for results from genome-wide association studies, phenome-wide association studies, gene-based tests and genetic correlation between phenotypes. We envision GBE as a platform that facilitates the dissemination of summary statistics from biobanks to the scientific and clinical communities. Availability and implementation GBE currently hosts data from the UK Biobank and can be found freely available at biobankengine.stanford.edu.

Download Full-text

Global Biobank Engine: enabling genotype-phenotype browsing for biobank summary statistics

10.1101/304188 ◽

2018 ◽

Cited By ~ 4

Author(s):

Gregory McInnes ◽

Yosuke Tanigawa ◽

Chris DeBoever ◽

Adam Lavertu ◽

Julia Eve Olivieri ◽

...

Keyword(s):

Association Studies ◽

Genetic Association Studies ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Uk Biobank ◽

Patient Privacy ◽

Web Based ◽

Genome Wide ◽

Wide Range ◽

The Uk

Large biobanks linking phenotype to genotype have led to an explosion of genetic association studies across a wide range of phenotypes. Sharing the knowledge generated by these resources with the scientific community remains a challenge due to patient privacy and the vast amount of data. Here we present Global Biobank Engine (GBE), a web-based tool that enables the exploration of the relationship between genotype and phenotype in large biobank cohorts, such as the UK Biobank. GBE supports browsing for results from genome-wide association studies, phenome-wide association studies, gene-based tests, and genetic correlation between phenotypes. We envision GBE as a platform that facilitates the dissemination of summary statistics from biobanks to the scientific and clinical communities. GBE currently hosts data from the UK Biobank and can be found freely available at biobankengine.stanford.edu.

Download Full-text

Evaluation and application of summary statistic imputation to discover new height-associated loci

10.1101/204560 ◽

2017 ◽

Author(s):

Sina Rüeger ◽

Aaron McDaid ◽

Zoltán Kutalik

Keyword(s):

Genetic Variants ◽

Association Studies ◽

Low Frequency ◽

Cost Effective ◽

Genotype Imputation ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Uk Biobank ◽

Genome Wide ◽

The Uk

AbstractAs most of the heritability of complex traits is attributed to common and low frequency genetic variants, imputing them by combining genotyping chips and large sequenced reference panels is the most cost-effective approach to discover the genetic basis of these traits. Association summary statistics from genome-wide meta-analyses are available for hundreds of traits. Updating these to ever-increasing reference panels is very cumbersome as it requires reimputation of the genetic data, rerunning the association scan, and meta-analysing the results. A much more efficient method is to directly impute the summary statistics, termed as summary statistics imputation. Its performance relative to genotype imputation and practical utility has not yet been fully investigated. To this end, we compared the two approaches on real (genotyped and imputed) data from 120K samples from the UK Biobank and show that, while genotype imputation boasts a 2- to 5-fold lower root-mean-square error, summary statistics imputation better distinguishes true associations from null ones: We observed the largest differences in power for variants with low minor allele frequency and low imputation quality. For fixed false positive rates of 0.001, 0.01, 0.05, using summary statistics imputation yielded an increase in statistical power by 15, 10 and 3%, respectively. To test its capacity to discover novel associations, we applied summary statistics imputation to the GIANT height meta-analysis summary statistics covering HapMap variants, and identified 34 novel loci, 19 of which replicated using data in the UK Biobank. Additionally, we successfully replicated 55 out of the 111 variants published in an exome chip study. Our study demonstrates that summary statistics imputation is a very efficient and cost-effective way to identify and fine-map trait-associated loci. Moreover, the ability to impute summary statistics is important for follow-up analyses, such as Mendelian randomisation or LD-score regression.Author summaryGenome-wide association studies (GWASs) quantify the effect of genetic variants and traits, such as height. Such estimates are called association summary statistics and are typically publicly shared through publication. Typically, GWASs are carried out by genotyping ~ 500′000 SNVs for each individual which are then combined with sequenced reference panels to infer untyped SNVs in each’ individuals genome. This process of genotype imputation is resource intensive and can therefore be a limitation when combining many GWASs. An alternative approach is to bypass the use of individual data and directly impute summary statistics. In our work we compare the performance of summary statistics imputation to genotype imputation. Although we observe a 2- to 5-fold lower RMSE for genotype imputation compared to summary statistics imputation, summary statistics imputation better distinguishes true associations from null results. Furthermore, we demonstrate the potential of summary statistics imputation by presenting 34 novel height-associated loci, 19 of which were confirmed in UK Biobank. Our study demonstrates that given current reference panels, summary statistics imputation is a very efficient and cost-effective way to identify common or low-frequency trait-associated loci.

Download Full-text

Genome-wide association study of liking of physical activity in the UK Biobank

10.1101/2021.10.13.21264969 ◽

2021 ◽

Author(s):

Yann C. Klimentidis ◽

Michelle Newell ◽

Matthijs D. van der Zee ◽

Victoria L. Bland ◽

Sebastian May-Wilson ◽

...

Keyword(s):

Physical Activity ◽

Genome Wide Association Study ◽

Association Studies ◽

Genome Wide Association ◽

Risk Scores ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Genome Wide ◽

Wide Range ◽

The Uk

A lack of physical activity (PA) is one of the most pressing health issues facing society today. Our individual propensity for PA is partly influenced by genetic factors. Stated liking of various PA behaviors may capture additional dimensions of PA behavior that are not captured by other measures, and contribute to our understanding of the genetics of PA behavior. Here, in over 157,000 individuals from the UK Biobank, we sought to complement and extend previous findings on the genetics of PA behavior by performing genome-wide association studies of self-reported liking of several PA-related behaviors plus an additional derived trait of overall PA-liking. We identified a total of 19 unique genome-wide significant loci across all traits, only four of which overlap with loci previously identified for PA behavior. The PA-liking traits were genetically correlated with self-reported (rg: 0.38 to 0.80) and accelerometry-derived (rg: 0.26 to 0.49) PA measures, and with a wide range of health-related traits and dietary behaviors. Replication in the Netherlands Twin Register (NTR; n>7,300) and the TwinsUK (n>1,300) study revealed directionally consistent associations. Polygenic risk scores (PRS) were then trained in UKB for each PA-liking trait and for self-reported PA behavior. The PA-liking PRS significantly predicted the same liking trait in NTR. The PRS for liking of going to the gym predicted PA behavior in NTR (r2 = 0.40%) nearly as well as the one constructed based on self-reported PA behavior (r2 = 0.42%). Combining the two PRS into a single model increased the r2 to 0.59%, suggesting that although these PRS correlate with each other, they are also capturing distinct dimensions of PA behavior. In conclusion, we have identified the first loci associated with PA-liking, and extended and refined our understanding of the genetic basis of PA behavior.

Download Full-text

Reproducibility in the UK Biobank of Genome-Wide Significant Signals Discovered in Earlier Genome-wide Association Studies

10.1101/2020.06.24.20139576 ◽

2020 ◽

Cited By ~ 1

Author(s):

Jack W. O’Sullivan ◽

John P. A. Ioannidis

Keyword(s):

Effect Size ◽

Association Studies ◽

Genome Wide Association ◽

P Value ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Single Nucleotide ◽

Genome Wide ◽

The Uk ◽

Open Question

AbstractWith the establishment of large biobanks, discovery of single nucleotide polymorphism (SNPs) that are associated with various phenotypes has been accelerated. An open question is whether SNPs identified with genome-wide significance in earlier genome-wide association studies (GWAS) are replicated also in later GWAS conducted in biobanks. To address this question, the authors examined a publicly available GWAS database and identified two, independent GWAS on the same phenotype (an earlier, “discovery” GWAS and a later, replication GWAS done in the UK biobank). The analysis evaluated 136,318,924 SNPs (of which 6,289 had reached p<5e-8 in the discovery GWAS) from 4,397,962 participants across nine phenotypes. The overall replication rate was 85.0% and it was lower for binary than for quantitative phenotypes (58.1% versus 94.8% respectively). There was a18.0% decrease in SNP effect size for binary phenotypes, but a 12.0% increase for quantitative phenotypes. Using the discovery SNP effect size, phenotype trait (binary or quantitative), and discovery p-value, we built and validated a model that predicted SNP replication with area under the Receiver Operator Curve = 0.90. While non-replication may often reflect lack of power rather than genuine false-positive findings, these results provide insights about which discovered associations are likely to be seen again across subsequent GWAS.

Download Full-text

The evolution of skin pigmentation associated variation in West Eurasia

10.1101/2020.05.08.085274 ◽

2020 ◽

Author(s):

Dan Ju ◽

Iain Mathieson

Keyword(s):

Genetic Variants ◽

Association Studies ◽

Skin Pigmentation ◽

Directional Selection ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Genome Wide ◽

Light Skin ◽

The Uk

AbstractSkin pigmentation is a classic example of a polygenic trait that has experienced directional selection in humans. Genome-wide association studies have identified well over a hundred pigmentation-associated loci, and genomic scans in present-day and ancient populations have identified selective sweeps for a small number of light pigmentation-associated alleles in Europeans. It is unclear whether selection has operated on all the genetic variation associated with skin pigmentation as opposed to just a small number of large-effect variants. Here, we address this question using ancient DNA from 1158 individuals from West Eurasia covering a period of 40,000 years combined with genome-wide association summary statistics from the UK Biobank. We find a robust signal of directional selection in ancient West Eurasians on skin pigmentation variants ascertained in the UK Biobank, but find this signal is driven mostly by a limited number of large-effect variants. Consistent with this observation, we find that a polygenic selection test in present-day populations fails to detect selection with the full set of variants; rather, only the top five show strong evidence of selection. Our data allow us to disentangle the effects of admixture and selection. Most notably, a large-effect variant at SLC24A5 was introduced to Europe by migrations of Neolithic farming populations but continued to be under selection post-admixture. This study shows that the response to selection for light skin pigmentation in West Eurasia was driven by a relatively small proportion of the variants that are associated with present-day phenotypic variation.SignificanceSome of the genes responsible for the evolution of light skin pigmentation in Europeans show signals of positive selection in present-day populations. Recently, genome-wide association studies have highlighted the highly polygenic nature of skin pigmentation. It is unclear whether selection has operated on all of these genetic variants or just a subset. By studying variation in over a thousand ancient genomes from West Eurasia covering 40,000 years we are able to study both the aggregate behavior of pigmentation-associated variants and the evolutionary history of individual variants. We find that the evolution of light skin pigmentation in Europeans was driven by frequency changes in a relatively small fraction of the genetic variants that are associated with variation in the trait today.

Download Full-text

Genome-wide association study of circulating liver enzymes reveals an expanded role for manganese transporter SLC30A10 in liver health

10.1101/2020.05.19.104570 ◽

2020 ◽

Author(s):

Lucas D. Ward ◽

Ho-Chou Tu ◽

Chelsea Quenneville ◽

Alexander O. Flynn-Carroll ◽

Margaret M. Parker ◽

...

Keyword(s):

Extrahepatic Bile Duct ◽

Association Studies ◽

Genome Wide Association ◽

Detectable Effect ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Extrahepatic Bile Duct Cancer ◽

Genome Wide ◽

Liver Health ◽

The Uk

AbstractTo better understand molecular pathways underlying liver health and disease, we performed genome-wide association studies (GWAS) on circulating levels of alanine aminotransferase (ALT) and aspartate aminotransferase (AST) across 408,300 subjects from four ethnic groups in the UK Biobank, focusing on variants associating with both enzymes. Of these variants, the strongest effect is a rare (MAF in White British = 0.12%) missense variant in the gene encoding manganese efflux transporter SLC30A10, Thr95Ile (rs188273166), associating with a 5.9% increase in ALT and a 4.2% increase in AST. Carriers have higher prevalence of all-cause liver disease (OR = 1.70; 95% CI = 1.24 to 2.34) and higher prevalence of extrahepatic bile duct cancer (OR = 23.8; 95% CI = 9.1 to 62.1) compared to non-carriers. Over 4% of the cases of extrahepatic cholangiocarcinoma in the UK Biobank carry SLC30A10 Thr95Ile. Unlike variants in SLC30A10 known to cause the recessive syndrome hypermanganesemia with dystonia-1 (HMNDYT1), the Thr95Ile variant has a detectable effect even in the heterozygous state. Also unlike HMNDYT1-causing variants, Thr95Ile results in a protein that is properly trafficked to the plasma membrane when expressed in HeLa cells. These results suggest that coding variation in SLC30A10 impacts liver health in more individuals than the small population of HMNDYT1 patients.

Download Full-text

Genome-wide genetic data on ~500,000 UK Biobank participants

10.1101/166298 ◽

2017 ◽

Cited By ~ 303

Author(s):

Clare Bycroft ◽

Colin Freeman ◽

Desislava Petkova ◽

Gavin Band ◽

Lloyd T. Elliott ◽

...

Keyword(s):

Quality Control ◽

Allelic Variation ◽

Association Studies ◽

Genetic Data ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genotype Data ◽

Uk Biobank ◽

Genome Wide ◽

Wide Range

AbstractThe UK Biobank project is a large prospective cohort study of ~500,000 individuals from across the United Kingdom, aged between 40-69 at recruitment. A rich variety of phenotypic and health-related information is available on each participant, making the resource unprecedented in its size and scope. Here we describe the genome-wide genotype data (~805,000 markers) collected on all individuals in the cohort and its quality control procedures. Genotype data on this scale offers novel opportunities for assessing quality issues, although the wide range of ancestries of the individuals in the cohort also creates particular challenges. We also conducted a set of analyses that reveal properties of the genetic data – such as population structure and relatedness – that can be important for downstream analyses. In addition, we phased and imputed genotypes into the dataset, using computationally efficient methods combined with the Haplotype Reference Consortium (HRC) and UK10K haplotype resource. This increases the number of testable variants by over 100-fold to ~96 million variants. We also imputed classical allelic variation at 11 human leukocyte antigen (HLA) genes, and as a quality control check of this imputation, we replicate signals of known associations between HLA alleles and many common diseases. We describe tools that allow efficient genome-wide association studies (GWAS) of multiple traits and fast phenome-wide association studies (PheWAS), which work together with a new compressed file format that has been used to distribute the dataset. As a further check of the genotyped and imputed datasets, we performed a test-case genome-wide association scan on a well-studied human trait, standing height.

Download Full-text

emeraLD: Rapid Linkage Disequilibrium Estimation with Massive Data Sets

10.1101/301366 ◽

2018 ◽

Cited By ~ 1

Author(s):

Corbin Quick ◽

Christian Fuchsberger ◽

Daniel Taliun ◽

Gonçalo Abecasis ◽

Michael Boehnke ◽

...

Keyword(s):

Linkage Disequilibrium ◽

Association Studies ◽

Random Access ◽

Supplementary Information ◽

Data Sets ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Genome Wide ◽

Wide Range ◽

Supplementary Material

AbstractSummaryEstimating linkage disequilibrium (LD) is essential for a wide range of summary statistics-based association methods for genome-wide association studies (GWAS). Large genetic data sets, e.g. the TOPMed WGS project and UK Biobank, enable more accurate and comprehensive LD estimates, but increase the computational burden of LD estimation. Here, we describe emeraLD (Efficient Methods for Estimation and Random Access of LD), a computational tool that leverages sparsity and haplotype structure to estimate LD orders of magnitude faster than existing tools.Availability and ImplementationemeraLD is implemented in C++, and is open source under GPLv3. Source code, documentation, an R interface, and utilities for analysis of summary statistics are freely available at http://github.com/statgen/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Cancer PRSweb – an Online Repository with Polygenic Risk Scores (PRS) for Major Cancer Traits and Their Phenome-wide Exploration in Two Independent Biobanks

10.1101/2020.01.22.915751 ◽

2020 ◽

Cited By ~ 1

Author(s):

Lars G. Fritsche ◽

Snehal Patil ◽

Lauren J. Beesley ◽

Peter VandeHaar ◽

Maxwell Salvatore ◽

...

Keyword(s):

Association Studies ◽

Predictive Performance ◽

Risk Scores ◽

P Value ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Uk Biobank ◽

Polygenic Risk ◽

Genome Wide ◽

Study Results

AbstractTo facilitate scientific collaboration on polygenic risk scores (PRS) research, we created an extensive PRS online repository for 49 common cancer traits integrating freely available genome-wide association studies (GWAS) summary statistics from three sources: published GWAS, the NHGRI-EBI GWAS Catalog, and UK Biobank-based GWAS. Our framework condenses these summary statistics into PRS using various approaches such as linkage disequilibrium pruning / p-value thresholding (fixed or data-adaptively optimized thresholds) and penalized, genome-wide effect size weighting. We evaluated the PRS in two biobanks: the Michigan Genomics Initiative (MGI), a longitudinal biorepository effort at Michigan Medicine, and the population-based UK Biobank (UKB). For each PRS construct, we provide measures on predictive performance, calibration, and discrimination. Besides PRS evaluation, the Cancer-PRSweb platform features construct downloads and phenome-wide PRS association study results (PRS-PheWAS) for predictive PRS. We expect this integrated platform to accelerate PRS-related cancer research.

Download Full-text

Body size and composition and site-specific cancers in UK Biobank: a Mendelian randomisation study

10.1101/2020.02.28.970459 ◽

2020 ◽

Cited By ~ 1

Author(s):

Mathew Vithayathil ◽

Paul Carter ◽

Siddhartha Kar ◽

Amy M. Mason ◽

Stephen Burgess ◽

...

Keyword(s):

Instrumental Variables ◽

Association Studies ◽

Genome Wide Association ◽

Mendelian Randomisation ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Site Specific ◽

Genome Wide ◽

Increased Risk ◽

The Uk

ABSTRACTObjectivesTo investigate the casual role of body mass index, body fat composition and height in cancer.DesignTwo stage mendelian randomisation studySettingPrevious genome wide association studies and the UK BiobankParticipantsGenetic instrumental variables for body mass index (BMI), fat mass index (FMI), fat free mass index (FFMI) and height from previous genome wide association studies and UK Biobank. Cancer outcomes from 367 586 participants of European descent from the UK Biobank.Main outcome measuresOverall cancer risk and 22 site-specific cancers risk for genetic instrumental variables for BMI, FMI, FFMI and height.ResultsGenetically predicted BMI (per 1 kg/m2) was not associated with overall cancer risk (OR 0.99; 95% confidence interval (CI) 0-98-1.00, p=0.105). Elevated BMI was associated with increased risk of stomach cancer (OR 1.15, 95% (CI) 1.05-1.26; p=0.003) and melanoma (OR 0.96, 95% CI 0.92-1.00; p=0.044). For sex-specific cancers, BMI was positively associated with uterine cancer (OR 1.08, 95% CI 1.01-1.14; p=0.015) but inversely associated with breast (OR 0.95, 95% CI 0.92-0.98; p=0.001), prostate (OR 0.95, 95% CI 0.92-0.99; p=0.007) and testicular cancer (OR 0.89, 95% CI 0.81-0.98; p=0.017). Elevated FMI (per 1 kg/m2) was associated with gastrointestinal cancer (stomach cancer OR 4.23, 95% CI 1.18-15.13, p=0.027; colorectal cancer OR 1.94, 95% CI 1.23-3.07; p=0.004). Increased height (per 1 standard deviation, approximately 6.5cm) was associated with increased risk of overall cancer (OR 1.06; 95% 1.04-1.09; p = 2.97×10-8) and most site-specific cancers with the strongest estimates for kidney, non-Hodgkin lymphoma, colorectal, lung, melanoma and breast cancer.ConclusionsThere is little evidence for BMI as a casual risk factor for cancer. BMI may have a causal role for sex-specific cancers, although with inconsistent directions of effect, and FMI for gastrointestinal malignancies. Elevated height is a risk factor for overall cancer and multiple site cancers.

Download Full-text