scholarly journals A framework for research into continental ancestry groups of the UK Biobank

2021 ◽  
Author(s):  
Andrei-Emil Constantinescu ◽  
Ruth E Mitchell ◽  
Jie Zheng ◽  
Caroline J Bull ◽  
Nicholas J Timpson ◽  
...  

The UK Biobank is a large prospective cohort, based in the United Kingdom, that has deep phenotypic and genomic data on roughly a half a million individuals. Included in this resource are data on approximately 78,000 individuals with "non-white British ancestry". Whilst most epidemiology studies have focused predominantly on populations of European ancestry, there is an opportunity to contribute to the study of health and disease for a broader segment of the population by making use of the UK Biobank's "non-white British ancestry" samples. Here we present an empirical description of the continental ancestry and population structure among the individuals in this UK Biobank subset. Reference populations from the 1000 Genomes Project for Africa, Europe, East Asia, and South Asia were used to estimate ancestry for each individual. Those with at least 80% ancestry in one of these four continental ancestry groups were taken forward (N=62,484). Principal component and K-means clustering analyses were used to identify and characterize population structure within each ancestry group. Of the approximately 78,000 individuals in the UK Biobank that are of "non-white British" ancestry, 50,685, 6,653, 2,782, and 2,364 individuals were associated to the European, African, South Asian, and East Asian continental ancestry groups, respectively. Each continental ancestry group exhibits prominent population structure that is consistent with self-reported country of birth data and geography. Methods outlined here provide an avenue to leverage UK Biobank's deeply phenotyped data allowing researchers to maximise its potential in the study of health and disease in individuals of non-white British ancestry.

2019 ◽  
Vol 116 (21) ◽  
pp. 10430-10434 ◽  
Author(s):  
Gaspard Kerner ◽  
Noe Ramirez-Alejo ◽  
Yoann Seeleuthner ◽  
Rui Yang ◽  
Masato Ogishi ◽  
...  

The human genetic basis of tuberculosis (TB) has long remained elusive. We recently reported a high level of enrichment in homozygosity for the common TYK2 P1104A variant in a heterogeneous cohort of patients with TB from non-European countries in which TB is endemic. This variant is homozygous in ∼1/600 Europeans and ∼1/5,000 people from other countries outside East Asia and sub-Saharan Africa. We report a study of this variant in the UK Biobank cohort. The frequency of P1104A homozygotes was much higher in patients with TB (6/620, 1%) than in controls (228/114,473, 0.2%), with an odds ratio (OR) adjusted for ancestry of 5.0 [95% confidence interval (CI): 1.96–10.31, P = 2 × 10−3]. Conversely, we did not observe enrichment for P1104A heterozygosity, or for TYK2 I684S or V362F homozygosity or heterozygosity. Moreover, it is unlikely that more than 10% of controls were infected with Mycobacterium tuberculosis, as 97% were of European genetic ancestry, born between 1939 and 1970, and resided in the United Kingdom. Had all of them been infected, the OR for developing TB upon infection would be higher. These findings suggest that homozygosity for TYK2 P1104A may account for ∼1% of TB cases in Europeans.


2020 ◽  
Vol 36 (16) ◽  
pp. 4449-4457 ◽  
Author(s):  
Florian Privé ◽  
Keurcien Luu ◽  
Michael G B Blum ◽  
John J McGrath ◽  
Bjarni J Vilhjálmsson

ABSTRACT Motivation Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr. Results For example, we find that PC19–PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16–18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data. Availability and implementation R packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https://github.com/privefl/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https://privefl.github.io/bigsnpr/articles/bedpca.html. All code used for this paper is available at https://github.com/privefl/paper4-bedpca/tree/master/code. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Aman Agrawal ◽  
Alec M. Chiu ◽  
Minh Le ◽  
Eran Halperin ◽  
Sriram Sankararaman

AbstractPrincipal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in less than thirty minutes. Leveraging the population structure inferred by ProPCA within the White British individuals in the UK Biobank, we scanned for SNPs that are not well-explained by the PCs to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.Author SummaryPrincipal component analysis is a commonly used technique for understanding population structure and genetic variation. With the advent of large-scale datasets that contain the genetic information of hundreds of thousands of individuals, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. In this study, we present ProPCA, a highly scalable statistical method to compute genetic PCs efficiently. We systematically evaluate the accuracy and robustness of our method on large-scale simulated data and apply it to the UK Biobank. Leveraging the population structure inferred by ProPCA within the White British individuals in the UK Biobank, we identify several novel signals of putative recent selection.


2017 ◽  
Vol 115 (1) ◽  
pp. 151-156 ◽  
Author(s):  
Jaleal S. Sanjak ◽  
Julia Sidorenko ◽  
Matthew R. Robinson ◽  
Kevin R. Thornton ◽  
Peter M. Visscher

Modern molecular genetic datasets, primarily collected to study the biology of human health and disease, can be used to directly measure the action of natural selection and reveal important features of contemporary human evolution. Here we leverage the UK Biobank data to test for the presence of linear and nonlinear natural selection in a contemporary population of the United Kingdom. We obtain phenotypic and genetic evidence consistent with the action of linear/directional selection. Phenotypic evidence suggests that stabilizing selection, which acts to reduce variance in the population without necessarily modifying the population mean, is widespread and relatively weak in comparison with estimates from other species.


2019 ◽  
Author(s):  
Florian Privé ◽  
Keurcien Luu ◽  
Michael G.B. Blum ◽  
John J. McGrath ◽  
Bjarni J. Vilhjálmsson

AbstractPrincipal Component Analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (1) capturing Linkage Disequilibrium (LD) structure instead of population structure, (2) projected PCs that suffer from shrinkage bias, (3) detecting sample outliers, and (4) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr.For example, we find that PC19 to PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16-18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr.Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data.


2020 ◽  
Author(s):  
John E. McGeary ◽  
Chelsie Benca-Bachman ◽  
Victoria Risner ◽  
Christopher G Beevers ◽  
Brandon Gibb ◽  
...  

Twin studies indicate that 30-40% of the disease liability for depression can be attributed to genetic differences. Here, we assess the explanatory ability of polygenic scores (PGS) based on broad- (PGSBD) and clinical- (PGSMDD) depression summary statistics from the UK Biobank using independent cohorts of adults (N=210; 100% European Ancestry) and children (N=728; 70% European Ancestry) who have been extensively phenotyped for depression and related neurocognitive phenotypes. PGS associations with depression severity and diagnosis were generally modest, and larger in adults than children. Polygenic prediction of depression-related phenotypes was mixed and varied by PGS. Higher PGSBD, in adults, was associated with a higher likelihood of having suicidal ideation, increased brooding and anhedonia, and lower levels of cognitive reappraisal; PGSMDD was positively associated with brooding and negatively related to cognitive reappraisal. Overall, PGS based on both broad and clinical depression phenotypes have modest utility in adult and child samples of depression.


Genes ◽  
2021 ◽  
Vol 12 (7) ◽  
pp. 991
Author(s):  
Erik Widen ◽  
Timothy G. Raben ◽  
Louis Lello ◽  
Stephen D. H. Hsu

We use UK Biobank data to train predictors for 65 blood and urine markers such as HDL, LDL, lipoprotein A, glycated haemoglobin, etc. from SNP genotype. For example, our Polygenic Score (PGS) predictor correlates ∼0.76 with lipoprotein A level, which is highly heritable and an independent risk factor for heart disease. This may be the most accurate genomic prediction of a quantitative trait that has yet been produced (specifically, for European ancestry groups). We also train predictors of common disease risk using blood and urine biomarkers alone (no DNA information); we call these predictors biomarker risk scores, BMRS. Individuals who are at high risk (e.g., odds ratio of >5× population average) can be identified for conditions such as coronary artery disease (AUC∼0.75), diabetes (AUC∼0.95), hypertension, liver and kidney problems, and cancer using biomarkers alone. Our atherosclerotic cardiovascular disease (ASCVD) predictor uses ∼10 biomarkers and performs in UKB evaluation as well as or better than the American College of Cardiology ASCVD Risk Estimator, which uses quite different inputs (age, diagnostic history, BMI, smoking status, statin usage, etc.). We compare polygenic risk scores (risk conditional on genotype: PRS) for common diseases to the risk predictors which result from the concatenation of learned functions BMRS and PGS, i.e., applying the BMRS predictors to the PGS output.


2021 ◽  
pp. 1-9
Author(s):  
Janice L. Atkins ◽  
Luke C. Pilling ◽  
Christine J. Heales ◽  
Sharon Savage ◽  
Chia-Ling Kuo ◽  
...  

Background: Brain iron deposition occurs in dementia. In European ancestry populations, the HFE p.C282Y variant can cause iron overload and hemochromatosis, mostly in homozygous males. Objective: To estimated p.C282Y associations with brain MRI features plus incident dementia diagnoses during follow-up in a large community cohort. Methods: UK Biobank participants with follow-up hospitalization records (mean 10.5 years). MRI in 206 p.C282Y homozygotes versus 23,349 without variants, including T2 * measures (lower values indicating more iron). Results: European ancestry participants included 2,890 p.C282Y homozygotes. Male p.C282Y homozygotes had lower T2 * measures in areas including the putamen, thalamus, and hippocampus, compared to no HFE mutations. Incident dementia was more common in p.C282Y homozygous men (Hazard Ratio HR = 1.83; 95% CI 1.23 to 2.72, p = 0.003), as was delirium. There were no associations in homozygote women or in heterozygotes. Conclusion: Studies are needed of whether early iron reduction prevents or slows related brain pathologies in male HFE p.C282Y homozygotes.


2021 ◽  
Vol 50 (Supplement_1) ◽  
Author(s):  
Joshua Sutherland ◽  
Ang Zhou ◽  
Matthew Leach ◽  
Elina Hyppönen

Abstract Background While controversy remains regarding optimal vitamin D status, the public health relevance of true vitamin D deficiency is undisputed. There are few contemporary cross-ethnic studies investigating the prevalence and determinants of very low 25-hydroxyvitamin D [25(OH)D] concentrations. Methods We used data from 440,581 UK Biobank participants, of which 415,903 identified as white European, 7,880 Asian, 7,602 black African, 1,383 Chinese, and 6,473 of mixed ancestry. 25(OH)D concentrations were measured by DiaSorin Liaison XL and deficiency defined as ≤ 25 nmol/L 25(OH)D. Results The prevalence of 25(OH)D deficiency was highest among participants of Asian ancestry (57.2% in winter/spring and 50.8% in summer/autumn; followed by black African [38.47%/30.78%], mixed ancestry [36.53%/22.48%], Chinese [33.12%/20.68%] and white European [17.45%/5.90%], P < 1.0E-300). Participants with higher socioeconomic deprivation were more likely to have 25(OH)D deficiency compared to less deprived (P < 1.0E-300 for all comparisons), with the pattern being more apparent among those of white European ancestry and in summer (Pinteraction<6.4E-5 for both). In fully-adjusted analyses, regular consumption of oily fish was effective in mitigating ≤25 nmol/L 25(OH)D deficiency across all ethnicities, whilst outdoor-summer time was less effective for black Africans than white Europeans (OR: 0.89; 95% CI: 0.70, 1.12 and OR: 0.40; 95% CI: 0.38, 0.42, respectively). Conclusions Vitamin D deficiency remains an issue throughout the UK, particularly in lower socioeconomic areas and the UK Asian population, half of whom have vitamin D deficiency across seasons. Key messages The prevalence of 25(OH)D deficiency in the UK is alarming, with certain ethnic and socioeconomic groups considered particularly vulnerable.


2020 ◽  
Author(s):  
Aliya Sarmanova ◽  
Tim Morris ◽  
Daniel John Lawson

AbstractPopulation stratification has recently been demonstrated to bias genetic studies even in relatively homogeneous populations such as within the British Isles. A key component to correcting for stratification in genome-wide association studies (GWAS) is accurately identifying and controlling for the underlying structure present in the sample. Meta-analysis across cohorts is increasingly important for achieving very large sample sizes, but comes with the major disadvantage that each individual cohort corrects for different population stratification. Here we demonstrate that correcting for structure against an external reference adds significant value to meta-analysis. We treat the UK Biobank as a collection of smaller studies, each of which is geographically localised. We provide software to standardize an external dataset against a reference, provide the UK Biobank principal component loadings for this purpose, and demonstrate the value of this with an analysis of the geographically sampled ALSPAC cohort.


Sign in / Sign up

Export Citation Format

Share Document