Scalable probabilistic PCA for large-scale genetic variation data

AbstractPrincipal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in less than thirty minutes. Leveraging the population structure inferred by ProPCA within the White British individuals in the UK Biobank, we scanned for SNPs that are not well-explained by the PCs to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.Author SummaryPrincipal component analysis is a commonly used technique for understanding population structure and genetic variation. With the advent of large-scale datasets that contain the genetic information of hundreds of thousands of individuals, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. In this study, we present ProPCA, a highly scalable statistical method to compute genetic PCs efficiently. We systematically evaluate the accuracy and robustness of our method on large-scale simulated data and apply it to the UK Biobank. Leveraging the population structure inferred by ProPCA within the White British individuals in the UK Biobank, we identify several novel signals of putative recent selection.

Download Full-text

Efficient toolkit implementing best practices for principal component analysis of population genetic data

Bioinformatics ◽

10.1093/bioinformatics/btaa520 ◽

2020 ◽

Vol 36 (16) ◽

pp. 4449-4457 ◽

Cited By ~ 4

Author(s):

Florian Privé ◽

Keurcien Luu ◽

Michael G B Blum ◽

John J McGrath ◽

Bjarni J Vilhjálmsson

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Best Practices ◽

Principal Component ◽

Genetic Data ◽

Uk Biobank ◽

1000 Genomes Project ◽

1000 Genomes ◽

R Packages ◽

The Uk

ABSTRACT Motivation Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr. Results For example, we find that PC19–PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16–18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data. Availability and implementation R packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https://github.com/privefl/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https://privefl.github.io/bigsnpr/articles/bedpca.html. All code used for this paper is available at https://github.com/privefl/paper4-bedpca/tree/master/code. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Efficient toolkit implementing best practices for principal component analysis of population genetic data

10.1101/841452 ◽

2019 ◽

Cited By ~ 2

Author(s):

Florian Privé ◽

Keurcien Luu ◽

Michael G.B. Blum ◽

John J. McGrath ◽

Bjarni J. Vilhjálmsson

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Best Practices ◽

Principal Component ◽

Genetic Data ◽

Component Analysis ◽

Uk Biobank ◽

1000 Genomes Project ◽

1000 Genomes ◽

The Uk

AbstractPrincipal Component Analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (1) capturing Linkage Disequilibrium (LD) structure instead of population structure, (2) projected PCs that suffer from shrinkage bias, (3) detecting sample outliers, and (4) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr.For example, we find that PC19 to PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16-18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr.Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data.

Download Full-text

Fast Principal Component Analysis of Large-Scale Genome-Wide Data

PLoS ONE ◽

10.1371/journal.pone.0093766 ◽

2014 ◽

Vol 9 (4) ◽

pp. e93766 ◽

Cited By ~ 145

Author(s):

Gad Abraham ◽

Michael Inouye

Keyword(s):

Principal Component Analysis ◽

Large Scale ◽

Principal Component ◽

Component Analysis ◽

Genome Wide ◽

Genome Wide Data

Download Full-text

A framework for research into continental ancestry groups of the UK Biobank

10.1101/2021.12.14.472589 ◽

2021 ◽

Author(s):

Andrei-Emil Constantinescu ◽

Ruth E Mitchell ◽

Jie Zheng ◽

Caroline J Bull ◽

Nicholas J Timpson ◽

...

Keyword(s):

Population Structure ◽

Principal Component ◽

European Ancestry ◽

Uk Biobank ◽

Ancestry Group ◽

The United Kingdom ◽

Health And Disease ◽

Birth Data ◽

The Uk ◽

Epidemiology Studies

The UK Biobank is a large prospective cohort, based in the United Kingdom, that has deep phenotypic and genomic data on roughly a half a million individuals. Included in this resource are data on approximately 78,000 individuals with "non-white British ancestry". Whilst most epidemiology studies have focused predominantly on populations of European ancestry, there is an opportunity to contribute to the study of health and disease for a broader segment of the population by making use of the UK Biobank's "non-white British ancestry" samples. Here we present an empirical description of the continental ancestry and population structure among the individuals in this UK Biobank subset. Reference populations from the 1000 Genomes Project for Africa, Europe, East Asia, and South Asia were used to estimate ancestry for each individual. Those with at least 80% ancestry in one of these four continental ancestry groups were taken forward (N=62,484). Principal component and K-means clustering analyses were used to identify and characterize population structure within each ancestry group. Of the approximately 78,000 individuals in the UK Biobank that are of "non-white British" ancestry, 50,685, 6,653, 2,782, and 2,364 individuals were associated to the European, African, South Asian, and East Asian continental ancestry groups, respectively. Each continental ancestry group exhibits prominent population structure that is consistent with self-reported country of birth data and geography. Methods outlined here provide an avenue to leverage UK Biobank's deeply phenotyped data allowing researchers to maximise its potential in the study of health and disease in individuals of non-white British ancestry.

Download Full-text

Fine-scale population structure in the UK Biobank: implications for genome-wide association studies

Human Molecular Genetics ◽

10.1093/hmg/ddaa157 ◽

2020 ◽

Vol 29 (16) ◽

pp. 2803-2811

Author(s):

James P Cook ◽

Anubha Mahajan ◽

Andrew P Morris

Keyword(s):

Population Structure ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Fine Scale ◽

Uk Biobank ◽

Genome Wide ◽

Scale Population ◽

The Uk ◽

The Impact

Abstract The UK Biobank is a prospective study of more than 500 000 participants, which has aggregated data from questionnaires, physical measures, biomarkers, imaging and follow-up for a wide range of health-related outcomes, together with genome-wide genotyping supplemented with high-density imputation. Previous studies have highlighted fine-scale population structure in the UK on a North-West to South-East cline, but the impact of unmeasured geographical confounding on genome-wide association studies (GWAS) of complex human traits in the UK Biobank has not been investigated. We considered 368 325 white British individuals from the UK Biobank and performed GWAS of their birth location. We demonstrate that widely used approaches to adjust for population structure, including principal component analysis and mixed modelling with a random effect for a genetic relationship matrix, cannot fully account for the fine-scale geographical confounding in the UK Biobank. We observe significant genetic correlation of birth location with a range of lifestyle-related traits, including body-mass index and fat mass, hypertension and lung function, even after adjustment for population structure. Variants driving associations with birth location are also strongly associated with many of these lifestyle-related traits after correction for population structure, indicating that there could be environmental factors that are confounded with geography that have not been adequately accounted for. Our findings highlight the need for caution in the interpretation of lifestyle-related trait GWAS in UK Biobank, particularly in loci demonstrating strong residual association with birth location.

Download Full-text

CNest: A Novel Copy Number Association Discovery Method Uncovers 862 New Associations from 200,629 Whole Exome Sequence Datasets in the UK Biobank

10.1101/2021.08.19.456963 ◽

2021 ◽

Author(s):

Tomas W Fitzgerald ◽

Ewan Birney

Keyword(s):

Copy Number ◽

Large Scale ◽

Association Studies ◽

Genomic Variation ◽

Next Generation Sequencing Data ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Genome Wide ◽

The Uk ◽

Ngs Data

Copy number variation (CNV) has long been known to influence human traits having a rich history of research into common and rare genetic disease and although CNV is accepted as an important class of genomic variation, progress on copy number (CN) phenotype associations from Next Generation Sequencing data (NGS) has been limited, in part, due to the relative difficulty in CNV detection and an enrichment for large numbers of false positives. To date most successful CN genome wide association studies (CN-GWAS) have focused on using predictive measures of dosage intolerance or gene burden tests to gain sufficient power for detecting CN effects. Here we present a novel method for large scale CN analysis from NGS data generating robust CN estimates and allowing CN-GWAS to be performed genome wide in discovery mode. We provide a detailed analysis in the large scale UK BioBank resource and a specifically designed software package for deriving CN estimates from NGS data that are robust enough to be used for CN-GWAS. We use these methods to perform genome wide CN-GWAS analysis across 78 human traits discovering 862 genetic associations that are likely to contribute strongly to trait distributions based solely on their CN or by acting in concert with other genetic variation. Finally, we undertake an analysis comparing CNV and SNP association signals across the same traits and samples, defining specific CNV association classes based on whether they could be detected using standard SNP-GWAS in the UK Biobank.

Download Full-text

An efficient and accurate frailty model approach for genome-wide survival association analysis controlling for population structure and relatedness in large-scale biobanks

10.1101/2020.10.31.358234 ◽

2020 ◽

Author(s):

Rounak Dey ◽

Wei Zhou ◽

Tuomo Kiiskinen ◽

Aki Havulinna ◽

Amanda Elliott ◽

...

Keyword(s):

Population Structure ◽

Association Analysis ◽

Large Scale ◽

Computational Cost ◽

Low Frequency ◽

Saddlepoint Approximation ◽

Frailty Model ◽

Uk Biobank ◽

Genome Wide ◽

Model Approach

AbstractWith decades of electronic health records linked to genetic data, large biobanks provide unprecedented opportunities for systematically understanding the genetics of the natural history of complex diseases. Genome-wide survival association analysis can identify genetic variants associated with ages of onset, disease progression and lifespan. We developed an efficient and accurate frailty (random effects) model approach for genome-wide survival association analysis of censored time-to-event (TTE) phenotypes in large biobanks by accounting for both population structure and relatedness. Our method utilizes state-of-the-art optimization strategies to reduce the computational cost. The saddlepoint approximation is used to allow for analysis of heavily censored phenotypes (>90%) and low frequency variants (down to minor allele count 20). We demonstrated the performance of our method through extensive simulation studies and analysis of five TTE phenotypes, including lifespan, with heavy censoring rates (90.9% to 99.8%) on ~400,000 UK Biobank participants with white British ancestry and ~180,000 samples in FinnGen, respectively. We further performed genome-wide association analysis for 871 TTE phenotypes in UK Biobank and presented the genome-wide scale phenome-wide association (PheWAS) results with the PheWeb browser.

Download Full-text

Large-Scale, Genome-Wide Gene-Diet Interaction Testing for HbA1c Using Derived Dietary Patterns in the UK Biobank

Current Developments in Nutrition ◽

10.1093/cdn/nzaa058_038 ◽

2020 ◽

Vol 4 (Supplement_2) ◽

pp. 1280-1280

Author(s):

Kenneth Westerman ◽

Ye Chen ◽

Han Chen ◽

Jose Florez ◽

Joanne Cole ◽

...

Keyword(s):

Principal Components ◽

Dietary Patterns ◽

Genetic Variants ◽

Interaction Analysis ◽

European Ancestry ◽

Uk Biobank ◽

Genome Wide ◽

A Genome ◽

Gene Level ◽

The Uk

Abstract Objectives Gene-diet interaction analysis can inform the development of precision nutrition for diabetes by uncovering genetic variants whose effects on glycemic traits vary across dietary behaviors. However, due to noise in dietary datasets and the low statistical power inherent in interaction analysis, there is a lack of confident, well-replicated gene-diet interactions for glycemic traits. Emerging computationally-efficient software tools have made it feasible to conduct well-powered, genome-wide interaction analysis in hundreds of thousands of individuals. Here, our objective was to conduct a genome-wide gene-diet interaction analysis for glycated hemoglobin (HbA1c; a measure of hyperglycemia), leveraging the large sample size of the UK Biobank cohort and data-driven dietary patterns to discover genetic variants whose effect is modulated by diet. Methods Food frequency questionnaires were previously used to derive empirical dietary patterns using principal components analysis (FFQ-PCs) in the UK Biobank. FFQ-PCs were used in genome-wide interaction analysis for HbA1c levels in unrelated, non-diabetic individuals of European ancestry (N = 331,610), adjusting for age, sex, and 10 genetic principal components. P-values were calculated for both the interaction (P-int) and a joint test (significance of the variant-HbA1c association combining the main and interaction effects) and the MAGMA tool was used to calculate gene-level enrichment statistics. Results Preliminary results from the first two FFQ-PCs confirmed known genetic loci for HbA1c using the joint test, such as at G6PC2 and GCK. Though no interaction tests reached genome-wide significance, suggestive signals (P-int < 1e-5) emerged at the variant level (including one near TPSD1, which codes for a tryptase and has been linked to red blood cell traits) and the gene level (such as for GTF3C2, which has previously been shown to interact with sleep in impacting lipid traits). Conclusions We have conducted the largest genome-wide study of gene-diet interactions for glycemic traits to-date and identified regions in the genome whose effect on HbA1c may be modulated by dietary intake, suggesting that this approach has the potential to reveal new insights into the genetics of glycemic traits and inform individualized dietary guidelines for diabetes prevention and management. Funding Sources NHLBI.

Download Full-text

Fast Principal Component Analysis of Large-Scale Genome-Wide Data

10.1101/002238 ◽

2014 ◽

Cited By ~ 2

Author(s):

Gad Abraham ◽

Michael Inouye

Keyword(s):

Principal Component Analysis ◽

Large Scale ◽

Principal Component ◽

Component Analysis ◽

Single Nucleotide ◽

Snp Data ◽

Genome Wide ◽

Genome Wide Data ◽

Eigen Decomposition ◽

Traditional Approaches

Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.

Download Full-text

Revealing the coexistence of differentiation and communication in an endemic hare, Lepus yarkandensis (Mammalia, Leporidae) using specific-length amplified fragment sequencing

Frontiers in Zoology ◽

10.1186/s12983-021-00432-x ◽

2021 ◽

Vol 18 (1) ◽

Author(s):

Buweihailiqiemu Ababaikeri ◽

Yucong Zhang ◽

Huiying Dai ◽

Wenjuan Shan

Keyword(s):

Genetic Diversity ◽

Population Structure ◽

Genetic Variation ◽

Genetic Differentiation ◽

Tarim Basin ◽

Principal Component ◽

Northwest China ◽

The North ◽

Genome Wide ◽

Specific Length

Abstract Background The Yarkand hare (Lepus yarkandensis Günther, 1875) is endemic to oasis and desert areas around the Tarim Basin in the Xinjiang Uyghur Autonomous Region of northwest China; however, genome-wide information for this species remains limited. Moreover, the genetic variation, genetic structure, and phylogenetic relationships of Yarkand hare from the plateau mountain regions have not been reported. Thus, we used specific-length amplified fragment sequencing (SLAF-seq) technology to evaluate the genetic diversity of 76 Yarkand hares from seven geographic populations in the northern and southwestern parts of the Tarim Basin to investigate single-nucleotide polymorphism (SNP) marker-based population differentiation and evolutionary processes. Selective sweep analysis was conducted to identify genetic differences between populations. Results Using SLAF-seq, a total of 1,835,504 SNPs were initially obtained, of which 308,942 high-confidence SNPs were selected for further analysis. Yarkand hares exhibited a relatively high degree of genetic diversity at the SNP level. Based on pairwise FST estimates, the north and southwest groups showed a moderate level of genetic differentiation. Phylogenetic tree and population structure analyses demonstrated evident systematic phylogeographical structure patterns consistent with the geographical distribution of the hares. Hierarchical analysis of molecular variation further indicated that genetic variation was mainly observed within populations. Low to moderate genetic differentiation also occurred among populations despite a common genomic background, likely due to geographical barriers, genetic drift, and differential selection pressure of distinct environments. Nevertheless, the observed lineage-mixing pattern, as indicated by the evolutionary tree, principal component analysis, population structure, and TreeMix analyses, suggests a certain degree of gene flow between the north and southwest groups. This may be related to the migration of hares to high-altitude water sources southwest of the basin during glacial climatic oscillations, as well as river re-diffusion and oasis restoration in the basin following the glacial period. We also identified candidate genes, and their associated gene ontology terms and pathways, related to the adaptation of Yarkand hares to different environmental habitats. Conclusions The identified genome-wide SNPs, genetic diversity, and population structure of Yarkand hares expand our understanding of the genetic background of this endemic species and provide valuable insights into its environmental adaptation, allowing for further exploration of the underlying mechanisms.

Download Full-text