Efficient toolkit implementing best practices for principal component analysis of population genetic data

Florian Privé; Keurcien Luu; Michael G B Blum; John J McGrath; Bjarni J Vilhjálmsson

doi:10.1093/bioinformatics/btaa520

Efficient toolkit implementing best practices for principal component analysis of population genetic data

Bioinformatics ◽

10.1093/bioinformatics/btaa520 ◽

2020 ◽

Vol 36 (16) ◽

pp. 4449-4457 ◽

Cited By ~ 4

Author(s):

Florian Privé ◽

Keurcien Luu ◽

Michael G B Blum ◽

John J McGrath ◽

Bjarni J Vilhjálmsson

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Best Practices ◽

Principal Component ◽

Genetic Data ◽

Uk Biobank ◽

1000 Genomes Project ◽

1000 Genomes ◽

R Packages ◽

The Uk

ABSTRACT Motivation Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr. Results For example, we find that PC19–PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16–18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data. Availability and implementation R packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https://github.com/privefl/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https://privefl.github.io/bigsnpr/articles/bedpca.html. All code used for this paper is available at https://github.com/privefl/paper4-bedpca/tree/master/code. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Efficient toolkit implementing best practices for principal component analysis of population genetic data

10.1101/841452 ◽

2019 ◽

Cited By ~ 2

Author(s):

Florian Privé ◽

Keurcien Luu ◽

Michael G.B. Blum ◽

John J. McGrath ◽

Bjarni J. Vilhjálmsson

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Best Practices ◽

Principal Component ◽

Genetic Data ◽

Component Analysis ◽

Uk Biobank ◽

1000 Genomes Project ◽

1000 Genomes ◽

The Uk

AbstractPrincipal Component Analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (1) capturing Linkage Disequilibrium (LD) structure instead of population structure, (2) projected PCs that suffer from shrinkage bias, (3) detecting sample outliers, and (4) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr.For example, we find that PC19 to PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16-18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr.Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data.

Download Full-text

Ancestry inference and grouping from principal component analysis of genetic data

10.1101/2020.10.06.328203 ◽

2020 ◽

Author(s):

Florian Privé

Keyword(s):

Principal Component Analysis ◽

Principal Component ◽

Genetic Data ◽

Component Analysis ◽

Easy Access ◽

1000 Genomes Project ◽

1000 Genomes ◽

Euclidean Distances ◽

Ancestry Inference ◽

The Uk

AbstractHere we propose a simple, robust and effective method for global ancestry inference and grouping from Principal Component Analysis (PCA) of genetic data. The proposed approach is particularly useful for methods that need to be applied in homogeneous samples. First, we show that Euclidean distances in the PCA space are proportional to FST between populations. Then, we show how to use this PCA-based distance to infer ancestry in the UK Biobank and the POPRES datasets. We propose two solutions, either relying on projection of PCs to reference populations such as from the 1000 Genomes Project, or by directly using the internal data. Finally, we conclude that our method and the community would benefit from having an easy access to a reference dataset with an even better coverage of the worldwide genetic diversity than the 1000 Genomes Project.

Download Full-text

Scalable probabilistic PCA for large-scale genetic variation data

10.1101/729202 ◽

2019 ◽

Cited By ~ 4

Author(s):

Aman Agrawal ◽

Alec M. Chiu ◽

Minh Le ◽

Eran Halperin ◽

Sriram Sankararaman

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Genetic Variation ◽

Principal Components ◽

Large Scale ◽

Principal Component ◽

Uk Biobank ◽

Genome Wide ◽

Variation Data ◽

The Uk

AbstractPrincipal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in less than thirty minutes. Leveraging the population structure inferred by ProPCA within the White British individuals in the UK Biobank, we scanned for SNPs that are not well-explained by the PCs to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.Author SummaryPrincipal component analysis is a commonly used technique for understanding population structure and genetic variation. With the advent of large-scale datasets that contain the genetic information of hundreds of thousands of individuals, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. In this study, we present ProPCA, a highly scalable statistical method to compute genetic PCs efficiently. We systematically evaluate the accuracy and robustness of our method on large-scale simulated data and apply it to the UK Biobank. Leveraging the population structure inferred by ProPCA within the White British individuals in the UK Biobank, we identify several novel signals of putative recent selection.

Download Full-text

A framework for research into continental ancestry groups of the UK Biobank

10.1101/2021.12.14.472589 ◽

2021 ◽

Author(s):

Andrei-Emil Constantinescu ◽

Ruth E Mitchell ◽

Jie Zheng ◽

Caroline J Bull ◽

Nicholas J Timpson ◽

...

Keyword(s):

Population Structure ◽

Principal Component ◽

European Ancestry ◽

Uk Biobank ◽

Ancestry Group ◽

The United Kingdom ◽

Health And Disease ◽

Birth Data ◽

The Uk ◽

Epidemiology Studies

The UK Biobank is a large prospective cohort, based in the United Kingdom, that has deep phenotypic and genomic data on roughly a half a million individuals. Included in this resource are data on approximately 78,000 individuals with "non-white British ancestry". Whilst most epidemiology studies have focused predominantly on populations of European ancestry, there is an opportunity to contribute to the study of health and disease for a broader segment of the population by making use of the UK Biobank's "non-white British ancestry" samples. Here we present an empirical description of the continental ancestry and population structure among the individuals in this UK Biobank subset. Reference populations from the 1000 Genomes Project for Africa, Europe, East Asia, and South Asia were used to estimate ancestry for each individual. Those with at least 80% ancestry in one of these four continental ancestry groups were taken forward (N=62,484). Principal component and K-means clustering analyses were used to identify and characterize population structure within each ancestry group. Of the approximately 78,000 individuals in the UK Biobank that are of "non-white British" ancestry, 50,685, 6,653, 2,782, and 2,364 individuals were associated to the European, African, South Asian, and East Asian continental ancestry groups, respectively. Each continental ancestry group exhibits prominent population structure that is consistent with self-reported country of birth data and geography. Methods outlined here provide an avenue to leverage UK Biobank's deeply phenotyped data allowing researchers to maximise its potential in the study of health and disease in individuals of non-white British ancestry.

Download Full-text

Fitting penalized regressions on very large genetic data using snpnet and bigstatsr

10.1101/2020.10.30.362079 ◽

2020 ◽

Author(s):

Florian Privé ◽

Bjarni J. Vilhjálmsson ◽

Hugues Aschard

Keyword(s):

Genetic Data ◽

Uk Biobank ◽

Individual Level ◽

Order Of Magnitude ◽

R Packages ◽

Similarities And Differences ◽

The Uk

AbstractBoth R packages snpnet and bigstatsr allow for fitting penalized regressions on individual-level genetic data as large as the UK Biobank. Here we benchmark bigstatsr against snpnet for fitting penalized regressions on large genetic data. We find bigstatsr to be an order of magnitude faster than snpnet when applied to the UK Biobank data (from 4.5x to 35x). We also discuss the similarities and differences between the two packages, provide theoretical insights, and make recommendations on how to fit penalized regressions in the context of genetic data.

Download Full-text

Principal component analysis reveals the 1000 Genomes Project does not sufficiently cover the human genetic diversity in Asia

Frontiers in Genetics ◽

10.3389/fgene.2013.00127 ◽

2013 ◽

Vol 4 ◽

Cited By ~ 18

Author(s):

Dongsheng Lu ◽

Shuhua Xu

Keyword(s):

Genetic Diversity ◽

Principal Component Analysis ◽

Principal Component ◽

Component Analysis ◽

1000 Genomes Project ◽

1000 Genomes

Download Full-text

Simulasi Metode Statistik untuk Seleksi Single Nucleotide Polymorphism pada Populasi Plasmodium

Life Science ◽

10.15294/lifesci.v8i1.29990 ◽

2019 ◽

Vol 8 (1) ◽

pp. 54-64

Author(s):

Mohamad Ikhsan Nurulloh ◽

Yustinus Ulung Anggraito ◽

Hidayat Trimarsanto ◽

Endah Peniati ◽

R. Susanti

Keyword(s):

Principal Component Analysis ◽

Single Nucleotide Polymorphism ◽

Population Structure ◽

Principal Component ◽

Selection Method ◽

Principal Coordinate Analysis ◽

Neighbor Joining ◽

Nucleotide Polymorphism ◽

Single Nucleotide ◽

Informative Snps

Plasmodium is a pathogen that causes malaria which has high genetic diversity and resistance to antimalarial drugs. Information on the population structure of Plasmodium can be used as molecular markers, one of which is Single Nucleotide Polymorphism (SNP). SNP markers are in large numbers and not entirely informative. The existing method has not been effective in producing informative SNPs, therefore it is necessary to develop an effective SNP selection method. The SNP selection method is developed using FST as the main filter (filter) and combines Linkage Disequilibrium (LD). The population structure of the SNP is known to use Principal Component Analysis (PCA), Principal Coordinate Analysis (PCoA), pairwise FST, and neighbor-joining population trees. Informative SNP criteria known by calculating FST and Minor Allele Frequency (MAF). Statistical methods were tested to determine their effectiveness in producing informative SNPs. The method testing was carried out using genetic data simulation of the Plasmodium population. The results of the study show that the statistical method is effective in producing informative SNPs. The informative SNP criteria are SNPs with MAF 0.2-0.4 and FST 0.1-0.4 and 0.8-1.0. Plasmodium merupakan patogen penyebab malaria dengan keanekaragaman genetik tinggi dan memiliki resistensi terhadap obat antimalaria. Informasi sturuktur populasi Plasmodium dapat dimanfaatkan sebagai marka molekuler seperti Single Nucleotide Polymorphism (SNP). Marka SNP terdapat dalam jumlah yang banyak dan tidak seluruhnya informatif. Metode yang telah ada belum efektif dalam menghasilkan SNP informatif sehingga perlu dilakukan pengembangan metode seleksi SNP yang efektif. Metode seleksi SNP dikembangkan menggunakan FST sebagai filter (penyaring) utamanya dan gabungkan Linkage Disequilibrium (LD). Struktur populasi dari SNP diketahui menggunakan Principal Component Analysis (PCA), Principal Coordinate Analysis (PCoA), pairwise FST, dan neighbor-joining population tree. Kriteria SNP informatif yang diketahui dengan menghitung FST dan Minor Allele Frequency (MAF). Metode statistika diuji untuk mengetahui keefektifannya dalam menghasilkan SNP informatif. Pengujian metode dilakukan menggunakan simulasi data genetik populasi Plasmodium. Hasil penelitian menunjukkan metode statistika efektif dalam menghasilkan SNP informatif. Kriteria SNP informatif adalah SNP dengan MAF 0.2-0.4 serta FST 0.1-0.4 dan 0.8-1.0.

Download Full-text

Application of geographic population structure (GPS) algorithm for biogeographical analyses of populations with complex ancestries: a case study of South Asians from 1000 genomes project

BMC Genetics ◽

10.1186/s12863-017-0579-2 ◽

2017 ◽

Vol 18 (S1) ◽

Cited By ~ 6

Author(s):

Ranajit Das ◽

Priyanka Upadhyai

Keyword(s):

Population Structure ◽

South Asians ◽

1000 Genomes Project ◽

Geographic Population ◽

1000 Genomes

Download Full-text

Accurate, scalable cohort variant calls using DeepVariant and GLnexus

10.1101/2020.02.10.942086 ◽

2020 ◽

Cited By ~ 4

Author(s):

Taedong Yun ◽

Helen Li ◽

Pi-Chuan Chang ◽

Michael F. Lin ◽

Andrew Carroll ◽

...

Keyword(s):

Genetic Variation ◽

Best Practices ◽

Open Source ◽

Variant Calling ◽

Cost Savings ◽

Quality Improvements ◽

1000 Genomes Project ◽

Genetic Analyses ◽

1000 Genomes ◽

Population Scale

AbstractPopulation-scale sequenced cohorts are foundational resources for genetic analyses, but processing raw reads into analysis-ready variants remains challenging. Here we introduce an open-source cohort variant-calling method using the highly-accurate caller DeepVariant and scalable merging tool GLnexus. We optimized callset quality based on benchmark samples and Mendelian consistency across many sample sizes and sequencing specifications, resulting in substantial quality improvements and cost savings over existing best practices. We further evaluated our pipeline in the 1000 Genomes Project (1KGP) samples, showing superior quality metrics and imputation performance. We publicly release the 1KGP callset to foster development of broad studies of genetic variation.

Download Full-text

Population stratification in GWAS meta-analysis should be standardized to the best available reference datasets

10.1101/2020.09.03.281568 ◽

2020 ◽

Author(s):

Aliya Sarmanova ◽

Tim Morris ◽

Daniel John Lawson

Keyword(s):

Population Stratification ◽

Association Studies ◽

Meta Analysis ◽

Principal Component ◽

Underlying Structure ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

External Reference ◽

Major Disadvantage ◽

The Uk

AbstractPopulation stratification has recently been demonstrated to bias genetic studies even in relatively homogeneous populations such as within the British Isles. A key component to correcting for stratification in genome-wide association studies (GWAS) is accurately identifying and controlling for the underlying structure present in the sample. Meta-analysis across cohorts is increasingly important for achieving very large sample sizes, but comes with the major disadvantage that each individual cohort corrects for different population stratification. Here we demonstrate that correcting for structure against an external reference adds significant value to meta-analysis. We treat the UK Biobank as a collection of smaller studies, each of which is geographically localised. We provide software to standardize an external dataset against a reference, provide the UK Biobank principal component loadings for this purpose, and demonstrate the value of this with an analysis of the geographically sampled ALSPAC cohort.

Download Full-text