Efficient toolkit implementing best practices for principal component analysis of population genetic data

AbstractPrincipal Component Analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (1) capturing Linkage Disequilibrium (LD) structure instead of population structure, (2) projected PCs that suffer from shrinkage bias, (3) detecting sample outliers, and (4) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr.For example, we find that PC19 to PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16-18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr.Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data.

Download Full-text

Efficient toolkit implementing best practices for principal component analysis of population genetic data

Bioinformatics ◽

10.1093/bioinformatics/btaa520 ◽

2020 ◽

Vol 36 (16) ◽

pp. 4449-4457 ◽

Cited By ~ 4

Author(s):

Florian Privé ◽

Keurcien Luu ◽

Michael G B Blum ◽

John J McGrath ◽

Bjarni J Vilhjálmsson

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Best Practices ◽

Principal Component ◽

Genetic Data ◽

Uk Biobank ◽

1000 Genomes Project ◽

1000 Genomes ◽

R Packages ◽

The Uk

ABSTRACT Motivation Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr. Results For example, we find that PC19–PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16–18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data. Availability and implementation R packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https://github.com/privefl/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https://privefl.github.io/bigsnpr/articles/bedpca.html. All code used for this paper is available at https://github.com/privefl/paper4-bedpca/tree/master/code. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Ancestry inference and grouping from principal component analysis of genetic data

10.1101/2020.10.06.328203 ◽

2020 ◽

Author(s):

Florian Privé

Keyword(s):

Principal Component Analysis ◽

Principal Component ◽

Genetic Data ◽

Component Analysis ◽

Easy Access ◽

1000 Genomes Project ◽

1000 Genomes ◽

Euclidean Distances ◽

Ancestry Inference ◽

The Uk

AbstractHere we propose a simple, robust and effective method for global ancestry inference and grouping from Principal Component Analysis (PCA) of genetic data. The proposed approach is particularly useful for methods that need to be applied in homogeneous samples. First, we show that Euclidean distances in the PCA space are proportional to FST between populations. Then, we show how to use this PCA-based distance to infer ancestry in the UK Biobank and the POPRES datasets. We propose two solutions, either relying on projection of PCs to reference populations such as from the 1000 Genomes Project, or by directly using the internal data. Finally, we conclude that our method and the community would benefit from having an easy access to a reference dataset with an even better coverage of the worldwide genetic diversity than the 1000 Genomes Project.

Download Full-text

Scalable probabilistic PCA for large-scale genetic variation data

10.1101/729202 ◽

2019 ◽

Cited By ~ 4

Author(s):

Aman Agrawal ◽

Alec M. Chiu ◽

Minh Le ◽

Eran Halperin ◽

Sriram Sankararaman

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Genetic Variation ◽

Principal Components ◽

Large Scale ◽

Principal Component ◽

Uk Biobank ◽

Genome Wide ◽

Variation Data ◽

The Uk

AbstractPrincipal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in less than thirty minutes. Leveraging the population structure inferred by ProPCA within the White British individuals in the UK Biobank, we scanned for SNPs that are not well-explained by the PCs to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.Author SummaryPrincipal component analysis is a commonly used technique for understanding population structure and genetic variation. With the advent of large-scale datasets that contain the genetic information of hundreds of thousands of individuals, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. In this study, we present ProPCA, a highly scalable statistical method to compute genetic PCs efficiently. We systematically evaluate the accuracy and robustness of our method on large-scale simulated data and apply it to the UK Biobank. Leveraging the population structure inferred by ProPCA within the White British individuals in the UK Biobank, we identify several novel signals of putative recent selection.

Download Full-text

Principal component analysis reveals the 1000 Genomes Project does not sufficiently cover the human genetic diversity in Asia

Frontiers in Genetics ◽

10.3389/fgene.2013.00127 ◽

2013 ◽

Vol 4 ◽

Cited By ~ 18

Author(s):

Dongsheng Lu ◽

Shuhua Xu

Keyword(s):

Genetic Diversity ◽

Principal Component Analysis ◽

Principal Component ◽

Component Analysis ◽

1000 Genomes Project ◽

1000 Genomes

Download Full-text

A framework for research into continental ancestry groups of the UK Biobank

10.1101/2021.12.14.472589 ◽

2021 ◽

Author(s):

Andrei-Emil Constantinescu ◽

Ruth E Mitchell ◽

Jie Zheng ◽

Caroline J Bull ◽

Nicholas J Timpson ◽

...

Keyword(s):

Population Structure ◽

Principal Component ◽

European Ancestry ◽

Uk Biobank ◽

Ancestry Group ◽

The United Kingdom ◽

Health And Disease ◽

Birth Data ◽

The Uk ◽

Epidemiology Studies

The UK Biobank is a large prospective cohort, based in the United Kingdom, that has deep phenotypic and genomic data on roughly a half a million individuals. Included in this resource are data on approximately 78,000 individuals with "non-white British ancestry". Whilst most epidemiology studies have focused predominantly on populations of European ancestry, there is an opportunity to contribute to the study of health and disease for a broader segment of the population by making use of the UK Biobank's "non-white British ancestry" samples. Here we present an empirical description of the continental ancestry and population structure among the individuals in this UK Biobank subset. Reference populations from the 1000 Genomes Project for Africa, Europe, East Asia, and South Asia were used to estimate ancestry for each individual. Those with at least 80% ancestry in one of these four continental ancestry groups were taken forward (N=62,484). Principal component and K-means clustering analyses were used to identify and characterize population structure within each ancestry group. Of the approximately 78,000 individuals in the UK Biobank that are of "non-white British" ancestry, 50,685, 6,653, 2,782, and 2,364 individuals were associated to the European, African, South Asian, and East Asian continental ancestry groups, respectively. Each continental ancestry group exhibits prominent population structure that is consistent with self-reported country of birth data and geography. Methods outlined here provide an avenue to leverage UK Biobank's deeply phenotyped data allowing researchers to maximise its potential in the study of health and disease in individuals of non-white British ancestry.

Download Full-text

Principal component analysis of genetic data

Nature Genetics ◽

10.1038/ng0508-491 ◽

2008 ◽

Vol 40 (5) ◽

pp. 491-492 ◽

Cited By ~ 132

Author(s):

David Reich ◽

Alkes L Price ◽

Nick Patterson

Keyword(s):

Principal Component Analysis ◽

Principal Component ◽

Genetic Data ◽

Component Analysis

Download Full-text

Pcadapt: An R Package to Perform Genome Scans for Selection Based on Principal Component Analysis

10.1101/056135 ◽

2016 ◽

Cited By ~ 6

Author(s):

Keurcien Luu ◽

Eric Bazin ◽

Michael G. B. Blum

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Mahalanobis Distance ◽

Principal Component ◽

R Package ◽

Component Analysis ◽

Population Divergence ◽

Sequencing Data ◽

Genome Scans ◽

False Discoveries

AbstractThe R package pcadapt performs genome scans to detect genes under selection based on population genomic data. It assumes that candidate markers are outliers with respect to how they are related to population structure. Because population structure is ascertained with principal component analysis, the package is fast and works with large-scale data. It can handle missing data and pooled sequencing data. By contrast to population-based approaches, the package handle admixed individuals and does not require grouping individuals into populations. Since its first release, pcadapt has evolved both in terms of statistical approach and software implementation. We present results obtained with robust Mahalanobis distance, which is a new statistic for genome scans available in the 2.0 and later versions of the package. When hierarchical population structure occurs, Mahalanobis distance is more powerful than the communality statistic that was implemented in the first version of the package. Using simulated data, we compare pcadapt to other software for genome scans (BayeScan, hapflk, OutFLANK, sNMF). We find that the proportion of false discoveries is around a nominal false discovery rate set at 10% with the exception of BayeScan that generates 40% of false discoveries. We also find that the power of BayeScan is severely impacted by the presence of admixed individuals whereas pcadapt is not impacted. Last, we find that pcadapt and hapflk are the most powerful software in scenarios of population divergence and range expansion. Because pcadapt handles next-generation sequencing data, it is a valuable tool for data analysis in molecular ecology.

Download Full-text

IPCAPS: an R package for iterative pruning to capture population structure

10.1101/186874 ◽

2017 ◽

Cited By ~ 3

Author(s):

Kridsadakorn Chaichoompu ◽

Fentaw Abegaz Yazew ◽

Sissades Tongsima ◽

Philip James Shaw ◽

Anavaj Sakuntabhai ◽

...

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Principal Component ◽

R Package ◽

Component Analysis ◽

Genomic Variation ◽

Fine Scale ◽

Nucleotide Polymorphisms ◽

Measurement Scales ◽

Scale Population

AbstractBackgroundResolving population genetic structure is challenging, especially when dealing with closely related or geographically confined populations. Although Principal Component Analysis (PCA)-based methods and genomic variation with single nucleotide polymorphisms (SNPs) are widely used to describe shared genetic ancestry, improvements can be made especially when fine-scale population structure is the target.ResultsThis work presents an R package called IPCAPS, which uses SNP information for resolving possibly fine-scale population structure. The IPCAPS routines are built on the iterative pruning Principal Component Analysis (ipPCA) framework that systematically assigns individuals to genetically similar subgroups. In each iteration, our tool is able to detect and eliminate outliers, hereby avoiding severe misclassification errors.ConclusionsIPCAPS supports different measurement scales for variables used to identify substructure. Hence, panels of gene expression and methylation data can be accommodated as well. The tool can also be applied in patient sub-phenotyping contexts. IPCAPS is developed in R and is freely available from bio3.giga.ulg.ac.be/ipcaps

Download Full-text

Fast and robust ancestry prediction using principal component analysis

10.1101/713172 ◽

2019 ◽

Cited By ~ 1

Author(s):

Daiwei Zhang ◽

Rounak Dey ◽

Seunggeun Lee

Keyword(s):

Principal Component Analysis ◽

Matrix Theory ◽

Association Studies ◽

Principal Component ◽

Component Analysis ◽

European Ancestry ◽

Genome Wide Association Studies ◽

Data Set ◽

1000 Genomes ◽

Alternative Approaches

AbstractPopulation stratification (PS) is a major confounder in genome-wide association studies (GWAS) and can lead to false positive associations. To adjust for PS, principal component analysis (PCA)-based ancestry prediction has been widely used. Simple projection (SP) based on principal component loading and recently developed data augmentation-decomposition-transformation (ADP), such as LASER and TRACE, are popular methods for predicting PC scores. However, they are either biased or computationally expensive. The predicted PC scores from SP can be biased toward NULL. On the other hand, since ADP requires running PCA separately for each study sample on the augmented data set, its computational cost is high. To address these problems, we develop and propose two alternative approaches, bias-adjusted projection (AP) and online ADP (OADP). Using random matrix theory, AP asymptotically estimates and adjusts for the bias of SP. OADP uses computationally efficient online singular value decomposition, which can greatly reduce the computation cost of ADP. We carried out extensive simulation studies to show that these alternative approaches are unbiased and the computation times can be 10-100 times faster than ADP. We applied our approaches to UK-Biobank data of 488,366 study samples with 2,492 samples from the 1000 Genomes data as the reference. AP and OADP required 7 and 75 CPU hours, respectively, while the projected computation time of ADP is 2,534 CPU hours. Furthermore, when we only used the European reference samples in the 1000 Genomes to infer sub-European ancestry, SP clearly showed bias, unlike the proposed approaches. By using AP and OADP, we can infer ancestry and adjust for PS robustly and efficiently.

Download Full-text

A deep learning framework for characterization of genotype data

10.1101/2020.09.30.320994 ◽

2020 ◽

Author(s):

Kristiina Ausmees ◽

Carl Nettelblad

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Deep Learning ◽

Dimensionality Reduction ◽

Principal Component ◽

Data Transformation ◽

Component Analysis ◽

Classification Model ◽

Genotype Data

ABSTRACTDimensionality reduction is a data transformation technique widely used in various fields of genomics research, with principal component analysis one of the most frequently employed methods. Application of principal component analysis to genotype data is known to capture genetic similarity between individuals, and is used for visualization of genetic variation, identification of population structure as well as ancestry mapping. However, the method is based on a linear model that is sensitive to characteristics of data such as correlation of single-nucleotide polymorphisms due to linkage disequilibrium, resulting in limitations in its ability to capture complex population structure.Deep learning models are a type of nonlinear machine learning method in which the features used in data transformation are decided by the model in a data-driven manner, rather than by the researcher, and have been shown to present a promising alternative to traditional statistical methods for various applications in omics research. In this paper, we propose a deep learning model based on a convolutional autoencoder architecture for dimensionality reduction of genotype data.Using a highly diverse cohort of human samples, we demonstrate that the model can identify population clusters and provide richer visual information in comparison to principal component analysis, and also yield a more accurate population classification model. We also discuss the use of the methodology for more general characterization of genotype data, showing that models of a similar architecture can be used as a genetic clustering method, comparing results to the ADMIXTURE software frequently used in population genetic studies.

Download Full-text