Ancestry inference and grouping from principal component analysis of genetic data

Mapping Intimacies ◽

10.1101/2020.10.06.328203 ◽

2020 ◽

Author(s):

Florian Privé

Keyword(s):

Principal Component Analysis ◽

Principal Component ◽

Genetic Data ◽

Component Analysis ◽

Easy Access ◽

1000 Genomes Project ◽

1000 Genomes ◽

Euclidean Distances ◽

Ancestry Inference ◽

The Uk

AbstractHere we propose a simple, robust and effective method for global ancestry inference and grouping from Principal Component Analysis (PCA) of genetic data. The proposed approach is particularly useful for methods that need to be applied in homogeneous samples. First, we show that Euclidean distances in the PCA space are proportional to FST between populations. Then, we show how to use this PCA-based distance to infer ancestry in the UK Biobank and the POPRES datasets. We propose two solutions, either relying on projection of PCs to reference populations such as from the 1000 Genomes Project, or by directly using the internal data. Finally, we conclude that our method and the community would benefit from having an easy access to a reference dataset with an even better coverage of the worldwide genetic diversity than the 1000 Genomes Project.

Download Full-text

Efficient toolkit implementing best practices for principal component analysis of population genetic data

10.1101/841452 ◽

2019 ◽

Cited By ~ 2

Author(s):

Florian Privé ◽

Keurcien Luu ◽

Michael G.B. Blum ◽

John J. McGrath ◽

Bjarni J. Vilhjálmsson

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Best Practices ◽

Principal Component ◽

Genetic Data ◽

Component Analysis ◽

Uk Biobank ◽

1000 Genomes Project ◽

1000 Genomes ◽

The Uk

AbstractPrincipal Component Analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (1) capturing Linkage Disequilibrium (LD) structure instead of population structure, (2) projected PCs that suffer from shrinkage bias, (3) detecting sample outliers, and (4) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr.For example, we find that PC19 to PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16-18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr.Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data.

Download Full-text

Efficient toolkit implementing best practices for principal component analysis of population genetic data

Bioinformatics ◽

10.1093/bioinformatics/btaa520 ◽

2020 ◽

Vol 36 (16) ◽

pp. 4449-4457 ◽

Cited By ~ 4

Author(s):

Florian Privé ◽

Keurcien Luu ◽

Michael G B Blum ◽

John J McGrath ◽

Bjarni J Vilhjálmsson

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Best Practices ◽

Principal Component ◽

Genetic Data ◽

Uk Biobank ◽

1000 Genomes Project ◽

1000 Genomes ◽

R Packages ◽

The Uk

ABSTRACT Motivation Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr. Results For example, we find that PC19–PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16–18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data. Availability and implementation R packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https://github.com/privefl/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https://privefl.github.io/bigsnpr/articles/bedpca.html. All code used for this paper is available at https://github.com/privefl/paper4-bedpca/tree/master/code. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Principal component analysis reveals the 1000 Genomes Project does not sufficiently cover the human genetic diversity in Asia

Frontiers in Genetics ◽

10.3389/fgene.2013.00127 ◽

2013 ◽

Vol 4 ◽

Cited By ~ 18

Author(s):

Dongsheng Lu ◽

Shuhua Xu

Keyword(s):

Genetic Diversity ◽

Principal Component Analysis ◽

Principal Component ◽

Component Analysis ◽

1000 Genomes Project ◽

1000 Genomes

Download Full-text

Principal component analysis of genetic data

Nature Genetics ◽

10.1038/ng0508-491 ◽

2008 ◽

Vol 40 (5) ◽

pp. 491-492 ◽

Cited By ~ 132

Author(s):

David Reich ◽

Alkes L Price ◽

Nick Patterson

Keyword(s):

Principal Component Analysis ◽

Principal Component ◽

Genetic Data ◽

Component Analysis

Download Full-text

A survey of epicuticular waxes among genera of Triticeae. III. Synthesis and conclusion

Canadian Journal of Botany ◽

10.1139/b82-223 ◽

1982 ◽

Vol 60 (9) ◽

pp. 1761-1770 ◽

Cited By ~ 6

Author(s):

Bernard R. Baum ◽

A. P. Tulloch

Keyword(s):

Principal Component Analysis ◽

Correlation Matrix ◽

Principal Component ◽

Component Analysis ◽

Principal Coordinate Analysis ◽

Epicuticular Waxes ◽

Whole Plant ◽

Close Relationship ◽

Euclidean Distances ◽

Highly Correlated

Characteristics of ultrastructural morphology and chemical composition of epicuticular waxes on glumes of Triticeae were combined for two series of numerical taxonomic analyses. The first, incorporating within-genus variability, utilized frequencies and information radius. The information radius matrix was subjected to Jardine–Sibson Bk clustering, then transformed to Euclidean distances for distance Wagner and principal-coordinate analyses. The second series employed a table of average character values for each genus which was subjected to four ordinations: (i) principal-component analysis of the correlation matrix, (ii) principal-component analysis of the variance-covariance matrix, (iii) principal-coordinate analysis, and (iv) nonmetric multidimensional scaling. The results are compared and general inferences are drawn. Occurrence of wax filaments on the glumes was highly correlated with presence of appreciable amounts of β-diketones in wax from the whole plant. While some genera, such as Triticum and Aegilops, appeared less closely related than expected from classification based on morphology, this procedure has suggested relationships between other genera, such as Roegneria and Hordeum and Secale and Elymus. The genera Leymus, Elymus, and Aneurolepidium were also closely related to each other and more distantly to Elytrigia, Triticum, and Agropyron. A relatively close relationship was also shown between the seven genera, Crithopsis, Eremopyron, Heteranthelium, Hordelymus, Psathyrostachys, Sitanion, and Taeniatherum, which have waxes which do not contain any β-diketones.

Download Full-text

Fast and robust ancestry prediction using principal component analysis

10.1101/713172 ◽

2019 ◽

Cited By ~ 1

Author(s):

Daiwei Zhang ◽

Rounak Dey ◽

Seunggeun Lee

Keyword(s):

Principal Component Analysis ◽

Matrix Theory ◽

Association Studies ◽

Principal Component ◽

Component Analysis ◽

European Ancestry ◽

Genome Wide Association Studies ◽

Data Set ◽

1000 Genomes ◽

Alternative Approaches

AbstractPopulation stratification (PS) is a major confounder in genome-wide association studies (GWAS) and can lead to false positive associations. To adjust for PS, principal component analysis (PCA)-based ancestry prediction has been widely used. Simple projection (SP) based on principal component loading and recently developed data augmentation-decomposition-transformation (ADP), such as LASER and TRACE, are popular methods for predicting PC scores. However, they are either biased or computationally expensive. The predicted PC scores from SP can be biased toward NULL. On the other hand, since ADP requires running PCA separately for each study sample on the augmented data set, its computational cost is high. To address these problems, we develop and propose two alternative approaches, bias-adjusted projection (AP) and online ADP (OADP). Using random matrix theory, AP asymptotically estimates and adjusts for the bias of SP. OADP uses computationally efficient online singular value decomposition, which can greatly reduce the computation cost of ADP. We carried out extensive simulation studies to show that these alternative approaches are unbiased and the computation times can be 10-100 times faster than ADP. We applied our approaches to UK-Biobank data of 488,366 study samples with 2,492 samples from the 1000 Genomes data as the reference. AP and OADP required 7 and 75 CPU hours, respectively, while the projected computation time of ADP is 2,534 CPU hours. Furthermore, when we only used the European reference samples in the 1000 Genomes to infer sub-European ancestry, SP clearly showed bias, unlike the proposed approaches. By using AP and OADP, we can infer ancestry and adjust for PS robustly and efficiently.

Download Full-text

Detecting Genomic Signatures of Natural Selection with Principal Component Analysis: Application to the 1000 Genomes Data

Molecular Biology and Evolution ◽

10.1093/molbev/msv334 ◽

2015 ◽

Vol 33 (4) ◽

pp. 1082-1093 ◽

Cited By ~ 71

Author(s):

Nicolas Duforet-Frebourg ◽

Keurcien Luu ◽

Guillaume Laval ◽

Eric Bazin ◽

Michael G.B. Blum

Keyword(s):

Principal Component Analysis ◽

Natural Selection ◽

Principal Component ◽

Component Analysis ◽

1000 Genomes ◽

Genomic Signatures ◽

Analysis Application

Download Full-text

Principal Component Analysis of the UK Term Structure & Its Application to Immunisation Strategies

SSRN Electronic Journal ◽

10.2139/ssrn.1678844 ◽

2010 ◽

Author(s):

Sadeeptha S. Jayathilaka

Keyword(s):

Principal Component Analysis ◽

Term Structure ◽

Principal Component ◽

Component Analysis ◽

The Uk

Download Full-text

Ancestry inference using principal component analysis and spatial analysis: a distance-based analysis to account for population substructure

BMC Genomics ◽

10.1186/s12864-017-4166-8 ◽

2017 ◽

Vol 18 (1) ◽

Cited By ~ 6

Author(s):

Jinyoung Byun ◽

Younghun Han ◽

Ivan P. Gorlov ◽

Jonathan A. Busam ◽

Michael F. Seldin ◽

...

Keyword(s):

Principal Component Analysis ◽

Spatial Analysis ◽

Principal Component ◽

Component Analysis ◽

Population Substructure ◽

Ancestry Inference

Download Full-text

A German version of the Intermittent Claudication Questionnaire (ICQ): cultural adaptation and validation

VASA ◽

10.1024/0301-1526/a000218 ◽

2012 ◽

Vol 41 (5) ◽

pp. 333-342 ◽

Cited By ~ 3

Author(s):

Kirchberger ◽

Finger ◽

Müller-Bühl

Keyword(s):

Principal Component Analysis ◽

Intermittent Claudication ◽

Completion Time ◽

Short Form ◽

Principal Component ◽

Component Analysis ◽

German Version ◽

Average Completion Time ◽

Sf 36 ◽

Related Quality

Background: The Intermittent Claudication Questionnaire (ICQ) is a short questionnaire for the assessment of health-related quality of life (HRQOL) in patients with intermittent claudication (IC). The objective of this study was to translate the ICQ into German and to investigate the psychometric properties of the German ICQ version in patients with IC. Patients and methods: The original English version was translated using a forward-backward method. The resulting German version was reviewed by the author of the original version and an experienced clinician. Finally, it was tested for clarity with 5 German patients with IC. A sample of 81 patients were administered the German ICQ. The sample consisted of 58.0 % male patients with a median age of 71 years and a median IC duration of 36 months. Test of feasibility included completeness of questionnaires, completion time, and ratings of clarity, length and relevance. Reliability was assessed through a retest in 13 patients at 14 days, and analysis of Cronbachs alpha for internal consistency. Construct validity was investigated using principal component analysis. Concurrent validity was assessed by correlating the ICQ scores with the Short Form 36 Health Survey (SF-36) as well as clinical measures. Results: The ICQ was completely filled in by 73 subjects (90.1 %) with an average completion time of 6.3 minutes. Cronbachs alpha coefficient reached 0.75. Intra-class correlation for test-retest reliability was r = 0.88. Principal component analysis resulted in a 3 factor solution. The first factor explained 51.5 of the total variation and all items had loadings of at least 0.65 on it. The ICQ was significantly associated with the SF-36 and treadmill-walking distances whereas no association was found for resting ABPI. Conclusions: The German version of the ICQ demonstrated good feasibility, satisfactory reliability and good validity. Responsiveness should be investigated in further validation studies.

Download Full-text