Fast and robust ancestry prediction using principal component analysis

Daiwei Zhang; Rounak Dey; Seunggeun Lee

doi:10.1093/bioinformatics/btaa152

Fast and robust ancestry prediction using principal component analysis

Bioinformatics ◽

10.1093/bioinformatics/btaa152 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3439-3446 ◽

Cited By ~ 1

Author(s):

Daiwei Zhang ◽

Rounak Dey ◽

Seunggeun Lee

Keyword(s):

Principal Component Analysis ◽

Matrix Theory ◽

Association Studies ◽

Principal Component ◽

Component Analysis ◽

Supplementary Information ◽

European Ancestry ◽

Genome Wide Association Studies ◽

Computation Cost ◽

Alternative Approaches

Abstract Motivation Population stratification (PS) is a major confounder in genome-wide association studies (GWAS) and can lead to false-positive associations. To adjust for PS, principal component analysis (PCA)-based ancestry prediction has been widely used. Simple projection (SP) based on principal component loadings and the recently developed data augmentation, decomposition and Procrustes (ADP) transformation, such as LASER and TRACE, are popular methods for predicting PC scores. However, the predicted PC scores from SP can be biased toward NULL. On the other hand, ADP has a high computation cost because it requires running PCA separately for each study sample on the augmented dataset. Results We develop and propose two alternative approaches: bias-adjusted projection (AP) and online ADP (OADP). Using random matrix theory, AP asymptotically estimates and adjusts for the bias of SP. OADP uses a computationally efficient online singular value decomposition algorithm, which can greatly reduce the computation cost of ADP. We carried out extensive simulation studies to show that these alternative approaches are unbiased and the computation speed can be 16–16 000 times faster than ADP. We applied our approaches to the UK Biobank data of 488 366 study samples with 2492 samples from the 1000 Genomes data as the reference. AP and OADP required 0.82 and 21 CPU hours, respectively, while the projected computation time of ADP was 1628 CPU hours. Furthermore, when inferring sub-European ancestry, SP clearly showed bias, unlike the proposed approaches. Availability and implementation The OADP and AP methods, as well as SP and ADP, have been implemented in the open-source Python software FRAPOSA, available at github.com/daviddaiweizhang/fraposa. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Fast and robust ancestry prediction using principal component analysis

10.1101/713172 ◽

2019 ◽

Cited By ~ 1

Author(s):

Daiwei Zhang ◽

Rounak Dey ◽

Seunggeun Lee

Keyword(s):

Principal Component Analysis ◽

Matrix Theory ◽

Association Studies ◽

Principal Component ◽

Component Analysis ◽

European Ancestry ◽

Genome Wide Association Studies ◽

Data Set ◽

1000 Genomes ◽

Alternative Approaches

AbstractPopulation stratification (PS) is a major confounder in genome-wide association studies (GWAS) and can lead to false positive associations. To adjust for PS, principal component analysis (PCA)-based ancestry prediction has been widely used. Simple projection (SP) based on principal component loading and recently developed data augmentation-decomposition-transformation (ADP), such as LASER and TRACE, are popular methods for predicting PC scores. However, they are either biased or computationally expensive. The predicted PC scores from SP can be biased toward NULL. On the other hand, since ADP requires running PCA separately for each study sample on the augmented data set, its computational cost is high. To address these problems, we develop and propose two alternative approaches, bias-adjusted projection (AP) and online ADP (OADP). Using random matrix theory, AP asymptotically estimates and adjusts for the bias of SP. OADP uses computationally efficient online singular value decomposition, which can greatly reduce the computation cost of ADP. We carried out extensive simulation studies to show that these alternative approaches are unbiased and the computation times can be 10-100 times faster than ADP. We applied our approaches to UK-Biobank data of 488,366 study samples with 2,492 samples from the 1000 Genomes data as the reference. AP and OADP required 7 and 75 CPU hours, respectively, while the projected computation time of ADP is 2,534 CPU hours. Furthermore, when we only used the European reference samples in the 1000 Genomes to infer sub-European ancestry, SP clearly showed bias, unlike the proposed approaches. By using AP and OADP, we can infer ancestry and adjust for PS robustly and efficiently.

Download Full-text

Maximizing the Power of Principal-Component Analysis of Correlated Phenotypes in Genome-wide Association Studies

The American Journal of Human Genetics ◽

10.1016/j.ajhg.2014.03.016 ◽

2014 ◽

Vol 94 (5) ◽

pp. 662-676 ◽

Cited By ~ 85

Author(s):

Hugues Aschard ◽

Bjarni J. Vilhjálmsson ◽

Nicolas Greliche ◽

Pierre-Emmanuel Morange ◽

David-Alexandre Trégouët ◽

...

Keyword(s):

Principal Component Analysis ◽

Association Studies ◽

Principal Component ◽

Component Analysis ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide

Download Full-text

Sparse Principal Component Analysis for Identifying Ancestry-Informative Markers in Genome-Wide Association Studies

Genetic Epidemiology ◽

10.1002/gepi.21621 ◽

2012 ◽

Vol 36 (4) ◽

pp. 293-302 ◽

Cited By ~ 26

Author(s):

Seokho Lee ◽

Michael P. Epstein ◽

Richard Duncan ◽

Xihong Lin

Keyword(s):

Principal Component Analysis ◽

Association Studies ◽

Principal Component ◽

Component Analysis ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Ancestry Informative Markers ◽

Sparse Principal Component Analysis ◽

Genome Wide

Download Full-text

Evaluation of methods for adjusting population stratification in genome‐wide association studies: Standard versus categorical principal component analysis

Annals of Human Genetics ◽

10.1111/ahg.12339 ◽

2019 ◽

Vol 83 (6) ◽

pp. 454-464

Author(s):

Asuman S. Turkmen ◽

Yuan Yuan ◽

Nedret Billor

Keyword(s):

Principal Component Analysis ◽

Population Stratification ◽

Association Studies ◽

Principal Component ◽

Component Analysis ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Categorical Principal Component Analysis ◽

Evaluation Of Methods

Download Full-text

Supervised logistic principal component analysis for pathway based genome-wide association studies

Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine - BCB '12 ◽

10.1145/2382936.2382943 ◽

2012 ◽

Cited By ~ 3

Author(s):

Meng Lu ◽

Jianhua Z. Huang ◽

Xiaoning Qian

Keyword(s):

Principal Component Analysis ◽

Association Studies ◽

Principal Component ◽

Component Analysis ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide

Download Full-text

Principal Component Analysis Characterizes Shared Pathogenetics from Genome-Wide Association Studies

PLoS Computational Biology ◽

10.1371/journal.pcbi.1003820 ◽

2014 ◽

Vol 10 (9) ◽

pp. e1003820 ◽

Cited By ~ 8

Author(s):

Diana Chang ◽

Alon Keinan

Keyword(s):

Principal Component Analysis ◽

Association Studies ◽

Principal Component ◽

Component Analysis ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide

Download Full-text

Direct Phenotyping and Principal Component Analysis of Type Traits Implicate Novel QTL in Bovine Mastitis through Genome-Wide Association

Animals ◽

10.3390/ani11041147 ◽

2021 ◽

Vol 11 (4) ◽

pp. 1147

Author(s):

Asha M. Miles ◽

Christian J. Posbergh ◽

Heather J. Huson

Keyword(s):

Principal Component Analysis ◽

Association Studies ◽

Principal Component ◽

Component Analysis ◽

Pleiotropic Effects ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Type Traits ◽

Genomic Regions

Our objectives were to robustly characterize a cohort of Holstein cows for udder and teat type traits and perform high-density genome-wide association studies for those traits within the same group of animals, thereby improving the accuracy of the phenotypic measurements and genomic association study. Additionally, we sought to identify a novel udder and teat trait composite risk index to determine loci with potential pleiotropic effects related to mastitis. This approach was aimed at improving the biological understanding of the genetic factors influencing mastitis. Cows (N = 471) were genotyped on the Illumina BovineHD777k beadchip and scored for front and rear teat length, width, end shape, and placement; fore udder attachment; udder cleft; udder depth; rear udder height; and rear udder width. We used principal component analysis to create a single composite measure describing type traits previously linked to high odds of developing mastitis within our cohort of cows. Genome-wide associations were performed, and 28 genomic regions were significantly associated (Bonferroni-corrected p < 0.05). Interrogation of these genomic regions revealed a number of biologically plausible genes whicht may contribute to the development of mastitis and whose functions range from regulating cell proliferation to immune system signaling, including ZNF683, DHX9, CUX1, TNNT1, and SPRY1. Genetic investigation of the risk composite trait implicated a novel locus and candidate genes that have potentially pleiotropic effects related to mastitis.

Download Full-text

Tropical principal component analysis on the space of phylogenetic trees

Bioinformatics ◽

10.1093/bioinformatics/btaa564 ◽

2020 ◽

Vol 36 (17) ◽

pp. 4590-4598

Author(s):

Robert Page ◽

Ruriko Yoshida ◽

Leon Zhang

Keyword(s):

Machine Learning ◽

Principal Component Analysis ◽

Phylogenetic Trees ◽

Principal Component ◽

Component Analysis ◽

Fixed Number ◽

Supplementary Information ◽

Gene Trees ◽

Learning Methods ◽

Machine Learning Methods

Abstract Motivation Due to new technology for efficiently generating genome data, machine learning methods are urgently needed to analyze large sets of gene trees over the space of phylogenetic trees. However, the space of phylogenetic trees is not Euclidean, so ordinary machine learning methods cannot be directly applied. In 2019, Yoshida et al. introduced the notion of tropical principal component analysis (PCA), a statistical method for visualization and dimensionality reduction using a tropical polytope with a fixed number of vertices that minimizes the sum of tropical distances between each data point and its tropical projection. However, their work focused on the tropical projective space rather than the space of phylogenetic trees. We focus here on tropical PCA for dimension reduction and visualization over the space of phylogenetic trees. Results Our main results are 2-fold: (i) theoretical interpretations of the tropical principal components over the space of phylogenetic trees, namely, the existence of a tropical cell decomposition into regions of fixed tree topology; and (ii) the development of a stochastic optimization method to estimate tropical PCs over the space of phylogenetic trees using a Markov Chain Monte Carlo approach. This method performs well with simulation studies, and it is applied to three empirical datasets: Apicomplexa and African coelacanth genomes as well as sequences of hemagglutinin for influenza from New York. Availability and implementation Dataset: http://polytopes.net/Data.tar.gz. Code: http://polytopes.net/tropica_MCMC_codes.tar.gz. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PopCluster: an algorithm to identify genetic variants with ethnicity-dependent effects

Bioinformatics ◽

10.1093/bioinformatics/btz017 ◽

2019 ◽

Vol 35 (17) ◽

pp. 3046-3054 ◽

Cited By ~ 2

Author(s):

Anastasia Gurinovich ◽

Harold Bae ◽

John J Farrell ◽

Stacy L Andersen ◽

Stefano Monti ◽

...

Keyword(s):

Genetic Variants ◽

Association Studies ◽

False Positive Rate ◽

Principal Component ◽

True Positive Rate ◽

Genome Wide Association ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Positive Rate

Abstract Motivation Over the last decade, more diverse populations have been included in genome-wide association studies. If a genetic variant has a varying effect on a phenotype in different populations, genome-wide association studies applied to a dataset as a whole may not pinpoint such differences. It is especially important to be able to identify population-specific effects of genetic variants in studies that would eventually lead to development of diagnostic tests or drug discovery. Results In this paper, we propose PopCluster: an algorithm to automatically discover subsets of individuals in which the genetic effects of a variant are statistically different. PopCluster provides a simple framework to directly analyze genotype data without prior knowledge of subjects’ ethnicities. PopCluster combines logistic regression modeling, principal component analysis, hierarchical clustering and a recursive bottom-up tree parsing procedure. The evaluation of PopCluster suggests that the algorithm has a stable low false positive rate (∼4%) and high true positive rate (>80%) in simulations with large differences in allele frequencies between cases and controls. Application of PopCluster to data from genetic studies of longevity discovers ethnicity-dependent heterogeneity in the association of rs3764814 (USP42) with the phenotype. Availability and implementation PopCluster was implemented using the R programming language, PLINK and Eigensoft software, and can be found at the following GitHub repository: https://github.com/gurinovich/PopCluster with instructions on its installation and usage. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes

Bioinformatics ◽

10.1093/bioinformatics/btz157 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3679-3683 ◽

Cited By ~ 8

Author(s):

Aritra Bose ◽

Vassilis Kalantzis ◽

Eugenia-Maria Kontopoulou ◽

Mai Elkady ◽

Peristera Paschou ◽

...

Keyword(s):

Principal Component Analysis ◽

Large Scale ◽

Human Genetics ◽

Random Access ◽

Principal Component ◽

Component Analysis ◽

Supplementary Information ◽

Subspace Iteration ◽

System Memory ◽

Traditional Approaches

Abstract Motivation Principal Component Analysis is a key tool in the study of population structure in human genetics. As modern datasets become increasingly larger in size, traditional approaches based on loading the entire dataset in the system memory (Random Access Memory) become impractical and out-of-core implementations are the only viable alternative. Results We present TeraPCA, a C++ implementation of the Randomized Subspace Iteration method to perform Principal Component Analysis of large-scale datasets. TeraPCA can be applied both in-core and out-of-core and is able to successfully operate even on commodity hardware with a system memory of just a few gigabytes. Moreover, TeraPCA has minimal dependencies on external libraries and only requires a working installation of the BLAS and LAPACK libraries. When applied to a dataset containing a million individuals genotyped on a million markers, TeraPCA requires <5 h (in multi-threaded mode) to accurately compute the 10 leading principal components. An extensive experimental analysis shows that TeraPCA is both fast and accurate and is competitive with current state-of-the-art software for the same task. Availability and implementation Source code and documentation are both available at https://github.com/aritra90/TeraPCA. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text