scholarly journals Fast and robust ancestry prediction using principal component analysis

2020 ◽  
Vol 36 (11) ◽  
pp. 3439-3446 ◽  
Author(s):  
Daiwei Zhang ◽  
Rounak Dey ◽  
Seunggeun Lee

Abstract Motivation Population stratification (PS) is a major confounder in genome-wide association studies (GWAS) and can lead to false-positive associations. To adjust for PS, principal component analysis (PCA)-based ancestry prediction has been widely used. Simple projection (SP) based on principal component loadings and the recently developed data augmentation, decomposition and Procrustes (ADP) transformation, such as LASER and TRACE, are popular methods for predicting PC scores. However, the predicted PC scores from SP can be biased toward NULL. On the other hand, ADP has a high computation cost because it requires running PCA separately for each study sample on the augmented dataset. Results We develop and propose two alternative approaches: bias-adjusted projection (AP) and online ADP (OADP). Using random matrix theory, AP asymptotically estimates and adjusts for the bias of SP. OADP uses a computationally efficient online singular value decomposition algorithm, which can greatly reduce the computation cost of ADP. We carried out extensive simulation studies to show that these alternative approaches are unbiased and the computation speed can be 16–16 000 times faster than ADP. We applied our approaches to the UK Biobank data of 488 366 study samples with 2492 samples from the 1000 Genomes data as the reference. AP and OADP required 0.82 and 21 CPU hours, respectively, while the projected computation time of ADP was 1628 CPU hours. Furthermore, when inferring sub-European ancestry, SP clearly showed bias, unlike the proposed approaches. Availability and implementation The OADP and AP methods, as well as SP and ADP, have been implemented in the open-source Python software FRAPOSA, available at github.com/daviddaiweizhang/fraposa. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Author(s):  
Daiwei Zhang ◽  
Rounak Dey ◽  
Seunggeun Lee

AbstractPopulation stratification (PS) is a major confounder in genome-wide association studies (GWAS) and can lead to false positive associations. To adjust for PS, principal component analysis (PCA)-based ancestry prediction has been widely used. Simple projection (SP) based on principal component loading and recently developed data augmentation-decomposition-transformation (ADP), such as LASER and TRACE, are popular methods for predicting PC scores. However, they are either biased or computationally expensive. The predicted PC scores from SP can be biased toward NULL. On the other hand, since ADP requires running PCA separately for each study sample on the augmented data set, its computational cost is high. To address these problems, we develop and propose two alternative approaches, bias-adjusted projection (AP) and online ADP (OADP). Using random matrix theory, AP asymptotically estimates and adjusts for the bias of SP. OADP uses computationally efficient online singular value decomposition, which can greatly reduce the computation cost of ADP. We carried out extensive simulation studies to show that these alternative approaches are unbiased and the computation times can be 10-100 times faster than ADP. We applied our approaches to UK-Biobank data of 488,366 study samples with 2,492 samples from the 1000 Genomes data as the reference. AP and OADP required 7 and 75 CPU hours, respectively, while the projected computation time of ADP is 2,534 CPU hours. Furthermore, when we only used the European reference samples in the 1000 Genomes to infer sub-European ancestry, SP clearly showed bias, unlike the proposed approaches. By using AP and OADP, we can infer ancestry and adjust for PS robustly and efficiently.


2014 ◽  
Vol 94 (5) ◽  
pp. 662-676 ◽  
Author(s):  
Hugues Aschard ◽  
Bjarni J. Vilhjálmsson ◽  
Nicolas Greliche ◽  
Pierre-Emmanuel Morange ◽  
David-Alexandre Trégouët ◽  
...  

Animals ◽  
2021 ◽  
Vol 11 (4) ◽  
pp. 1147
Author(s):  
Asha M. Miles ◽  
Christian J. Posbergh ◽  
Heather J. Huson

Our objectives were to robustly characterize a cohort of Holstein cows for udder and teat type traits and perform high-density genome-wide association studies for those traits within the same group of animals, thereby improving the accuracy of the phenotypic measurements and genomic association study. Additionally, we sought to identify a novel udder and teat trait composite risk index to determine loci with potential pleiotropic effects related to mastitis. This approach was aimed at improving the biological understanding of the genetic factors influencing mastitis. Cows (N = 471) were genotyped on the Illumina BovineHD777k beadchip and scored for front and rear teat length, width, end shape, and placement; fore udder attachment; udder cleft; udder depth; rear udder height; and rear udder width. We used principal component analysis to create a single composite measure describing type traits previously linked to high odds of developing mastitis within our cohort of cows. Genome-wide associations were performed, and 28 genomic regions were significantly associated (Bonferroni-corrected p < 0.05). Interrogation of these genomic regions revealed a number of biologically plausible genes whicht may contribute to the development of mastitis and whose functions range from regulating cell proliferation to immune system signaling, including ZNF683, DHX9, CUX1, TNNT1, and SPRY1. Genetic investigation of the risk composite trait implicated a novel locus and candidate genes that have potentially pleiotropic effects related to mastitis.


2020 ◽  
Vol 36 (17) ◽  
pp. 4590-4598
Author(s):  
Robert Page ◽  
Ruriko Yoshida ◽  
Leon Zhang

Abstract Motivation Due to new technology for efficiently generating genome data, machine learning methods are urgently needed to analyze large sets of gene trees over the space of phylogenetic trees. However, the space of phylogenetic trees is not Euclidean, so ordinary machine learning methods cannot be directly applied. In 2019, Yoshida et al. introduced the notion of tropical principal component analysis (PCA), a statistical method for visualization and dimensionality reduction using a tropical polytope with a fixed number of vertices that minimizes the sum of tropical distances between each data point and its tropical projection. However, their work focused on the tropical projective space rather than the space of phylogenetic trees. We focus here on tropical PCA for dimension reduction and visualization over the space of phylogenetic trees. Results Our main results are 2-fold: (i) theoretical interpretations of the tropical principal components over the space of phylogenetic trees, namely, the existence of a tropical cell decomposition into regions of fixed tree topology; and (ii) the development of a stochastic optimization method to estimate tropical PCs over the space of phylogenetic trees using a Markov Chain Monte Carlo approach. This method performs well with simulation studies, and it is applied to three empirical datasets: Apicomplexa and African coelacanth genomes as well as sequences of hemagglutinin for influenza from New York. Availability and implementation Dataset: http://polytopes.net/Data.tar.gz. Code: http://polytopes.net/tropica_MCMC_codes.tar.gz. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (17) ◽  
pp. 3046-3054 ◽  
Author(s):  
Anastasia Gurinovich ◽  
Harold Bae ◽  
John J Farrell ◽  
Stacy L Andersen ◽  
Stefano Monti ◽  
...  

Abstract Motivation Over the last decade, more diverse populations have been included in genome-wide association studies. If a genetic variant has a varying effect on a phenotype in different populations, genome-wide association studies applied to a dataset as a whole may not pinpoint such differences. It is especially important to be able to identify population-specific effects of genetic variants in studies that would eventually lead to development of diagnostic tests or drug discovery. Results In this paper, we propose PopCluster: an algorithm to automatically discover subsets of individuals in which the genetic effects of a variant are statistically different. PopCluster provides a simple framework to directly analyze genotype data without prior knowledge of subjects’ ethnicities. PopCluster combines logistic regression modeling, principal component analysis, hierarchical clustering and a recursive bottom-up tree parsing procedure. The evaluation of PopCluster suggests that the algorithm has a stable low false positive rate (∼4%) and high true positive rate (>80%) in simulations with large differences in allele frequencies between cases and controls. Application of PopCluster to data from genetic studies of longevity discovers ethnicity-dependent heterogeneity in the association of rs3764814 (USP42) with the phenotype. Availability and implementation PopCluster was implemented using the R programming language, PLINK and Eigensoft software, and can be found at the following GitHub repository: https://github.com/gurinovich/PopCluster with instructions on its installation and usage. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (19) ◽  
pp. 3679-3683 ◽  
Author(s):  
Aritra Bose ◽  
Vassilis Kalantzis ◽  
Eugenia-Maria Kontopoulou ◽  
Mai Elkady ◽  
Peristera Paschou ◽  
...  

Abstract Motivation Principal Component Analysis is a key tool in the study of population structure in human genetics. As modern datasets become increasingly larger in size, traditional approaches based on loading the entire dataset in the system memory (Random Access Memory) become impractical and out-of-core implementations are the only viable alternative. Results We present TeraPCA, a C++ implementation of the Randomized Subspace Iteration method to perform Principal Component Analysis of large-scale datasets. TeraPCA can be applied both in-core and out-of-core and is able to successfully operate even on commodity hardware with a system memory of just a few gigabytes. Moreover, TeraPCA has minimal dependencies on external libraries and only requires a working installation of the BLAS and LAPACK libraries. When applied to a dataset containing a million individuals genotyped on a million markers, TeraPCA requires &lt;5 h (in multi-threaded mode) to accurately compute the 10 leading principal components. An extensive experimental analysis shows that TeraPCA is both fast and accurate and is competitive with current state-of-the-art software for the same task. Availability and implementation Source code and documentation are both available at https://github.com/aritra90/TeraPCA. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document