Modelling complex population structure using F-statistics and Principal Component Analysis

Mapping Intimacies ◽

10.1101/2021.07.13.452141 ◽

2021 ◽

Author(s):

Benjamin Marco Peter

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Principal Component ◽

Component Analysis ◽

Human Genetic Variation ◽

Human Diversity ◽

Orthogonal Projections ◽

Complex Population ◽

Discrete Populations ◽

F Statistics

Human genetic diversity is shaped by our complex history. Population genetic tools to understand this variation can broadly be classified into data-driven methods such as Principal Component Analysis (PCA), and model-based approaches such as F -statistics. Here, I show that these two perspectives are closely related, and I derive explicit connections between the two approaches. I show that F-statistics have a simple geometrical interpretation in the context of PCA, and that orthogonal projections are the key concept to establish this link. I illustrate my results on two examples, one of local, and one of global human diversity. In both examples, I find that population structure is sparse, and only a few components contribute to most statistics. Based on these results, I develop novel visualizations that allow for investigating specific hypotheses, checking the assumptions of more sophisticated models. My results extend F-statistics to non-discrete populations, moving towards more complete and less biased descriptions of human genetic variation.

Download Full-text

Pcadapt: An R Package to Perform Genome Scans for Selection Based on Principal Component Analysis

10.1101/056135 ◽

2016 ◽

Cited By ~ 6

Author(s):

Keurcien Luu ◽

Eric Bazin ◽

Michael G. B. Blum

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Mahalanobis Distance ◽

Principal Component ◽

R Package ◽

Component Analysis ◽

Population Divergence ◽

Sequencing Data ◽

Genome Scans ◽

False Discoveries

AbstractThe R package pcadapt performs genome scans to detect genes under selection based on population genomic data. It assumes that candidate markers are outliers with respect to how they are related to population structure. Because population structure is ascertained with principal component analysis, the package is fast and works with large-scale data. It can handle missing data and pooled sequencing data. By contrast to population-based approaches, the package handle admixed individuals and does not require grouping individuals into populations. Since its first release, pcadapt has evolved both in terms of statistical approach and software implementation. We present results obtained with robust Mahalanobis distance, which is a new statistic for genome scans available in the 2.0 and later versions of the package. When hierarchical population structure occurs, Mahalanobis distance is more powerful than the communality statistic that was implemented in the first version of the package. Using simulated data, we compare pcadapt to other software for genome scans (BayeScan, hapflk, OutFLANK, sNMF). We find that the proportion of false discoveries is around a nominal false discovery rate set at 10% with the exception of BayeScan that generates 40% of false discoveries. We also find that the power of BayeScan is severely impacted by the presence of admixed individuals whereas pcadapt is not impacted. Last, we find that pcadapt and hapflk are the most powerful software in scenarios of population divergence and range expansion. Because pcadapt handles next-generation sequencing data, it is a valuable tool for data analysis in molecular ecology.

Download Full-text

IPCAPS: an R package for iterative pruning to capture population structure

10.1101/186874 ◽

2017 ◽

Cited By ~ 3

Author(s):

Kridsadakorn Chaichoompu ◽

Fentaw Abegaz Yazew ◽

Sissades Tongsima ◽

Philip James Shaw ◽

Anavaj Sakuntabhai ◽

...

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Principal Component ◽

R Package ◽

Component Analysis ◽

Genomic Variation ◽

Fine Scale ◽

Nucleotide Polymorphisms ◽

Measurement Scales ◽

Scale Population

AbstractBackgroundResolving population genetic structure is challenging, especially when dealing with closely related or geographically confined populations. Although Principal Component Analysis (PCA)-based methods and genomic variation with single nucleotide polymorphisms (SNPs) are widely used to describe shared genetic ancestry, improvements can be made especially when fine-scale population structure is the target.ResultsThis work presents an R package called IPCAPS, which uses SNP information for resolving possibly fine-scale population structure. The IPCAPS routines are built on the iterative pruning Principal Component Analysis (ipPCA) framework that systematically assigns individuals to genetically similar subgroups. In each iteration, our tool is able to detect and eliminate outliers, hereby avoiding severe misclassification errors.ConclusionsIPCAPS supports different measurement scales for variables used to identify substructure. Hence, panels of gene expression and methylation data can be accommodated as well. The tool can also be applied in patient sub-phenotyping contexts. IPCAPS is developed in R and is freely available from bio3.giga.ulg.ac.be/ipcaps

Download Full-text

A deep learning framework for characterization of genotype data

10.1101/2020.09.30.320994 ◽

2020 ◽

Author(s):

Kristiina Ausmees ◽

Carl Nettelblad

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Deep Learning ◽

Dimensionality Reduction ◽

Principal Component ◽

Data Transformation ◽

Component Analysis ◽

Classification Model ◽

Genotype Data

ABSTRACTDimensionality reduction is a data transformation technique widely used in various fields of genomics research, with principal component analysis one of the most frequently employed methods. Application of principal component analysis to genotype data is known to capture genetic similarity between individuals, and is used for visualization of genetic variation, identification of population structure as well as ancestry mapping. However, the method is based on a linear model that is sensitive to characteristics of data such as correlation of single-nucleotide polymorphisms due to linkage disequilibrium, resulting in limitations in its ability to capture complex population structure.Deep learning models are a type of nonlinear machine learning method in which the features used in data transformation are decided by the model in a data-driven manner, rather than by the researcher, and have been shown to present a promising alternative to traditional statistical methods for various applications in omics research. In this paper, we propose a deep learning model based on a convolutional autoencoder architecture for dimensionality reduction of genotype data.Using a highly diverse cohort of human samples, we demonstrate that the model can identify population clusters and provide richer visual information in comparison to principal component analysis, and also yield a more accurate population classification model. We also discuss the use of the methodology for more general characterization of genotype data, showing that models of a similar architecture can be used as a genetic clustering method, comparing results to the ADMIXTURE software frequently used in population genetic studies.

Download Full-text

Haplotype and Population Structure Inference using Neural Networks in Whole-Genome Sequencing Data

10.1101/2020.12.28.424587 ◽

2020 ◽

Author(s):

Jonas Meisner ◽

Anders Albrechtsen

Keyword(s):

Neural Networks ◽

Principal Component Analysis ◽

Population Structure ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Principal Component ◽

Component Analysis ◽

Whole Genome ◽

Sequencing Data ◽

Latent Space

AbstractAccurate inference of population structure is important in many studies of population genetics. In this paper we present, HaploNet, a novel method for performing dimensionality reduction and clustering in genetic data. The method is based on local clustering of phased haplotypes using neural networks from whole-genome sequencing or genotype data. By utilizing a Gaussian mixture prior in a variational autoencoder framework, we are able to learn a low-dimensional latent space in which we cluster haplotypes along the genome in a highly scalable manner. We demonstrate that we can use encodings of the latent space to infer global population structure using principal component analysis with haplotype information. Additionally, we derive an expectation-maximization algorithm for estimating ancestry proportions based on the haplotype clustering and the neural networks in a likelihood framework. Using different examples of sequencing data, we demonstrate that our approach is better at distinguishing closely related populations than standard principal component analysis and admixture analysis. We show that HaploNet performs similarly to ChromoPainter for principal component analysis while being much faster and allowing for unsupervised clustering.

Download Full-text

Inferring Population Structure and Admixture Proportions in Low Depth NGS Data

10.1101/302463 ◽

2018 ◽

Cited By ~ 5

Author(s):

Jonas Meisner ◽

Anders Albrechtsen

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Next Generation Sequencing ◽

Principal Component ◽

Component Analysis ◽

Allele Frequencies ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

ABSTRACTWe here present two methods for inferring population structure and admixture proportions in low depth next generation sequencing data. Inference of population structure is essential in both population genetics and association studies and is often performed using principal component analysis or clustering-based approaches. Next-generation sequencing methods provide large amounts of genetic data but are associated with statistical uncertainty for especially low depth sequencing data. Models can account for this uncertainty by working directly on genotype likelihoods of the unobserved genotypes. We propose a method for inferring population structure through principal component analysis in an iterative approach of estimating individual allele frequencies, where we demonstrate improved accuracy in samples with low and variable sequencing depth for both simulated and real datasets. We also use the estimated individual allele frequencies in a fast non-negative matrix factorization method to estimate admixture proportions. Both methods have been implemented in the PCAngsd framework available at http://www.popgen.dk/software/.

Download Full-text

Exploring Population Structure with Admixture Models and Principal Component Analysis

Methods in Molecular Biology - Statistical Population Genomics ◽

10.1007/978-1-0716-0199-0_4 ◽

2020 ◽

pp. 67-86

Author(s):

Chi-Chun Liu ◽

Suyash Shringarpure ◽

Kenneth Lange ◽

John Novembre

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Principal Component ◽

Component Analysis

Download Full-text

Efficient toolkit implementing best practices for principal component analysis of population genetic data

10.1101/841452 ◽

2019 ◽

Cited By ~ 2

Author(s):

Florian Privé ◽

Keurcien Luu ◽

Michael G.B. Blum ◽

John J. McGrath ◽

Bjarni J. Vilhjálmsson

Keyword(s):

Principal Component Analysis ◽

Population Structure ◽

Best Practices ◽

Principal Component ◽

Genetic Data ◽

Component Analysis ◽

Uk Biobank ◽

1000 Genomes Project ◽

1000 Genomes ◽

The Uk

AbstractPrincipal Component Analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (1) capturing Linkage Disequilibrium (LD) structure instead of population structure, (2) projected PCs that suffer from shrinkage bias, (3) detecting sample outliers, and (4) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr.For example, we find that PC19 to PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16-18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr.Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data.

Download Full-text

A German version of the Intermittent Claudication Questionnaire (ICQ): cultural adaptation and validation

VASA ◽

10.1024/0301-1526/a000218 ◽

2012 ◽

Vol 41 (5) ◽

pp. 333-342 ◽

Cited By ~ 3

Author(s):

Kirchberger ◽

Finger ◽

Müller-Bühl

Keyword(s):

Principal Component Analysis ◽

Intermittent Claudication ◽

Completion Time ◽

Short Form ◽

Principal Component ◽

Component Analysis ◽

German Version ◽

Average Completion Time ◽

Sf 36 ◽

Related Quality

Background: The Intermittent Claudication Questionnaire (ICQ) is a short questionnaire for the assessment of health-related quality of life (HRQOL) in patients with intermittent claudication (IC). The objective of this study was to translate the ICQ into German and to investigate the psychometric properties of the German ICQ version in patients with IC. Patients and methods: The original English version was translated using a forward-backward method. The resulting German version was reviewed by the author of the original version and an experienced clinician. Finally, it was tested for clarity with 5 German patients with IC. A sample of 81 patients were administered the German ICQ. The sample consisted of 58.0 % male patients with a median age of 71 years and a median IC duration of 36 months. Test of feasibility included completeness of questionnaires, completion time, and ratings of clarity, length and relevance. Reliability was assessed through a retest in 13 patients at 14 days, and analysis of Cronbachs alpha for internal consistency. Construct validity was investigated using principal component analysis. Concurrent validity was assessed by correlating the ICQ scores with the Short Form 36 Health Survey (SF-36) as well as clinical measures. Results: The ICQ was completely filled in by 73 subjects (90.1 %) with an average completion time of 6.3 minutes. Cronbachs alpha coefficient reached 0.75. Intra-class correlation for test-retest reliability was r = 0.88. Principal component analysis resulted in a 3 factor solution. The first factor explained 51.5 of the total variation and all items had loadings of at least 0.65 on it. The ICQ was significantly associated with the SF-36 and treadmill-walking distances whereas no association was found for resting ABPI. Conclusions: The German version of the ICQ demonstrated good feasibility, satisfactory reliability and good validity. Responsiveness should be investigated in further validation studies.

Download Full-text