The effects of human population structure on large genetic association studies

AbstractModern genetic association studies require modeling population structure and family relatedness in order to calculate correct statistics. Principal Components Analysis (PCA) is one of the most common approaches for modeling this population structure, but nowadays the Linear Mixed-Effects Model (LMM) is believed by many to be a superior model. Remarkably, previous comparisons have been limited by testing PCA without varying the number of principal components (PCs), by simulating unrealistically simple population structures, and by not always measuring both type-I error control and predictive power. In this work, we thoroughly evaluate PCA with varying number of PCs alongside LMM in various realistic scenarios, including admixture together with family structure, measuring both null p-value uniformity and the area under the precision-recall curves. We find that PCA performs as well as LMM when enough PCs are used and the sample size is large, and find a remarkable robustness to extreme number of PCs. However, we notice decreased performance for PCA relative to LMM when sample sizes are small and when there is family structure, although LMM performance is highly variable. Altogether, our work suggests that PCA is a favorable approach for association studies when sample sizes are large and no close relatives exist in the data, and a hybrid approach of LMM with PCs may be the best of both worlds.

Download Full-text

Genome-Wide Control of Population Structure and Relatedness in Genetic Association Studies via Linear Mixed Models with Orthogonally Partitioned Structure

10.1101/409953 ◽

2018 ◽

Author(s):

Matthew P. Conomos ◽

Alex P. Reiner ◽

Mary Sara McPeek ◽

Timothy A. Thornton

Keyword(s):

Population Structure ◽

Genetic Association ◽

Mixed Models ◽

Association Studies ◽

Linear Mixed Models ◽

Genetic Association Studies ◽

European Ancestry ◽

Type I ◽

Genome Wide ◽

Wbc Count

AbstractLinear mixed models (LMMs) have become the standard approach for genetic association testing in the presence of sample structure. However, the performance of LMMs has primarily been evaluated in relatively homogeneous populations of European ancestry, despite many of the recent genetic association studies including samples from worldwide populations with diverse ancestries. In this paper, we demonstrate that existing LMM methods can have systematic miscalibration of association test statistics genome-wide in samples with heterogenous ancestry, resulting in both increased type-I error rates and a loss of power. Furthermore, we show that this miscalibration arises due to varying allele frequency differences across the genome among populations. To overcome this problem, we developed LMM-OPS, an LMM approach which orthogonally partitions diverse genetic structure into two components: distant population structure and recent genetic relatedness. In simulation studies with real and simulated genotype data, we demonstrate that LMM-OPS is appropriately calibrated in the presence of ancestry heterogeneity and outperforms existing LMM approaches, including EMMAX, GCTA, and GEMMA. We conduct a GWAS of white blood cell (WBC) count in an admixed sample of 3,551 Hispanic/Latino American women from the Women’s Health Initiative SNP Health Association Resource where LMM-OPS detects genome-wide significant associations with corresponding p-values that are one or more orders of magnitude smaller than those from competing LMM methods. We also identify a genome-wide significant association with regulatory variant rs2814778 in the DARC gene on chromosome 1, which generalizes to Hispanic/Latino Americans a previous association with reduced WBC count identified in African Americans.

Download Full-text

Controlling population structure in human genetic association studies with samples of unrelated individuals

Statistics and Its Interface ◽

10.4310/sii.2011.v4.n3.a6 ◽

2011 ◽

Vol 4 (3) ◽

pp. 317-326 ◽

Cited By ~ 8

Author(s):

David B. Allison ◽

Nita A. Limdi ◽

Nianjun Liu ◽

Amit Patki ◽

Hongyu Zhao

Keyword(s):

Population Structure ◽

Genetic Association ◽

Association Studies ◽

Genetic Association Studies

Download Full-text

Robust Population Structure Inference and Correction in the Presence of Known or Cryptic Relatedness

10.1101/008276 ◽

2014 ◽

Author(s):

Matthew P Conomos ◽

Michael B Miller ◽

Timothy A Thornton

Keyword(s):

Population Structure ◽

Genetic Association ◽

Association Studies ◽

Genetic Association Studies ◽

Real Data ◽

Substantial Improvement ◽

Genetic Ancestry ◽

Phase Iii ◽

Cryptic Relatedness ◽

Study Participants

Population structure inference with genetic data has been motivated by a variety of applications in population genetics and genetic association studies. Several approaches have been proposed for the identification of genetic ancestry differences in samples where study participants are assumed to be unrelated, including principal components analysis (PCA), multi-dimensional scaling (MDS), and model-based methods for proportional ancestry estimation. Many genetic studies, however, include individuals with some degree of relatedness, and existing methods for inferring genetic ancestry fail in related samples. We present a method, PC-AiR, for robust population structure inference in the presence of known or cryptic relatedness. PC-AiR utilizes genome-screen data and an efficient algorithm to identify a diverse subset of unrelated individuals that is representative of all ancestries in the sample. The PC-AiR method directly performs PCA on the identified ancestry representative subset and then predicts components of variation for all remaining individuals based on genetic similarities. In simulation studies and in applications to real data from Phase III of the HapMap Project, we demonstrate that PC-AiR provides a substantial improvement over existing approaches for population structure inference in related samples. We also demonstrate significant efficiency gains, where a single axis of variation from PC-AiR provides better prediction of ancestry in a variety of structure settings than using ten (or more) components of variation from widely used PCA and MDS approaches. Finally, we illustrate that PC-AiR can provide improved population stratification correction over existing methods in genetic association studies with population structure and relatedness.

Download Full-text