scholarly journals Bayestrat: Population Stratification Correction Using Bayesian Shrinkage Prior for Genetic Association Studies

2021 ◽  
Author(s):  
Zilu Liu ◽  
Asuman Turkmen ◽  
Shili Lin

In genetic association studies with common diseases, population stratification is a major source of confounding. Principle component regression (PCR) and linear mixed model (LMM) are two commonly used approaches to account for population stratification. Previous studies have shown that LMM can be interpreted as including all principle components (PCs) as random-effect covariates. However, including all PCs in LMM may inflate type I error in some scenarios due to redundancy, while including only a few pre-selected PCs in PCR may fail to fully capture the genetic diversity. Here, we propose a statistical method under the Bayesian framework, Bayestrat, that utilizes appropriate shrinkage priors to shrink the effects of non- or minimally confounded PCs and improve the identification of highly confounded ones. Simulation results show that Bayestrat consistently achieves lower type I error rates yet higher power, especially when the number of PCs included in the model is large. We also apply our method to two real datasets, the Dallas Heart Studies (DHS) and the Multi-Ethnic Study of Atherosclerosis (MESA), and demonstrate the superiority of Bayestrat over commonly used methods.

2019 ◽  
Vol 21 (3) ◽  
pp. 753-761 ◽  
Author(s):  
Regina Brinster ◽  
Dominique Scherer ◽  
Justo Lorenzo Bermejo

Abstract Population stratification is usually corrected relying on principal component analysis (PCA) of genome-wide genotype data, even in populations considered genetically homogeneous, such as Europeans. The need to genotype only a small number of genetic variants that show large differences in allele frequency among subpopulations—so-called ancestry-informative markers (AIMs)—instead of the whole genome for stratification adjustment could represent an advantage for replication studies and candidate gene/pathway studies. Here we compare the correction performance of classical and robust principal components (PCs) with the use of AIMs selected according to four different methods: the informativeness for assignment measure ($IN$-AIMs), the combination of PCA and F-statistics, PCA-correlated measurement and the PCA weighted loadings for each genetic variant. We used real genotype data from the Population Reference Sample and The Cancer Genome Atlas to simulate European genetic association studies and to quantify type I error rate and statistical power in different case–control settings. In studies with the same numbers of cases and controls per country and control-to-case ratios reflecting actual rates of disease prevalence, no adjustment for population stratification was required. The unnecessary inclusion of the country of origin, PCs or AIMs as covariates in the regression models translated into increasing type I error rates. In studies with cases and controls from separate countries, no investigated method was able to adequately correct for population stratification. The first classical and the first two robust PCs achieved the lowest (although inflated) type I error, followed at some distance by the first eight $IN$-AIMs.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Matthieu Bouaziz ◽  
Jimmy Mullaert ◽  
Benedetta Bigio ◽  
Yoann Seeleuthner ◽  
Jean-Laurent Casanova ◽  
...  

AbstractPopulation stratification is a confounder of genetic association studies. In analyses of rare variants, corrections based on principal components (PCs) and linear mixed models (LMMs) yield conflicting conclusions. Studies evaluating these approaches generally focused on limited types of structure and large sample sizes. We investigated the properties of several correction methods through a large simulation study using real exome data, and several within- and between-continent stratification scenarios. We considered different sample sizes, with situations including as few as 50 cases, to account for the analysis of rare disorders. Large samples showed that accounting for stratification was more difficult with a continental than with a worldwide structure. When considering a sample of 50 cases, an inflation of type-I-errors was observed with PCs for small numbers of controls (≤ 100), and with LMMs for large numbers of controls (≥ 1000). We also tested a novel local permutation method (LocPerm), which maintained a correct type-I-error in all situations. Powers were equivalent for all approaches pointing out that the key issue is to properly control type-I-errors. Finally, we found that power of analyses including small numbers of cases can be increased, by adding a large panel of external controls, provided an appropriate stratification correction was used.


2021 ◽  
Author(s):  
Debashree Ray ◽  
Candelaria I Vergara ◽  
Margaret I Taub ◽  
Genevieve L Wojcik ◽  
Christine Ladd-Acosta ◽  
...  

Genetic association studies of child health outcomes often employ family-based designs. One of the most popular family-based designs is the case-parent trio design that considers the smallest possible nuclear family consisting of two parents and their affected child. This trio design is particularly advantageous for studying relatively rare disorders because it is less prone to type 1 error inflation due to population stratification compared to population-based study designs (e.g., case-control studies). However, obtaining genetic data from both parents is difficult, from a practical perspective, and many large studies predominantly measure genetic variants in mother-child dyads. While some statistical methods for analyzing parent-child dyad data (most commonly involving mother-child pairs) exist, it is not clear if they provide the same advantage as trio methods in protecting against population stratification, or if a specific dyad design (e.g., case-mother dyads vs. case-mother/control-mother dyads) is more advantageous. In this article, we review existing statistical methods for analyzing genome-wide data on dyads and perform extensive simulation experiments to benchmark their type I errors and statistical power under different scenarios. We extend our evaluation to existing methods for analyzing a combination of case-parent trios and dyads together. We apply these methods on genotyped and imputed data from multi-ethnic mother-child pairs only, case-parent trios only or combinations of both dyads and trios from the Gene, Environment Association Studies consortium (GENEVA), where each family was ascertained through a child affected by nonsyndromic cleft lip with or without cleft palate. Results from the GENEVA study corroborate the findings from our simulation experiments. Finally, we provide recommendations for using statistical genetic association methods for dyads.


2021 ◽  
Author(s):  
Yongwen Zhuang ◽  
Brooke N. Wolford ◽  
Kisung Nam ◽  
Wenjian Bi ◽  
Wei Zhou ◽  
...  

In the genome-wide association analysis of population-based biobanks, most diseases have low prevalence, which results in low detection power. One approach to tackle the problem is using family disease history, yet existing methods are unable to address type I error inflation induced by increased correlation of phenotypes among closely related samples, as well as unbalanced phenotypic distribution. We propose a new method for genetic association test with family disease history, TAPE (mixed-model-based Test with Adjusted Phenotype and Empirical saddlepoint approximation), which controls for increased phenotype correlation by adopting a two-variance-component mixed model and accounts for case-control imbalance by using empirical saddlepoint approximation. We show through simulation studies and analysis of UK-Biobank data of white British samples and KoGES data of Korean samples that the proposed method is computationally efficient and gains greater power for detection of variant-phenotype associations than common GWAS with binary traits while yielding better calibration compared to existing methods.


2020 ◽  
Author(s):  
Matthieu Bouaziz ◽  
Jimmy Mullaert ◽  
Benedetta Bigio ◽  
Yoann Seeleuthner ◽  
Jean-Laurent Casanova ◽  
...  

AbstractPopulation stratification is a strong confounding factor in human genetic association studies. In analyses of rare variants, the main correction strategies based on principal components (PC) and linear mixed models (LMM), may yield conflicting conclusions, due to both the specific type of structure induced by rare variants and the particular statistical features of association tests. Studies evaluating these approaches generally focused on specific situations with limited types of simulated structure and large sample sizes. We investigated the properties of several correction methods in the context of a large simulation study using real exome data, and several within- and between- continent stratification scenarios. We also considered different sample sizes, with situations including as few as 50 cases, to account for the analysis of rare disorders. In this context, we focused on a genetic model with a phenotype driven by rare deleterious variants well suited for a burden test. For analyses of large samples, we found that accounting for stratification was more difficult with a continental structure than with a worldwide structure. LMM failed to maintain a correct type I error in many scenarios, whereas PCs based on common variants failed only in the presence of extreme continental stratification. When a sample of 50 cases was considered, an inflation of type I errors was observed with PC for small numbers of controls (≤100), and with LMM for large numbers of controls (≥1000). We also tested a promising novel adapted local permutation method (LocPerm), which maintained a correct type I error in all situations. All approaches capable of correcting for stratification properly had similar powers for detecting actual associations pointing out that the key issue is to properly control type I errors. Finally, we found that adding a large panel of external controls (e.g. extracted from publicly available databases) was an efficient way to increase the power of analyses including small numbers of cases, provided an appropriate stratification correction was used.Author SummaryGenetic association studies focusing on rare variants using next generation sequencing (NGS) data have become a common strategy to overcome the shortcomings of classical genome-wide association studies for the analysis of rare and common diseases. The issue of population stratification remains however a substantial question that has not been fully resolved when analyzing NGS data. In this work, we propose a comprehensive evaluation of the main strategies to account for stratification, that are principal components and linear mixed model, along with a novel approach based on local permutations (LocPerm). We compared these correction methods in many different settings, considering several types of population structures, sample sizes or types of variants. Our results highlighted important limitations of some classical methods as those using principal components (in particular in small samples) and linear mixed models (in several situations). In contrast, LocPerm maintained a correct type I error in all situations. Also, we showed that adding a large panel of external controls, e.g coming from publicly available databases, is an efficient strategy to increase the power of an analysis including a low number of cases, as long as an appropriate stratification correction is used. Our findings provide helpful guidelines for many researchers working on rare variant association studies.


Sign in / Sign up

Export Citation Format

Share Document