Optimal selection of genetic variants for adjustment of population stratification in European association studies

Regina Brinster; Dominique Scherer; Justo Lorenzo Bermejo

doi:10.1093/bib/bbz023

Optimal selection of genetic variants for adjustment of population stratification in European association studies

Briefings in Bioinformatics ◽

10.1093/bib/bbz023 ◽

2019 ◽

Vol 21 (3) ◽

pp. 753-761 ◽

Cited By ~ 2

Author(s):

Regina Brinster ◽

Dominique Scherer ◽

Justo Lorenzo Bermejo

Keyword(s):

Genetic Variants ◽

Population Stratification ◽

Statistical Power ◽

Type I Error ◽

Association Studies ◽

Reference Sample ◽

Error Rates ◽

The Cancer Genome Atlas ◽

Type I ◽

Genotype Data

Abstract Population stratification is usually corrected relying on principal component analysis (PCA) of genome-wide genotype data, even in populations considered genetically homogeneous, such as Europeans. The need to genotype only a small number of genetic variants that show large differences in allele frequency among subpopulations—so-called ancestry-informative markers (AIMs)—instead of the whole genome for stratification adjustment could represent an advantage for replication studies and candidate gene/pathway studies. Here we compare the correction performance of classical and robust principal components (PCs) with the use of AIMs selected according to four different methods: the informativeness for assignment measure ($IN$-AIMs), the combination of PCA and F-statistics, PCA-correlated measurement and the PCA weighted loadings for each genetic variant. We used real genotype data from the Population Reference Sample and The Cancer Genome Atlas to simulate European genetic association studies and to quantify type I error rate and statistical power in different case–control settings. In studies with the same numbers of cases and controls per country and control-to-case ratios reflecting actual rates of disease prevalence, no adjustment for population stratification was required. The unnecessary inclusion of the country of origin, PCs or AIMs as covariates in the regression models translated into increasing type I error rates. In studies with cases and controls from separate countries, no investigated method was able to adequately correct for population stratification. The first classical and the first two robust PCs achieved the lowest (although inflated) type I error, followed at some distance by the first eight $IN$-AIMs.

Download Full-text

A comparative analysis of cell-type adjustment methods for epigenome-wide association studies based on simulated and real data sets

Briefings in Bioinformatics ◽

10.1093/bib/bby068 ◽

2018 ◽

Vol 20 (6) ◽

pp. 2055-2065 ◽

Cited By ~ 1

Author(s):

Johannes Brägelmann ◽

Justo Lorenzo Bermejo

Keyword(s):

Statistical Power ◽

Type I Error ◽

Association Studies ◽

Real Data ◽

Error Rates ◽

Data Sets ◽

Type I ◽

Cell Type ◽

Type I Error Rates

Abstract Technological advances and reduced costs of high-density methylation arrays have led to an increasing number of association studies on the possible relationship between human disease and epigenetic variability. DNA samples from peripheral blood or other tissue types are analyzed in epigenome-wide association studies (EWAS) to detect methylation differences related to a particular phenotype. Since information on the cell-type composition of the sample is generally not available and methylation profiles are cell-type specific, statistical methods have been developed for adjustment of cell-type heterogeneity in EWAS. In this study we systematically compared five popular adjustment methods: the factored spectrally transformed linear mixed model (FaST-LMM-EWASher), the sparse principal component analysis algorithm ReFACTor, surrogate variable analysis (SVA), independent SVA (ISVA) and an optimized version of SVA (SmartSVA). We used real data and applied a multilayered simulation framework to assess the type I error rate, the statistical power and the quality of estimated methylation differences according to major study characteristics. While all five adjustment methods improved false-positive rates compared with unadjusted analyses, FaST-LMM-EWASher resulted in the lowest type I error rate at the expense of low statistical power. SVA efficiently corrected for cell-type heterogeneity in EWAS up to 200 cases and 200 controls, but did not control type I error rates in larger studies. Results based on real data sets confirmed simulation findings with the strongest control of type I error rates by FaST-LMM-EWASher and SmartSVA. Overall, ReFACTor, ISVA and SmartSVA showed the best comparable statistical power, quality of estimated methylation differences and runtime.

Download Full-text

Bayestrat: Population Stratification Correction Using Bayesian Shrinkage Prior for Genetic Association Studies

10.1101/2021.03.23.436705 ◽

2021 ◽

Author(s):

Zilu Liu ◽

Asuman Turkmen ◽

Shili Lin

Keyword(s):

Genetic Association ◽

Population Stratification ◽

Linear Mixed Model ◽

Type I Error ◽

Association Studies ◽

Random Effect ◽

Genetic Association Studies ◽

Error Rates ◽

Type I ◽

Bayesian Shrinkage

In genetic association studies with common diseases, population stratification is a major source of confounding. Principle component regression (PCR) and linear mixed model (LMM) are two commonly used approaches to account for population stratification. Previous studies have shown that LMM can be interpreted as including all principle components (PCs) as random-effect covariates. However, including all PCs in LMM may inflate type I error in some scenarios due to redundancy, while including only a few pre-selected PCs in PCR may fail to fully capture the genetic diversity. Here, we propose a statistical method under the Bayesian framework, Bayestrat, that utilizes appropriate shrinkage priors to shrink the effects of non- or minimally confounded PCs and improve the identification of highly confounded ones. Simulation results show that Bayestrat consistently achieves lower type I error rates yet higher power, especially when the number of PCs included in the model is large. We also apply our method to two real datasets, the Dallas Heart Studies (DHS) and the Multi-Ethnic Study of Atherosclerosis (MESA), and demonstrate the superiority of Bayestrat over commonly used methods.

Download Full-text

Associating Multivariate Traits with Genetic Variants Using Collapsing and Kernel Methods with Pedigree- or Population-Based Studies

Computational and Mathematical Methods in Medicine ◽

10.1155/2021/8812282 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Li-Chu Chien

Keyword(s):

Kernel Methods ◽

Genetic Variants ◽

Statistical Power ◽

Type I Error ◽

Population Based ◽

Error Rates ◽

Type I ◽

Omnibus Test ◽

Multifactorial Diseases ◽

Multivariate Traits

In genetic association analysis, several relevant phenotypes or multivariate traits with different types of components are usually collected to study complex or multifactorial diseases. Over the past few years, jointly testing for association between multivariate traits and multiple genetic variants has become more popular because it can increase statistical power to identify causal genes in pedigree- or population-based studies. However, most of the existing methods mainly focus on testing genetic variants associated with multiple continuous phenotypes. In this investigation, we develop a framework for identifying the pleiotropic effects of genetic variants on multivariate traits by using collapsing and kernel methods with pedigree- or population-structured data. The proposed framework is applicable to the burden test, the kernel test, and the omnibus test for autosomes and the X chromosome. The proposed multivariate trait association methods can accommodate continuous phenotypes or binary phenotypes and further can adjust for covariates. Simulation studies show that the performance of our methods is satisfactory with respect to the empirical type I error rates and power rates in comparison with the existing methods.

Download Full-text

The effect of different sets of critical values on type I error rates in tiled regression for genome-wide association studies

International Journal of Data Mining and Bioinformatics ◽

10.1504/ijdmb.2016.080030 ◽

2016 ◽

Vol 16 (2) ◽

pp. 111

Author(s):

Heejong Sung ◽

Jeremy A. Sabourin ◽

Alexa J.M. Sorant ◽

Alexander F. Wilson

Keyword(s):

Type I Error ◽

Association Studies ◽

Error Rates ◽

Critical Values ◽

Genome Wide Association ◽

Type I ◽

Genome Wide Association Studies ◽

Type I Error Rates ◽

Genome Wide

Download Full-text

Statistical model specification and power: recommendations on the use of test-qualified pooling in analysis of experimental data

Proceedings of The Royal Society B Biological Sciences ◽

10.1098/rspb.2016.1850 ◽

2017 ◽

Vol 284 (1851) ◽

pp. 20161850 ◽

Cited By ~ 7

Author(s):

Nick Colegrave ◽

Graeme D. Ruxton

Keyword(s):

Experimental Data ◽

Statistical Model ◽

Statistical Power ◽

Error Term ◽

Degrees Of Freedom ◽

Type I Error ◽

Error Rates ◽

Statistical Testing ◽

Model Specification ◽

Type I

A common approach to the analysis of experimental data across much of the biological sciences is test-qualified pooling. Here non-significant terms are dropped from a statistical model, effectively pooling the variation associated with each removed term with the error term used to test hypotheses (or estimate effect sizes). This pooling is only carried out if statistical testing on the basis of applying that data to a previous more complicated model provides motivation for this model simplification; hence the pooling is test-qualified. In pooling, the researcher increases the degrees of freedom of the error term with the aim of increasing statistical power to test their hypotheses of interest. Despite this approach being widely adopted and explicitly recommended by some of the most widely cited statistical textbooks aimed at biologists, here we argue that (except in highly specialized circumstances that we can identify) the hoped-for improvement in statistical power will be small or non-existent, and there is likely to be much reduced reliability of the statistical procedures through deviation of type I error rates from nominal levels. We thus call for greatly reduced use of test-qualified pooling across experimental biology, more careful justification of any use that continues, and a different philosophy for initial selection of statistical models in the light of this change in procedure.

Download Full-text

A Fully-Adjusted Two-Stage Procedure for Rank Normalization in Genetic Association Studies

10.1101/344770 ◽

2018 ◽

Author(s):

Tamar Sofer ◽

Xiuwen Zheng ◽

Stephanie M. Gogarten ◽

Cecelia A. Laurie ◽

Kelsey Grinde ◽

...

Keyword(s):

Statistical Power ◽

Type I Error ◽

Association Studies ◽

Genetic Association Studies ◽

Statistical Properties ◽

Type I ◽

Residual Distribution ◽

Two Stage ◽

Trait Distribution ◽

Error Rate Control

AbstractWhen testing genotype-phenotype associations using linear regression, departure of the trait distribution from normality can impact both Type I error rate control and statistical power, with worse consequences for rarer variants. While it has been shown that applying a rank-normalization transformation to trait values before testing may improve these statistical properties, the factor driving them is not the trait distribution itself, but its residual distribution after regression on both covariates and genotype. Because genotype is expected to have a small effect (if any) investigators now routinely use a two-stage method, in which they first regress the trait on covariates, obtain residuals, rank-normalize them, and then secondly use the rank-normalized residuals in association analysis with the genotypes. Potential confounding signals are assumed to be removed at the first stage, so in practice no further adjustment is done in the second stage. Here, we show that this widely-used approach can lead to tests with undesirable statistical properties, due to both a combination of a mis-specified mean-variance relationship, and remaining covariate associations between the rank-normalized residuals and genotypes. We demonstrate these properties theoretically, and also in applications to genome-wide and whole-genome sequencing association studies. We further propose and evaluate an alternative fully-adjusted two-stage approach that adjusts for covariates both when residuals are obtained, and in the subsequent association test. This method can reduce excess Type I errors and improve statistical power.

Download Full-text

A Monte Carlo Simulation Study for Kolmogorov-Smirnov Two-Sample Test Under the Precondition of Heterogeneity: Upon the Changes on the Probabilities of Statistical Power and Type I Error Rates with Respect to Skewness Measure

SSRN Electronic Journal ◽

10.2139/ssrn.2497601 ◽

2013 ◽

Author(s):

ttken Senger ◽

Ali Kemal elik

Keyword(s):

Monte Carlo Simulation ◽

Monte Carlo ◽

Statistical Power ◽

Type I Error ◽

Error Rates ◽

Type I ◽

Monte Carlo Simulation Study ◽

Type I Error Rates ◽

Sample Test ◽

Kolmogorov Smirnov

Download Full-text

The effect of number of clusters and cluster size on statistical power and Type I error rates when testing random effects variance components in multilevel linear and logistic regression models

Journal of Statistical Computation and Simulation ◽

10.1080/00949655.2018.1504945 ◽

2018 ◽

Vol 88 (16) ◽

pp. 3151-3163 ◽

Cited By ~ 8

Author(s):

Peter C. Austin ◽

George Leckie

Keyword(s):

Variance Components ◽

Cluster Size ◽

Regression Models ◽

Statistical Power ◽

Type I Error ◽

Error Rates ◽

Type I ◽

Number Of Clusters ◽

Logistic Regression Models ◽

Type I Error Rates

Download Full-text

Bias, Type I Error Rates, and Statistical Power of a Latent Mediation Model in the Presence of Violations of Invariance

Educational and Psychological Measurement ◽

10.1177/0013164416684169 ◽

2017 ◽

Vol 78 (3) ◽

pp. 460-481 ◽

Cited By ~ 3

Author(s):

Margarita Olivera-Aguilar ◽

Samuel H. Rikoon ◽

Oscar Gonzalez ◽

Yasemin Kisbu-Sakarya ◽

David P. MacKinnon

Keyword(s):

Measurement Invariance ◽

Statistical Power ◽

Type I Error ◽

Error Rates ◽

Parameter Estimates ◽

Type I ◽

Mediation Model ◽

Type I Error Rates ◽

Mediated Effects ◽

The Impact

When testing a statistical mediation model, it is assumed that factorial measurement invariance holds for the mediating construct across levels of the independent variable X. The consequences of failing to address the violations of measurement invariance in mediation models are largely unknown. The purpose of the present study was to systematically examine the impact of mediator noninvariance on the Type I error rates, statistical power, and relative bias in parameter estimates of the mediated effect in the single mediator model. The results of a large simulation study indicated that, in general, the mediated effect was robust to violations of invariance in loadings. In contrast, most conditions with violations of intercept invariance exhibited severely positively biased mediated effects, Type I error rates above acceptable levels, and statistical power larger than in the invariant conditions. The implications of these results are discussed and recommendations are offered.

Download Full-text

Type I Error Rates and Statistical Power for the James Second-Order Test and the UnivariateFTest in Two-Way Fixed-Effects ANOVA Models Under Heteroscedasticity and/or Nonnormality

The Journal of Experimental Education ◽

10.1080/00220973.1996.9943463 ◽

1996 ◽

Vol 65 (1) ◽

pp. 57-71 ◽

Cited By ~ 10

Author(s):

Tung-Hsing Hsiung ◽

Stephen Olejnik

Keyword(s):

Fixed Effects ◽

Statistical Power ◽

Type I Error ◽

Error Rates ◽

Second Order ◽

Type I ◽

Type I Error Rates

Download Full-text