LiMMBo: a simple, scalable approach for linear mixed models in high-dimensional genetic association studies

AbstractGenome-wide association studies have helped to shed light on the genetic architecture of complex traits and diseases. Deep phenotyping of population cohorts is increasingly applied, where multi-to high-dimensional phenotypes are recorded in the individuals. Whilst these rich datasets provide important opportunities to analyse complex trait structures and pleiotropic effects at a genome-wide scale, existing statistical methods for joint genetic analyses are hampered by computational limitations posed by high-dimensional phenotypes. Consequently, such multivariate analyses are currently limited to a moderate number of traits. Here, we introduce a method that combines linear mixed models with bootstrapping (LiMMBo) to enable computationally efficient joint genetic analysis of high-dimensional phenotypes. Our method builds on linear mixed models, thereby providing robust control for population structure and other confounding factors, and the model scales to larger datasets with up to hundreds of phenotypes. We first validate LiMMBo using simulations, demonstrating consistent covariance estimates at greatly reduced computational cost compared to existing methods. We also find LiMMBo yields consistent power advantages compared to univariate modelling strategies, where the advantages of multivariate mapping increases substantially with the phenotype dimensionality. Finally, we applied LiMMBo to 41 yeast growth traits to map their genetic determinants, finding previously known and novel pleiotropic relationships in this high-dimensional phenotype space. LiMMBo is accessible as open source software (https://github.com/HannahVMeyer/limmbo).Author summaryIn multi-trait genetic association studies one is interested in detecting genetic variants that are associated with one or multiple traits. Genetic variants that influence two or more traits are referred to as pleiotropic. Multivariate linear mixed models have been successfully applied to detect pleiotropic effects, by jointly modelling association signals across traits. However, these models are currently limited to a moderate number of phenotypes as the number of model parameters grows steeply with the number of phenotypes, raising a computational burden. We developed LiMMBo, a new approach for the joint analysis of high-dimensional phenotypes. Our method reduces the number of effective model parameters by introducing an intermediate subsampling step. We validate this strategy using simulations, where we apply LiMMBo for the genetic analysis of hundreds of phenotypes, detecting pleiotropic effects for a wide range of simulated genetic architectures. Finally, to illustrate LiMMBo in practice, we apply the model to a study of growth traits in yeast, where we identify pleiotropic effects for traits with formerly known genetic effects as well as revealing previously unconnected traits.

Download Full-text

Genome-Wide Control of Population Structure and Relatedness in Genetic Association Studies via Linear Mixed Models with Orthogonally Partitioned Structure

10.1101/409953 ◽

2018 ◽

Author(s):

Matthew P. Conomos ◽

Alex P. Reiner ◽

Mary Sara McPeek ◽

Timothy A. Thornton

Keyword(s):

Population Structure ◽

Genetic Association ◽

Mixed Models ◽

Association Studies ◽

Linear Mixed Models ◽

Genetic Association Studies ◽

European Ancestry ◽

Type I ◽

Genome Wide ◽

Wbc Count

AbstractLinear mixed models (LMMs) have become the standard approach for genetic association testing in the presence of sample structure. However, the performance of LMMs has primarily been evaluated in relatively homogeneous populations of European ancestry, despite many of the recent genetic association studies including samples from worldwide populations with diverse ancestries. In this paper, we demonstrate that existing LMM methods can have systematic miscalibration of association test statistics genome-wide in samples with heterogenous ancestry, resulting in both increased type-I error rates and a loss of power. Furthermore, we show that this miscalibration arises due to varying allele frequency differences across the genome among populations. To overcome this problem, we developed LMM-OPS, an LMM approach which orthogonally partitions diverse genetic structure into two components: distant population structure and recent genetic relatedness. In simulation studies with real and simulated genotype data, we demonstrate that LMM-OPS is appropriately calibrated in the presence of ancestry heterogeneity and outperforms existing LMM approaches, including EMMAX, GCTA, and GEMMA. We conduct a GWAS of white blood cell (WBC) count in an admixed sample of 3,551 Hispanic/Latino American women from the Women’s Health Initiative SNP Health Association Resource where LMM-OPS detects genome-wide significant associations with corresponding p-values that are one or more orders of magnitude smaller than those from competing LMM methods. We also identify a genome-wide significant association with regulatory variant rs2814778 in the DARC gene on chromosome 1, which generalizes to Hispanic/Latino Americans a previous association with reduced WBC count identified in African Americans.

Download Full-text

Linear Score Tests for Variance Components in Linear Mixed Models and Applications to Genetic Association Studies

Biometrics ◽

10.1111/biom.12095 ◽

2013 ◽

Vol 69 (4) ◽

pp. 883-892 ◽

Cited By ~ 20

Author(s):

Long Qu ◽

Tobias Guennel ◽

Scott L. Marshall

Keyword(s):

Genetic Association ◽

Mixed Models ◽

Variance Components ◽

Association Studies ◽

Linear Mixed Models ◽

Genetic Association Studies ◽

Score Tests

Download Full-text

Ludicrous Speed Linear Mixed Models for Genome-Wide Association Studies

10.1101/154682 ◽

2017 ◽

Cited By ~ 3

Author(s):

Carl Kadie ◽

David Heckerman

Keyword(s):

Mixed Models ◽

Mixed Model ◽

Linear Mixed Model ◽

Association Studies ◽

Linear Mixed Models ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Confounding Factors ◽

Genome Wide ◽

A Genome

AbstractWe have developed Ludicrous Speed Linear Mixed Models, a version of FaST-LMM optimized for the cloud. The approach can perform a genome-wide association analysis on a dataset of one million SNPs across one million individuals at a cost of about 868 CPU days with an elapsed time on the order of two weeks. A Python implementation is available at https://fastlmm.github.io/.SignificanceIdentifying SNP-phenotype correlations using GWAS is difficult because effect sizes are so small for common, complex diseases. To address this issue, institutions are creating extremely large cohorts with sample sizes on the order of one million. Unfortunately, such cohorts are likely to contain confounding factors such as population structure and family/cryptic relatedness. The linear mixed model (LMM) can often correct for such confounding factors, but is too slow to use even with algebraic speedups known as FaST-LMM. We present a cloud implementation of FaST-LMM, called Ludicrous Speed LMM, that can process one million samples and one million test SNPs in a reasonable amount of time and at a reasonable cost.

Download Full-text

A re-formulation of generalized linear mixed models to fit family data in genetic association studies

Frontiers in Genetics ◽

10.3389/fgene.2015.00120 ◽

2015 ◽

Vol 6 ◽

Cited By ~ 5

Author(s):

Tao Wang ◽

Peng He ◽

Kwang Woo Ahn ◽

Xujing Wang ◽

Soumitra Ghosh ◽

...

Keyword(s):

Genetic Association ◽

Mixed Models ◽

Generalized Linear Mixed Models ◽

Association Studies ◽

Linear Mixed Models ◽

Genetic Association Studies ◽

Family Data

Download Full-text

Improved linear mixed models for genome-wide association studies

Nature Methods ◽

10.1038/nmeth.2037 ◽

2012 ◽

Vol 9 (6) ◽

pp. 525-526 ◽

Cited By ~ 202

Author(s):

Jennifer Listgarten ◽

Christoph Lippert ◽

Carl M Kadie ◽

Robert I Davidson ◽

Eleazar Eskin ◽

...

Keyword(s):

Mixed Models ◽

Association Studies ◽

Linear Mixed Models ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide

Download Full-text

Fitting linear mixed models to a highly-structured dataset effectively controls for population structure in bacterial genome-wide association studies

Access Microbiology ◽

10.1099/acmi.ac2019.po0121 ◽

2019 ◽

Vol 1 (1A) ◽

Author(s):

Samuel Kidman ◽

Emem-Fong Ukor ◽

Andres Floto ◽

Julian Parkhill

Keyword(s):

Population Structure ◽

Mixed Models ◽

Association Studies ◽

Linear Mixed Models ◽

Bacterial Genome ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide

Download Full-text

FaST linear mixed models for genome-wide association studies

Nature Methods ◽

10.1038/nmeth.1681 ◽

2011 ◽

Vol 8 (10) ◽

pp. 833-835 ◽

Cited By ~ 610

Author(s):

Christoph Lippert ◽

Jennifer Listgarten ◽

Ying Liu ◽

Carl M Kadie ◽

Robert I Davidson ◽

...

Keyword(s):

Mixed Models ◽

Association Studies ◽

Linear Mixed Models ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide

Download Full-text

Further Improvements to Linear Mixed Models for Genome-Wide Association Studies

Scientific Reports ◽

10.1038/srep06874 ◽

2014 ◽

Vol 4 (1) ◽

Cited By ~ 30

Author(s):

Christian Widmer ◽

Christoph Lippert ◽

Omer Weissbrod ◽

Nicolo Fusi ◽

Carl Kadie ◽

...

Keyword(s):

Mixed Models ◽

Association Studies ◽

Linear Mixed Models ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide

Download Full-text

Evaluation of genome-wide power of genetic association studies based on empirical data from the HapMap project

Human Molecular Genetics ◽

10.1093/hmg/ddm205 ◽

2007 ◽

Vol 16 (20) ◽

pp. 2494-2505 ◽

Cited By ~ 23

Author(s):

Yasuhito Nannya ◽

Kenjiro Taura ◽

Mineo Kurokawa ◽

Shigeru Chiba ◽

Seishi Ogawa

Keyword(s):

Genetic Association ◽

Empirical Data ◽

Association Studies ◽

Genetic Association Studies ◽

Hapmap Project ◽

Genome Wide

Download Full-text

Lies, Gosh Darn Lies, and Not Enough Good Statistics: Why Epidemic Model Parameter Estimation Fails

10.1101/2020.04.20.20071928 ◽

2020 ◽

Author(s):

Daniel E. Platt ◽

Laxmi Parida ◽

Pierre Zalloua

Keyword(s):

Genetic Association ◽

Transmission Rate ◽

Association Studies ◽

Genetic Association Studies ◽

Personal Space ◽

Model Parameters ◽

Spread Model ◽

Positive Growth ◽

Positive Growth Rate ◽

Rate Limiting

AbstractAn opportunity exists in exploring epidemic modeling as a novel way to determine physiological and demic parameters for genetic association studies on a population/environmental (quasi) epidemiological study level. First, the spread of SARS-COV-2 has produced population specific lineages; second, epidemic spread model parameters are tied directly to these physiological and demic rates (e. g. incubation time, recovery time, transmission rate); and third, these parameters may serve as novel phenotypes to associate with region-specific genetic mutations as well as demic characteristics (e. g. age structure, cultural observance of personal space, crowdedness). Therefore, we sought to understand whether the parameters of epidemic models could be determined from the trajectory of infections, recovery, and hospitalizations prior to peak, and also to evaluate the quality and comparability of data between jurisdictions reporting their statistics necessary for the analysis of model parameters across populations. We found that, analytically, the pre-peak growth of an epidemic is limited by a subset of the model variates, and that the rate limiting variables are dominated by the expanding eigenmode of their equations. The variates quickly converge to the ratio of eigenvector components of the positive growth rate, which determines the doubling time. There are 9 parameters and 4 independent components in the eigenmode, leaving 5 undetermined parameters. Those parameters can be strikingly population dependent, and can have significant impact on estimates of hospital loads downstream. Without a sound framework, measurements of infection rates and other parameters are highly corrupted by uneven testing rates to uneven counting and reporting of relevant values. From the standpoint of phenotype parameters, this means that structured experiments must be performed to estimate these parameters in order to perform genetic association studies, or to construct viable models that accurately predict critical quantities such as hospitalization loads.

Download Full-text