scholarly journals Improved inference and prediction of bacterial genotype-phenotype associations using pangenome-spanning regressions

2019 ◽  
Author(s):  
John A. Lees ◽  
T. Tien Mai ◽  
Marco Galardini ◽  
Nicole E. Wheeler ◽  
Jukka Corander

ABSTRACTDiscovery of influential genetic variants and prediction of phenotypes such as antibiotic resistance are becoming routine tasks in bacterial genomics. Genome-wide association study (GWAS) methods can be applied to study bacterial populations, with a particular emphasis on alignment-free approaches, which are necessitated by the more plastic nature of bacterial genomes. Here we advance bacterial GWAS by introducing a computationally scalable joint modeling framework, where genetic variants covering the entire pangenome are compactly represented by unitigs, and the model fitting is achieved using elastic net penalization. In contrast to current leading GWAS approaches, which test each genotype-phenotype association separately for each variant, our joint modelling approach is shown to lead to increased statistical power while maintaining control of the false positive rate. Our inference procedure also delivers an estimate of the narrow-sense heritability, which is gaining considerable interest in studies of bacteria. Using an extensive set of state-of-the-art bacterial population genomic datasets we demonstrate that our approach performs accurate phenotype prediction, comparable to popular machine learning methods, while retaining both interpretability and computational efficiency. We expect that these advances will pave the way for the next generation of high-powered association and prediction studies for an increasing number of bacterial species.

mBio ◽  
2020 ◽  
Vol 11 (4) ◽  
Author(s):  
John A. Lees ◽  
T. Tien Mai ◽  
Marco Galardini ◽  
Nicole E. Wheeler ◽  
Samuel T. Horsfield ◽  
...  

ABSTRACT Discovery of genetic variants underlying bacterial phenotypes and the prediction of phenotypes such as antibiotic resistance are fundamental tasks in bacterial genomics. Genome-wide association study (GWAS) methods have been applied to study these relations, but the plastic nature of bacterial genomes and the clonal structure of bacterial populations creates challenges. We introduce an alignment-free method which finds sets of loci associated with bacterial phenotypes, quantifies the total effect of genetics on the phenotype, and allows accurate phenotype prediction, all within a single computationally scalable joint modeling framework. Genetic variants covering the entire pangenome are compactly represented by extended DNA sequence words known as unitigs, and model fitting is achieved using elastic net penalization, an extension of standard multiple regression. Using an extensive set of state-of-the-art bacterial population genomic data sets, we demonstrate that our approach performs accurate phenotype prediction, comparable to popular machine learning methods, while retaining both interpretability and computational efficiency. Compared to those of previous approaches, which test each genotype-phenotype association separately for each variant and apply a significance threshold, the variants selected by our joint modeling approach overlap substantially. IMPORTANCE Being able to identify the genetic variants responsible for specific bacterial phenotypes has been the goal of bacterial genetics since its inception and is fundamental to our current level of understanding of bacteria. This identification has been based primarily on painstaking experimentation, but the availability of large data sets of whole genomes with associated phenotype metadata promises to revolutionize this approach, not least for important clinical phenotypes that are not amenable to laboratory analysis. These models of phenotype-genotype association can in the future be used for rapid prediction of clinically important phenotypes such as antibiotic resistance and virulence by rapid-turnaround or point-of-care tests. However, despite much effort being put into adapting genome-wide association study (GWAS) approaches to cope with bacterium-specific problems, such as strong population structure and horizontal gene exchange, current approaches are not yet optimal. We describe a method that advances methodology for both association and generation of portable prediction models.


2021 ◽  
Vol 12 ◽  
Author(s):  
Liwan Fu ◽  
Yuquan Wang ◽  
Tingting Li ◽  
Yue-Qing Hu

As a pivotal research tool, genome-wide association study has successfully identified numerous genetic variants underlying distinct diseases. However, these identified genetic variants only explain a small proportion of the phenotypic variation for certain diseases, suggesting that there are still more genetic signals to be detected. One of the reasons may be that one-phenotype one-variant association study is not so efficient in detecting variants of weak effects. Nowadays, it is increasingly worth noting that joint analysis of multiple phenotypes may boost the statistical power to detect pathogenic variants with weak genetic effects on complex diseases, providing more clues for their underlying biology mechanisms. So a Weighted Combination of multiple phenotypes following Hierarchical Clustering method (WCHC) is proposed for simultaneously analyzing multiple phenotypes in association studies. A series of simulations are conducted, and the results show that WCHC is either the most powerful method or comparable with the most powerful competitor in most of the simulation scenarios. Additionally, we evaluated the performance of WCHC in its application to the obesity-related phenotypes from Atherosclerosis Risk in Communities, and several associated variants are reported.


2019 ◽  
Vol 110 (2) ◽  
pp. 473-484 ◽  
Author(s):  
Hassan S Dashti ◽  
Jordi Merino ◽  
Jacqueline M Lane ◽  
Yanwei Song ◽  
Caren E Smith ◽  
...  

ABSTRACT Background Little is known about the contribution of genetic variation to food timing, and breakfast has been determined to exhibit the most heritable meal timing. As breakfast timing and skipping are not routinely measured in large cohort studies, alternative approaches include analyses of correlated traits. Objectives The aim of this study was to elucidate breakfast skipping genetic variants through a proxy-phenotype genome-wide association study (GWAS) for breakfast cereal skipping, a commonly assessed correlated trait. Methods We leveraged the statistical power of the UK Biobank (n = 193,860) to identify genetic variants related to breakfast cereal skipping as a proxy-phenotype for breakfast skipping and applied several in silico approaches to investigate mechanistic functions and links to traits/diseases. Next, we attempted validation of our approach in smaller breakfast skipping GWAS from the TwinUK (n = 2,006) and the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) consortium (n = 11,963). Results In the UK Biobank, we identified 6 independent GWAS variants, including those implicated for caffeine (ARID3B/CYP1A1), carbohydrate metabolism (FGF21), schizophrenia (ZNF804A), and encoding enzymes important for N6-methyladenosine RNA transmethylation (METTL4, YWHAB, and YTHDF3), which regulates the pace of the circadian clock. Expression of identified genes was enriched in the cerebellum. Genome-wide correlation analyses indicated positive correlations with anthropometric traits. Through Mendelian randomization (MR), we observed causal links between genetically determined breakfast skipping and higher body mass index, more depressive symptoms, and smoking. In bidirectional MR, we demonstrated a causal link between being an evening person and skipping breakfast, but not vice versa. We observed association of our signals in an independent breakfast skipping GWAS in another British cohort (P = 0.032), TwinUK, but not in a meta-analysis of non-British cohorts from the CHARGE consortium (P = 0.095). Conclusions Our proxy-phenotype GWAS identified 6 genetic variants for breakfast skipping, linking clock regulation with food timing and suggesting a possible beneficial role of regular breakfast intake as part of a healthy lifestyle.


Author(s):  
Andrew A. Crawford ◽  
◽  
Sean Bankier ◽  
Elisabeth Altmaier ◽  
Catriona L. K. Barnes ◽  
...  

AbstractThe stress hormone cortisol modulates fuel metabolism, cardiovascular homoeostasis, mood, inflammation and cognition. The CORtisol NETwork (CORNET) consortium previously identified a single locus associated with morning plasma cortisol. Identifying additional genetic variants that explain more of the variance in cortisol could provide new insights into cortisol biology and provide statistical power to test the causative role of cortisol in common diseases. The CORNET consortium extended its genome-wide association meta-analysis for morning plasma cortisol from 12,597 to 25,314 subjects and from ~2.2 M to ~7 M SNPs, in 17 population-based cohorts of European ancestries. We confirmed the genetic association with SERPINA6/SERPINA1. This locus contains genes encoding corticosteroid binding globulin (CBG) and α1-antitrypsin. Expression quantitative trait loci (eQTL) analyses undertaken in the STARNET cohort of 600 individuals showed that specific genetic variants within the SERPINA6/SERPINA1 locus influence expression of SERPINA6 rather than SERPINA1 in the liver. Moreover, trans-eQTL analysis demonstrated effects on adipose tissue gene expression, suggesting that variations in CBG levels have an effect on delivery of cortisol to peripheral tissues. Two-sample Mendelian randomisation analyses provided evidence that each genetically-determined standard deviation (SD) increase in morning plasma cortisol was associated with increased odds of chronic ischaemic heart disease (0.32, 95% CI 0.06–0.59) and myocardial infarction (0.21, 95% CI 0.00–0.43) in UK Biobank and similarly in CARDIoGRAMplusC4D. These findings reveal a causative pathway for CBG in determining cortisol action in peripheral tissues and thereby contributing to the aetiology of cardiovascular disease.


2020 ◽  
Vol 4 (14) ◽  
pp. 3224-3233
Author(s):  
Paul J. Martin ◽  
David M. Levine ◽  
Barry E. Storer ◽  
Sarah C. Nelson ◽  
Xinyuan Dong ◽  
...  

Abstract Many studies have suggested that genetic variants in donors and recipients are associated with survival-related outcomes after allogeneic hematopoietic cell transplantation (HCT), but these results have not been confirmed. Therefore, the utility of testing genetic variants in donors and recipients for risk stratification or understanding mechanisms leading to mortality after HCT has not been established. We tested 122 recipient and donor candidate variants for association with nonrelapse mortality (NRM) and relapse mortality (RM) in a cohort of 2560 HCT recipients of European ancestry with related or unrelated donors. Associations discovered in this cohort were tested for replication in a separate cohort of 1710 HCT recipients. We found that the donor rs1051792 A allele in MICA was associated with a lower risk of NRM. Donor and recipient rs1051792 genotypes were highly correlated, making it statistically impossible to determine whether the donor or recipient genotype accounted for the association. Risks of grade 3 to 4 graft-versus-host disease (GVHD) and NRM in patients with grades 3 to 4 GVHD were lower with donor MICA-129Met but not with MICA-129Val, implicating MICA-129Met in the donor as an explanation for the decreased risk of NRM after HCT. Our analysis of candidate variants did not show any other association with NRM or RM. A genome-wide association study did not identify any other variants associated with NRM or RM.


Biostatistics ◽  
2017 ◽  
Vol 18 (3) ◽  
pp. 477-494 ◽  
Author(s):  
Jakub Pecanka ◽  
Marianne A. Jonker ◽  
Zoltan Bochdanovits ◽  
Aad W. Van Der Vaart ◽  

Summary For over a decade functional gene-to-gene interaction (epistasis) has been suspected to be a determinant in the “missing heritability” of complex traits. However, searching for epistasis on the genome-wide scale has been challenging due to the prohibitively large number of tests which result in a serious loss of statistical power as well as computational challenges. In this article, we propose a two-stage method applicable to existing case-control data sets, which aims to lessen both of these problems by pre-assessing whether a candidate pair of genetic loci is involved in epistasis before it is actually tested for interaction with respect to a complex phenotype. The pre-assessment is based on a two-locus genotype independence test performed in the sample of cases. Only the pairs of loci that exhibit non-equilibrium frequencies are analyzed via a logistic regression score test, thereby reducing the multiple testing burden. Since only the computationally simple independence tests are performed for all pairs of loci while the more demanding score tests are restricted to the most promising pairs, genome-wide association study (GWAS) for epistasis becomes feasible. By design our method provides strong control of the type I error. Its favourable power properties especially under the practically relevant misspecification of the interaction model are illustrated. Ready-to-use software is available. Using the method we analyzed Parkinson’s disease in four cohorts and identified possible interactions within several SNP pairs in multiple cohorts.


Sign in / Sign up

Export Citation Format

Share Document