Forest and Trees: Exploring Bacterial Virulence with Genome-wide Association Studies and Machine Learning

Author(s):  
Jonathan P. Allen ◽  
Evan Snitkin ◽  
Nathan B. Pincus ◽  
Alan R. Hauser
2021 ◽  
Vol 25 (1/2) ◽  
pp. 17
Author(s):  
Lingling Jin ◽  
Randy Kutcher ◽  
Yan Yan ◽  
Lipu Wang ◽  
Longhai Li ◽  
...  

Author(s):  
Shanwen Sun ◽  
Benzhi Dong ◽  
Quan Zou

Abstract Over the last decade, genome-wide association studies (GWAS) have discovered thousands of genetic variants underlying complex human diseases and agriculturally important traits. These findings have been utilized to dissect the biological basis of diseases, to develop new drugs, to advance precision medicine and to boost breeding. However, the potential of GWAS is still underexploited due to methodological limitations. Many challenges have emerged, including detecting epistasis and single-nucleotide polymorphisms (SNPs) with small effects and distinguishing causal variants from other SNPs associated through linkage disequilibrium. These issues have motivated advancements in GWAS analyses in two contrasting cultures—statistical modelling and machine learning. In this review, we systematically present the basic concepts and the benefits and limitations in both methods. We further discuss recent efforts to mitigate their weaknesses. Additionally, we summarize the state-of-the-art tools for detecting the missed signals, ultrarare mutations and gene–gene interactions and for prioritizing SNPs. Our work can offer both theoretical and practical guidelines for performing GWAS analyses and for developing further new robust methods to fully exploit the potential of GWAS.


GigaScience ◽  
2020 ◽  
Vol 9 (8) ◽  
Author(s):  
Arash Bayat ◽  
Piotr Szul ◽  
Aidan R O’Brien ◽  
Robert Dunne ◽  
Brendan Hosking ◽  
...  

Abstract Background Many traits and diseases are thought to be driven by >1 gene (polygenic). Polygenic risk scores (PRS) hence expand on genome-wide association studies by taking multiple genes into account when risk models are built. However, PRS only considers the additive effect of individual genes but not epistatic interactions or the combination of individual and interacting drivers. While evidence of epistatic interactions ais found in small datasets, large datasets have not been processed yet owing to the high computational complexity of the search for epistatic interactions. Findings We have developed VariantSpark, a distributed machine learning framework able to perform association analysis for complex phenotypes that are polygenic and potentially involve a large number of epistatic interactions. Efficient multi-layer parallelization allows VariantSpark to scale to the whole genome of population-scale datasets with 100,000,000 genomic variants and 100,000 samples. Conclusions Compared with traditional monogenic genome-wide association studies, VariantSpark better identifies genomic variants associated with complex phenotypes. VariantSpark is 3.6 times faster than ReForeSt and the only method able to scale to ultra-high-dimensional genomic data in a manageable time.


Sign in / Sign up

Export Citation Format

Share Document