scholarly journals DMWAS: Deep Machine learning omics Wide Association Study & Feature set optimization by clustering & univariate association for Biomarkers discovery as tested on GTEx pilot dataset for death due to heart-attack

2021 ◽  
Author(s):  
Abhishek Narain Singh

AbstractUnivariate and multivariate methods for association of the genomic variations with the end-or-endo phenotype have been widely used for genome wide association studies. In addition to encoding the SNPs, we advocate usage of clustering as a novel method to encode the structural variations, SVs, in genomes, such as the deletions and insertions polymorphism (DIPs), Copy Number Variations (CNVs), translocation, inversion, etc., that can be used as an independent feature variable value for downstream computation by artificial intelligence methods to predict the endo-or-end phenotype. We introduce a clustering based encoding scheme for structural variations and omics based analysis. We conducted a complete all genomic variants association with the phenotype using deep learning and other machine learning techniques, though other methods such as genetic algorithm can also be applied. Applying this encoding of SVs and one-hot encoding of SNPs on GTEx V7 pilot DNA variation dataset, we were able to get high accuracy using various methods of DMWAS, and particularly found logistic regression to work the best for death due to heart-attack (MHHRTATT) phenotype. The genomic variants acting as feature sets were then arranged in descending order of power of impact on the disease or trait phenotype, which we call optimization and that also uses top univariate association into account. Variant Id P1_M_061510_3_402_P at chromosome 3 & position 192063195 was found to be most highly associated to MHHRTATT. We present here the top ten optimized genomic variant feature set for the MHHRTATT phenotypic cause of death.

2021 ◽  
Author(s):  
Abhishek N Singh

Abstract Univariate and multivariate methods for association of the genomic variations with the end-or-endo phenotype have been widely used for genome wide association studies. In addition to encoding the SNPs, we advocate usage of clustering as a novel method to encode the structural variations, SVs, in genomes, such as the deletions and insertions polymorphism (DIPs), Copy Number Variations (CNVs), translocation, inversion, etc., that can be used as an independent feature variable value for downstream computation by artificial intelligence methods to predict the endo-or-end phenotype. We introduce a clustering based encoding scheme for structural variations and omics based analysis. We conducted a complete all genomic variants association with the phenotype using deep learning and other machine learning techniques, though other methods such as genetic algorithm can also be applied. Applying this encoding of SVs and one-hot encoding of SNPs on GTEx V7 pilot DNA variation dataset, we were able to get high accuracy using various methods of DMWAS, and particularly found logistic regression to work the best for death due to heart-attack (MHHRTATT) phenotype. The genomic variants acting as feature sets were then arranged in descending order of power of impact on the disease or trait phenotype, which we call optimization and that also uses top univariate association into account. Variant Id P1_M_061510_3_402_P at chromosome 3 & position 192063195 was found to be most highly associated to MHHRTATT. We present here the top ten optimized genomic variant feature set for the MHHRTATT phenotypic cause of death.


2021 ◽  
Author(s):  
Abhishek N Singh

Abstract Univariate and multivariate methods for association of the genomic variations with the end-or-endo phenotype have been widely used for genome wide association studies. In addition to encoding the SNPs, we advocate usage of clustering as a novel method to encode the structural variations, SVs, in genomes, such as the deletions and insertions polymorphism (DIPs), Copy Number Variations (CNVs), translocation, inversion, etc., that can be used as an independent feature variable value for downstream computation by artificial intelligence methods to predict the endo-or-end phenotype. We introduce a clustering based encoding scheme for structural variations and omics based analysis. We conducted a complete all genomic variants association with the phenotype using deep learning and other machine learning techniques, though other methods such as genetic algorithm can also be applied. Applying this encoding of SVs and one-hot encoding of SNPs on GTEx V7 pilot DNA variation dataset, we were able to get high accuracy using various methods of DMWAS, and particularly found logistic regression to work the best for death due to heart-attack (MHHRTATT) phenotype. The genomic variants acting as feature sets were then arranged in descending order of power of impact on the disease or trait phenotype, which we call optimization and that also uses top univariate association into account. Variant Id P1_M_061510_3_402_P at chromosome 3 & position 192063195 was found to be most highly associated to MHHRTATT. We present here the top ten optimized genomic variant feature set for the MHHRTATT phenotypic cause of death.


GigaScience ◽  
2020 ◽  
Vol 9 (8) ◽  
Author(s):  
Arash Bayat ◽  
Piotr Szul ◽  
Aidan R O’Brien ◽  
Robert Dunne ◽  
Brendan Hosking ◽  
...  

Abstract Background Many traits and diseases are thought to be driven by >1 gene (polygenic). Polygenic risk scores (PRS) hence expand on genome-wide association studies by taking multiple genes into account when risk models are built. However, PRS only considers the additive effect of individual genes but not epistatic interactions or the combination of individual and interacting drivers. While evidence of epistatic interactions ais found in small datasets, large datasets have not been processed yet owing to the high computational complexity of the search for epistatic interactions. Findings We have developed VariantSpark, a distributed machine learning framework able to perform association analysis for complex phenotypes that are polygenic and potentially involve a large number of epistatic interactions. Efficient multi-layer parallelization allows VariantSpark to scale to the whole genome of population-scale datasets with 100,000,000 genomic variants and 100,000 samples. Conclusions Compared with traditional monogenic genome-wide association studies, VariantSpark better identifies genomic variants associated with complex phenotypes. VariantSpark is 3.6 times faster than ReForeSt and the only method able to scale to ultra-high-dimensional genomic data in a manageable time.


2011 ◽  
Vol 26 (S2) ◽  
pp. 2007-2007
Author(s):  
J. Mendlewicz

The lifetime prevalence of mood disorders is estimated around 20% in the general population leading to a main cause of disability worldwide and a major public health issue.1 The ethiology of mood disorders is still unknown, but its various phenotypes are believed to be caused by multiple genetic variants interacting in a complex way with environmental vulnerability factors. Therefore, the identification of biomarkers and environmental markers is crucial to improve our understanding and diagnosis as well as our treatments. Despite intensive and costly research for more than two decades to unravel suceptibility genes, although pathophysiological pathways of interest have been recongnized, results have not been consistant so far and not a single genetic biomarker of depression has been identified and replicated. More recent systematic genome-wide association studies (GWAS) have reported weak associations of some genetic variants in large samples, but multiple rare variants may concur to confer only part of the suceptibility to depression. Structural variations may also be considered to be promising as is the case for copy-number-variations (CNVs). Methodological issues and limitations will also be critically discussed in light of the complexity of gene-evironment interactions (epigenetic modulation of gene expression)2 and in relation to future prospects for individualized pharmacotherapy of depressive illness.


Author(s):  
Jody Ye ◽  
Kathleen Gillespie ◽  
Santiago Rodriguez

Although genome-wide association studies (GWAS) have identified several hundred loci associated with autoimmune diseases, their mechanistic insights are still poorly understood. The human genome is more complex than common single nucleotide polymorphisms (SNPs) that are interrogated by GWAS arrays. Some structural variants such as insertions-deletions, copy number variations, and minisatellites that are not very well tagged by SNPs cannot be fully explored by GWAS. Therefore, it is possible that some of these loci may have large effects on autoimmune disease risk. In addition, other layers of regulations such as gene-gene interactions, epigenetic-determinants, gene and environmental interactions also contribute to the heritability of autoimmune diseases. This review focuses on discussing why studying these elements may allow us to gain a more comprehensive understanding of the aetiology of complex autoimmune traits.


Sign in / Sign up

Export Citation Format

Share Document