scholarly journals Identifying disease genes using machine learning and gene functional similarities, assessed through Gene Ontology

PLoS ONE ◽  
2018 ◽  
Vol 13 (12) ◽  
pp. e0208626 ◽  
Author(s):  
Muhammad Asif ◽  
Hugo F. M. C. M. Martiniano ◽  
Astrid M. Vicente ◽  
Francisco M. Couto
2018 ◽  
Author(s):  
Muhammad Asif ◽  
Hugo F. M. C. M. Martiniano ◽  
Astrid M. Vicente ◽  
Francisco M. Couto

AbstractIdentifying disease genes from a vast amount of genetic data is one of the most challenging tasks in the post-genomic era. Also, complex diseases present highly heterogeneous genotype, which difficult biological marker identification. Machine learning methods are widely used to identify these markers, but their performance is highly dependent upon the size and quality of available data.In this study, we demonstrated that machine learning classifiers trained on gene functional similarities, using Gene Ontology (GO), can improve the identification of genes involved in complex diseases. For this purpose, we developed a supervised machine learning methodology to predict complex disease genes. The proposed pipeline was assessed using Autism Spectrum Disorder (ASD) candidate genes. A quantitative measure of gene functional similarities was obtained by employing different semantic similarity measures. To infer the hidden functional similarities between ASD genes, various types of machine learning classifiers were built on quantitative semantic similarity matrices of ASD and non-ASD genes. The classifiers trained and tested on ASD and non-ASD gene functional similarities outperformed previously reported ASD classifiers. For example, a Random Forest (RF) classifier achieved an AUC of 0. 80 for predicting new ASD genes, which was higher than the reported classifier (0.73). Additionally, this classifier was able to predict 73 novel ASD candidate genes that were were enriched for core ASD phenotypes, such as autism and obsessive-compulsive behavior. In addition, predicted genes were also enriched for ASD co-occurring conditions, including Attention Deficit Hyperactivity Disorder (ADHD).We also developed a KNIME workflow with the proposed methodology which allows users to configure and execute it without requiring machine learning and programming skills. Machine learning is an effective and reliable technique to decipher ASD mechanism by identifying novel disease genes, but this study further demonstrated that their performance can be improved by incorporating a quantitative measure of gene functional similarities. Source code and the workflow of the proposed methodology are available at https://github.com/Muh-Asif/ASD-genes-prediction.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Margot Gunning ◽  
Paul Pavlidis

AbstractDiscovering genes involved in complex human genetic disorders is a major challenge. Many have suggested that machine learning (ML) algorithms using gene networks can be used to supplement traditional genetic association-based approaches to predict or prioritize disease genes. However, questions have been raised about the utility of ML methods for this type of task due to biases within the data, and poor real-world performance. Using autism spectrum disorder (ASD) as a test case, we sought to investigate the question: can machine learning aid in the discovery of disease genes? We collected 13 published ASD gene prioritization studies and evaluated their performance using known and novel high-confidence ASD genes. We also investigated their biases towards generic gene annotations, like number of association publications. We found that ML methods which do not incorporate genetics information have limited utility for prioritization of ASD risk genes. These studies perform at a comparable level to generic measures of likelihood for the involvement of genes in any condition, and do not out-perform genetic association studies. Future efforts to discover disease genes should be focused on developing and validating statistical models for genetic association, specifically for association between rare variants and disease, rather than developing complex machine learning methods using complex heterogeneous biological data with unknown reliability.


2020 ◽  
Author(s):  
Jing Xu ◽  
Yuejing Yang

Abstract Objective To explore the molecular mechanism and search for the candidate biomarkers with predictive and prognostic potentiality that detectable in the whole blood of STEMI patients and post-STEMI HF patients.Methods In this study, we downloaded GSE60993, GSE61144, GSE66360, and GSE59867 datasets from the NCBI-GEO database. Differentially expressed genes (DEGs) of the datasets were investigated using R. Gene ontology and pathway enrichment were performed via ClueGO, CluePedia, and DAVID database. Protein interaction network was constructed via STRING. Enriched hub genes were analyzed by Cytoscape software. LASSO logistic regression algorithm and ROC analysis were performed to build machine learning models for predicting STEMI. Hub genes for further validated in post-STEMI HF patients from GSE59867.Results We identified 90 up-regulated DEGs and 9 down-regulated DEGs convergence in the three datasets (|log2FC| ≥ 0.8 and adjusted p value < 0.05). They were mainly enriched in Gene Ontology terms relating to cytokine secretion, pattern recognition receptors signaling pathway, and immune cells activation. A cluster of 8 genes including ITGAM, CLEC4D, SLC2A3, BST1, MCEMP1, PLAUR, GPR97, and MMP25 was found to be significant. A machine learning model built by SLC2A3, CLEC4D, GPR97, PLAUR, and BST1 exerted great value for STEMI prediction. Besides, ITGAM and BST1 might be candidate prognostic biomarkers for post-STEMI HF.Conclusions We re-analyzed the integrated transcriptomic signature of STEMI patients showing predictive potentiality and revealed new insights and specific prospective biomarkers for STEMI risk stratification and HF development.


2014 ◽  
Vol 2014 ◽  
pp. 1-10 ◽  
Author(s):  
Jian Zhang ◽  
ZhiHao Xing ◽  
Mingming Ma ◽  
Ning Wang ◽  
Yu-Dong Cai ◽  
...  

Identifying disease genes is one of the most important topics in biomedicine and may facilitate studies on the mechanisms underlying disease. Age-related macular degeneration (AMD) is a serious eye disease; it typically affects older adults and results in a loss of vision due to retina damage. In this study, we attempt to develop an effective method for distinguishing AMD-related genes. Gene ontology and KEGG enrichment analyses of known AMD-related genes were performed, and a classification system was established. In detail, each gene was encoded into a vector by extracting enrichment scores of the gene set, including it and its direct neighbors in STRING, and gene ontology terms or KEGG pathways. Then certain feature-selection methods, including minimum redundancy maximum relevance and incremental feature selection, were adopted to extract key features for the classification system. As a result, 720 GO terms and 11 KEGG pathways were deemed the most important factors for predicting AMD-related genes.


Author(s):  
Souhrid Mukherjee ◽  
Joy D. Cogan ◽  
John H. Newman ◽  
John A. Phillips ◽  
Rizwan Hamid ◽  
...  

2020 ◽  
Author(s):  
Margot Gunning ◽  
Paul Pavlidis

AbstractDiscovering genes involved in complex human genetic disorders is a major challenge. Many have suggested that machine learning (ML) algorithms using gene networks can be used to supplement traditional genetic association-based approaches to predict or prioritize disease genes. However, questions have been raised about the utility of ML methods for this type of task due to biases within the data, and poor real-world performance. Using autism spectrum disorder (ASD) as a test case, we sought to investigate the question: Can machine learning aid in the discovery of disease genes? We collected thirteen published ASD gene prioritization studies and evaluated their performance using known and novel high-confidence ASD genes. We also investigated their biases towards generic gene annotations, like number of association publications. We found that ML methods which do not incorporate genetics information have limited utility for prioritization of ASD risk genes. These studies perform at a comparable level to generic measures of likelihood for the involvement of genes in any condition, and do not out-perform genetic association studies. Future efforts to discover disease genes should be focused on developing and validating statistical models for genetic association, specifically for association between rare variants and disease, rather than developing complex machine learning methods using complex heterogeneous biological data with unknown reliability.


Sign in / Sign up

Export Citation Format

Share Document