Identifying disease genes using machine learning and gene functional similarities, assessed through Gene Ontology

AbstractIdentifying disease genes from a vast amount of genetic data is one of the most challenging tasks in the post-genomic era. Also, complex diseases present highly heterogeneous genotype, which difficult biological marker identification. Machine learning methods are widely used to identify these markers, but their performance is highly dependent upon the size and quality of available data.In this study, we demonstrated that machine learning classifiers trained on gene functional similarities, using Gene Ontology (GO), can improve the identification of genes involved in complex diseases. For this purpose, we developed a supervised machine learning methodology to predict complex disease genes. The proposed pipeline was assessed using Autism Spectrum Disorder (ASD) candidate genes. A quantitative measure of gene functional similarities was obtained by employing different semantic similarity measures. To infer the hidden functional similarities between ASD genes, various types of machine learning classifiers were built on quantitative semantic similarity matrices of ASD and non-ASD genes. The classifiers trained and tested on ASD and non-ASD gene functional similarities outperformed previously reported ASD classifiers. For example, a Random Forest (RF) classifier achieved an AUC of 0. 80 for predicting new ASD genes, which was higher than the reported classifier (0.73). Additionally, this classifier was able to predict 73 novel ASD candidate genes that were were enriched for core ASD phenotypes, such as autism and obsessive-compulsive behavior. In addition, predicted genes were also enriched for ASD co-occurring conditions, including Attention Deficit Hyperactivity Disorder (ADHD).We also developed a KNIME workflow with the proposed methodology which allows users to configure and execute it without requiring machine learning and programming skills. Machine learning is an effective and reliable technique to decipher ASD mechanism by identifying novel disease genes, but this study further demonstrated that their performance can be improved by incorporating a quantitative measure of gene functional similarities. Source code and the workflow of the proposed methodology are available at https://github.com/Muh-Asif/ASD-genes-prediction.

Download Full-text

“Guilt by association” is not competitive with genetic association for identifying autism risk genes

Scientific Reports ◽

10.1038/s41598-021-95321-y ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Margot Gunning ◽

Paul Pavlidis

Keyword(s):

Machine Learning ◽

Genetic Association ◽

Gene Networks ◽

Rare Variants ◽

Association Studies ◽

Genetic Disorders ◽

Autism Spectrum ◽

Biological Data ◽

Disease Genes ◽

Risk Genes

AbstractDiscovering genes involved in complex human genetic disorders is a major challenge. Many have suggested that machine learning (ML) algorithms using gene networks can be used to supplement traditional genetic association-based approaches to predict or prioritize disease genes. However, questions have been raised about the utility of ML methods for this type of task due to biases within the data, and poor real-world performance. Using autism spectrum disorder (ASD) as a test case, we sought to investigate the question: can machine learning aid in the discovery of disease genes? We collected 13 published ASD gene prioritization studies and evaluated their performance using known and novel high-confidence ASD genes. We also investigated their biases towards generic gene annotations, like number of association publications. We found that ML methods which do not incorporate genetics information have limited utility for prioritization of ASD risk genes. These studies perform at a comparable level to generic measures of likelihood for the involvement of genes in any condition, and do not out-perform genetic association studies. Future efforts to discover disease genes should be focused on developing and validating statistical models for genetic association, specifically for association between rare variants and disease, rather than developing complex machine learning methods using complex heterogeneous biological data with unknown reliability.

Download Full-text

Integrated Gene Expression Profiling Analysis Reveals Potential Molecular Mechanisms and Candidate Biomarkers for Early Risk Stratification and Prediction of STEMI and Post-STEMI Heart Failure Patients

10.21203/rs.3.rs-118025/v1 ◽

2020 ◽

Author(s):

Jing Xu ◽

Yuejing Yang

Keyword(s):

Machine Learning ◽

Gene Ontology ◽

Risk Stratification ◽

Molecular Mechanisms ◽

Interaction Network ◽

Hub Genes ◽

Machine Learning Model ◽

Transcriptomic Signature ◽

Logistic Regression Algorithm ◽

Geo Database

Abstract Objective To explore the molecular mechanism and search for the candidate biomarkers with predictive and prognostic potentiality that detectable in the whole blood of STEMI patients and post-STEMI HF patients.Methods In this study, we downloaded GSE60993, GSE61144, GSE66360, and GSE59867 datasets from the NCBI-GEO database. Differentially expressed genes (DEGs) of the datasets were investigated using R. Gene ontology and pathway enrichment were performed via ClueGO, CluePedia, and DAVID database. Protein interaction network was constructed via STRING. Enriched hub genes were analyzed by Cytoscape software. LASSO logistic regression algorithm and ROC analysis were performed to build machine learning models for predicting STEMI. Hub genes for further validated in post-STEMI HF patients from GSE59867.Results We identified 90 up-regulated DEGs and 9 down-regulated DEGs convergence in the three datasets (|log2FC| ≥ 0.8 and adjusted p value < 0.05). They were mainly enriched in Gene Ontology terms relating to cytokine secretion, pattern recognition receptors signaling pathway, and immune cells activation. A cluster of 8 genes including ITGAM, CLEC4D, SLC2A3, BST1, MCEMP1, PLAUR, GPR97, and MMP25 was found to be significant. A machine learning model built by SLC2A3, CLEC4D, GPR97, PLAUR, and BST1 exerted great value for STEMI prediction. Besides, ITGAM and BST1 might be candidate prognostic biomarkers for post-STEMI HF.Conclusions We re-analyzed the integrated transcriptomic signature of STEMI patients showing predictive potentiality and revealed new insights and specific prospective biomarkers for STEMI risk stratification and HF development.

Download Full-text

Integrating Gene Ontology Based Grouping and Ranking into the Machine Learning Algorithm for Gene Expression Data Analysis

10.1007/978-3-030-87101-7_20 ◽

2021 ◽

pp. 205-214

Author(s):

Malik Yousef ◽

Ahmet Sayıcı ◽

Burcu Bakir-Gungor

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Gene Ontology ◽

Data Analysis ◽

Gene Expression Data ◽

Learning Algorithm ◽

Machine Learning Algorithm ◽

Expression Data ◽

Gene Expression Data Analysis

Download Full-text

Gene Ontology and KEGG Enrichment Analyses of Genes Related to Age-Related Macular Degeneration

BioMed Research International ◽

10.1155/2014/450386 ◽

2014 ◽

Vol 2014 ◽

pp. 1-10 ◽

Cited By ~ 10

Author(s):

Jian Zhang ◽

ZhiHao Xing ◽

Mingming Ma ◽

Ning Wang ◽

Yu-Dong Cai ◽

...

Keyword(s):

Gene Ontology ◽

Feature Selection ◽

Macular Degeneration ◽

Classification System ◽

Underlying Disease ◽

Age Related Macular Degeneration ◽

Disease Genes ◽

Kegg Pathways ◽

Age Related ◽

Go Terms

Identifying disease genes is one of the most important topics in biomedicine and may facilitate studies on the mechanisms underlying disease. Age-related macular degeneration (AMD) is a serious eye disease; it typically affects older adults and results in a loss of vision due to retina damage. In this study, we attempt to develop an effective method for distinguishing AMD-related genes. Gene ontology and KEGG enrichment analyses of known AMD-related genes were performed, and a classification system was established. In detail, each gene was encoded into a vector by extracting enrichment scores of the gene set, including it and its direct neighbors in STRING, and gene ontology terms or KEGG pathways. Then certain feature-selection methods, including minimum redundancy maximum relevance and incremental feature selection, were adopted to extract key features for the classification system. As a result, 720 GO terms and 11 KEGG pathways were deemed the most important factors for predicting AMD-related genes.

Download Full-text

ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples

BMC Bioinformatics ◽

10.1186/1471-2105-12-389 ◽

2011 ◽

Vol 12 (1) ◽

Cited By ~ 93

Author(s):

Fantine Mordelet ◽

Jean-Philippe Vert

Keyword(s):

Machine Learning ◽

Disease Genes

Download Full-text

Identifying Digenic Disease Genes with Machine Learning

American Journal of Medical Genetics Part A ◽

10.1002/ajmg.a.62268 ◽

2021 ◽

Vol 188 (1) ◽

pp. 10-11

Keyword(s):

Machine Learning ◽

Disease Genes

Download Full-text

Identifying digenic disease genes via machine learning in the Undiagnosed Diseases Network

The American Journal of Human Genetics ◽

10.1016/j.ajhg.2021.08.010 ◽

2021 ◽

Author(s):

Souhrid Mukherjee ◽

Joy D. Cogan ◽

John H. Newman ◽

John A. Phillips ◽

Rizwan Hamid ◽

...

Keyword(s):

Machine Learning ◽

Disease Genes ◽

Undiagnosed Diseases ◽

Undiagnosed Diseases Network

Download Full-text

Prediction of Candidate Primary Immunodeficiency Disease Genes Using a Support Vector Machine Learning Approach

DNA Research ◽

10.1093/dnares/dsp019 ◽

2009 ◽

Vol 16 (6) ◽

pp. 345-351 ◽

Cited By ~ 18

Author(s):

S. Keerthikumar ◽

S. Bhadra ◽

K. Kandasamy ◽

R. Raju ◽

Y.L. Ramachandra ◽

...

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Primary Immunodeficiency ◽

Primary Immunodeficiency Disease ◽

Support Vector ◽

Disease Genes ◽

Learning Approach ◽

Machine Learning Approach ◽

Immunodeficiency Disease

Download Full-text

Can machine learning aid in identifying disease genes? The case of autism spectrum disorder

10.1101/2020.11.26.394676 ◽

2020 ◽

Author(s):

Margot Gunning ◽

Paul Pavlidis

Keyword(s):

Machine Learning ◽

Autism Spectrum Disorder ◽

Genetic Association ◽

Gene Networks ◽

Rare Variants ◽

Association Studies ◽

Autism Spectrum ◽

Biological Data ◽

Spectrum Disorder ◽

Disease Genes

AbstractDiscovering genes involved in complex human genetic disorders is a major challenge. Many have suggested that machine learning (ML) algorithms using gene networks can be used to supplement traditional genetic association-based approaches to predict or prioritize disease genes. However, questions have been raised about the utility of ML methods for this type of task due to biases within the data, and poor real-world performance. Using autism spectrum disorder (ASD) as a test case, we sought to investigate the question: Can machine learning aid in the discovery of disease genes? We collected thirteen published ASD gene prioritization studies and evaluated their performance using known and novel high-confidence ASD genes. We also investigated their biases towards generic gene annotations, like number of association publications. We found that ML methods which do not incorporate genetics information have limited utility for prioritization of ASD risk genes. These studies perform at a comparable level to generic measures of likelihood for the involvement of genes in any condition, and do not out-perform genetic association studies. Future efforts to discover disease genes should be focused on developing and validating statistical models for genetic association, specifically for association between rare variants and disease, rather than developing complex machine learning methods using complex heterogeneous biological data with unknown reliability.

Download Full-text