Identifying Digenic Disease Genes with Machine Learning

2021 ◽  
Vol 188 (1) ◽  
pp. 10-11
2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Margot Gunning ◽  
Paul Pavlidis

AbstractDiscovering genes involved in complex human genetic disorders is a major challenge. Many have suggested that machine learning (ML) algorithms using gene networks can be used to supplement traditional genetic association-based approaches to predict or prioritize disease genes. However, questions have been raised about the utility of ML methods for this type of task due to biases within the data, and poor real-world performance. Using autism spectrum disorder (ASD) as a test case, we sought to investigate the question: can machine learning aid in the discovery of disease genes? We collected 13 published ASD gene prioritization studies and evaluated their performance using known and novel high-confidence ASD genes. We also investigated their biases towards generic gene annotations, like number of association publications. We found that ML methods which do not incorporate genetics information have limited utility for prioritization of ASD risk genes. These studies perform at a comparable level to generic measures of likelihood for the involvement of genes in any condition, and do not out-perform genetic association studies. Future efforts to discover disease genes should be focused on developing and validating statistical models for genetic association, specifically for association between rare variants and disease, rather than developing complex machine learning methods using complex heterogeneous biological data with unknown reliability.


2018 ◽  
Author(s):  
Muhammad Asif ◽  
Hugo F. M. C. M. Martiniano ◽  
Astrid M. Vicente ◽  
Francisco M. Couto

AbstractIdentifying disease genes from a vast amount of genetic data is one of the most challenging tasks in the post-genomic era. Also, complex diseases present highly heterogeneous genotype, which difficult biological marker identification. Machine learning methods are widely used to identify these markers, but their performance is highly dependent upon the size and quality of available data.In this study, we demonstrated that machine learning classifiers trained on gene functional similarities, using Gene Ontology (GO), can improve the identification of genes involved in complex diseases. For this purpose, we developed a supervised machine learning methodology to predict complex disease genes. The proposed pipeline was assessed using Autism Spectrum Disorder (ASD) candidate genes. A quantitative measure of gene functional similarities was obtained by employing different semantic similarity measures. To infer the hidden functional similarities between ASD genes, various types of machine learning classifiers were built on quantitative semantic similarity matrices of ASD and non-ASD genes. The classifiers trained and tested on ASD and non-ASD gene functional similarities outperformed previously reported ASD classifiers. For example, a Random Forest (RF) classifier achieved an AUC of 0. 80 for predicting new ASD genes, which was higher than the reported classifier (0.73). Additionally, this classifier was able to predict 73 novel ASD candidate genes that were were enriched for core ASD phenotypes, such as autism and obsessive-compulsive behavior. In addition, predicted genes were also enriched for ASD co-occurring conditions, including Attention Deficit Hyperactivity Disorder (ADHD).We also developed a KNIME workflow with the proposed methodology which allows users to configure and execute it without requiring machine learning and programming skills. Machine learning is an effective and reliable technique to decipher ASD mechanism by identifying novel disease genes, but this study further demonstrated that their performance can be improved by incorporating a quantitative measure of gene functional similarities. Source code and the workflow of the proposed methodology are available at https://github.com/Muh-Asif/ASD-genes-prediction.


Author(s):  
Souhrid Mukherjee ◽  
Joy D. Cogan ◽  
John H. Newman ◽  
John A. Phillips ◽  
Rizwan Hamid ◽  
...  

PLoS ONE ◽  
2018 ◽  
Vol 13 (12) ◽  
pp. e0208626 ◽  
Author(s):  
Muhammad Asif ◽  
Hugo F. M. C. M. Martiniano ◽  
Astrid M. Vicente ◽  
Francisco M. Couto

2020 ◽  
Author(s):  
Margot Gunning ◽  
Paul Pavlidis

AbstractDiscovering genes involved in complex human genetic disorders is a major challenge. Many have suggested that machine learning (ML) algorithms using gene networks can be used to supplement traditional genetic association-based approaches to predict or prioritize disease genes. However, questions have been raised about the utility of ML methods for this type of task due to biases within the data, and poor real-world performance. Using autism spectrum disorder (ASD) as a test case, we sought to investigate the question: Can machine learning aid in the discovery of disease genes? We collected thirteen published ASD gene prioritization studies and evaluated their performance using known and novel high-confidence ASD genes. We also investigated their biases towards generic gene annotations, like number of association publications. We found that ML methods which do not incorporate genetics information have limited utility for prioritization of ASD risk genes. These studies perform at a comparable level to generic measures of likelihood for the involvement of genes in any condition, and do not out-perform genetic association studies. Future efforts to discover disease genes should be focused on developing and validating statistical models for genetic association, specifically for association between rare variants and disease, rather than developing complex machine learning methods using complex heterogeneous biological data with unknown reliability.


BMC Neurology ◽  
2018 ◽  
Vol 18 (1) ◽  
Author(s):  
Xiaoyan Huang ◽  
Hankui Liu ◽  
Xinming Li ◽  
Liping Guan ◽  
Jiankang Li ◽  
...  

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Marta Correia ◽  
Eva Kagenaar ◽  
Daniël Bernardus van Schalkwijk ◽  
Mafalda Bourbon ◽  
Margarida Gama-Carvalho

AbstractFamilial hypercholesterolaemia increases circulating LDL-C levels and leads to premature cardiovascular disease when undiagnosed or untreated. Current guidelines support genetic testing in patients complying with clinical diagnostic criteria and cascade screening of their family members. However, most of hyperlipidaemic subjects do not present pathogenic variants in the known disease genes, and most likely suffer from polygenic hypercholesterolaemia, which translates into a relatively low yield of genetic screening programs. This study aims to identify new biomarkers and develop new approaches to improve the identification of individuals carrying monogenic causative variants. Using a machine-learning approach in a paediatric dataset of individuals, tested for disease causative genes and with an extended lipid profile, we developed new models able to classify familial hypercholesterolaemia patients with a much higher specificity than currently used methods. The best performing models incorporated parameters absent from the most common FH clinical criteria, namely apoB/apoA-I, TG/apoB and LDL1. These parameters were found to contribute to an improved identification of monogenic individuals. Furthermore, models using only TC and LDL-C levels presented a higher specificity of classification when compared to simple cut-offs. Our results can be applied towards the improvement of the yield of genetic screening programs and corresponding costs.


2018 ◽  
Author(s):  
Sergio Picart-Armada ◽  
Steven J. Barrett ◽  
David R. Willé ◽  
Alexandre Perera-Lluna ◽  
Alex Gutteridge ◽  
...  

AbstractBackgroundIn-silico identification of potential disease genes has become an essential aspect of drug target discovery. Recent studies suggest that one powerful way to identify successful targets is through the use of genetic and genomic information. Given a known disease gene, leveraging intermolecular connections via networks and pathways seems a natural way to identify other genes and proteins that are involved in similar biological processes, and that can therefore be analysed as additional targets.ResultsHere, we systematically tested the ability of 12 varied network-based algorithms to identify target genes and cross-validated these using gene-disease data from Open Targets on 22 common diseases. We considered two biological networks, six performance metrics and compared two types of input gene-disease association scores. We also compared several cross-validation schemes and showed that different choices had a remarkable impact on the performance estimates. When seeding biological networks with known drug targets, we found that machine learning and diffusion-based methods are able to find novel targets, showing around 2-4 true hits in the top 20 suggestions. Seeding the networks with genes associated to disease by genetics resulted in poorer performance, below 1 true hit on average. We also observed that the use of a larger network, although noisier, improved overall performance.ConclusionsWe conclude that machine learning and diffusion-based prioritisers are suited for drug discovery in practice and improve over simpler neighbour-voting methods. We also demonstrate the large effect of several factors on prediction performance, especially the validation strategy, input biological network, and definition of seed disease genes.


Sign in / Sign up

Export Citation Format

Share Document