“Guilt by association” is not competitive with genetic association for identifying autism risk genes

AbstractDiscovering genes involved in complex human genetic disorders is a major challenge. Many have suggested that machine learning (ML) algorithms using gene networks can be used to supplement traditional genetic association-based approaches to predict or prioritize disease genes. However, questions have been raised about the utility of ML methods for this type of task due to biases within the data, and poor real-world performance. Using autism spectrum disorder (ASD) as a test case, we sought to investigate the question: can machine learning aid in the discovery of disease genes? We collected 13 published ASD gene prioritization studies and evaluated their performance using known and novel high-confidence ASD genes. We also investigated their biases towards generic gene annotations, like number of association publications. We found that ML methods which do not incorporate genetics information have limited utility for prioritization of ASD risk genes. These studies perform at a comparable level to generic measures of likelihood for the involvement of genes in any condition, and do not out-perform genetic association studies. Future efforts to discover disease genes should be focused on developing and validating statistical models for genetic association, specifically for association between rare variants and disease, rather than developing complex machine learning methods using complex heterogeneous biological data with unknown reliability.

Download Full-text

Can machine learning aid in identifying disease genes? The case of autism spectrum disorder

10.1101/2020.11.26.394676 ◽

2020 ◽

Author(s):

Margot Gunning ◽

Paul Pavlidis

Keyword(s):

Machine Learning ◽

Autism Spectrum Disorder ◽

Genetic Association ◽

Gene Networks ◽

Rare Variants ◽

Association Studies ◽

Autism Spectrum ◽

Biological Data ◽

Spectrum Disorder ◽

Disease Genes

AbstractDiscovering genes involved in complex human genetic disorders is a major challenge. Many have suggested that machine learning (ML) algorithms using gene networks can be used to supplement traditional genetic association-based approaches to predict or prioritize disease genes. However, questions have been raised about the utility of ML methods for this type of task due to biases within the data, and poor real-world performance. Using autism spectrum disorder (ASD) as a test case, we sought to investigate the question: Can machine learning aid in the discovery of disease genes? We collected thirteen published ASD gene prioritization studies and evaluated their performance using known and novel high-confidence ASD genes. We also investigated their biases towards generic gene annotations, like number of association publications. We found that ML methods which do not incorporate genetics information have limited utility for prioritization of ASD risk genes. These studies perform at a comparable level to generic measures of likelihood for the involvement of genes in any condition, and do not out-perform genetic association studies. Future efforts to discover disease genes should be focused on developing and validating statistical models for genetic association, specifically for association between rare variants and disease, rather than developing complex machine learning methods using complex heterogeneous biological data with unknown reliability.

Download Full-text

Identification of disease-associated loci using machine learning for genotype and network data integration

Bioinformatics ◽

10.1093/bioinformatics/btz310 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5182-5190 ◽

Cited By ~ 4

Author(s):

Luis G Leal ◽

Alessia David ◽

Marjo-Riita Jarvelin ◽

Sylvain Sebert ◽

Minna Männikkö ◽

...

Keyword(s):

Machine Learning ◽

Gene Networks ◽

Association Studies ◽

R Package ◽

Biological Data ◽

Machine Learning Algorithms ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Omics Data ◽

Missing Heritability

Abstract Motivation Integration of different omics data could markedly help to identify biological signatures, understand the missing heritability of complex diseases and ultimately achieve personalized medicine. Standard regression models used in Genome-Wide Association Studies (GWAS) identify loci with a strong effect size, whereas GWAS meta-analyses are often needed to capture weak loci contributing to the missing heritability. Development of novel machine learning algorithms for merging genotype data with other omics data is highly needed as it could enhance the prioritization of weak loci. Results We developed cNMTF (corrected non-negative matrix tri-factorization), an integrative algorithm based on clustering techniques of biological data. This method assesses the inter-relatedness between genotypes, phenotypes, the damaging effect of the variants and gene networks in order to identify loci-trait associations. cNMTF was used to prioritize genes associated with lipid traits in two population cohorts. We replicated 129 genes reported in GWAS world-wide and provided evidence that supports 85% of our findings (226 out of 265 genes), including recent associations in literature (NLGN1), regulators of lipid metabolism (DAB1) and pleiotropic genes for lipid traits (CARM1). Moreover, cNMTF performed efficiently against strong population structures by accounting for the individuals’ ancestry. As the method is flexible in the incorporation of diverse omics data sources, it can be easily adapted to the user’s research needs. Availability and implementation An R package (cnmtf) is available at https://lgl15.github.io/cnmtf_web/index.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Assessing the contribution of rare-to-common protein-coding variants to circulating metabolic biomarker levels via 412,394 UK Biobank exome sequences

10.1101/2021.12.24.21268381 ◽

2021 ◽

Author(s):

Abhishek Nag ◽

Lawrence Middleton ◽

Ryan S Dhindsa ◽

Dimitrios Vitsios ◽

Eleanor M Wigmore ◽

...

Keyword(s):

Gene Networks ◽

Rare Variants ◽

Association Studies ◽

Low Frequency ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

Protein Coding ◽

The Uk ◽

Metabolic Biomarkers ◽

Coding Variants

Genome-wide association studies have established the contribution of common and low frequency variants to metabolic biomarkers in the UK Biobank (UKB); however, the role of rare variants remains to be assessed systematically. We evaluated rare coding variants for 198 metabolic biomarkers, including metabolites assayed by Nightingale Health, using exome sequencing in participants from four genetically diverse ancestries in the UKB (N=412,394). Gene-level collapsing analysis, that evaluated a range of genetic architectures, identified a total of 1,303 significant relationships between genes and metabolic biomarkers (p<1x10-8), encompassing 207 distinct genes. These include associations between rare non-synonymous variants in GIGYF1 and glucose and lipid biomarkers, SYT7 and creatinine, and others, which may provide insights into novel disease biology. Comparing to a previous microarray-based genotyping study in the same cohort, we observed that 40% of gene-biomarker relationships identified in the collapsing analysis were novel. Finally, we applied Gene-SCOUT, a novel tool that utilises the gene-biomarker association statistics from the collapsing analysis to identify genes having similar biomarker fingerprints and thus expand our understanding of gene networks.

Download Full-text

Random Forests for Genetic Association Studies

Statistical Applications in Genetics and Molecular Biology ◽

10.2202/1544-6115.1691 ◽

2011 ◽

Vol 10 (1) ◽

Cited By ~ 85

Author(s):

Benjamin A Goldstein ◽

Eric C Polley ◽

Farren B. S. Briggs

Keyword(s):

Machine Learning ◽

Genetic Association ◽

Random Forests ◽

Learning Algorithm ◽

Association Studies ◽

Genetic Association Studies ◽

Machine Learning Algorithms ◽

Computationally Efficient ◽

Genetic Studies ◽

Variable Importance Measures

The Random Forests (RF) algorithm has become a commonly used machine learning algorithm for genetic association studies. It is well suited for genetic applications since it is both computationally efficient and models genetic causal mechanisms well. With its growing ubiquity, there has been inconsistent and less than optimal use of RF in the literature. The purpose of this review is to breakdown the theoretical and statistical basis of RF so that practitioners are able to apply it in their work. An emphasis is placed on showing how the various components contribute to bias and variance, as well as discussing variable importance measures. Applications specific to genetic studies are highlighted. To provide context, RF is compared to other commonly used machine learning algorithms.

Download Full-text

Linking Autism Risk Genes to Disruption of Cortical Development

Cells ◽

10.3390/cells9112500 ◽

2020 ◽

Vol 9 (11) ◽

pp. 2500

Author(s):

Marta Garcia-Forn ◽

Andrea Boitnott ◽

Zeynep Akpinar ◽

Silvia De Rubeis

Keyword(s):

Large Scale ◽

Association Studies ◽

Neurodevelopmental Disorder ◽

Cortical Development ◽

Autism Spectrum ◽

Repetitive Behaviors ◽

Genome Wide Association Studies ◽

Restricted Interests ◽

Risk Genes ◽

Whole Exome

Autism spectrum disorder (ASD) is a prevalent neurodevelopmental disorder characterized by impairments in social communication and social interaction, and the presence of repetitive behaviors and/or restricted interests. In the past few years, large-scale whole-exome sequencing and genome-wide association studies have made enormous progress in our understanding of the genetic risk architecture of ASD. While showing a complex and heterogeneous landscape, these studies have led to the identification of genetic loci associated with ASD risk. The intersection of genetic and transcriptomic analyses have also begun to shed light on functional convergences between risk genes, with the mid-fetal development of the cerebral cortex emerging as a critical nexus for ASD. In this review, we provide a concise summary of the latest genetic discoveries on ASD. We then discuss the studies in postmortem tissues, stem cell models, and rodent models that implicate recently identified ASD risk genes in cortical development.

Download Full-text

Poster #S142 GENETIC ASSOCIATION STUDIES OF SCHIZOPHRENIA RISK GENES WITH COGNITIVE AND NEUROIMAGING TRAITS IN THE GENUS CONSORTIUM COLLECTION

Schizophrenia Research ◽

10.1016/s0920-9964(14)70421-9 ◽

2014 ◽

Vol 153 ◽

pp. S140

Author(s):

Tracey Petryshen ◽

Gabriella Blokland

Keyword(s):

Genetic Association ◽

Association Studies ◽

Genetic Association Studies ◽

Risk Genes

Download Full-text

Open Community Challenge Reveals Molecular Network Modules with Key Roles in Diseases

10.1101/265553 ◽

2018 ◽

Cited By ~ 11

Author(s):

Sarvenaz Choobdar ◽

Mehmet E. Ahsen ◽

Jake Crawford ◽

Mattia Tomasoni ◽

Tao Fang ◽

...

Keyword(s):

Complex Traits ◽

Gene Networks ◽

Association Studies ◽

Molecular Network ◽

Molecular Networks ◽

Disease Genes ◽

Genome Wide Association Studies ◽

Identification Methods ◽

Module Identification ◽

Network Modules

AbstractIdentification of modules in molecular networks is at the core of many current analysis methods in biomedical research. However, how well different approaches identify disease-relevant modules in different types of gene and protein networks remains poorly understood. We launched the “Disease Module Identification DREAM Challenge”, an open competition to comprehensively assess module identification methods across diverse protein-protein interaction, signaling, gene co-expression, homology, and cancer-gene networks. Predicted network modules were tested for association with complex traits and diseases using a unique collection of 180 genome-wide association studies (GWAS). Our critical assessment of 75 contributed module identification methods reveals novel top-performing algorithms, which recover complementary trait-associated modules. We find that most of these modules correspond to core disease-relevant pathways, which often comprise therapeutic targets and correctly prioritize candidate disease genes. This community challenge establishes benchmarks, tools and guidelines for molecular network analysis to study human disease biology (https://synapse.org/modulechallenge).

Download Full-text

Rare and de novo variants in 827 congenital diaphragmatic hernia probands implicate LONP1 and ALYREF as new candidate risk genes

10.1101/2021.06.01.21257928 ◽

2021 ◽

Author(s):

Lu Qiao ◽

Le Xu ◽

Lan Yu ◽

Julia Wynn ◽

Rebecca Hernan ◽

...

Keyword(s):

Congenital Diaphragmatic Hernia ◽

Diaphragmatic Hernia ◽

Rare Variants ◽

De Novo ◽

Disease Genes ◽

Risk Genes ◽

Association Analyses ◽

Significant Enrichment ◽

Coding Variants

Congenital diaphragmatic hernia (CDH) is a severe congenital anomaly that is often accompanied by other anomalies. Although the role of genetics in the pathogenesis of CDH has been established, only a small number of disease genes have been identified. To further investigate the genetics of CDH, we analyzed de novo coding variants in 827 proband-parent trios and confirmed an overall significant enrichment of damaging de novo variants, especially in constrained genes. We identified LONP1 (Lon Peptidase 1, Mitochondrial) and ALYREF (Aly/REF Export Factor) as novel candidate CDH genes based on de novo variants at a false discovery rate below 0.05. We also performed ultra-rare variant association analyses in 748 cases and 11,220 ancestry-matched population controls and identified LONP1 as a risk gene contributing to CDH through both de novo and ultra-rare inherited largely heterozygous variants clustered in the core of the domains and segregating with CDH in familial cases. Approximately 3% of our CDH cohort was heterozygous with ultra-rare predicted damaging variants in LONP1 who have a range of clinical phenotypes including other anomalies in some individuals and higher mortality and requirement for extracorporeal membrane oxygenation. Mice with lung epithelium specific deletion of Lonp1 die immediately after birth and have reduced lung growth and branching that may at least partially explain the high mortality in humans. Our findings of both de novo and inherited rare variants in the same gene may have implications in the design and analysis for other genetic studies of congenital anomalies.

Download Full-text

Chances and challenges of machine learning‐based disease classification in genetic association studies illustrated on age‐related macular degeneration

Genetic Epidemiology ◽

10.1002/gepi.22336 ◽

2020 ◽

Vol 44 (7) ◽

pp. 759-777

Author(s):

Felix Guenther ◽

Caroline Brandl ◽

Thomas W. Winkler ◽

Veronika Wanner ◽

Klaus Stark ◽

...

Keyword(s):

Machine Learning ◽

Macular Degeneration ◽

Genetic Association ◽

Association Studies ◽

Genetic Association Studies ◽

Disease Classification ◽

Age Related Macular Degeneration ◽

Age Related

Download Full-text

A maximum flow-based network approach for identification of stable noncoding biomarkers associated with the multigenic neurological condition, autism

BioData Mining ◽

10.1186/s13040-021-00262-x ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Maya Varma ◽

Kelley M. Paskov ◽

Brianna S. Chrisman ◽

Min Woo Sun ◽

Jae-Yoon Jung ◽

...

Keyword(s):

Machine Learning ◽

Predictive Accuracy ◽

Disease Risk ◽

Genetic Disorders ◽

Maximum Flow ◽

Autism Spectrum ◽

Whole Genome Sequence ◽

Learning Approaches ◽

Model Stability ◽

Improve Model

Abstract Background Machine learning approaches for predicting disease risk from high-dimensional whole genome sequence (WGS) data often result in unstable models that can be difficult to interpret, limiting the identification of putative sets of biomarkers. Here, we design and validate a graph-based methodology based on maximum flow, which leverages the presence of linkage disequilibrium (LD) to identify stable sets of variants associated with complex multigenic disorders. Results We apply our method to a previously published logistic regression model trained to identify variants in simple repeat sequences associated with autism spectrum disorder (ASD); this L1-regularized model exhibits high predictive accuracy yet demonstrates great variability in the features selected from over 230,000 possible variants. In order to improve model stability, we extract the variants assigned non-zero weights in each of 5 cross-validation folds and then assemble the five sets of features into a flow network subject to LD constraints. The maximum flow formulation allowed us to identify 55 variants, which we show to be more stable than the features identified by the original classifier. Conclusion Our method allows for the creation of machine learning models that can identify predictive variants. Our results help pave the way towards biomarker-based diagnosis methods for complex genetic disorders.

Download Full-text