Forecasting autism gene discovery with machine learning and genome-scale data

AbstractBackgroundGenes are one of the most powerful windows into the biology of autism, and it has been estimated that perhaps a thousand or more genes may confer risk. However, less than 100 genes are currently viewed as having robust enough evidence to be considered true "autism genes". Massive genetic studies are underway to produce data to implicate additional genes, but this approach, although necessary, is costly and slow-moving.MethodsWe approach autism gene discovery as a machine learning problem, rather than a genetic association problem, and use genome-scale data as predictors for identifying further genes that have similar properties in the feature space compared to established autism risk genes. This approach, which we call forecASD, integrates spatiotemporal gene expression, heterogeneous network data, and previous gene-level predictors of autism association into an ensemble classifier that yields a single score that indexes each gene’s evidence for being involved in the etiology of autism.ResultsWe demonstrate that forecASD has substantially increased sensitivity and specificity compared to previous gene-level predictors of autism association, including genetic measures such as TADA. On an independent test set, consisting of newly-released pilot data from the SPARK Genomics Consortium, we show that forecASD best predicts which genes will have an excess of likely gene disrupting (LGD) de novo mutations. We further use independent data from a recent post mortem study of case/control gene expression to show that forecASD is also a significant predictor of genes implicated in ASD through differential expression. Using forecASD results, we show which molecular pathways are currently under-represented in the autism literature and likely represent under-appreciated biological mechanisms of autism. Finally, forecASD correctly predicted 12 of 16 genes implicated at FDR=0.2 by the latest ASD gene discovery study, while also identifying the most likely false positives among the candidate genes.ConclusionsThese results demonstrate that forecASD bridges the gap between genetic- and expression-based ASD gene discovery, and provides a data-driven replacement to much of the manual filtering and curation that is a critical step in ensuring the robustness of gene discovery studies.

Download Full-text

Forecasting risk gene discovery in autism with machine learning and genome-scale data

Scientific Reports ◽

10.1038/s41598-020-61288-5 ◽

2020 ◽

Vol 10 (1) ◽

Cited By ~ 2

Author(s):

Leo Brueggeman ◽

Tanner Koomar ◽

Jacob J. Michaelson

Keyword(s):

Machine Learning ◽

Gene Discovery ◽

Autism Spectrum ◽

Sufficient Evidence ◽

Pathway Enrichment Analysis ◽

Risk Genes ◽

Risk Gene ◽

New Genes ◽

Genome Scale ◽

Scale Data

AbstractGenetics has been one of the most powerful windows into the biology of autism spectrum disorder (ASD). It is estimated that a thousand or more genes may confer risk for ASD when functionally perturbed, however, only around 100 genes currently have sufficient evidence to be considered true “autism risk genes”. Massive genetic studies are currently underway producing data to implicate additional genes. This approach — although necessary — is costly and slow-moving, making identification of putative ASD risk genes with existing data vital. Here, we approach autism risk gene discovery as a machine learning problem, rather than a genetic association problem, by using genome-scale data as predictors to identify new genes with similar properties to established autism risk genes. This ensemble method, forecASD, integrates brain gene expression, heterogeneous network data, and previous gene-level predictors of autism association into an ensemble classifier that yields a single score indexing evidence of each gene’s involvement in the etiology of autism. We demonstrate that forecASD has substantially better performance than previous predictors of autism association in three independent trio-based sequencing studies. Studying forecASD prioritized genes, we show that forecASD is a robust indicator of a gene’s involvement in ASD etiology, with diverse applications to gene discovery, differential expression analysis, eQTL prioritization, and pathway enrichment analysis.

Download Full-text

Author Correction: Forecasting risk gene discovery in autism with machine learning and genome-scale data

Scientific Reports ◽

10.1038/s41598-020-77832-2 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Leo Brueggeman ◽

Tanner Koomar ◽

Jacob J. Michaelson

Keyword(s):

Machine Learning ◽

Gene Discovery ◽

Risk Gene ◽

Genome Scale ◽

Scale Data

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

Download Full-text

A machine learning approach to predicting autism risk genes: Validation of known genes and discovery of new candidates

10.1101/463547 ◽

2018 ◽

Cited By ~ 4

Author(s):

Ying Lin ◽

Anjali M. Rajadhyaksha ◽

James B. Potash ◽

Shizhong Han

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Human Brain ◽

Candidate Genes ◽

De Novo ◽

Expression Patterns ◽

Autism Spectrum ◽

Gene Expression Patterns ◽

Risk Genes ◽

Gene Level

AbstractAutism spectrum disorder (ASD) is a complex neurodevelopmental condition with a strong genetic basis. The role ofde novomutations in ASD has been well established, but the set of genes implicated to date is still far from complete. The current study employs a machine learning-based approach to predict ASD risk genes using features from spatiotemporal gene expression patterns in human brain, gene-level constraint metrics, and other gene variation features. The genes identified through our prediction model were enriched for independent sets of ASD risk genes, and tended to be differentially expressed in ASD brains, especially in the frontal and parietal cortex. The highest-ranked genes not only included those with strong prior evidence for involvement in ASD (for example,TCF20andFBOX11), but also indicated potentially novel candidates, such asDOCK3,MYCBP2andCAND1, which are all involved in neuronal development. Through extensive validations, we also showed that our method outperformed state-of-the-art scoring systems for ranking ASD candidate genes. Gene ontology enrichment analysis of our predicted risk genes revealed biological processes clearly relevant to ASD, including neuronal signaling, neurogenesis, and chromatin remodeling, but also highlighted other potential mechanisms that might underlie ASD, such as regulation of RNA alternative splicing and ubiquitination pathway related to protein degradation. Our study demonstrates that human brain spatiotemporal gene expression patterns and gene-level constraint metrics can help predict ASD risk genes. Our gene ranking system provides a useful resource for prioritizing ASD candidate genes.

Download Full-text

A post-method condition analysis of using ensemble machine learning for cancer prognosis and diagnosis: a systematic review

10.21203/rs.2.18222/v1 ◽

2019 ◽

Author(s):

Leila Mirsadeghi ◽

Ali Mohammad Banaei-Moghaddam ◽

Seyed Reza Beh-Afarin ◽

Reza Haji Hosseini ◽

Kaveh Kavousi

Keyword(s):

Machine Learning ◽

Empirical Studies ◽

Ensemble Methods ◽

Feature Space ◽

Ensemble Classifier ◽

Cancer Prognosis ◽

Learning Approaches ◽

Learning Methods ◽

Machine Learning Methods ◽

Different Types

Abstract Background: Ensemble methods are supervised learning approaches that integrate different types of data or multiple individual classifiers. It has been shown that these methods can improve professional performance.Methods: This study is an attempt to provide an in-depth review on 45 most relevant articles and aims to introduce 42 ensemble classifier (EC) machine learning methods used for the detection of 18 different types of cancer. Compared to other types of cancer, breast cancer, and the 22 ensemble methods introduced for its identification, is extensively investigated. The purpose of this study is to identify, map, and analyze the current academic discourse on EC machine learning methods in order to: 1. identify overarching themes emerging from empirical studies as regards EC methods, 2. determine their input data and decision-making strategies, and 3. evaluate relevant statistical procedures.Results: By comparing various approaches, we can introduce Relevance Vector Machine (RVM)-based ensemble learning method that can provide optimal solutions for problems such as curse the dimensionality and high-dimensionality of feature space without missing data values.Conclusions: To obtain robust performance and achieve better results, it is tactfully suggested to use multi-omics data integration, which has demonstrated to identify cancers and their subtypes more efficiently.

Download Full-text

A Post-Method Condition Analysis of Using Ensemble Machine Learning for Cancer Prognosis and Diagnosis: a systematic review

10.21203/rs.2.10561/v1 ◽

2019 ◽

Author(s):

Kaveh Kavousi ◽

Leila Mirsadeghi ◽

Reza Haji Hosseini ◽

Ali Mohammad Banaei-Moghaddam ◽

Seyed Reza Beh-Afarin

Keyword(s):

Machine Learning ◽

Empirical Studies ◽

Ensemble Methods ◽

Feature Space ◽

Ensemble Classifier ◽

Cancer Prognosis ◽

Learning Approaches ◽

Learning Methods ◽

Machine Learning Methods ◽

Different Types

Abstract Background Ensemble methods are supervised learning approaches that integrate different types of data or multiple individual classifiers. It has been shown that these methods can improve professional performance. Methods This study is an attempt to provide an in-depth review on 45 most relevant articles and aims to introduce 42 ensemble classifier (EC) machine learning methods used for the detection of 18 different types of cancer. Compared to other types of cancer, breast cancer, and the 22 ensemble methods introduced for its identification, is extensively investigated. The purpose of this study was to identify, map, and analyze the current academic discourse on EC machine learning methods in order to: 1. identify overarching themes emerging from empirical studies regarding EC methods, 2. determine their input data and decision-making strategies, and 3. evaluate relevant statistical procedures. Results By comparing various approaches, we can introduce Relevance Vector Machine (RVM)-based ensemble learning method that can provide optimal solutions for problems such as curse the dimensionality and high-dimensionality of feature space without missing data values. Conclusions To obtain robust performance and achieve better results, it is tactfully suggested to use multi-omics data integration, which has demonstrated to identify cancers and their subtypes more efficiently.

Download Full-text

A Post-Method Condition Analysis of Using Ensemble Machine Learning for Cancer Prognosis and Diagnosis: a systematic review

10.21203/rs.2.11426/v1 ◽

2019 ◽

Author(s):

Kaveh Kavousi ◽

Leila Mirsadeghi ◽

Reza Haji Hosseini ◽

Ali Mohammad Banaei-Moghaddam ◽

Seyed Reza Beh-Afarin

Keyword(s):

Machine Learning ◽

Empirical Studies ◽

Ensemble Methods ◽

Feature Space ◽

Ensemble Classifier ◽

Cancer Prognosis ◽

Learning Approaches ◽

Learning Methods ◽

Machine Learning Methods ◽

Different Types

Download Full-text

Machine-learning from Pseudomonas putida transcriptomes reveals its transcriptional regulatory network

10.1101/2022.01.11.475908 ◽

2022 ◽

Author(s):

Hyun Gyu Lim ◽

Kevin Rychel ◽

Anand V. Sastry ◽

Joshua Mueller ◽

Wei Niu ◽

...

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Pseudomonas Putida ◽

Growth Rates ◽

Regulatory Network ◽

Transcriptional Regulatory Network ◽

Bacterial Gene ◽

Bacterial Physiology ◽

Transcriptional Regulatory ◽

Genome Scale

Bacterial gene expression is orchestrated by numerous transcription factors (TFs). Elucidating how gene expression is regulated is fundamental to understanding bacterial physiology and engineering it for practical use. In this study, a machine-learning approach was applied to uncover the genome-scale transcriptional regulatory network (TRN) in Pseudomonas putida, an important organism for bioproduction. We performed independent component analysis of a compendium of 321 high-quality gene expression profiles, which were previously published or newly generated in this study. We identified 84 groups of independently modulated genes (iModulons) that explain 75.7% of the total variance in the compendium. With these iModulons, we (i) expand our understanding of the regulatory functions of 39 iModulon associated TFs (e.g., HexR, Zur) by systematic comparison with 1,993 previously reported TF-gene interactions; (ii) outline transcriptional changes after the transition from the exponential growth to stationary phases; (iii) capture group of genes required for utilizing diverse carbon sources and increased stationary response with slower growth rates; (iv) unveil multiple evolutionary strategies of transcriptome reallocation to achieve fast growth rates; and (v) define an osmotic stimulon, which includes the Type VI secretion system, as coordination of multiple iModulon activity changes. Taken together, this study provides the first quantitative genome-scale TRN for P. putida and a basis for a comprehensive understanding of its complex transcriptome changes in a variety of physiological states.

Download Full-text