Interpreting machine learning models to investigate circadian regulation and facilitate exploration of clock function

The circadian clock is an important adaptation to life on Earth. Here, we use machine learning to predict complex, temporal, and circadian gene expression patterns in Arabidopsis. Most significantly, we classify circadian genes using DNA sequence features generated de novo from public, genomic resources, facilitating downstream application of our methods with no experimental work or prior knowledge needed. We use local model explanation that is transcript specific to rank DNA sequence features, providing a detailed profile of the potential circadian regulatory mechanisms for each transcript. Furthermore, we can discriminate the temporal phase of transcript expression using the local, explanation-derived, and ranked DNA sequence features, revealing hidden subclasses within the circadian class. Model interpretation/explanation provides the backbone of our methodological advances, giving insight into biological processes and experimental design. Next, we use model interpretation to optimize sampling strategies when we predict circadian transcripts using reduced numbers of transcriptomic timepoints. Finally, we predict the circadian time from a single, transcriptomic timepoint, deriving marker transcripts that are most impactful for accurate prediction; this could facilitate the identification of altered clock function from existing datasets.

Download Full-text

Interpreting machine learning models to investigate circadian regulation and facilitate exploration of clock function

10.1101/2021.02.04.429826 ◽

2021 ◽

Cited By ~ 1

Author(s):

Laura-Jayne Gardiner ◽

Rachel Rusholme-Pilcher ◽

Josh Colmer ◽

Hannah Rees ◽

Juan Manuel Crescente ◽

...

Keyword(s):

Machine Learning ◽

Dna Sequence ◽

Expression Patterns ◽

Regulatory Elements ◽

Sequence Features ◽

Model Interpretation ◽

Temporal Phase ◽

Circadian Time ◽

Expression Model ◽

Life On Earth

AbstractThe circadian clock is an important adaptation to life on earth. Here, we use machine learning to predict complex temporal circadian gene expression patterns in Arabidopsis. Most significantly, we classify circadian genes using DNA sequence features generated from public genomic resources, with no experimental work or prior knowledge needed. We use model explanation to rank DNA sequence features, observing transcript-specific combinations of potential circadian regulatory elements that discriminate temporal phase of expression. Model interpretation/explanation provides the backbone of our methodological advances, giving insight into biological processes and experimental design. Next, we use model interpretation to optimize sampling strategies when we predict circadian transcripts using reduced numbers of transcriptomic timepoints, saving both time and money. Finally, we predict the circadian time from a single transcriptomic timepoint, deriving novel marker transcripts that are most impactful for accurate prediction, this could facilitate the identification of altered clock function from existing datasets.

Download Full-text

A machine learning approach to predicting autism risk genes: Validation of known genes and discovery of new candidates

10.1101/463547 ◽

2018 ◽

Cited By ~ 4

Author(s):

Ying Lin ◽

Anjali M. Rajadhyaksha ◽

James B. Potash ◽

Shizhong Han

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Human Brain ◽

Candidate Genes ◽

De Novo ◽

Expression Patterns ◽

Autism Spectrum ◽

Gene Expression Patterns ◽

Risk Genes ◽

Gene Level

AbstractAutism spectrum disorder (ASD) is a complex neurodevelopmental condition with a strong genetic basis. The role ofde novomutations in ASD has been well established, but the set of genes implicated to date is still far from complete. The current study employs a machine learning-based approach to predict ASD risk genes using features from spatiotemporal gene expression patterns in human brain, gene-level constraint metrics, and other gene variation features. The genes identified through our prediction model were enriched for independent sets of ASD risk genes, and tended to be differentially expressed in ASD brains, especially in the frontal and parietal cortex. The highest-ranked genes not only included those with strong prior evidence for involvement in ASD (for example,TCF20andFBOX11), but also indicated potentially novel candidates, such asDOCK3,MYCBP2andCAND1, which are all involved in neuronal development. Through extensive validations, we also showed that our method outperformed state-of-the-art scoring systems for ranking ASD candidate genes. Gene ontology enrichment analysis of our predicted risk genes revealed biological processes clearly relevant to ASD, including neuronal signaling, neurogenesis, and chromatin remodeling, but also highlighted other potential mechanisms that might underlie ASD, such as regulation of RNA alternative splicing and ubiquitination pathway related to protein degradation. Our study demonstrates that human brain spatiotemporal gene expression patterns and gene-level constraint metrics can help predict ASD risk genes. Our gene ranking system provides a useful resource for prioritizing ASD candidate genes.

Download Full-text

An Unbiased Predictive Model to Detect DNA Methylation Propensity of CpG Islands in the Human Genome

Current Bioinformatics ◽

10.2174/1574893615999200724145835 ◽

2020 ◽

Vol 15 ◽

Author(s):

Dicle Yalcin ◽

Hasan H. Otu

Keyword(s):

Model Building ◽

De Novo ◽

Cpg Islands ◽

Treatment Strategies ◽

Area Under The Curve ◽

Global Methylation ◽

Sequence Features ◽

A Genome ◽

Combined Features ◽

Epigenetic Repression

Background: Epigenetic repression mechanisms play an important role in gene regulation, specifically in cancer development. In many cases, a CpG island’s (CGI) susceptibility or resistance to methylation are shown to be contributed by local DNA sequence features. Objective: To develop unbiased machine learning models–individually and combined for different biological features–that predict the methylation propensity of a CGI. Methods: We developed our model consisting of CGI sequence features on a dataset of 75 sequences (28 prone, 47 resistant) representing a genome-wide methylation structure. We tested our model on two independent datasets that are chromosome (132 sequences) and disease (70 sequences) specific. Results: We provided improvements in prediction accuracy over previous models. Our results indicate that combined features better predict the methylation propensity of a CGI (area under the curve (AUC) ~0.81). Our global methylation classifier performs well on independent datasets reaching an AUC of ~0.82 for the complete model and an AUC of ~0.88 for the model using select sequences that better represent their classes in the training set. We report certain de novo motifs and transcription factor binding site (TFBS) motifs that are consistently better in separating prone and resistant CGIs. Conclusion: Predictive models for the methylation propensity of CGIs lead to a better understanding of disease mechanisms and can be used to classify genes based on their tendency to contain methylation prone CGIs, which may lead to preventative treatment strategies. MATLAB and Python™ scripts used for model building, prediction, and downstream analyses are available at https://github.com/dicleyalcin/methylProp_predictor.

Download Full-text

Integration of transcriptomic data identifies key hallmark genes in hypertrophic cardiomyopathy

BMC Cardiovascular Disorders ◽

10.1186/s12872-021-02147-7 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Jing Xu ◽

Xiangdong Liu ◽

Qiming Dai

Keyword(s):

Machine Learning ◽

Hypertrophic Cardiomyopathy ◽

Heart Diseases ◽

Expression Patterns ◽

Support Vector ◽

Rna Seq ◽

Ppi Network ◽

Learning Methods ◽

Transcriptomic Data ◽

Machine Learning Methods

Abstract Background Hypertrophic cardiomyopathy (HCM) represents one of the most common inherited heart diseases. To identify key molecules involved in the development of HCM, gene expression patterns of the heart tissue samples in HCM patients from multiple microarray and RNA-seq platforms were investigated. Methods The significant genes were obtained through the intersection of two gene sets, corresponding to the identified differentially expressed genes (DEGs) within the microarray data and within the RNA-Seq data. Those genes were further ranked using minimum-Redundancy Maximum-Relevance feature selection algorithm. Moreover, the genes were assessed by three different machine learning methods for classification, including support vector machines, random forest and k-Nearest Neighbor. Results Outstanding results were achieved by taking exclusively the top eight genes of the ranking into consideration. Since the eight genes were identified as candidate HCM hallmark genes, the interactions between them and known HCM disease genes were explored through the protein–protein interaction (PPI) network. Most candidate HCM hallmark genes were found to have direct or indirect interactions with known HCM diseases genes in the PPI network, particularly the hub genes JAK2 and GADD45A. Conclusions This study highlights the transcriptomic data integration, in combination with machine learning methods, in providing insight into the key hallmark genes in the genetic etiology of HCM.

Download Full-text

PredNTS: Improved and Robust Prediction of Nitrotyrosine Sites by Integrating Multiple Sequence Features

International Journal of Molecular Sciences ◽

10.3390/ijms22052704 ◽

2021 ◽

Vol 22 (5) ◽

pp. 2704

Author(s):

Andi Nur Nilamyani ◽

Firda Nurul Auliah ◽

Mohammad Ali Moni ◽

Watshara Shoombuatong ◽

Md Mehedi Hasan ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Web Application ◽

Computational Prediction ◽

Vital Role ◽

Machine Learning Algorithms ◽

Recursive Feature Elimination ◽

Post Translational Modification ◽

Multiple Sequence ◽

Sequence Features

Nitrotyrosine, which is generated by numerous reactive nitrogen species, is a type of protein post-translational modification. Identification of site-specific nitration modification on tyrosine is a prerequisite to understanding the molecular function of nitrated proteins. Thanks to the progress of machine learning, computational prediction can play a vital role before the biological experimentation. Herein, we developed a computational predictor PredNTS by integrating multiple sequence features including K-mer, composition of k-spaced amino acid pairs (CKSAAP), AAindex, and binary encoding schemes. The important features were selected by the recursive feature elimination approach using a random forest classifier. Finally, we linearly combined the successive random forest (RF) probability scores generated by the different, single encoding-employing RF models. The resultant PredNTS predictor achieved an area under a curve (AUC) of 0.910 using five-fold cross validation. It outperformed the existing predictors on a comprehensive and independent dataset. Furthermore, we investigated several machine learning algorithms to demonstrate the superiority of the employed RF algorithm. The PredNTS is a useful computational resource for the prediction of nitrotyrosine sites. The web-application with the curated datasets of the PredNTS is publicly available.

Download Full-text

Identification and Expression Analysis of the Genes Involved in the Raffinose Family Oligosaccharides Pathway of Phaseolus vulgaris and Glycine max

Plants ◽

10.3390/plants10071465 ◽

2021 ◽

Vol 10 (7) ◽

pp. 1465

Author(s):

Ramon de Koning ◽

Raphaël Kiekens ◽

Mary Esther Muyoka Toili ◽

Geert Angenon

Keyword(s):

Common Bean ◽

Seed Development ◽

Expression Analysis ◽

De Novo ◽

Expression Patterns ◽

Gene Families ◽

Rna Seq ◽

Raffinose Family Oligosaccharides ◽

Specific Expression ◽

Raffinose Synthase

Raffinose family oligosaccharides (RFO) play an important role in plants but are also considered to be antinutritional factors. A profound understanding of the galactinol and RFO biosynthetic gene families and the expression patterns of the individual genes is a prerequisite for the sustainable reduction of the RFO content in the seeds, without compromising normal plant development and functioning. In this paper, an overview of the annotation and genetic structure of all galactinol- and RFO biosynthesis genes is given for soybean and common bean. In common bean, three galactinol synthase genes, two raffinose synthase genes and one stachyose synthase gene were identified for the first time. To discover the expression patterns of these genes in different tissues, two expression atlases have been created through re-analysis of publicly available RNA-seq data. De novo expression analysis through an RNA-seq study during seed development of three varieties of common bean gave more insight into the expression patterns of these genes during the seed development. The results of the expression analysis suggest that different classes of galactinol- and RFO synthase genes have tissue-specific expression patterns in soybean and common bean. With the obtained knowledge, important galactinol- and RFO synthase genes that specifically play a key role in the accumulation of RFOs in the seeds are identified. These candidate genes may play a pivotal role in reducing the RFO content in the seeds of important legumes which could improve the nutritional quality of these beans and would solve the discomforts associated with their consumption.

Download Full-text

MED12-Related (Neuro)Developmental Disorders: A Question of Causality

Genes ◽

10.3390/genes12050663 ◽

2021 ◽

Vol 12 (5) ◽

pp. 663

Author(s):

Stijn van de Plassche ◽

Arjan PM de Brouwer

Keyword(s):

Developmental Disorders ◽

De Novo ◽

Expression Patterns ◽

Mediator Complex ◽

Gene Expression Patterns ◽

Facial Dysmorphism ◽

Regulation Of Transcription ◽

Feeding Difficulties ◽

Missense Variants ◽

Pathogenic Variants

MED12 is a member of the Mediator complex that is involved in the regulation of transcription. Missense variants in MED12 cause FG syndrome, Lujan-Fryns syndrome, and Ohdo syndrome, as well as non-syndromic intellectual disability (ID) in hemizygous males. Recently, female patients with de novo missense variants and de novo protein truncating variants in MED12 were described, resulting in a clinical spectrum centered around ID and Hardikar syndrome without ID. The missense variants are found throughout MED12, whether they are inherited in hemizygous males or de novo in females. They can result in syndromic or nonsyndromic ID. The de novo nonsense variants resulting in Hardikar syndrome that is characterized by facial clefting, pigmentary retinopathy, biliary anomalies, and intestinal malrotation, are found more N-terminally, whereas the more C-terminally positioned variants are de novo protein truncating variants that cause a severe, syndromic phenotype consisting of ID, facial dysmorphism, short stature, skeletal abnormalities, feeding difficulties, and variable other abnormalities. This broad range of distinct phenotypes calls for a method to distinguish between pathogenic and non-pathogenic variants in MED12. We propose an isogenic iNeuron model to establish the unique gene expression patterns that are associated with the specific MED12 variants. The discovery of these patterns would help in future diagnostics and determine the causality of the MED12 variants.

Download Full-text

Machine learning differentiates enzymatic and non-enzymatic metals in proteins

Nature Communications ◽

10.1038/s41467-021-24070-3 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Ryan Feehan ◽

Meghan W. Franklin ◽

Joanna S. G. Slusky

Keyword(s):

Machine Learning ◽

Metal Binding ◽

Binding Sites ◽

Active Sites ◽

De Novo ◽

Enzyme Design ◽

Metal Binding Sites ◽

Ensemble Machine Learning ◽

Machine Learning Model ◽

Physicochemical Features

AbstractMetalloenzymes are 40% of all enzymes and can perform all seven classes of enzyme reactions. Because of the physicochemical similarities between the active sites of metalloenzymes and inactive metal binding sites, it is challenging to differentiate between them. Yet distinguishing these two classes is critical for the identification of both native and designed enzymes. Because of similarities between catalytic and non-catalytic metal binding sites, finding physicochemical features that distinguish these two types of metal sites can indicate aspects that are critical to enzyme function. In this work, we develop the largest structural dataset of enzymatic and non-enzymatic metalloprotein sites to date. We then use a decision-tree ensemble machine learning model to classify metals bound to proteins as enzymatic or non-enzymatic with 92.2% precision and 90.1% recall. Our model scores electrostatic and pocket lining features as more important than pocket volume, despite the fact that volume is the most quantitatively different feature between enzyme and non-enzymatic sites. Finally, we find our model has overall better performance in a side-to-side comparison against other methods that differentiate enzymatic from non-enzymatic sequences. We anticipate that our model’s ability to correctly identify which metal sites are responsible for enzymatic activity could enable identification of new enzymatic mechanisms and de novo enzyme design.

Download Full-text

Genome Wide Analysis of the Transcriptional Profiles in Different Regions of the Developing Rice Grains

Rice ◽

10.1186/s12284-020-00421-4 ◽

2020 ◽

Vol 13 (1) ◽

Author(s):

Ting-Ying Wu ◽

Marlen Müller ◽

Wilhelm Gruissem ◽

Navreet K. Bhullar

Keyword(s):

Gene Expression ◽

Dna Sequence ◽

Expression Patterns ◽

Grain Filling ◽

Aleurone Layer ◽

Rice Grain ◽

Sequence Motifs ◽

Grain Development ◽

Spatio Temporal ◽

Dna Sequence Motifs

Abstract Background Rice is an important food source for humans worldwide. Because of its nutritional and agricultural significance, a number of studies addressed various aspects of rice grain development and grain filling. Nevertheless, the molecular processes underlying grain filling and development, and in particular the contributions of different grain tissues to these processes, are not understood. Main Text Using RNA-sequencing, we profiled gene expression activity in grain tissues comprised of cross cells (CC), the nucellar epidermis (NE), ovular vascular trace (OVT), endosperm (EN) and the aleurone layer (AL). These tissues were dissected using laser capture microdissection (LCM) at three distinct grain development stages. The mRNA expression datasets offer comprehensive and new insights into the gene expression patterns in different rice grain tissues and their contributions to grain development. Comparative analysis of the different tissues revealed their similar and/or unique functions, as well as the spatio-temporal regulation of common and tissue-specific genes. The expression patterns of genes encoding hormones and transporters indicate an important role of the OVT tissue in metabolite transport during grain development. Gene co-expression network prediction on OVT-specific genes identified several distinct and common development-specific transcription factors. Further analysis of enriched DNA sequence motifs proximal to OVT-specific genes revealed known and novel DNA sequence motifs relevant to rice grain development. Conclusion Together, the dataset of gene expression in rice grain tissues is a novel and useful resource for further work to dissect the molecular and metabolic processes during rice grain development.

Download Full-text

Viral Gene Expression Patterns in Human Herpesvirus 6B-Infected T Cells

Journal of Virology ◽

10.1128/jvi.76.15.7578-7586.2002 ◽

2002 ◽

Vol 76 (15) ◽

pp. 7578-7586 ◽

Cited By ~ 34

Author(s):

Bodil Øster ◽

Per Höllsberg

Keyword(s):

Gene Expression ◽

T Cells ◽

Protein Synthesis ◽

De Novo ◽

Expression Patterns ◽

Human Herpesvirus ◽

Viral Gene Expression ◽

Polymerase Activity ◽

Viral Gene ◽

De Novo Protein Synthesis

ABSTRACT Herpesvirus gene expression is divided into immediate-early (IE) or α genes, early (E) or β genes, and late (L) or γ genes on the basis of temporal expression and dependency on other gene products. By using real-time PCR, we have investigated the expression of 35 human herpesvirus 6B (HHV-6B) genes in T cells infected by strain PL-1. Kinetic analysis and dependency on de novo protein synthesis and viral DNA polymerase activity suggest that the HHV-6B genes segregate into six separate kinetic groups. The genes expressed early (groups I and II) and late (groups V and VI) corresponded well with IE and L genes, whereas the intermediate groups III and IV contained E and L genes. Although HHV-6B has characteristics similar to those of other roseoloviruses in its overall gene regulation, we detected three B-variant-specific IE genes. Moreover, genes that were independent of de novo protein synthesis clustered in an area of the viral genome that has the lowest identity to the HHV-6A variant. The organization of IE genes in an area of the genome that differs from that of HHV-6A underscores the distinct differences between HHV-6B and HHV-6A and may provide a basis for further molecular and immunological analyses to elucidate their different biological behaviors.

Download Full-text