scholarly journals Predicting which genes will respond to perturbations of a TF: TF-independent properties of genes are major determinants of their responsiveness

2020 ◽  
Author(s):  
Yiming Kang ◽  
Michael Brent

Background: The ability to predict which genes will respond to perturbation of a TF's activity serves as a benchmark for our systems-level understanding of transcriptional regulatory networks. In previous work, machine learning models have been trained to predict static gene expression levels in a given sample by using data from the same or similar conditions, including data on TF binding locations, histone marks, or DNA sequence. We report on a different challenge -- training machine learning models that can predict which genes will respond to perturbation of a TF without using any data from the perturbed cells. Results: Existing TF location data (ChIP-Seq) from human K562 cells have no detectable utility for predicting which genes will respond to perturbation of the TF, but data obtained by newer methods in yeast cells are useful. TF-independent features of genes, including their preperturbation expression level and expression variation, are very useful for predicting responses to TF perturbations. This shows that some genes are poised to respond to TF perturbations and others are resistant, shedding significant light on why it has been so difficult to predict responses from binding locations. Certain histone marks (HMs), including H3K4me1 and H3K4me3, have some predictive power, especially when downstream of the transcription start site. In human, the predictive power of HMs is much less than that of gene expression level and variation. Code is available at https://github.com/yiming-kang/TFPertRespExplainer. Conclusions: Sequence-based or epigenetic properties of genes strongly influence their tendency to respond to direct TF perturbations, partially explaining the oft-noted difficulty of predicting responsiveness from TF binding location data. These molecular features are largely reflected in and summarized by the gene's expression level and expression variation.

2020 ◽  
Author(s):  
Emanuele Colonnelli ◽  
Jorge Gallego ◽  
Mounu Prem

The ability to predict corruption is crucial to policy. Using rich micro-data from Brazil, we show that multiple machine learning models display high levels of performance in predicting municipality-level corruption in public spending. We then quantify which individual municipality features and groups of similar characteristics have the highest predictive power. We find that measures of private sector activity, financial development, and human capital are the strongest predictors of corruption, while public sector and political features play a secondary role. Our findings have implications for the design and cost-effectiveness of various anti-corruption policies.


2020 ◽  
Author(s):  
Irene M. Kaplow ◽  
Morgan E. Wirthlin ◽  
Alyssa J. Lawler ◽  
Ashley R. Brown ◽  
Michael Kleyman ◽  
...  

ABSTRACTMany phenotypes have evolved through gene expression, meaning that differences between species are caused in part by differences in enhancers. Here, we demonstrate that we can accurately predict differences between species in open chromatin status at putative enhancers using machine learning models trained on genome sequence across species. We present a new set of criteria that we designed to explicitly demonstrate if models are useful for studying open chromatin regions whose orthologs are not open in every species. Our approach and evaluation metrics can be applied to any tissue or cell type with open chromatin data available from multiple species.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Jacob Schreiber ◽  
Ritambhara Singh ◽  
Jeffrey Bilmes ◽  
William Stafford Noble

AbstractMachine learning models that predict genomic activity are most useful when they make accurate predictions across cell types. Here, we show that when the training and test sets contain the same genomic loci, the resulting model may falsely appear to perform well by effectively memorizing the average activity associated with each locus across the training cell types. We demonstrate this phenomenon in the context of predicting gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data becomes available, future projects will increasingly risk suffering from this issue.


Leukemia ◽  
2021 ◽  
Author(s):  
Adrián Mosquera Orgueira ◽  
Marta Sonia González Pérez ◽  
José Ángel Díaz Arias ◽  
Beatriz Antelo Rodríguez ◽  
Natalia Alonso Vence ◽  
...  

2021 ◽  
Author(s):  
Chayaporn Suphavilai ◽  
Hatairat Yingtaweesittikul

Background: Transcriptomic profiles have become crucial information in understanding diseases and improving treatments. While dysregulated gene sets are identified via pathway analysis, various machine learning models have been proposed for predicting phenotypes such as disease type and drug response based on gene expression patterns. However, these models still lack interpretability, as well as the ability to integrate prior knowledge from a protein-protein interaction network. Results: We propose Grandline, a graph convolutional neural network that can integrate gene expression data and structure of the protein interaction network to predict a specific phenotype. Transforming the interaction network into a spectral domain enables convolution of neighbouring genes and pinpointing high-impact subnetworks, which allow better interpretability of deep learning models. Grandline achieves high phenotype prediction accuracy (67-85% in 8 use cases), comparable to state-of-the-art machine learning models while requiring a smaller number of parameters, allowing it to learn complex but interpretable gene expression patterns from biological datasets. Conclusion: To improve the interpretability of phenotype prediction based on gene expression patterns, we developed Grandline using graph convolutional neural network technique to integrate protein interaction information. We focus on improving the ability to learn nonlinear relationships between gene expression patterns and a given phenotype and incorporation of prior knowledge, which are the main challenges of machine learning models for biological datasets. The graph convolution allows us to aggregate information from relevant genes and reduces the number of trainable parameters, facilitating model training for a small-sized biological dataset.


2021 ◽  
Author(s):  
Eike Caldeweyher ◽  
Christoph Bauer ◽  
Ali Soltani Tehrani

We present the open-source framework kallisto that enables the efficient and robust calculation of quantum mechanical features for atoms and molecules. For a benchmark set of 49 experimental molecular polarizabilities, the predictive power of the presented method competes against second-order perturbation theory in a converged atomic-orbital basis set at a fraction of its computational costs. Robustness tests within a diverse validation set of more than 80,000 molecules show that the calculation of isotropic molecular polarizabilities has a low failure-rate of only 0.3 %. We present furthermore a generally applicable van der Waals radius model that is rooted on atomic static polarizabilites. Efficiency tests show that such radii can even be calculated for small- to medium-size proteins where the largest system (SARS-CoV-2 spike protein) has 42,539 atoms. Following the work of Domingo-Alemenara et al. [Domingo-Alemenara et al., Nat. Comm., 2019, 10, 5811], we present computational predictions for retention times for different chromatographic methods and describe how physicochemical features improve the predictive power of machine-learning models that otherwise only rely on two-dimensional features like molecular fingerprints. Additionally, we developed an internal benchmark set of experimental super-critical fluid chromatography retention times. For those methods, improvements of up to 17 % are obtained when combining molecular fingerprints with physicochemical descriptors. Shapley additive explanation values show furthermore that the physical nature of the applied features can be retained within the final machine-learning models. We generally recommend the kallisto framework as a robust, low-cost, and physically motivated featurizer for upcoming state-of-the-art machine-learning studies.


10.2196/24572 ◽  
2021 ◽  
Vol 9 (2) ◽  
pp. e24572
Author(s):  
Juan Carlos Quiroz ◽  
You-Zhen Feng ◽  
Zhong-Yuan Cheng ◽  
Dana Rezazadegan ◽  
Ping-Kang Chen ◽  
...  

Background COVID-19 has overwhelmed health systems worldwide. It is important to identify severe cases as early as possible, such that resources can be mobilized and treatment can be escalated. Objective This study aims to develop a machine learning approach for automated severity assessment of COVID-19 based on clinical and imaging data. Methods Clinical data—including demographics, signs, symptoms, comorbidities, and blood test results—and chest computed tomography scans of 346 patients from 2 hospitals in the Hubei Province, China, were used to develop machine learning models for automated severity assessment in diagnosed COVID-19 cases. We compared the predictive power of the clinical and imaging data from multiple machine learning models and further explored the use of four oversampling methods to address the imbalanced classification issue. Features with the highest predictive power were identified using the Shapley Additive Explanations framework. Results Imaging features had the strongest impact on the model output, while a combination of clinical and imaging features yielded the best performance overall. The identified predictive features were consistent with those reported previously. Although oversampling yielded mixed results, it achieved the best model performance in our study. Logistic regression models differentiating between mild and severe cases achieved the best performance for clinical features (area under the curve [AUC] 0.848; sensitivity 0.455; specificity 0.906), imaging features (AUC 0.926; sensitivity 0.818; specificity 0.901), and a combination of clinical and imaging features (AUC 0.950; sensitivity 0.764; specificity 0.919). The synthetic minority oversampling method further improved the performance of the model using combined features (AUC 0.960; sensitivity 0.845; specificity 0.929). Conclusions Clinical and imaging features can be used for automated severity assessment of COVID-19 and can potentially help triage patients with COVID-19 and prioritize care delivery to those at a higher risk of severe disease.


Sign in / Sign up

Export Citation Format

Share Document