scholarly journals Explainable Deep Relational Networks for Predicting Compound-Protein Affinities and Contacts

2019 ◽  
Author(s):  
Mostafa Karimi ◽  
Di Wu ◽  
Zhangyang Wang ◽  
Yang Shen

AbstractPredicting compound-protein affinity is beneficial for accelerating drug discovery. Doing so without the often-unavailable structure data is gaining interest. However, recent progress in structure-free affinity prediction, made by machine learning, focuses on accuracy but leaves much to be desired for interpretability. Defining inter-molecular contacts underlying affinities as a vehicle for interpretability, our large-scale interpretability assessment finds previously-used attention mechanisms inadequate. We thus formulate a hierarchical multi-objective learning problem whose predicted contacts form the basis for predicted affinities. And we solve the problem by embedding protein sequences (by hierarchical recurrent neural networks) and compound graphs (by graph neural networks) with joint attentions between protein residues and compound atoms. We further introduce three methodological advances to enhance interpretability: (1) structure-aware regularization of attentions using protein sequence-predicted solvent exposure and residue-residue contact maps; (2) supervision of attentions using known inter-molecular contacts in training data; and (3) an intrinsically explainable architecture where atomic-level contacts or “relations” lead to molecular-level affinity prediction. The first two and all three advances result in DeepAffinity+ and DeepRelations, respectively. Our methods show generalizability in affinity prediction for molecules that are new and dissimilar to training examples. Moreover, they show superior interpretability compared to state-of-the-art interpretable methods: with similar or better affinity prediction, they boost the AUPRC of contact prediction by around 33, 35, 10, and 9-fold for the default test, new-compound, new-protein, and both-new sets, respectively. We further demonstrate their potential utilities in contact-assisted docking, structure-free binding site prediction, and structure-activity relationship studies without docking. Our study represents the first model development and systematic model assessment dedicated to interpretable machine learning for structure-free compound-protein affinity prediction.

2021 ◽  
Vol 11 (2) ◽  
pp. 472
Author(s):  
Hyeongmin Cho ◽  
Sangkyun Lee

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.


2020 ◽  
Vol 8 (Suppl 3) ◽  
pp. A62-A62
Author(s):  
Dattatreya Mellacheruvu ◽  
Rachel Pyke ◽  
Charles Abbott ◽  
Nick Phillips ◽  
Sejal Desai ◽  
...  

BackgroundAccurately identified neoantigens can be effective therapeutic agents in both adjuvant and neoadjuvant settings. A key challenge for neoantigen discovery has been the availability of accurate prediction models for MHC peptide presentation. We have shown previously that our proprietary model based on (i) large-scale, in-house mono-allelic data, (ii) custom features that model antigen processing, and (iii) advanced machine learning algorithms has strong performance. We have extended upon our work by systematically integrating large quantities of high-quality, publicly available data, implementing new modelling algorithms, and rigorously testing our models. These extensions lead to substantial improvements in performance and generalizability. Our algorithm, named Systematic HLA Epitope Ranking Pan Algorithm (SHERPA™), is integrated into the ImmunoID NeXT Platform®, our immuno-genomics and transcriptomics platform specifically designed to enable the development of immunotherapies.MethodsIn-house immunopeptidomic data was generated using stably transfected HLA-null K562 cells lines that express a single HLA allele of interest, followed by immunoprecipitation using W6/32 antibody and LC-MS/MS. Public immunopeptidomics data was downloaded from repositories such as MassIVE and processed uniformly using in-house pipelines to generate peptide lists filtered at 1% false discovery rate. Other metrics (features) were either extracted from source data or generated internally by re-processing samples utilizing the ImmunoID NeXT Platform.ResultsWe have generated large-scale and high-quality immunopeptidomics data by using approximately 60 mono-allelic cell lines that unambiguously assign peptides to their presenting alleles to create our primary models. Briefly, our primary ‘binding’ algorithm models MHC-peptide binding using peptide and binding pockets while our primary ‘presentation’ model uses additional features to model antigen processing and presentation. Both primary models have significantly higher precision across all recall values in multiple test data sets, including mono-allelic cell lines and multi-allelic tissue samples. To further improve the performance of our model, we expanded the diversity of our training set using high-quality, publicly available mono-allelic immunopeptidomics data. Furthermore, multi-allelic data was integrated by resolving peptide-to-allele mappings using our primary models. We then trained a new model using the expanded training data and a new composite machine learning architecture. The resulting secondary model further improves performance and generalizability across several tissue samples.ConclusionsImproving technologies for neoantigen discovery is critical for many therapeutic applications, including personalized neoantigen vaccines, and neoantigen-based biomarkers for immunotherapies. Our new and improved algorithm (SHERPA) has significantly higher performance compared to a state-of-the-art public algorithm and furthers this objective.


2021 ◽  
Author(s):  
Hyeyoung Koh ◽  
Hannah Beth Blum

This study presents a machine learning-based approach for sensitivity analysis to examine how parameters affect a given structural response while accounting for uncertainty. Reliability-based sensitivity analysis involves repeated evaluations of the performance function incorporating uncertainties to estimate the influence of a model parameter, which can lead to prohibitive computational costs. This challenge is exacerbated for large-scale engineering problems which often carry a large quantity of uncertain parameters. The proposed approach is based on feature selection algorithms that rank feature importance and remove redundant predictors during model development which improve model generality and training performance by focusing only on the significant features. The approach allows performing sensitivity analysis of structural systems by providing feature rankings with reduced computational effort. The proposed approach is demonstrated with two designs of a two-bay, two-story planar steel frame with different failure modes: inelastic instability of a single member and progressive yielding. The feature variables in the data are uncertainties including material yield strength, Young’s modulus, frame sway imperfection, and residual stress. The Monte Carlo sampling method is utilized to generate random realizations of the frames from published distributions of the feature parameters, and the response variable is the frame ultimate strength obtained from finite element analyses. Decision trees are trained to identify important features. Feature rankings are derived by four feature selection techniques including impurity-based, permutation, SHAP, and Spearman's correlation. Predictive performance of the model including the important features are discussed using the evaluation metric for imbalanced datasets, Matthews correlation coefficient. Finally, the results are compared with those from reliability-based sensitivity analysis on the same example frames to show the validity of the feature selection approach. As the proposed machine learning-based approach produces the same results as the reliability-based sensitivity analysis with improved computational efficiency and accuracy, it could be extended to other structural systems.


2019 ◽  
Author(s):  
Mojtaba Haghighatlari ◽  
Gaurav Vishwakarma ◽  
Mohammad Atif Faiz Afzal ◽  
Johannes Hachmann

<div><div><div><p>We present a multitask, physics-infused deep learning model to accurately and efficiently predict refractive indices (RIs) of organic molecules, and we apply it to a library of 1.5 million compounds. We show that it outperforms earlier machine learning models by a significant margin, and that incorporating known physics into data-derived models provides valuable guardrails. Using a transfer learning approach, we augment the model to reproduce results consistent with higher-level computational chemistry training data, but with a considerably reduced number of corresponding calculations. Prediction errors of machine learning models are typically smallest for commonly observed target property values, consistent with the distribution of the training data. However, since our goal is to identify candidates with unusually large RI values, we propose a strategy to boost the performance of our model in the remoter areas of the RI distribution: We bias the model with respect to the under-represented classes of molecules that have values in the high-RI regime. By adopting a metric popular in web search engines, we evaluate our effectiveness in ranking top candidates. We confirm that the models developed in this study can reliably predict the RIs of the top 1,000 compounds, and are thus able to capture their ranking. We believe that this is the first study to develop a data-derived model that ensures the reliability of RI predictions by model augmentation in the extrapolation region on such a large scale. These results underscore the tremendous potential of machine learning in facilitating molecular (hyper)screening approaches on a massive scale and in accelerating the discovery of new compounds and materials, such as organic molecules with high-RI for applications in opto-electronics.</p></div></div></div>


2020 ◽  
Author(s):  
Paul Francoeur ◽  
Tomohide Masuda ◽  
David R. Koes

One of the main challenges in drug discovery is predicting protein-ligand binding affinity. Recently, machine learning approaches have made substantial progress on this task. However, current methods of model evaluation are overly optimistic in measuring generalization to new targets, and there does not exist a standard dataset of sufficient size to compare performance between models. We present a new dataset for structure-based machine learning, the CrossDocked2020 set, with 22.5 million poses of ligands docked into multiple similar binding pockets across the Protein Data Bank and perform a comprehensive evaluation of grid-based convolutional neural network models on this dataset. We also demonstrate how the partitioning of the training data and test data can impact the results of models trained with the PDBbind dataset, how performance improves by adding more, lower-quality training data, and how training with docked poses imparts pose sensitivity to the predicted affinity of a complex. Our best performing model, an ensemble of 5 densely connected convolutional newtworks, achieves a root mean squared error of 1.42 and Pearson R of 0.612 on the affinity prediction task, an AUC of 0.956 at binding pose classification, and a 68.4% accuracy at pose selection on the CrossDocked2020 set. By providing data splits for clustered cross-validation and the raw data for the CrossDocked2020 set, we establish the first standardized dataset for training machine learning models to recognize ligands in non-cognate target structures while also greatly expanding the number of poses available for training. In order to facilitate community adoption of this dataset for benchmarking protein-ligand binding affinity prediction, we provide our models, weights, and the CrossDocked2020 set at https://github.com/gnina/models.


2021 ◽  
Author(s):  
Aurore Lafond ◽  
Maurice Ringer ◽  
Florian Le Blay ◽  
Jiaxu Liu ◽  
Ekaterina Millan ◽  
...  

Abstract Abnormal surface pressure is typically the first indicator of a number of problematic events, including kicks, losses, washouts and stuck pipe. These events account for 60–70% of all drilling-related nonproductive time, so their early and accurate detection has the potential to save the industry billions of dollars. Detecting these events today requires an expert user watching multiple curves, which can be costly, and subject to human errors. The solution presented in this paper is aiming at augmenting traditional models with new machine learning techniques, which enable to detect these events automatically and help the monitoring of the drilling well. Today’s real-time monitoring systems employ complex physical models to estimate surface standpipe pressure while drilling. These require many inputs and are difficult to calibrate. Machine learning is an alternative method to predict pump pressure, but this alone needs significant labelled training data, which is often lacking in the drilling world. The new system combines these approaches: a machine learning framework is used to enable automated learning while the physical models work to compensate any gaps in the training data. The system uses only standard surface measurements, is fully automated, and is continuously retrained while drilling to ensure the most accurate pressure prediction. In addition, a stochastic (Bayesian) machine learning technique is used, which enables not only a prediction of the pressure, but also the uncertainty and confidence of this prediction. Last, the new system includes a data quality control workflow. It discards periods of low data quality for the pressure anomaly detection and enables to have a smarter real-time events analysis. The new system has been tested on historical wells using a new test and validation framework. The framework runs the system automatically on large volumes of both historical and simulated data, to enable cross-referencing the results with observations. In this paper, we show the results of the automated test framework as well as the capabilities of the new system in two specific case studies, one on land and another offshore. Moreover, large scale statistics enlighten the reliability and the efficiency of this new detection workflow. The new system builds on the trend in our industry to better capture and utilize digital data for optimizing drilling.


2022 ◽  
pp. 1559-1575
Author(s):  
Mário Pereira Véstias

Machine learning is the study of algorithms and models for computing systems to do tasks based on pattern identification and inference. When it is difficult or infeasible to develop an algorithm to do a particular task, machine learning algorithms can provide an output based on previous training data. A well-known machine learning model is deep learning. The most recent deep learning models are based on artificial neural networks (ANN). There exist several types of artificial neural networks including the feedforward neural network, the Kohonen self-organizing neural network, the recurrent neural network, the convolutional neural network, the modular neural network, among others. This article focuses on convolutional neural networks with a description of the model, the training and inference processes and its applicability. It will also give an overview of the most used CNN models and what to expect from the next generation of CNN models.


2020 ◽  
Vol 12 (11) ◽  
pp. 1743
Author(s):  
Artur M. Gafurov ◽  
Oleg P. Yermolayev

Transition from manual (visual) interpretation to fully automated gully detection is an important task for quantitative assessment of modern gully erosion, especially when it comes to large mapping areas. Existing approaches to semi-automated gully detection are based on either object-oriented selection based on multispectral images or gully selection based on a probabilistic model obtained using digital elevation models (DEMs). These approaches cannot be used for the assessment of gully erosion on the territory of the European part of Russia most affected by gully erosion due to the lack of national large-scale DEM and limited resolution of open source multispectral satellite images. An approach based on the use of convolutional neural networks for automated gully detection on the RGB-synthesis of ultra-high resolution satellite images publicly available for the test region of the east of the Russian Plain with intensive basin erosion has been proposed and developed. The Keras library and U-Net architecture of convolutional neural networks were used for training. Preliminary results of application of the trained gully erosion convolutional neural network (GECNN) allow asserting that the algorithm performs well in detecting active gullies, well differentiates gullies from other linear forms of slope erosion — rills and balkas, but so far has errors in detecting complex gully systems. Also, GECNN does not identify a gully in 10% of cases and in another 10% of cases it identifies not a gully. To solve these problems, it is necessary to additionally train the neural network on the enlarged training data set.


Author(s):  
A. A. Meldo ◽  
L. V. Utkin ◽  
T. N. Trofimova ◽  
M. A. Ryabinin ◽  
V. M. Moiseenko ◽  
...  

The relevance of developing an intelligent automated diagnostic system (IADS) for lung cancer (LC) detection stems from the social significance of this disease and its leading position among all cancer diseases. Theoretically, the use of IADS is possible at a stage of screening as well as at a stage of adjusted diagnosis of LC. The recent approaches to training the IADS do not take into account the clinical and radiological classification as well as peculiarities of the LC clinical forms, which are used by the medical community. This defines difficulties and obstacles of using the available IADS. The authors are of the opinion that the closeness of a developed IADS to the «doctor’s logic» contributes to a better reproducibility and interpretability of the IADS usage results. Most IADS described in the literature have been developed on the basis of neural networks, which have several disadvantages that affect reproducibility when using the system. This paper proposes a composite algorithm using machine learning methods such as Deep Forest and Siamese neural network, which can be regarded as a more efficient approach for dealing with a small amount of training data and optimal from the reproducibility point of view. The open datasets used for training IADS include annotated objects which in some cases are not confirmed morphologically. The paper provides a description of the LIRA dataset developed by using the diagnostic results of St. Petersburg Clinical Research Center of Specialized Types of Medical Care (Oncology), which includes only computed tomograms of patients with the verified diagnosis. The paper considers stages of the machine learning process on the basis of the shape features, of the internal structure features as well as a new developed system of differential diagnosis of LC based on the Siamese neural networks. A new approach to the feature dimension reduction is also presented in the paper, which aims more efficient and faster learning of the system.


2020 ◽  
Author(s):  
Paul Francoeur ◽  
Tomohide Masuda ◽  
David R. Koes

One of the main challenges in drug discovery is predicting protein-ligand binding affinity. Recently, machine learning approaches have made substantial progress on this task. However, current methods of model evaluation are overly optimistic in measuring generalization to new targets, and there does not exist a standard dataset of sufficient size to compare performance between models. We present a new dataset for structure-based machine learning, the CrossDocked2020 set, with 22.5 million poses of ligands docked into multiple similar binding pockets across the Protein Data Bank and perform a comprehensive evaluation of grid-based convolutional neural network models on this dataset. We also demonstrate how the partitioning of the training data and test data can impact the results of models trained with the PDBbind dataset, how performance improves by adding more, lower-quality training data, and how training with docked poses imparts pose sensitivity to the predicted affinity of a complex. Our best performing model, an ensemble of 5 densely connected convolutional newtworks, achieves a root mean squared error of 1.42 and Pearson R of 0.612 on the affinity prediction task, an AUC of 0.956 at binding pose classification, and a 68.4% accuracy at pose selection on the CrossDocked2020 set. By providing data splits for clustered cross-validation and the raw data for the CrossDocked2020 set, we establish the first standardized dataset for training machine learning models to recognize ligands in non-cognate target structures while also greatly expanding the number of poses available for training. In order to facilitate community adoption of this dataset for benchmarking protein-ligand binding affinity prediction, we provide our models, weights, and the CrossDocked2020 set at https://github.com/gnina/models.


Sign in / Sign up

Export Citation Format

Share Document