SAMPL6 Challenge Results from pKa Predictions Based on a General Gaussian Process Model

Author(s):  
Caitlin C. Bannan ◽  
David Mobley ◽  
A. Geoff Skillman

<div>A variety of fields would benefit from accurate pK<sub>a</sub> predictions, especially drug design due to the affect a change in ionization state can have on a molecules physiochemical properties.</div><div>Participants in the recent SAMPL6 blind challenge were asked to submit predictions for microscopic and macroscopic pK<sub>a</sub>s of 24 drug like small molecules.</div><div>We recently built a general model for predicting pK<sub>a</sub>s using a Gaussian process regression trained using physical and chemical features of each ionizable group.</div><div>Our pipeline takes a molecular graph and uses the OpenEye Toolkits to calculate features describing the removal of a proton.</div><div>These features are fed into a Scikit-learn Gaussian process to predict microscopic pK<sub>a</sub>s which are then used to analytically determine macroscopic pK<sub>a</sub>s.</div><div>Our Gaussian process is trained on a set of 2,700 macroscopic pK<sub>a</sub>s from monoprotic and select diprotic molecules.</div><div>Here, we share our results for microscopic and macroscopic predictions in the SAMPL6 challenge.</div><div>Overall, we ranked in the middle of the pack compared to other participants, but our fairly good agreement with experiment is still promising considering the challenge molecules are chemically diverse and often polyprotic while our training set is predominately monoprotic.</div><div>Of particular importance to us when building this model was to include an uncertainty estimate based on the chemistry of the molecule that would reflect the likely accuracy of our prediction. </div><div>Our model reports large uncertainties for the molecules that appear to have chemistry outside our domain of applicability, along with good agreement in quantile-quantile plots, indicating it can predict its own accuracy.</div><div>The challenge highlighted a variety of means to improve our model, including adding more polyprotic molecules to our training set and more carefully considering what functional groups we do or do not identify as ionizable. </div>

2018 ◽  
Author(s):  
Caitlin C. Bannan ◽  
David L. Mobley ◽  
Geoff Skillman

<div>A variety of fields would benefit from accurate pK<sub>a</sub> predictions, especially drug design due to the affect a change in ionization state can have on a molecules physiochemical properties.</div><div>Participants in the recent SAMPL6 blind challenge were asked to submit predictions for microscopic and macroscopic pK<sub>a</sub>s of 24 drug like small molecules.</div><div>We recently built a general model for predicting pK<sub>a</sub>s using a Gaussian process regression trained using physical and chemical features of each ionizable group.</div><div>Our pipeline takes a molecular graph and uses the OpenEye Toolkits to calculate features describing the removal of a proton.</div><div>These features are fed into a Scikit-learn Gaussian process to predict microscopic pK<sub>a</sub>s which are then used to analytically determine macroscopic pK<sub>a</sub>s.</div><div>Our Gaussian process is trained on a set of 2,700 macroscopic pK<sub>a</sub>s from monoprotic and select diprotic molecules.</div><div>Here, we share our results for microscopic and macroscopic predictions in the SAMPL6 challenge.</div><div>Overall, we ranked in the middle of the pack compared to other participants, but our fairly good agreement with experiment is still promising considering the challenge molecules are chemically diverse and often polyprotic while our training set is predominately monoprotic.</div><div>Of particular importance to us when building this model was to include an uncertainty estimate based on the chemistry of the molecule that would reflect the likely accuracy of our prediction. </div><div>Our model reports large uncertainties for the molecules that appear to have chemistry outside our domain of applicability, along with good agreement in quantile-quantile plots, indicating it can predict its own accuracy.</div><div>The challenge highlighted a variety of means to improve our model, including adding more polyprotic molecules to our training set and more carefully considering what functional groups we do or do not identify as ionizable. </div>


2018 ◽  
Author(s):  
Caitlin C. Bannan ◽  
David Mobley ◽  
A. Geoff Skillman

<div>A variety of fields would benefit from accurate pK<sub>a</sub> predictions, especially drug design due to the affect a change in ionization state can have on a molecules physiochemical properties.</div><div>Participants in the recent SAMPL6 blind challenge were asked to submit predictions for microscopic and macroscopic pK<sub>a</sub>s of 24 drug like small molecules.</div><div>We recently built a general model for predicting pK<sub>a</sub>s using a Gaussian process regression trained using physical and chemical features of each ionizable group.</div><div>Our pipeline takes a molecular graph and uses the OpenEye Toolkits to calculate features describing the removal of a proton.</div><div>These features are fed into a Scikit-learn Gaussian process to predict microscopic pK<sub>a</sub>s which are then used to analytically determine macroscopic pK<sub>a</sub>s.</div><div>Our Gaussian process is trained on a set of 2,700 macroscopic pK<sub>a</sub>s from monoprotic and select diprotic molecules.</div><div>Here, we share our results for microscopic and macroscopic predictions in the SAMPL6 challenge.</div><div>Overall, we ranked in the middle of the pack compared to other participants, but our fairly good agreement with experiment is still promising considering the challenge molecules are chemically diverse and often polyprotic while our training set is predominately monoprotic.</div><div>Of particular importance to us when building this model was to include an uncertainty estimate based on the chemistry of the molecule that would reflect the likely accuracy of our prediction. </div><div>Our model reports large uncertainties for the molecules that appear to have chemistry outside our domain of applicability, along with good agreement in quantile-quantile plots, indicating it can predict its own accuracy.</div><div>The challenge highlighted a variety of means to improve our model, including adding more polyprotic molecules to our training set and more carefully considering what functional groups we do or do not identify as ionizable. </div>


Author(s):  
Charley M. Wu ◽  
Eric Schulz ◽  
Mona M. Garvert ◽  
Björn Meder ◽  
Nicolas W. Schuck

AbstractLearning and generalization in spatial domains is often thought to rely on a “cognitive map”, representing relationships between spatial locations. Recent research suggests that this same neural machinery is also recruited for reasoning about more abstract, conceptual forms of knowledge. Yet, to what extent do spatial and conceptual reasoning share common computational principles, and what are the implications for behavior? Using a within-subject design we studied how participants used spatial or conceptual distances to generalize and search for correlated rewards in successive multi-armed bandit tasks. Participant behavior indicated sensitivity to both spatial and conceptual distance, and was best captured using a Bayesian model of generalization that formalized distance-dependent generalization and uncertainty-guided exploration as a Gaussian Process regression with a radial basis function kernel. The same Gaussian Process model best captured human search decisions and judgments in both domains, and could simulate realistic learning curves, where we found equivalent levels of generalization in spatial and conceptual tasks. At the same time, we also find characteristic differences between domains. Relative to the spatial domain, participants showed reduced levels of uncertainty-directed exploration and increased levels of random exploration in the conceptual domain. Participants also displayed a one-directional transfer effect, where experience in the spatial task boosted performance in the conceptual task, but not vice versa. While confidence judgments indicated that participants were sensitive to the uncertainty of their knowledge in both tasks, they did not or could not leverage their estimates of uncertainty to guide exploration in the conceptual task. These results support the notion that value-guided learning and generalization recruit cognitive-map dependent computational mechanisms in spatial and conceptual domains. Yet both behavioral and model-based analyses suggest domain specific differences in how these representations map onto actions.Author summaryThere is a resurgence of interest in “cognitive maps” based on recent evidence that the hippocampal-entorhinal system encodes both spatial and non-spatial relational information, with far-reaching implications for human behavior. Yet little is known about the commonalities and differences in the computational principles underlying human learning and decision making in spatial and non-spatial domains. We use a within-subject design to examine how humans search for either spatially or conceptually correlated rewards. Using a Bayesian learning model, we find evidence for the same computational mechanisms of generalization across domains. While participants were sensitive to expected rewards and uncertainty in both tasks, how they leveraged this knowledge to guide exploration was different: participants displayed less uncertainty-directed and more random exploration in the conceptual domain. Moreover, experience with the spatial task improved conceptual performance, but not vice versa. These results provide important insights about the degree of overlap between spatial and conceptual cognition.


2019 ◽  
Author(s):  
Olli-Pekka Koistinen ◽  
Vilhjálmur Ásgeirsson ◽  
Aki Vehtari ◽  
Hannes Jónsson

The minimum mode following method can be used to find saddle points on an energy surface by following a direction guided by the lowest curvature mode. Such calculations are often started close to a minimum on the energy surface to find out which transitions can occur from an initial state of the system, but it is also common to start from the vicinity of a first order saddle point making use of an initial guess based on intuition or more approximate calculations. In systems where accurate evaluations of the energy and its gradient are computationally intensive, it is important to exploit the information of the previous evaluations to enhance the performance. Here, we show that the number of evaluations required for convergence to the saddle point can be significantly reduced by making use of an approximate energy surface obtained by a Gaussian process model based on inverse inter-atomic distances, evaluating accurate energy and gradient at the saddle point of the approximate surface and then correcting the model based on the new information. The performance of the method is tested with start points chosen randomly in the vicinity of saddle points for dissociative adsorption of an H2 molecule on the Cu(110) Surface and three gas phase chemical reactions.<br>


2019 ◽  
Vol 32 (8) ◽  
pp. 3005-3028 ◽  
Author(s):  
Zexun Chen ◽  
Bo Wang ◽  
Alexander N. Gorban

AbstractGaussian process model for vector-valued function has been shown to be useful for multi-output prediction. The existing method for this model is to reformulate the matrix-variate Gaussian distribution as a multivariate normal distribution. Although it is effective in many cases, reformulation is not always workable and is difficult to apply to other distributions because not all matrix-variate distributions can be transformed to respective multivariate distributions, such as the case for matrix-variate Student-t distribution. In this paper, we propose a unified framework which is used not only to introduce a novel multivariate Student-t process regression model (MV-TPR) for multi-output prediction, but also to reformulate the multivariate Gaussian process regression (MV-GPR) that overcomes some limitations of the existing methods. Both MV-GPR and MV-TPR have closed-form expressions for the marginal likelihoods and predictive distributions under this unified framework and thus can adopt the same optimization approaches as used in the conventional GPR. The usefulness of the proposed methods is illustrated through several simulated and real-data examples. In particular, we verify empirically that MV-TPR has superiority for the datasets considered, including air quality prediction and bike rent prediction. At last, the proposed methods are shown to produce profitable investment strategies in the stock markets.


2016 ◽  
Vol 5 (1) ◽  
pp. 67-78
Author(s):  
Ali L. Yaumi ◽  
Ahmed M. Murtala ◽  
Habiba D. Muhd ◽  
Fatima M. Saleh

Gum Arabic “GA” is an organic adhesive produced from a tree called named Acacia Senegal. The gum has a wide range of industrial uses, especially in areas of feeds, textiles, and pharmaceuticals. It is used as emulsifier and serves mostly as stabilizer in both cosmetic and food products which contains oil water interface. GA sample was collected, formulated and prepared into various concentrations ranging from 20%w/v to 85%w/v. The quality and applicability of well characterized materials are directly related to their physical and chemical properties. From the physiochemical analysis, the result revealed that all the samples were slightly acidic (pH ranging from 4.81-6.41). This range is in good agreement with reported pH values for gum arabic and other Acacia gums by several authors. . The binding strength increases as the number of days increases for example in sample F (50%w/v) gum Arabic concentration increases from 1.5 in the 1st day to 1.97 in the 28th day. The samples prepared are denser than water which indicates that the density increases as the percentage concentration of the samples increases and the relative density of the gum solution is independent on time. The binding strength of sample G (75%w/v) gum concentration compared well to that of polyvinyl acetate (PVA). International Journal of Environment Vol. 5 (1) 2016,  pp: 67-78


2020 ◽  
Vol 86 (1) ◽  
Author(s):  
Eric C. Howell ◽  
J. D. Hanson

A non-parametric Gaussian process regression model is developed in the three-dimensional equilibrium reconstruction code V3FIT. A Gaussian process is a normal distribution of functions that is uniquely defined by specifying a mean function and covariance kernel function. Gaussian process regression assumes that an unknown profile belongs to a particular Gaussian process and uses Bayesian analysis to select the function the give the best fit to measured data. The implementation in V3FIT uses a hybrid representation where Gaussian processes are used to infer some of the equilibrium profiles and standard parametric techniques are used to infer the remaining profiles. The implementation of the Gaussian process is tested using both synthetic data and experimental data from multiple machines.


2019 ◽  
Author(s):  
Olli-Pekka Koistinen ◽  
Vilhjálmur Ásgeirsson ◽  
Aki Vehtari ◽  
Hannes Jónsson

The minimum mode following method can be used to find saddle points on an energy surface by following a direction guided by the lowest curvature mode. Such calculations are often started close to a minimum on the energy surface to find out which transitions can occur from an initial state of the system, but it is also common to start from the vicinity of a first order saddle point making use of an initial guess based on intuition or more approximate calculations. In systems where accurate evaluations of the energy and its gradient are computationally intensive, it is important to exploit the information of the previous evaluations to enhance the performance. Here, we show that the number of evaluations required for convergence to the saddle point can be significantly reduced by making use of an approximate energy surface obtained by a Gaussian process model based on inverse inter-atomic distances, evaluating accurate energy and gradient at the saddle point of the approximate surface and then correcting the model based on the new information. The performance of the method is tested with start points chosen randomly in the vicinity of saddle points for dissociative adsorption of an H2 molecule on the Cu(110) Surface and three gas phase chemical reactions.<br>


2015 ◽  
Vol 2015 ◽  
pp. 1-9 ◽  
Author(s):  
Le Zhou ◽  
Junghui Chen ◽  
Zhihuan Song

In chemical batch processes with slow responses and a long duration, it is time-consuming and expensive to obtain sufficient normal data for statistical analysis. With the persistent accumulation of the newly evolving data, the modelling becomes adequate gradually and the subsequent batches will change slightly owing to the slow time-varying behavior. To efficiently make use of the small amount of initial data and the newly evolving data sets, an adaptive monitoring scheme based on the recursive Gaussian process (RGP) model is designed in this paper. Based on the initial data, a Gaussian process model and the corresponding SPE statistic are constructed at first. When the new batches of data are included, a strategy based on the RGP model is used to choose the proper data for model updating. The performance of the proposed method is finally demonstrated by a penicillin fermentation batch process and the result indicates that the proposed monitoring scheme is effective for adaptive modelling and online monitoring.


Sign in / Sign up

Export Citation Format

Share Document