scholarly journals Implicitly perturbed Hamiltonian: a class of versatile and general-purpose molecular representations for machine leaning

Author(s):  
Amin Alibakhshi ◽  
Bernd Hartke

Unraveling challenging problems by machine learning has recently become a hot topic in many scientific disciplines. For developing rigorous machine-learning models to study problems of interest in molecular sciences, translating molecular structures to quantitative representations as suitable machine-learning inputs plays a central role. Many different molecular representations and the state-ofthe- art ones, although efficient in studying numerous molecular features, still are sub-optimal in many challenging cases, as discussed in the context of present research. The main aim of the present study is to introduce the Implicitly Perturbed Hamiltonian (ImPerHam) as a class of versatile representations for more efficient machine learning of challenging problems in molecular sciences. ImPerHam representations are defined as energy attributes of the molecular Hamiltonian, implicitly perturbed by a number of hypothetic or real arbitrary solvents based on continuum solvation models. We demonstrate outstanding performance of machine-learning models based on ImPerHam representations for three diverse and challenging cases of predicting inhibition of the CYP450 enzyme, high precision and transferrable evaluation of conformational energy of molecular systems and accurately reproducing solvation free energies for large benchmark sets.

2021 ◽  
Author(s):  
Amin Alibakhshi ◽  
Bernd Hartke

Abstract Unraveling challenging problems by machine learning has recently become a hot topic in many scientific disciplines. For developing rigorous machine-learning models to study problems of interest in molecular sciences, translating molecular structures to quantitative representations as suitable machine-learning inputs play a central role. Many different molecular representations and the state-of-the-art ones, although efficient in studying numerous molecular features, still are suboptimal in many challenging cases, as discussed in the context of the present research. The main aim of the present study is to introduce the Implicitly Perturbed Hamiltonian (ImPerHam) as a class of versatile representations for more efficient machine learning of challenging problems in molecular sciences. ImPerHam representations are defined as energy attributes of the molecular Hamiltonian, implicitly perturbed by a number of hypothetic or real arbitrary solvents based on continuum solvation models. We demonstrate the outstanding performance of machine-learning models based on ImPerHam representations for three diverse and challenging cases of predicting inhibition of the CYP450 enzyme, high precision, and transferrable evaluation of conformational energy of molecular systems, and accurately reproducing solvation free energies for large benchmark sets.


Author(s):  
Daniel Elton ◽  
Zois Boukouvalas ◽  
Mark S. Butrico ◽  
Mark D. Fuge ◽  
Peter W. Chung

We present a proof of concept that machine learning techniques can be used to predict the properties of CNOHF energetic molecules from their molecular structures. We focus on a small but diverse dataset consisting of 109 molecular structures spread across ten compound classes. Up until now, candidate molecules for energetic materials have been screened using predictions from expensive quantum simulations and thermochemical codes. We present a comprehensive comparison of machine learning models and several molecular featurization methods - sum over bonds, custom descriptors, Coulomb matrices, bag of bonds, and fingerprints. The best featurization was sum over bonds (bond counting), and the best model was kernel ridge regression. Despite having a small data set, we obtain acceptable errors and Pearson correlations for the prediction of detonation pressure, detonation velocity, explosive energy, heat of formation, density, and other properties out of sample. By including another dataset with 309 additional molecules in our training we show how the error can be pushed lower, although the convergence with number of molecules is slow. Our work paves the way for future applications of machine learning in this domain, including automated lead generation and interpreting machine learning models to obtain novel chemical insights.


2021 ◽  
Author(s):  
Siddharth Ghule ◽  
Sayan Bagchi ◽  
Kumar Vanka

<div>Electricity generation is a major contributing factor for greenhouse gas emissions. Energy storage systems available today have a combined capacity to store less than 1% of the electricity being consumed worldwide. Redox Flow Batteries (RFBs) are promising candidates for green and efficient energy storage systems. RFBs are being used in renewable energy systems, but their widespread adoption is limited due to high production costs and toxicity associated with the transition-metal-based redox-active species. Therefore, cheaper and greener alternative organic redox-active species are being investigated. Recent reports have shown organic molecules based on phenazine are promising candidates for redox-active species in RFBs. However, the large number of available organic compounds makes the conventional experimental and DFT methods impractical to screen thousands of molecules in a reasonable amount of time. In contrast, machine-learning models have low development time, short prediction time, and high accuracy; thus, are being heavily investigated for virtual screening applications. In this work, we developed machine-learning models to predict the redox potential of phenazine derivatives in DME solvent using a small dataset of 185 molecules. 2D, 3D, and Molecular Fingerprint features were computed using readily available and easy-to-use python libraries, making our approach easily adaptable to similar work. Twenty linear and non-linear machine-learning models were investigated in this work. These models achieved excellent performance on the unseen data (i.e., R<sup>2</sup> > 0.98, MSE < 0.008 V2 and MAE < 0.07 V). Model performance was assessed in a consistent manner using the training and evaluation pipeline developed in this work. We showed that 2D molecular features are most informative and achieve the best prediction accuracy among four feature sets. We also showed that often less preferred but relatively faster linear models could perform better than non-linear models when the feature set contains different types of features (i.e., 2D, 3D, and Molecular Fingerprints). Further investigations revealed that it is possible to reduce the training and inference time without sacrificing prediction accuracy by using a small subset of features. Moreover, models were able to predict the previously reported promising redox-active compounds with high accuracy. Also, significantly low prediction errors were observed for the functional groups. Although some functional groups had only one compound in the training set, best-performing models could achieve errors (MAPE) less than 10%. The major source of error was a lack of data near-zero and in the positive region. Therefore, this work shows that it is possible to develop accurate machine-learning models that could potentially screen millions of compounds in a short amount of time with a small training set and limited number of easy to compute features. Thus, results obtained in this report would help in the adoption of green energy by accelerating the field of materials discovery for energy storage applications.</div>


Author(s):  
Eelke B. Lenselink ◽  
Pieter F. W. Stouten

AbstractAccurate prediction of lipophilicity—logP—based on molecular structures is a well-established field. Predictions of logP are often used to drive forward drug discovery projects. Driven by the SAMPL7 challenge, in this manuscript we describe the steps that were taken to construct a novel machine learning model that can predict and generalize well. This model is based on the recently described Directed-Message Passing Neural Networks (D-MPNNs). Further enhancements included: both the inclusion of additional datasets from ChEMBL (RMSE improvement of 0.03), and the addition of helper tasks (RMSE improvement of 0.04). To the best of our knowledge, the concept of adding predictions from other models (Simulations Plus logP and [email protected], respectively) as helper tasks is novel and could be applied in a broader context. The final model that we constructed and used to participate in the challenge ranked 2/17 ranked submissions with an RMSE of 0.66, and an MAE of 0.48 (submission: Chemprop). On other datasets the model also works well, especially retrospectively applied to the SAMPL6 challenge where it would have ranked number one out of all submissions (RMSE of 0.35). Despite the fact that our model works well, we conclude with suggestions that are expected to improve the model even further.


2019 ◽  
Vol 9 (6) ◽  
pp. 1248 ◽  
Author(s):  
Hyun-Joo Oh ◽  
Mutiara Syifa ◽  
Chang-Wook Lee ◽  
Saro Lee

To effectively prevent land subsidence over abandoned coal mines, it is necessary to quantitatively identify vulnerable areas. In this study, we evaluated the performance of predictive Bayesian, functional, and meta-ensemble machine learning models in generating land subsidence susceptibility (LSS) maps. All models were trained using half of a land subsidence inventory, and validated using the other half of the dataset. The model performance was evaluated by comparing the area under the receiver operating characteristic (ROC) curve of the resulting LSS map for each model. Among all models tested, the logit boost, which is a meta-ensemble machine leaning model, generated LSS maps with the highest accuracy (91.44%), i.e., higher than that of the other Bayesian and functional machine learning models, including the Bayes net (86.42%), naïve Bayes (85.39%), logistic (88.92%), and multilayer perceptron models (86.76%). The LSS maps produced in this study can be used to mitigate subsidence risk for people and important facilities within the study area, and as a foundation for further studies in other regions.


2018 ◽  
Author(s):  
Daniel Elton ◽  
Zois Boukouvalas ◽  
Mark S. Butrico ◽  
Mark D. Fuge ◽  
Peter W. Chung

We present a proof of concept that machine learning techniques can be used to predict the properties of CNOHF energetic molecules from their molecular structures. We focus on a small but diverse dataset consisting of 109 molecular structures spread across ten compound classes. Up until now, candidate molecules for energetic materials have been screened using predictions from expensive quantum simulations and thermochemical codes. We present a comprehensive comparison of machine learning models and several molecular featurization methods - sum over bonds, custom descriptors, Coulomb matrices, Bag of Bonds, and fingerprints. The best featurization was sum over bonds (bond counting), and the best model was kernel ridge regression. Despite having a small data set, we obtain acceptable errors and Pearson correlations for the prediction of detonation pressure, detonation velocity, explosive energy, heat of formation, density, and other properties out of sample. By including another dataset with 309 additional molecules in our training we show how the error can be pushed lower, although the convergence with number of molecules is slow. Our work paves the way for future applications of machine learning in this domain, including automated lead generation and interpreting machine learning models to obtain novel chemical insights.


2018 ◽  
Author(s):  
Daniel Elton ◽  
Zois Boukouvalas ◽  
Mark S. Butrico ◽  
Mark D. Fuge ◽  
Peter W. Chung

We present a proof of concept that machine learning techniques can be used to predict the properties of CNOHF energetic molecules from their molecular structures. We focus on a small but diverse dataset consisting of 109 molecular structures spread across ten compound classes. Up until now, candidate molecules for energetic materials have been screened using predictions from expensive quantum simulations and thermochemical codes. We present a comprehensive comparison of machine learning models and several molecular featurization methods - sum over bonds, custom descriptors, Coulomb matrices, bag of bonds, and fingerprints. The best featurization was sum over bonds (bond counting), and the best model was kernel ridge regression. Despite having a small data set, we obtain acceptable errors and Pearson correlations for the prediction of detonation pressure, detonation velocity, explosive energy, heat of formation, density, and other properties out of sample. By including another dataset with 309 additional molecules in our training we show how the error can be pushed lower, although the convergence with number of molecules is slow. Our work paves the way for future applications of machine learning in this domain, including automated lead generation and interpreting machine learning models to obtain novel chemical insights.


2021 ◽  
Author(s):  
Siddharth Ghule ◽  
Sayan Bagchi ◽  
Kumar Vanka

<div>Electricity generation is a major contributing factor for greenhouse gas emissions. Energy storage systems available today have a combined capacity to store less than 1% of the electricity being consumed worldwide. Redox Flow Batteries (RFBs) are promising candidates for green and efficient energy storage systems. RFBs are being used in renewable energy systems, but their widespread adoption is limited due to high production costs and toxicity associated with the transition-metal-based redox-active species. Therefore, cheaper and greener alternative organic redox-active species are being investigated. Recent reports have shown organic molecules based on phenazine are promising candidates for redox-active species in RFBs. However, the large number of available organic compounds makes the conventional experimental and DFT methods impractical to screen thousands of molecules in a reasonable amount of time. In contrast, machine-learning models have low development time, short prediction time, and high accuracy; thus, are being heavily investigated for virtual screening applications. In this work, we developed machine-learning models to predict the redox potential of phenazine derivatives in DME solvent using a small dataset of 185 molecules. 2D, 3D, and Molecular Fingerprint features were computed using readily available and easy-to-use python libraries, making our approach easily adaptable to similar work. Twenty linear and non-linear machine-learning models were investigated in this work. These models achieved excellent performance on the unseen data (i.e., R<sup>2</sup> > 0.98, MSE < 0.008 V2 and MAE < 0.07 V). Model performance was assessed in a consistent manner using the training and evaluation pipeline developed in this work. We showed that 2D molecular features are most informative and achieve the best prediction accuracy among four feature sets. We also showed that often less preferred but relatively faster linear models could perform better than non-linear models when the feature set contains different types of features (i.e., 2D, 3D, and Molecular Fingerprints). Further investigations revealed that it is possible to reduce the training and inference time without sacrificing prediction accuracy by using a small subset of features. Moreover, models were able to predict the previously reported promising redox-active compounds with high accuracy. Also, significantly low prediction errors were observed for the functional groups. Although some functional groups had only one compound in the training set, best-performing models could achieve errors (MAPE) less than 10%. The major source of error was a lack of data near-zero and in the positive region. Therefore, this work shows that it is possible to develop accurate machine-learning models that could potentially screen millions of compounds in a short amount of time with a small training set and limited number of easy to compute features. Thus, results obtained in this report would help in the adoption of green energy by accelerating the field of materials discovery for energy storage applications.</div>


2021 ◽  
Author(s):  
Feng Gao ◽  
Yike Shen ◽  
J. Brett Sallach ◽  
Hui Li ◽  
Cun Liu ◽  
...  

Abstract Root concentration factor is an important substance-specific characterization parameter for plant uptake of organic contaminants from soils in life cycle impact assessment (LCIA); however, the availability of a reliable dataset and building of robust predictive models remain challenging due to the complexity of chemical-soil-plant root interactions. Here we developed end-to-end machine learning models to devolve the interaction complexity by training on a unified dataset with 341 data points covering 72 chemicals. The gradient boosting regression tree (GBRT) model based on the extended connectivity fingerprints (ECFP) demonstrated a superior prediction performance with R-squared of 0.77 and Mean Absolute Error (MAE) of 0.22. In addition, partial dependence analysis was used to determine the nonlinear relationships in the chemical-soil-plant root system. Feature importance analysis revealed the relationship between and chemical topological structures. Stemming from its simplicity and universality, the GBRT-ECFP model provides a promising tool for LCIA to better characterize the human and ecological impacts of chemicals in the environment.


Sign in / Sign up

Export Citation Format

Share Document