A Diversified Machine Learning Strategy for Predicting and Understanding Molecular Melting Points

<p>The ability to predict multi-molecule processes, using only knowledge of single molecule structure, stands as a grand challenge for molecular modeling. Methods capable of predicting melting points (MP) solely from chemical structure represent a canonical example, and are highly desirable in many crucial industrial applications. In this work, we explore a data-driven approach utilizing machine learning (ML) techniques to predict and understand the MP of molecules. Several experimental databases are aggregated from the literature to design a low-bias dataset that includes 3D structural and quantum-chemical properties. Using experimental and polymorph-induced uncertainties, we derive a tenable lower limit for MP prediction accuracy, and apply graph neural networks and Gaussian processes to predict MP competitive with these error bounds. To further understand how MP correlates with molecular structure, we employ several semi-supervised and unsupervised ML techniques. First, we use unsupervised clustering methods to identify classes of molecules, their common fragments, and expected errors for each data set. We then build molecular geometric spaces shaped by MP with a semi-supervised variational autoencoder and graph embedding spaces, and apply graph attribution methods to highlight atom-level contributions to MP within the datasets. Overall, this work serves as a case study of how to employ a diversified ML toolkit to predict and understand correlations between molecular structures and thermophysical properties of interest.</p>

Download Full-text

Exchange Spin Coupling from Gaussian Process Regression

10.26434/chemrxiv.12589541.v3 ◽

2020 ◽

Author(s):

Marc Philipp Bahlke ◽

Natnael Mogos ◽

Jonny Proppe ◽

Carmen Herrmann

Keyword(s):

Machine Learning ◽

Gaussian Process ◽

Gaussian Process Regression ◽

Molecular Magnets ◽

Molecular Structures ◽

Spin Coupling ◽

Structure Property ◽

Data Set ◽

Uncertainty Estimates

Heisenberg exchange spin coupling between metal centers is essential for describing and understanding the electronic structure of many molecular catalysts, metalloenzymes, and molecular magnets for potential application in information technology. We explore the machine-learnability of exchange spin coupling, which has not been studied yet. We employ Gaussian process regression since it can potentially deal with small training sets (as likely associated with the rather complex molecular structures required for exploring spin coupling) and since it provides uncertainty estimates (“error bars”) along with predicted values. We compare a range of descriptors and kernels for 257 small dicopper complexes and find that a simple descriptor based on chemical intuition, consisting only of copper-bridge angles and copper-copper distances, clearly outperforms several more sophisticated descriptors when it comes to extrapolating towards larger experimentally relevant complexes. Exchange spin coupling is similarly easy to learn as the polarizability, while learning dipole moments is much harder. The strength of the sophisticated descriptors lies in their ability to linearize structure-property relationships, to the point that a simple linear ridge regression performs just as well as the kernel-based machine-learning model for our small dicopper data set. The superior extrapolation performance of the simple descriptor is unique to exchange spin coupling, reinforcing the crucial role of choosing a suitable descriptor, and highlighting the interesting question of the role of chemical intuition vs. systematic or automated selection of features for machine learning in chemistry and material science.

Download Full-text

Subspace Clustering of Physiological Data From Acute Traumatic Brain Injury Patients: Retrospective Analysis Based on the PROTECT III Trial

JMIR Biomedical Engineering ◽

10.2196/24698 ◽

2021 ◽

Vol 6 (1) ◽

pp. e24698

Author(s):

Sina Ehsani ◽

Chandan K Reddy ◽

Brandon Foreman ◽

Jonathan Ratcliff ◽

Vignesh Subbian

Keyword(s):

Machine Learning ◽

Traumatic Brain Injury ◽

Brain Injury ◽

Subspace Clustering ◽

Machine Learning Techniques ◽

Phase Iii ◽

Physiological Data ◽

Clustering Methods ◽

Data Set ◽

Health Technologies

Background With advances in digital health technologies and proliferation of biomedical data in recent years, applications of machine learning in health care and medicine have gained considerable attention. While inpatient settings are equipped to generate rich clinical data from patients, there is a dearth of actionable information that can be used for pursuing secondary research for specific clinical conditions. Objective This study focused on applying unsupervised machine learning techniques for traumatic brain injury (TBI), which is the leading cause of death and disability among children and adults aged less than 44 years. Specifically, we present a case study to demonstrate the feasibility and applicability of subspace clustering techniques for extracting patterns from data collected from TBI patients. Methods Data for this study were obtained from the Progesterone for Traumatic Brain Injury, Experimental Clinical Treatment–Phase III (PROTECT III) trial, which included a cohort of 882 TBI patients. We applied subspace-clustering methods (density-based, cell-based, and clustering-oriented methods) to this data set and compared the performance of the different clustering methods. Results The analyses showed the following three clusters of laboratory physiological data: (1) international normalized ratio (INR), (2) INR, chloride, and creatinine, and (3) hemoglobin and hematocrit. While all subclustering algorithms had a reasonable accuracy in classifying patients by mortality status, the density-based algorithm had a higher F1 score and coverage. Conclusions Clustering approaches serve as an important step for phenotype definition and validation in clinical domains such as TBI, where patient and injury heterogeneity are among the major reasons for failure of clinical trials. The results from this study provide a foundation to develop scalable clustering algorithms for further research and validation.

Download Full-text

Applying Machine Learning Techniques to Predict the Properties of Energetic Materials

10.26434/chemrxiv.5883157.v2 ◽

2018 ◽

Cited By ~ 1

Author(s):

Daniel Elton ◽

Zois Boukouvalas ◽

Mark S. Butrico ◽

Mark D. Fuge ◽

Peter W. Chung

Keyword(s):

Machine Learning ◽

Energetic Materials ◽

Molecular Structures ◽

Machine Learning Techniques ◽

Small Data ◽

Detonation Pressure ◽

Learning Models ◽

Data Set ◽

Learning Techniques ◽

Machine Learning Models

We present a proof of concept that machine learning techniques can be used to predict the properties of CNOHF energetic molecules from their molecular structures. We focus on a small but diverse dataset consisting of 109 molecular structures spread across ten compound classes. Up until now, candidate molecules for energetic materials have been screened using predictions from expensive quantum simulations and thermochemical codes. We present a comprehensive comparison of machine learning models and several molecular featurization methods - sum over bonds, custom descriptors, Coulomb matrices, bag of bonds, and fingerprints. The best featurization was sum over bonds (bond counting), and the best model was kernel ridge regression. Despite having a small data set, we obtain acceptable errors and Pearson correlations for the prediction of detonation pressure, detonation velocity, explosive energy, heat of formation, density, and other properties out of sample. By including another dataset with 309 additional molecules in our training we show how the error can be pushed lower, although the convergence with number of molecules is slow. Our work paves the way for future applications of machine learning in this domain, including automated lead generation and interpreting machine learning models to obtain novel chemical insights.

Download Full-text

Unifying machine learning and quantum chemistry with a deep neural network for molecular wavefunctions

Nature Communications ◽

10.1038/s41467-019-12875-2 ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 64

Author(s):

K. T. Schütt ◽

M. Gastegger ◽

A. Tkatchenko ◽

K.-R. Müller ◽

R. J. Maurer

Keyword(s):

Machine Learning ◽

Quantum Chemistry ◽

Degrees Of Freedom ◽

Large Scale ◽

Materials Science ◽

Chemical Space ◽

Chemical Properties ◽

Molecular Structures ◽

Learning Framework ◽

Molecular Wavefunctions

AbstractMachine learning advances chemistry and materials science by enabling large-scale exploration of chemical space based on quantum chemical calculations. While these models supply fast and accurate predictions of atomistic chemical properties, they do not explicitly capture the electronic degrees of freedom of a molecule, which limits their applicability for reactive chemistry and chemical analysis. Here we present a deep learning framework for the prediction of the quantum mechanical wavefunction in a local basis of atomic orbitals from which all other ground-state properties can be derived. This approach retains full access to the electronic structure via the wavefunction at force-field-like efficiency and captures quantum mechanics in an analytically differentiable representation. On several examples, we demonstrate that this opens promising avenues to perform inverse design of molecular structures for targeting electronic property optimisation and a clear path towards increased synergy of machine learning and quantum chemistry.

Download Full-text

Exchange Spin Coupling fromGaussian Progress Regression

10.26434/chemrxiv.12589541.v1 ◽

2020 ◽

Author(s):

Marc Philipp Bahlke ◽

Natnael Mogos ◽

Jonny Proppe ◽

Carmen Herrmann

Keyword(s):

Machine Learning ◽

Dipole Moments ◽

Gaussian Process Regression ◽

Molecular Magnets ◽

Molecular Structures ◽

Spin Coupling ◽

Structure Property ◽

Data Set ◽

Uncertainty Estimates

Heisenberg exchange spin coupling between metal centers is essential for describing and understanding the electronic structure of many molecular catalysts, metalloenzymes, and molecular magnets for potential application in information technology. We explore the machine-learnability of exchange spin coupling, which has not been studied yet. We employ Gaussian process regression since it can potentially deal with small training sets (as likely associated with the rather complex molecular structures required for exploring spin coupling) and since it provides uncertainty estimates (“error bars”) along with predicted values. We compare a range of descriptors and kernels for 257 small dicopper complexes and find that a simple descriptor based on chemical intuition, consisting only of copper-bridge angles and copper-copper distances, clearly outperforms several more sophisticated descriptors when it comes to extrapolating towards larger experimentally relevant complexes. Exchange spin coupling is similarly easy to learn as the polarizability, while learning dipole moments is much harder. The strength of the sophisticated descriptors lies in their ability to linearize structure-property relationships, to the point that a simple linear ridge regression performs just as well as the kernel-based machine-learning model for our small dicopper data set. The superior extrapolation performance of the simple descriptor is unique to exchange spin coupling, reinforcing the crucial role of choosing a suitable descriptor, and highlighting the interesting question of the role of chemical intuition vs. systematic or automated selection of features for machine learning in chemistry and material science.

Download Full-text

Subspace Clustering of Physiological Data From Acute Traumatic Brain Injury Patients: Retrospective Analysis Based on the PROTECT III Trial (Preprint)

10.2196/preprints.24698 ◽

2020 ◽

Author(s):

Sina Ehsani ◽

Chandan K Reddy ◽

Brandon Foreman ◽

Jonathan Ratcliff ◽

Vignesh Subbian

Keyword(s):

Machine Learning ◽

Traumatic Brain Injury ◽

Brain Injury ◽

Subspace Clustering ◽

Machine Learning Techniques ◽

Phase Iii ◽

Physiological Data ◽

Clustering Methods ◽

Data Set ◽

Health Technologies

BACKGROUND With advances in digital health technologies and proliferation of biomedical data in recent years, applications of machine learning in health care and medicine have gained considerable attention. While inpatient settings are equipped to generate rich clinical data from patients, there is a dearth of actionable information that can be used for pursuing secondary research for specific clinical conditions. OBJECTIVE This study focused on applying unsupervised machine learning techniques for traumatic brain injury (TBI), which is the leading cause of death and disability among children and adults aged less than 44 years. Specifically, we present a case study to demonstrate the feasibility and applicability of subspace clustering techniques for extracting patterns from data collected from TBI patients. METHODS Data for this study were obtained from the Progesterone for Traumatic Brain Injury, Experimental Clinical Treatment–Phase III (PROTECT III) trial, which included a cohort of 882 TBI patients. We applied subspace-clustering methods (density-based, cell-based, and clustering-oriented methods) to this data set and compared the performance of the different clustering methods. RESULTS The analyses showed the following three clusters of laboratory physiological data: (1) international normalized ratio (INR), (2) INR, chloride, and creatinine, and (3) hemoglobin and hematocrit. While all subclustering algorithms had a reasonable accuracy in classifying patients by mortality status, the density-based algorithm had a higher F1 score and coverage. CONCLUSIONS Clustering approaches serve as an important step for phenotype definition and validation in clinical domains such as TBI, where patient and injury heterogeneity are among the major reasons for failure of clinical trials. The results from this study provide a foundation to develop scalable clustering algorithms for further research and validation.

Download Full-text

Exchange Spin Coupling from Gaussian Process Regression

10.26434/chemrxiv.12589541 ◽

2020 ◽

Author(s):

Marc Philipp Bahlke ◽

Natnael Mogos ◽

Jonny Proppe ◽

Carmen Herrmann

Keyword(s):

Machine Learning ◽

Gaussian Process ◽

Gaussian Process Regression ◽

Molecular Magnets ◽

Molecular Structures ◽

Spin Coupling ◽

Structure Property ◽

Data Set ◽

Uncertainty Estimates

Heisenberg exchange spin coupling between metal centers is essential for describing and understanding the electronic structure of many molecular catalysts, metalloenzymes, and molecular magnets for potential application in information technology. We explore the machine-learnability of exchange spin coupling, which has not been studied yet. We employ Gaussian process regression since it can potentially deal with small training sets (as likely associated with the rather complex molecular structures required for exploring spin coupling) and since it provides uncertainty estimates (“error bars”) along with predicted values. We compare a range of descriptors and kernels for 257 small dicopper complexes and find that a simple descriptor based on chemical intuition, consisting only of copper-bridge angles and copper-copper distances, clearly outperforms several more sophisticated descriptors when it comes to extrapolating towards larger experimentally relevant complexes. Exchange spin coupling is similarly easy to learn as the polarizability, while learning dipole moments is much harder. The strength of the sophisticated descriptors lies in their ability to linearize structure-property relationships, to the point that a simple linear ridge regression performs just as well as the kernel-based machine-learning model for our small dicopper data set. The superior extrapolation performance of the simple descriptor is unique to exchange spin coupling, reinforcing the crucial role of choosing a suitable descriptor, and highlighting the interesting question of the role of chemical intuition vs. systematic or automated selection of features for machine learning in chemistry and material science.

Download Full-text

Machine-learning for cluster analysis of localization microscopy data.

10.1101/505719 ◽

2018 ◽

Author(s):

David J Williamson ◽

Garth L Burn ◽

Juliette Griffie ◽

Daniel M Davis ◽

Dylan M Owen

Keyword(s):

Machine Learning ◽

Cluster Analysis ◽

Single Molecule ◽

Large Scale ◽

Supervised Machine Learning ◽

Point Pattern ◽

Data Set ◽

Localization Microscopy ◽

Human T Cell ◽

Microscopy Data

Quantifying the clustering of points within single-molecule localization microscopy data is useful to understanding the spatial relationships of the molecules in the underlying sample. The conversion of point pattern data into a meaningful description of clustering is difficult, especially for biologically derived data, as the definitions of clustering are often subjective or simplistic. Many existing computational approaches are also limited in their ability to process large-scale data-sets or to deal effectively with inhomogeneities in clustering. Here we have developed a supervised machine-learning approach to cluster analysis which is fast and accurate. Trained on a variety of simulated clustered data, the network can then classify all points from a typical localization microscopy data-set (several million points from the entire field of view) as being either clustered or not-clustered, with the potential to include additional classifiers to describe different types of clusters. Clustered points can then be further refined into like-clusters for the measurement of cluster area, shape, and point-density. We demonstrate the performance on simulated data and experimental data of the kinase Csk and the adaptor PAG in both naive and pre-stimulated primary human T cell synapses.

Download Full-text

Applying Machine Learning Techniques to Predict the Properties of Energetic Materials

10.26434/chemrxiv.5883157.v1 ◽

2018 ◽

Author(s):

Daniel Elton ◽

Zois Boukouvalas ◽

Mark S. Butrico ◽

Mark D. Fuge ◽

Peter W. Chung

Keyword(s):

Machine Learning ◽

Energetic Materials ◽

Molecular Structures ◽

Machine Learning Techniques ◽

Small Data ◽

Detonation Pressure ◽

Learning Models ◽

Data Set ◽

Learning Techniques ◽

Machine Learning Models

We present a proof of concept that machine learning techniques can be used to predict the properties of CNOHF energetic molecules from their molecular structures. We focus on a small but diverse dataset consisting of 109 molecular structures spread across ten compound classes. Up until now, candidate molecules for energetic materials have been screened using predictions from expensive quantum simulations and thermochemical codes. We present a comprehensive comparison of machine learning models and several molecular featurization methods - sum over bonds, custom descriptors, Coulomb matrices, Bag of Bonds, and fingerprints. The best featurization was sum over bonds (bond counting), and the best model was kernel ridge regression. Despite having a small data set, we obtain acceptable errors and Pearson correlations for the prediction of detonation pressure, detonation velocity, explosive energy, heat of formation, density, and other properties out of sample. By including another dataset with 309 additional molecules in our training we show how the error can be pushed lower, although the convergence with number of molecules is slow. Our work paves the way for future applications of machine learning in this domain, including automated lead generation and interpreting machine learning models to obtain novel chemical insights.

Download Full-text