scholarly journals Machine learning with physicochemical relationships: solubility prediction in organic solvents and water

2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Samuel Boobier ◽  
David R. J. Hose ◽  
A. John Blacker ◽  
Bao N. Nguyen

AbstractSolubility prediction remains a critical challenge in drug development, synthetic route and chemical process design, extraction and crystallisation. Here we report a successful approach to solubility prediction in organic solvents and water using a combination of machine learning (ANN, SVM, RF, ExtraTrees, Bagging and GP) and computational chemistry. Rational interpretation of dissolution process into a numerical problem led to a small set of selected descriptors and subsequent predictions which are independent of the applied machine learning method. These models gave significantly more accurate predictions compared to benchmarked open-access and commercial tools, achieving accuracy close to the expected level of noise in training data (LogS ± 0.7). Finally, they reproduced physicochemical relationship between solubility and molecular properties in different solvents, which led to rational approaches to improve the accuracy of each models.

2019 ◽  
Author(s):  
Ge Liu ◽  
Haoyang Zeng ◽  
Jonas Mueller ◽  
Brandon Carter ◽  
Ziheng Wang ◽  
...  

AbstractThe precise targeting of antibodies and other protein therapeutics is required for their proper function and the elimination of deleterious off-target effects. Often the molecular structure of a therapeutic target is unknown and randomized methods are used to design antibodies without a model that relates antibody sequence to desired properties. Here we present a machine learning method that can design human Immunoglobulin G (IgG) antibodies with target affinities that are superior to candidates from phage display panning experiments within a limited design budget. We also demonstrate that machine learning can improve target-specificity by the modular composition of models from different experimental campaigns, enabling a new integrative approach to improving target specificity. Our results suggest a new path for the discovery of therapeutic molecules by demonstrating that predictive and differentiable models of antibody binding can be learned from high-throughput experimental data without the need for target structural data.SignificanceAntibody based therapeutics must meet both affinity and specificity metrics, and existing in vitro methods for meeting these metrics are based upon randomization and empirical testing. We demonstrate that with sufficient target-specific training data machine learning can suggest novel antibody variable domain sequences that are superior to those observed during training. Our machine learning method does not require any target structural information. We further show that data from disparate antibody campaigns can be combined by machine learning to improve antibody specificity.


2020 ◽  
Author(s):  
Yinxue Liu ◽  
Paul Bates ◽  
Jeffery Neal ◽  
Dai Yamazaki

<p>Precise representation of global terrain is of great significance for estimating global flood risk. As the most vulnerable areas to flooding, urban areas need GDEMs of high quality. However, current Global Digital Elevation Models (GDEMs) are all Digital Surface Models (DSMs) in urban areas, which will cause substantial blockages of flow pathways within flood inundation models. By taking GPS and LIDAR data as terrain observations, errors of popular GDEMs (including SRTM 1” void-filled version DEM - SRTM, Multi-Error-Removed Improved-Terrain DEM - MERIT and TanDEM-X 3” resolution DEM -TDM3) were analysed in seven varied types of cities. It was found that the RMSE of GDEMs errors are in the range of 2.3 m – 7.9 m, and that MERIT and TDM3 both outperformed SRTM. The error comparison between MERIT and TDM3 showed that the most accurate model varied among the studied cities. Generally, error of TDM3 is slightly lower than MERIT, but TDM3 has more extreme errors (absolute value exceeds 15 m). For cities which have experienced rapid development in the past decade, the RMSE of MERIT is lower than that of TDM3, which is mainly caused by the acquisition time difference between these two models. A machine learning method was adopted to estimate MERIT error. Night Time Light, world population density data, Openstreetmap building data, slope, elevation and neighbourhood elevation values from widely available datasets, comprising 14 factors in total, were used in the regression. Models were trained based on single city and combinations of cities, respectively, and then used to estimate error in a target city. By this approach, the RMSE of corrected MERIT can decline by up to 75% with target city trained model, though less significant a reduction of 35% -68% was shown in the combined model with target city excluded in the training data. Further validation via flood simulation showed improvements in terms of both flood extent and inundation depth by the corrected MERIT over the original MERIT, with a validation in small sized city. However, the corrected MERIT was not as good as TDM3 in this case. This method has the potential to generate a better bare-earth global DEM in urban areas, but the sensitive level about the model extrapolative application needs investigation in more study sites.</p>


Author(s):  
N. A. K. Doan ◽  
W. Polifke ◽  
L. Magri

We propose a physics-constrained machine learning method—based on reservoir computing—to time-accurately predict extreme events and long-term velocity statistics in a model of chaotic flow. The method leverages the strengths of two different approaches: empirical modelling based on reservoir computing, which learns the chaotic dynamics from data only, and physical modelling based on conservation laws. This enables the reservoir computing framework to output physical predictions when training data are unavailable. We show that the combination of the two approaches is able to accurately reproduce the velocity statistics, and to predict the occurrence and amplitude of extreme events in a model of self-sustaining process in turbulence. In this flow, the extreme events are abrupt transitions from turbulent to quasi-laminar states, which are deterministic phenomena that cannot be traditionally predicted because of chaos. Furthermore, the physics-constrained machine learning method is shown to be robust with respect to noise. This work opens up new possibilities for synergistically enhancing data-driven methods with physical knowledge for the time-accurate prediction of chaotic flows.


2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Zhuyifan Ye ◽  
Defang Ouyang

AbstractRapid solvent selection is of great significance in chemistry. However, solubility prediction remains a crucial challenge. This study aimed to develop machine learning models that can accurately predict compound solubility in organic solvents. A dataset containing 5081 experimental temperature and solubility data of compounds in organic solvents was extracted and standardized. Molecular fingerprints were selected to characterize structural features. lightGBM was compared with deep learning and traditional machine learning (PLS, Ridge regression, kNN, DT, ET, RF, SVM) to develop models for predicting solubility in organic solvents at different temperatures. Compared to other models, lightGBM exhibited significantly better overall generalization (logS  ± 0.20). For unseen solutes, our model gave a prediction accuracy (logS  ± 0.59) close to the expected noise level of experimental solubility data. lightGBM revealed the physicochemical relationship between solubility and structural features. Our method enables rapid solvent screening in chemistry and may be applied to solubility prediction in other solvents.


2017 ◽  
Vol 24 (14) ◽  
pp. 2012-2020 ◽  
Author(s):  
Akira Yasumura ◽  
Mikimasa Omori ◽  
Ayako Fukuda ◽  
Junichi Takahashi ◽  
Yukiko Yasumura ◽  
...  

Objective: To establish valid, objective biomarkers for ADHD using machine learning. Method: Machine learning was used to predict disorder severity from new brain function data, using a support vector machine (SVM). A multicenter approach was used to collect data for machine learning training, including behavioral and physiological indicators, age, and reverse Stroop task (RST) data from 108 children with ADHD and 108 typically developing (TD) children. Near-infrared spectroscopy (NIRS) was used to quantify change in prefrontal cortex oxygenated hemoglobin during RST. Verification data were from 62 children with ADHD and 37 TD children from six facilities in Japan. Results: The SVM general performance results showed sensitivity of 88.71%, specificity of 83.78%, and an overall discrimination rate of 86.25%. Conclusion: A SVM using an objective index from RST may be useful as an auxiliary biomarker for diagnosis for children with ADHD.


Author(s):  
Miroslav Stampar ◽  
Kresimir Fertalj

Recognition of domain names generated by domain generation algorithms (DGAs) is the essential part of malware detection by inspection of network traffic. Besides basic heuristics (HE) and limited detection based on blacklists, the most promising course seems to be machine learning (ML). There is a lack of studies that extensively compare different ML models in the field of DGA binary classification, including both conventional and deep learning (DL) representatives. Also, those few that exist are either focused on a small set of models, use a poor set of features in ML models or fail to secure unbiased independence between training and evaluation samples. To overcome these limitations, we engineered a robust feature set, and accordingly trained and evaluated 14 ML, 9 DL, and 2 comparative models on two independent datasets. Results show that if ML features are properly engineered, there is a marginal difference in overall score between top ML and DL representatives. This paper represents the first attempt to neutrally compare the performance of many different models for the recognition of DGA domain names, where the best models perform as well as the top representatives from the literature.


Author(s):  
Э. Д. Алисултанова ◽  
У. Р. Тасуев ◽  
Н. А. Моисеенко

В данной статье рассматриваются алгоритмы машинного обучения, которые строят математическую модель на основе выборочных данных, известных как «обучающая выборка» (training data) для исполнения прогнозных решений без явного задания алгоритма в целях выполнения поставленной задачи. Сложные маркетинговые проблемы рассматриваются с помощью технологий машинного обучения, уделяя первостепенное внимание индивидуальной поддержке клиентов и разработке новых продуктов. Предлагаемые решения на основе интеллектуальных систем бизнес-задач, представляющих наибольшую сложность, позволят прогнозировать возможные вариации поведения клиентов. Алгоритмы машинного обучения в данном случае для реализации бизнес-проектов используются для решения проблем, для которых сложно или невозможно разработать традиционный алгоритм для эффективного выполнения задачи. Примененные технологии машинного обучения помогают систематизировать и извлекать информацию из огромного набора необработанных данных. This paper discusses machine learning algorithms that construct a mathematical model based on sample data, known as “training data,” to execute predictive decisions without explicitly specifying an algorithm in order to perform a given task. Complex marketing issues are addressed through machine learning technologies, focusing on individual customer support and new product development.The proposed solutions based on intelligent systems of business tasks, which are the most complex, will predict possible variations in customer behavior. In this case, machine learning algorithms for implementing business projects are used to solve problems for which it is difficult or impossible to develop a traditional algorithm for efficiently performing a task. Applied machine learning technologies help systematize and extract information from a huge set of raw data.


2021 ◽  
Vol 8 (1) ◽  
pp. 13
Author(s):  
Stefano Sfarra ◽  
Gianfranco Gargiulo ◽  
Mohammed Omar

The use of infrared thermography presents unique perspectives in imaging of artifacts to help interrogate their surface and subsurface characteristics, highlight deviations and detect contrast. This research capitalizes on active and passive thermal imagery along with advanced machine learning-based algorithms for pre- and post-processing of acquired scans. Such codes operate efficiently (compress data) to help link the observed temperature variations and the thermophysical parameters of targeted samples. One such processing modality is dictionary learning, which infers a “frame dictionary” to help represent the scans as linear combinations of a small set of features, thus training data to show a sparse representation. This technique (along factorization and component analysis-based methods) was used in current research on ancient polychrome marquetries aimed at detecting aging anomalies. The presented research is unique in terms of the targeted samples and the applied approaches and should provide specific guidance to similar domains.


2019 ◽  
Author(s):  
Andrew Medford ◽  
Shengchun Yang ◽  
Fuzhu Liu

Understanding the interaction of multiple types of adsorbate molecules on solid surfaces is crucial to establishing the stability of catalysts under various chemical environments. Computational studies on the high coverage and mixed coverages of reaction intermediates are still challenging, especially for transition-metal compounds. In this work, we present a framework to predict differential adsorption energies and identify low-energy structures under high- and mixed-adsorbate coverages on oxide materials. The approach uses Gaussian process machine-learning models with quantified uncertainty in conjunction with an iterative training algorithm to actively identify the training set. The framework is demonstrated for the mixed adsorption of CH<sub>x</sub>, NH<sub>x</sub> and OH<sub>x</sub> species on the oxygen vacancy and pristine rutile TiO<sub>2</sub>(110) surface sites. The results indicate that the proposed algorithm is highly efficient at identifying the most valuable training data, and is able to predict differential adsorption energies with a mean absolute error of ~0.3 eV based on <25% of the total DFT data. The algorithm is also used to identify 76% of the low-energy structures based on <30% of the total DFT data, enabling construction of surface phase diagrams that account for high and mixed coverage as a function of the chemical potential of C, H, O, and N. Furthermore, the computational scaling indicates the algorithm scales nearly linearly (N<sup>1.12</sup>) as the number of adsorbates increases. This framework can be directly extended to metals, metal oxides, and other materials, providing a practical route toward the investigation of the behavior of catalysts under high-coverage conditions.


2018 ◽  
Vol 6 (2) ◽  
pp. 283-286
Author(s):  
M. Samba Siva Rao ◽  
◽  
M.Yaswanth . ◽  
K. Raghavendra Swamy ◽  
◽  
...  

Sign in / Sign up

Export Citation Format

Share Document