scholarly journals Prediction of small-molecule compound solubility in organic solvents by machine learning algorithms

2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Zhuyifan Ye ◽  
Defang Ouyang

AbstractRapid solvent selection is of great significance in chemistry. However, solubility prediction remains a crucial challenge. This study aimed to develop machine learning models that can accurately predict compound solubility in organic solvents. A dataset containing 5081 experimental temperature and solubility data of compounds in organic solvents was extracted and standardized. Molecular fingerprints were selected to characterize structural features. lightGBM was compared with deep learning and traditional machine learning (PLS, Ridge regression, kNN, DT, ET, RF, SVM) to develop models for predicting solubility in organic solvents at different temperatures. Compared to other models, lightGBM exhibited significantly better overall generalization (logS  ± 0.20). For unseen solutes, our model gave a prediction accuracy (logS  ± 0.59) close to the expected noise level of experimental solubility data. lightGBM revealed the physicochemical relationship between solubility and structural features. Our method enables rapid solvent screening in chemistry and may be applied to solubility prediction in other solvents.

PLoS ONE ◽  
2017 ◽  
Vol 12 (4) ◽  
pp. e0175683 ◽  
Author(s):  
Raymond Salvador ◽  
Joaquim Radua ◽  
Erick J. Canales-Rodríguez ◽  
Aleix Solanes ◽  
Salvador Sarró ◽  
...  

2017 ◽  
Author(s):  
Anna Rychkova ◽  
MyMy C. Buu ◽  
Curt Scharfe ◽  
Martina I. Lefterova ◽  
Justin I. Odegaard ◽  
...  

AbstractRapid, accurate, and inexpensive genome sequencing promises to transform medical care. However, a critical hurdle to enabling personalized genomic medicine is predicting the functional impact of novel genomic variation. Various methods of missense variants pathogenicity prediction have been developed by now. Here we present a new strategy for developing a pathogenicity predictor of improved accuracy by applying and training a supervised machine learning model in a gene-specific manner. Our meta-predictor combines outputs of various existing predictors, supplements them with an extended set of stability and structural features of the protein, as well as its physicochemical properties, and adds information about allele frequency from various datasets. We used such a supervised gene-specific meta-predictor approach to train the model on the CFTR gene, and to predict pathogenicity of about 1,000 variants of unknown significance that we collected from various publicly available and internal resources. Our CFTR-specific meta-predictor based on the Random Forest model performs better than other machine learning algorithms that we tested, and also outperforms other available tools, such as CADD, MutPred, SIFT, and PolyPhen-2. Our predicted pathogenicity probability correlates well with clinical measures of Cystic Fibrosis patients and experimental functional measures of mutated CFTR proteins. Training the model on one gene, in contrast to taking a genome wide approach, allows taking into account structural features specific for a particular protein, thus increasing the overall accuracy of the predictor. Collecting data from several separate resources, on the other hand, allows to accumulate allele frequency information, estimated as the most important feature by our approach, for a larger set of variants. Finally, our predictor will be hosted on the ClinGen Consortium database to make it available to CF researchers and to serve as a feasibility pilot study for other Mendelian diseases.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Rinu Chacko ◽  
Deepak Jain ◽  
Manasi Patwardhan ◽  
Abhishek Puri ◽  
Shirish Karande ◽  
...  

Abstract Machine learning and data analytics are being increasingly used for quantitative structure property relation (QSPR) applications in the chemical domain where the traditional Edisonian approach towards knowledge-discovery have not been fruitful. The perception of odorant stimuli is one such application as olfaction is the least understood among all the other senses. In this study, we employ machine learning based algorithms and data analytics to address the efficacy of using a data-driven approach to predict the perceptual attributes of an odorant namely the odorant characters (OC) of “sweet” and “musky”. We first analyze a psychophysical dataset containing perceptual ratings of 55 subjects to reveal patterns in the ratings given by subjects. We then use the data to train several machine learning algorithms such as random forest, gradient boosting and support vector machine for prediction of the odor characters and report the structural features correlating well with the odor characters based on the optimal model. Furthermore, we analyze the impact of the data quality on the performance of the models by comparing the semantic descriptors generally associated with a given odorant to its perception by majority of the subjects. The study presents a methodology for developing models for odor perception and provides insights on the perception of odorants by untrained human subjects and the effect of the inherent bias in the perception data on the model performance. The models and methodology developed here could be used for predicting odor characters of new odorants.


Author(s):  
Benjamin Stone ◽  
Erik Sapper

Biofilms are congregations of bacteria on a surface, and they grow into obstacles for the functionalities of any device or machinery involves anything biological. Biofilms are developed through a biochemical system known as ‘Quorum Sensing’ that accounts for the chemical signaling that direct either biofilm formation or inhibition. Computational models that relate chemical and structural features of compounds to their performance properties have been used to aide in the discovery of active small molecules for many decades. These quantitative structure-activity relationship (QSAR) models are also important for predicting the activity of molecules that can have a range of effectiveness in biological systems. This study uses QSAR methodologies combined with and different machine learning algorithms to predict and assess the performance of several different compounds acting in Quorum Sensing. Through computational probing of the quorum sensing molecular interaction, new design rules can be elucidated for countering biofilms.


2016 ◽  
Author(s):  
Noah Fleming ◽  
Benjamin Kinsella ◽  
Christopher Ing

AbstractA large number of human diseases result from disruptions to protein structure and function caused by missense mutations. Computational methods are frequently employed to assist in the prediction of protein stability upon mutation. These methods utilize a combination of protein sequence data, protein structure data, empirical energy functions, and physicochemical properties of amino acids. In this work, we present the first use of dynamic protein structural features in order to improve stability predictions upon mutation. This is achieved through the use of a set of timeseries extracted from microsecond timescale atomistic molecular dynamics simulations of proteins. Standard machine learning algorithms using mean, variance, and histograms of these timeseries were found to be 60-70% accurate in stability classification based on experimental ΔΔGor protein-chaperone interaction measurements. A recurrent neural network with full treatment of timeseries data was found to be 80% accurate according the F1 score. The performance of our models was found to be equal or better than two recently developed machine learning methods for binary classification as well as two industry-standard stability prediction algorithms. In addition to classification, understanding the molecular basis of protein stability disruption due to disease-causing mutations is a significant challenge that impedes the development of drugs and therapies that may be used treat genetic diseases. The use of dynamic structural features allows for novel insight into the molecular basis of protein disruption by mutation in a diverse set of soluble proteins. To assist in the interpretation of machine learning results, we present a technique for determining the importance of features to a recurrent neural network using Garson’s method. We propose a novel extension of neural interpretation diagrams by implementing Garson’s method to scale each node in the neural interpretation diagram according to its relative importance to the network.


Author(s):  
Alane Lima ◽  
André Vignatti ◽  
Murilo Silva

The empirical study of large real world networks in the last 20 years showed that a variety of real-world graphs are power-law. There are evidence that optimization problems seem easier in these graphs; however, for a given graph, classifying it as power-law is a problem in itself. In this work, we propose using machine learning algorithms (KNN, SVM, Gradient Boosting and Random Forests) for this task. We suggest a graph representation based on [Canning et al. 2018], but using a much simplified set of structural properties of the graph. We compare the proposed representation with the one generated by the graph2vec framework. The experiments attained high accuracy, indicating that a reduced set of structural graph properties is enough for the presented problem.


2020 ◽  
Vol 182 ◽  
pp. 03007
Author(s):  
John Lai ◽  
David Chao ◽  
Alvin Wu ◽  
Carl Wang

A novel way to apply machine learning algorithms on the incremental capacity analysis (dQ/dV) is developed to identify battery cycling conditions under different temperatures and working SOC ranges. Batteries are cycled under each combination of temperatures (-10oC, 25oC, 60oC) and SOC ranges (0-10%, 25-75%, 90-100%, 0-100%) up to 60 equivalent cycles. The discharge data is transformed into dQ/dV-V curve and its features of the peaks and valleys are further taken for machine learning. Both supervised and unsupervised machine learning algorithms (PCA and LDA) are applied to classify batteries in terms of temperature or SOC range. The results reveal that batteries cycled under different temperatures can be identified separately regardless of the working SOC range. When splitting 60 samples with a ratio of training set equals to 0.85, the remaining test set gives an identification accuracy of 89% in temperature and 67% in working SOC range.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Yunan Luo ◽  
Guangde Jiang ◽  
Tianhao Yu ◽  
Yang Liu ◽  
Lam Vo ◽  
...  

AbstractMachine learning has been increasingly used for protein engineering. However, because the general sequence contexts they capture are not specific to the protein being engineered, the accuracy of existing machine learning algorithms is rather limited. Here, we report ECNet (evolutionary context-integrated neural network), a deep-learning algorithm that exploits evolutionary contexts to predict functional fitness for protein engineering. This algorithm integrates local evolutionary context from homologous sequences that explicitly model residue-residue epistasis for the protein of interest with the global evolutionary context that encodes rich semantic and structural features from the enormous protein sequence universe. As such, it enables accurate mapping from sequence to function and provides generalization from low-order mutants to higher-order mutants. We show that ECNet predicts the sequence-function relationship more accurately as compared to existing machine learning algorithms by using ~50 deep mutational scanning and random mutagenesis datasets. Moreover, we used ECNet to guide the engineering of TEM-1 β-lactamase and identified variants with improved ampicillin resistance with high success rates.


2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Samuel Boobier ◽  
David R. J. Hose ◽  
A. John Blacker ◽  
Bao N. Nguyen

AbstractSolubility prediction remains a critical challenge in drug development, synthetic route and chemical process design, extraction and crystallisation. Here we report a successful approach to solubility prediction in organic solvents and water using a combination of machine learning (ANN, SVM, RF, ExtraTrees, Bagging and GP) and computational chemistry. Rational interpretation of dissolution process into a numerical problem led to a small set of selected descriptors and subsequent predictions which are independent of the applied machine learning method. These models gave significantly more accurate predictions compared to benchmarked open-access and commercial tools, achieving accuracy close to the expected level of noise in training data (LogS ± 0.7). Finally, they reproduced physicochemical relationship between solubility and molecular properties in different solvents, which led to rational approaches to improve the accuracy of each models.


Sign in / Sign up

Export Citation Format

Share Document