Prediction of small-molecule compound solubility in organic solvents by machine learning algorithms

AbstractRapid solvent selection is of great significance in chemistry. However, solubility prediction remains a crucial challenge. This study aimed to develop machine learning models that can accurately predict compound solubility in organic solvents. A dataset containing 5081 experimental temperature and solubility data of compounds in organic solvents was extracted and standardized. Molecular fingerprints were selected to characterize structural features. lightGBM was compared with deep learning and traditional machine learning (PLS, Ridge regression, kNN, DT, ET, RF, SVM) to develop models for predicting solubility in organic solvents at different temperatures. Compared to other models, lightGBM exhibited significantly better overall generalization (logS ± 0.20). For unseen solutes, our model gave a prediction accuracy (logS ± 0.59) close to the expected noise level of experimental solubility data. lightGBM revealed the physicochemical relationship between solubility and structural features. Our method enables rapid solvent screening in chemistry and may be applied to solubility prediction in other solvents.

Download Full-text

Evaluation of machine learning algorithms and structural features for optimal MRI-based diagnostic prediction in psychosis

PLoS ONE ◽

10.1371/journal.pone.0175683 ◽

2017 ◽

Vol 12 (4) ◽

pp. e0175683 ◽

Cited By ~ 38

Author(s):

Raymond Salvador ◽

Joaquim Radua ◽

Erick J. Canales-Rodríguez ◽

Aleix Solanes ◽

Salvador Sarró ◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Structural Features ◽

Machine Learning Algorithms

Download Full-text

Developing Gene-Specific Meta-Predictor of Variant Pathogenicity

10.1101/115956 ◽

2017 ◽

Author(s):

Anna Rychkova ◽

MyMy C. Buu ◽

Curt Scharfe ◽

Martina I. Lefterova ◽

Justin I. Odegaard ◽

...

Keyword(s):

Machine Learning ◽

Allele Frequency ◽

Genomic Medicine ◽

Structural Features ◽

Machine Learning Algorithms ◽

Genomic Variation ◽

Supervised Machine Learning ◽

Internal Resources ◽

Variants Of Unknown Significance ◽

A Genome

AbstractRapid, accurate, and inexpensive genome sequencing promises to transform medical care. However, a critical hurdle to enabling personalized genomic medicine is predicting the functional impact of novel genomic variation. Various methods of missense variants pathogenicity prediction have been developed by now. Here we present a new strategy for developing a pathogenicity predictor of improved accuracy by applying and training a supervised machine learning model in a gene-specific manner. Our meta-predictor combines outputs of various existing predictors, supplements them with an extended set of stability and structural features of the protein, as well as its physicochemical properties, and adds information about allele frequency from various datasets. We used such a supervised gene-specific meta-predictor approach to train the model on the CFTR gene, and to predict pathogenicity of about 1,000 variants of unknown significance that we collected from various publicly available and internal resources. Our CFTR-specific meta-predictor based on the Random Forest model performs better than other machine learning algorithms that we tested, and also outperforms other available tools, such as CADD, MutPred, SIFT, and PolyPhen-2. Our predicted pathogenicity probability correlates well with clinical measures of Cystic Fibrosis patients and experimental functional measures of mutated CFTR proteins. Training the model on one gene, in contrast to taking a genome wide approach, allows taking into account structural features specific for a particular protein, thus increasing the overall accuracy of the predictor. Collecting data from several separate resources, on the other hand, allows to accumulate allele frequency information, estimated as the most important feature by our approach, for a larger set of variants. Finally, our predictor will be hosted on the ClinGen Consortium database to make it available to CF researchers and to serve as a feasibility pilot study for other Mendelian diseases.

Download Full-text

Data based predictive models for odor perception

Scientific Reports ◽

10.1038/s41598-020-73978-1 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Rinu Chacko ◽

Deepak Jain ◽

Manasi Patwardhan ◽

Abhishek Puri ◽

Shirish Karande ◽

...

Keyword(s):

Machine Learning ◽

Data Analytics ◽

Model Performance ◽

Structural Features ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Structure Property ◽

Odor Perception ◽

The Impact

Abstract Machine learning and data analytics are being increasingly used for quantitative structure property relation (QSPR) applications in the chemical domain where the traditional Edisonian approach towards knowledge-discovery have not been fruitful. The perception of odorant stimuli is one such application as olfaction is the least understood among all the other senses. In this study, we employ machine learning based algorithms and data analytics to address the efficacy of using a data-driven approach to predict the perceptual attributes of an odorant namely the odorant characters (OC) of “sweet” and “musky”. We first analyze a psychophysical dataset containing perceptual ratings of 55 subjects to reveal patterns in the ratings given by subjects. We then use the data to train several machine learning algorithms such as random forest, gradient boosting and support vector machine for prediction of the odor characters and report the structural features correlating well with the odor characters based on the optimal model. Furthermore, we analyze the impact of the data quality on the performance of the models by comparing the semantic descriptors generally associated with a given odorant to its perception by majority of the subjects. The study presents a methodology for developing models for odor perception and provides insights on the perception of odorants by untrained human subjects and the effect of the inherent bias in the perception data on the model performance. The models and methodology developed here could be used for predicting odor characters of new odorants.

Download Full-text

Machine Learning for the Design and Development of Biofilm Regulators

10.20944/preprints201803.0118.v1 ◽

2018 ◽

Author(s):

Benjamin Stone ◽

Erik Sapper

Keyword(s):

Machine Learning ◽

Quorum Sensing ◽

Computational Models ◽

Quantitative Structure Activity Relationship ◽

Structural Features ◽

Machine Learning Algorithms ◽

Chemical Signaling ◽

Biochemical System ◽

Structure Activity ◽

Qsar Models

Biofilms are congregations of bacteria on a surface, and they grow into obstacles for the functionalities of any device or machinery involves anything biological. Biofilms are developed through a biochemical system known as ‘Quorum Sensing’ that accounts for the chemical signaling that direct either biofilm formation or inhibition. Computational models that relate chemical and structural features of compounds to their performance properties have been used to aide in the discovery of active small molecules for many decades. These quantitative structure-activity relationship (QSAR) models are also important for predicting the activity of molecules that can have a range of effectiveness in biological systems. This study uses QSAR methodologies combined with and different machine learning algorithms to predict and assess the performance of several different compounds acting in Quorum Sensing. Through computational probing of the quorum sensing molecular interaction, new design rules can be elucidated for countering biofilms.

Download Full-text

Predicting Protein Thermostability Upon Mutation Using Molecular Dynamics Timeseries Data

10.1101/078246 ◽

2016 ◽

Cited By ~ 1

Author(s):

Noah Fleming ◽

Benjamin Kinsella ◽

Christopher Ing

Keyword(s):

Neural Network ◽

Machine Learning ◽

Molecular Dynamics ◽

Protein Structure ◽

Protein Stability ◽

Recurrent Neural Network ◽

Molecular Basis ◽

Sequence Data ◽

Structural Features ◽

Machine Learning Algorithms

AbstractA large number of human diseases result from disruptions to protein structure and function caused by missense mutations. Computational methods are frequently employed to assist in the prediction of protein stability upon mutation. These methods utilize a combination of protein sequence data, protein structure data, empirical energy functions, and physicochemical properties of amino acids. In this work, we present the first use of dynamic protein structural features in order to improve stability predictions upon mutation. This is achieved through the use of a set of timeseries extracted from microsecond timescale atomistic molecular dynamics simulations of proteins. Standard machine learning algorithms using mean, variance, and histograms of these timeseries were found to be 60-70% accurate in stability classification based on experimental ΔΔGor protein-chaperone interaction measurements. A recurrent neural network with full treatment of timeseries data was found to be 80% accurate according the F1 score. The performance of our models was found to be equal or better than two recently developed machine learning methods for binary classification as well as two industry-standard stability prediction algorithms. In addition to classification, understanding the molecular basis of protein stability disruption due to disease-causing mutations is a significant challenge that impedes the development of drugs and therapies that may be used treat genetic diseases. The use of dynamic structural features allows for novel insight into the molecular basis of protein disruption by mutation in a diverse set of soluble proteins. To assist in the interpretation of machine learning results, we present a technique for determining the importance of features to a recurrent neural network using Garson’s method. We propose a novel extension of neural interpretation diagrams by implementing Garson’s method to scale each node in the neural interpretation diagram according to its relative importance to the network.

Download Full-text

Recognizing Power-law Graphs by Machine Learning Algorithms using a Reduced Set of Structural Features

10.5753/eniac.2019.9319 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alane Lima ◽

André Vignatti ◽

Murilo Silva

Keyword(s):

Machine Learning ◽

Power Law ◽

Real World ◽

Optimization Problems ◽

Learning Algorithms ◽

Structural Features ◽

Machine Learning Algorithms ◽

Graph Representation ◽

Gradient Boosting ◽

Graph Properties

The empirical study of large real world networks in the last 20 years showed that a variety of real-world graphs are power-law. There are evidence that optimization problems seem easier in these graphs; however, for a given graph, classifying it as power-law is a problem in itself. In this work, we propose using machine learning algorithms (KNN, SVM, Gradient Boosting and Random Forests) for this task. We suggest a graph representation based on [Canning et al. 2018], but using a much simplified set of structural properties of the graph. We compare the proposed representation with the one generated by the graph2vec framework. The experiments attained high accuracy, indicating that a reduced set of structural graph properties is enough for the presented problem.

Download Full-text

Determination of Abraham model solute descriptors for monomeric 3,4,5-trimethoxybenzoic acid from experimental solubility data in organic solvents measured at 298.2 K

Physics and Chemistry of Liquids ◽

10.1080/00319104.2017.1346097 ◽

2017 ◽

Vol 56 (3) ◽

pp. 381-390 ◽

Cited By ~ 15

Author(s):

Erin Hart ◽

Alex Klein ◽

Olivia Zha ◽

Anisha Wadawadigi ◽

Ellen Qian ◽

...

Keyword(s):

Organic Solvents ◽

Solubility Data ◽

Abraham Model ◽

Experimental Solubility Data ◽

Experimental Solubility ◽

Solute Descriptors

Download Full-text

Combining machine learning algorithms and an incremental capacity analysis on 18650 cell under different cycling temperature and SOC range

E3S Web of Conferences ◽

10.1051/e3sconf/202018203007 ◽

2020 ◽

Vol 182 ◽

pp. 03007

Author(s):

John Lai ◽

David Chao ◽

Alvin Wu ◽

Carl Wang

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Identification Accuracy ◽

Capacity Analysis ◽

Training Set ◽

Discharge Data ◽

Different Temperatures ◽

Incremental Capacity ◽

Cycling Temperature

A novel way to apply machine learning algorithms on the incremental capacity analysis (dQ/dV) is developed to identify battery cycling conditions under different temperatures and working SOC ranges. Batteries are cycled under each combination of temperatures (-10oC, 25oC, 60oC) and SOC ranges (0-10%, 25-75%, 90-100%, 0-100%) up to 60 equivalent cycles. The discharge data is transformed into dQ/dV-V curve and its features of the peaks and valleys are further taken for machine learning. Both supervised and unsupervised machine learning algorithms (PCA and LDA) are applied to classify batteries in terms of temperature or SOC range. The results reveal that batteries cycled under different temperatures can be identified separately regardless of the working SOC range. When splitting 60 samples with a ratio of training set equals to 0.85, the remaining test set gives an identification accuracy of 89% in temperature and 67% in working SOC range.

Download Full-text

ECNet is an evolutionary context-integrated deep learning framework for protein engineering

Nature Communications ◽

10.1038/s41467-021-25976-8 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Yunan Luo ◽

Guangde Jiang ◽

Tianhao Yu ◽

Yang Liu ◽

Lam Vo ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Protein Engineering ◽

Learning Algorithm ◽

Learning Algorithms ◽

Structural Features ◽

Machine Learning Algorithms ◽

Success Rates ◽

General Sequence ◽

Evolutionary Context

AbstractMachine learning has been increasingly used for protein engineering. However, because the general sequence contexts they capture are not specific to the protein being engineered, the accuracy of existing machine learning algorithms is rather limited. Here, we report ECNet (evolutionary context-integrated neural network), a deep-learning algorithm that exploits evolutionary contexts to predict functional fitness for protein engineering. This algorithm integrates local evolutionary context from homologous sequences that explicitly model residue-residue epistasis for the protein of interest with the global evolutionary context that encodes rich semantic and structural features from the enormous protein sequence universe. As such, it enables accurate mapping from sequence to function and provides generalization from low-order mutants to higher-order mutants. We show that ECNet predicts the sequence-function relationship more accurately as compared to existing machine learning algorithms by using ~50 deep mutational scanning and random mutagenesis datasets. Moreover, we used ECNet to guide the engineering of TEM-1 β-lactamase and identified variants with improved ampicillin resistance with high success rates.

Download Full-text

Machine learning with physicochemical relationships: solubility prediction in organic solvents and water

Nature Communications ◽

10.1038/s41467-020-19594-z ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Samuel Boobier ◽

David R. J. Hose ◽

A. John Blacker ◽

Bao N. Nguyen

Keyword(s):

Machine Learning ◽

Organic Solvents ◽

Process Design ◽

Training Data ◽

Machine Learning Method ◽

Solubility Prediction ◽

Chemical Process Design ◽

Applied Machine Learning ◽

Small Set ◽

Successful Approach

AbstractSolubility prediction remains a critical challenge in drug development, synthetic route and chemical process design, extraction and crystallisation. Here we report a successful approach to solubility prediction in organic solvents and water using a combination of machine learning (ANN, SVM, RF, ExtraTrees, Bagging and GP) and computational chemistry. Rational interpretation of dissolution process into a numerical problem led to a small set of selected descriptors and subsequent predictions which are independent of the applied machine learning method. These models gave significantly more accurate predictions compared to benchmarked open-access and commercial tools, achieving accuracy close to the expected level of noise in training data (LogS ± 0.7). Finally, they reproduced physicochemical relationship between solubility and molecular properties in different solvents, which led to rational approaches to improve the accuracy of each models.

Download Full-text