B3LYP-SVM METHOD FOR THE ESTIMATION OF MOLECULAR ENTHALPIES OF FORMATION

2007 ◽  
Vol 06 (03) ◽  
pp. 495-509 ◽  
Author(s):  
GUI-KAI YAN ◽  
JUN-JIE LI ◽  
BING-RUI LI ◽  
JIA HU ◽  
WEN-PING GUO

Support vector machine (SVM) is used to predict the enthalpies of formation at 298 K [Formula: see text] for 261 molecules based on B3LYP/6-311g (3df,2p) results. With data randomly separated into two parts: 195 for training set and 66 for test set, the resulting mean absolute deviation (MAD) and maximum deviation (MD) for training set are 1.51 kcal/mol and 9.23 kcal/mol (correlation coefficient R = 0.9995), and for test set they become to 1.78 kcal/mol and 7.31 kcal/mol (R = 0.9990). The result is improved according to G2 method.

2021 ◽  
pp. 004051752199236
Author(s):  
Xinliang Yu ◽  
Hanlu Wang

Investigations of quantitative relationships between structure and color fastness of dyes are crucial in seeking novel dyes. For the first time, this work reported a classification model based on a quantitative structure–property relationship to predict color fastness to ironing of vat dyes. By performing binary classification analysis based on a support vector machine (SVM) and genetic algorithm, 56 vat dyes in the training set together with seven molecular descriptors were used to develop the classification model, which was validated with 59 vat dyes in the test set. The optimal SVM model ( C = 208.465 and γ = 5.9692) possesses overall accuracy of 91.1% for the training set and 83.1% for the test set, which is more accurate than those from the binary logistic regression model (87.5% and 81.4%, respectively). Furthermore, the mechanism of molecular descriptors correlated with color fastness to ironing of vat dyes is discussed.


Author(s):  
Golokesh Santra ◽  
Nitai Sylvetsky ◽  
Gershom Martin

We present a family of minimally empirical double-hybrid DFT functionals parametrized against the very large and diverse GMTKN55 benchmark. The very recently proposed wB97M(2) empirical double hybrid (with 16 empirical parameters) has the lowest WTMAD2 (weighted mean absolute deviation over GMTKN55) ever reported at 2.19 kcal/mol. However, our xrevDSD-PBEP86-D4 functional reaches a statistically equivalent WTMAD2=2.22 kcal/mol, using just a handful of empirical parameters, and the xrevDOD-PBEP86-D4 functional reaches 2.25 kcal/mol with just opposite-spin MP2 correlation, making it amenable to reduced-scaling algorithms. In general, the D4 empirical dispersion correction is clearly superior to D3BJ. If one eschews dispersion corrections of any kind, noDispSD-SCAN offers a viable alternative. Parametrization over the entire GMTKN55 dataset yields substantial improvement over the small training set previously employed in the DSD papers.


2019 ◽  
Author(s):  
Golokesh Santra ◽  
Nitai Sylvetsky ◽  
Gershom Martin

We present a family of minimally empirical double-hybrid DFT functionals parametrized against the very large and diverse GMTKN55 benchmark. The very recently proposed wB97M(2) empirical double hybrid (with 16 empirical parameters) has the lowest WTMAD2 (weighted mean absolute deviation over GMTKN55) ever reported at 2.19 kcal/mol. However, our xrevDSD-PBEP86-D4 functional reaches a statistically equivalent WTMAD2=2.22 kcal/mol, using just a handful of empirical parameters, and the xrevDOD-PBEP86-D4 functional reaches 2.25 kcal/mol with just opposite-spin MP2 correlation, making it amenable to reduced-scaling algorithms. In general, the D4 empirical dispersion correction is clearly superior to D3BJ. If one eschews dispersion corrections of any kind, noDispSD-SCAN offers a viable alternative. Parametrization over the entire GMTKN55 dataset yields substantial improvement over the small training set previously employed in the DSD papers.


2018 ◽  
Vol 19 (11) ◽  
pp. 3423 ◽  
Author(s):  
Ting Wang ◽  
Lili Tang ◽  
Feng Luan ◽  
M. Natália D. S. Cordeiro

Organic compounds are often exposed to the environment, and have an adverse effect on the environment and human health in the form of mixtures, rather than as single chemicals. In this paper, we try to establish reliable and developed classical quantitative structure–activity relationship (QSAR) models to evaluate the toxicity of 99 binary mixtures. The derived QSAR models were built by forward stepwise multiple linear regression (MLR) and nonlinear radial basis function neural networks (RBFNNs) using the hypothetical descriptors, respectively. The statistical parameters of the MLR model provided were N (number of compounds in training set) = 79, R2 (the correlation coefficient between the predicted and observed activities)= 0.869, LOOq2 (leave-one-out correlation coefficient) = 0.864, F (Fisher’s test) = 165.494, and RMS (root mean square) = 0.599 for the training set, and Next (number of compounds in external test set) = 20, R2 = 0.853, qext2 (leave-one-out correlation coefficient for test set)= 0.825, F = 30.861, and RMS = 0.691 for the external test set. The RBFNN model gave the statistical results, namely N = 79, R2 = 0.925, LOOq2 = 0.924, F = 950.686, RMS = 0.447 for the training set, and Next = 20, R2 = 0.896, qext2 = 0.890, F = 155.424, RMS = 0.547 for the external test set. Both of the MLR and RBFNN models were evaluated by some statistical parameters and methods. The results confirm that the built models are acceptable, and can be used to predict the toxicity of the binary mixtures.


2020 ◽  
pp. 410-423
Author(s):  
Prabhu RV Shankar ◽  
Anupama Kesari ◽  
Priya Shalini ◽  
N. Kamalashree ◽  
Charan Bharadwaj ◽  
...  

As part of a data mining competition, a training and test set of laboratory test data about patients with and without surgical site infection (SSI) were provided. The task was to develop predictive models with training set and identify patients with SSI in the no label test set. Lab test results are vital resources that guide healthcare providers make decisions about all aspects of surgical patient management. Many machine learning models were developed after pre-processing and imputing the lab tests data and only the top performing methods are discussed. Overall, RANDOM FOREST algorithms performed better than Support Vector Machine and Logistic Regression. Using a set of 74 lab tests, with RF, there were only 4 false positives in the training set and predicted 35 out of 50 SSI patients in the test set (Accuracy 0.86, Sensitivity 0.68, and Specificity 0.91). Optimal ways to address healthcare data quality concerns and imputation methods as well as newer generalizable algorithms need to be explored further to decipher new associations and knowledge among laboratory biomarkers and SSI.


2018 ◽  
Vol 57 (05/06) ◽  
pp. 253-260 ◽  
Author(s):  
J. Patel ◽  
Z. Siddiqui ◽  
A. Krishnan ◽  
T. Thyvalikakath

Background Smoking is an established risk factor for oral diseases and, therefore, dental clinicians routinely assess and record their patients' detailed smoking status. Researchers have successfully extracted smoking history from electronic health records (EHRs) using text mining methods. However, they could not retrieve patients' smoking intensity due to its limited availability in the EHR. The presence of detailed smoking information in the electronic dental record (EDR) often under a separate section allows retrieving this information with less preprocessing. Objective To determine patients' detailed smoking status based on smoking intensity from the EDR. Methods First, the authors created a reference standard of 3,296 unique patients’ smoking histories from the EDR that classified patients based on their smoking intensity. Next, they trained three machine learning classifiers (support vector machine, random forest, and naïve Bayes) using the training set (2,176) and evaluated performances on test set (1,120) using precision (P), recall (R), and F-measure (F). Finally, they applied the best classifier to classify smoking status from an additional 3,114 patients’ smoking histories. Results Support vector machine performed best to classify patients into smokers, nonsmokers, and unknowns (P, R, F: 98%); intermittent smoker (P: 95%, R: 98%, F: 96%); past smoker (P, R, F: 89%); light smoker (P, R, F: 87%); smokers with unknown intensity (P: 76%, R: 86%, F: 81%), and intermediate smoker (P: 90%, R: 88%, F: 89%). It performed moderately to differentiate heavy smokers (P: 90%, R: 44%, F: 60%). EDR could be a valuable source for obtaining patients’ detailed smoking information. Conclusion EDR data could serve as a valuable source for obtaining patients' detailed smoking information based on their smoking intensity that may not be readily available in the EHR.


SPE Journal ◽  
2018 ◽  
Vol 23 (04) ◽  
pp. 1075-1089 ◽  
Author(s):  
Jared Schuetter ◽  
Srikanta Mishra ◽  
Ming Zhong ◽  
Randy LaFollette (ret.)

Summary Considerable amounts of data are being generated during the development and operation of unconventional reservoirs. Statistical methods that can provide data-driven insights into production performance are gaining in popularity. Unfortunately, the application of advanced statistical algorithms remains somewhat of a mystery to petroleum engineers and geoscientists. The objective of this paper is to provide some clarity to this issue, focusing on how to build robust predictive models and how to develop decision rules that help identify factors separating good wells from poor performers. The data for this study come from wells completed in the Wolfcamp Shale Formation in the Permian Basin. Data categories used in the study included well location and assorted metrics capturing various aspects of well architecture, well completion, stimulation, and production. Predictive models for the production metric of interest are built using simple regression and other advanced methods such as random forests (RFs), support-vector regression (SVR), gradient-boosting machine (GBM), and multidimensional Kriging. The data-fitting process involves splitting the data into a training set and a test set, building a regression model on the training set and validating it with the test set. Repeated application of a “cross-validation” procedure yields valuable information regarding the robustness of each regression-modeling approach. Furthermore, decision rules that can identify extreme behavior in production wells (i.e., top x% of the wells vs. bottom x%, as ranked by the production metric) are generated using the classification and regression-tree algorithm. The resulting decision tree (DT) provides useful insights regarding what variables (or combinations of variables) can drive production performance into such extreme categories. The main contributions of this paper are to provide guidelines on how to build robust predictive models, and to demonstrate the utility of DTs for identifying factors responsible for good vs. poor wells.


2020 ◽  
Vol 143 (2) ◽  
Author(s):  
Mawloud Guermoui ◽  
Kacem Gairaa ◽  
John Boland ◽  
Toufik Arrif

Abstract This article proposes a new hybrid least squares-support vector machine and artificial bee colony algorithm (ABC-LS-SVM) for multi-hour ahead forecasting of global solar radiation (GHI) data. The framework performs on training the least squares-support vector machine (LS-SVM) model by means of the ABC algorithm using the measured data. ABC is developed for free parameters optimization for the LS-SVM model in a search space so as to boost the forecasting performance. The developed ABC-LS-SVM approach is verified on an hourly scale on a database of five years of measurements. The measured data were collected from 2013 to 2017 at the Applied Research Unit for Renewable Energy (URAER) in Ghardaia, south of Algeria. Several combinations of input data have been tested to model the desired output. Forecasting results of 12 h ahead GHI with the ABC-LS-SVM model led to the root-mean-square error (RMSE) equal to 116.22 Wh/m2, Correlation coefficient r = 94.3%. With the classical LS-SVM, the RMSE error equals to 117.73 Wh/m2 and correlation coefficient r = 92.42%; for cuckoo search algorithm combined with LS-SVM, the RMSE = 116.89 Wh/m2 and r = 93.78%. The results achieved reveal that the proposed hybridization scheme provides a more accurate performance compared to cuckoo search-LS-SVM and the stand-alone LS-SVM.


2013 ◽  
Vol 859 ◽  
pp. 315-321 ◽  
Author(s):  
Jing Cao ◽  
Chang Ning Sun ◽  
Hai Ming Liu

The correlation of failure modes needs to be considered in the reliability analysis of foundation excavations system. Because it is difficult to calculate the correlation coefficient of failure modes, the computational efficiency of traditional method is low. In this paper, the response surface (RS) is established by using the uniform test and support vector machine (SVM). On this basis, in order to obtain the index of each failure mode, the random parameters generated by Monte Carlo simulation are predicted. Combined with the Pearson correlation analysis, the correlation coefficient of failure modes is obtained. And then, the Breadth Border Method, Narrow Bounds Method and PNET method are used to calculate system failure probability of foundation excavations. The reliability analysis method of the foundation excavations system based on the response surface of the support vector machine (RSSVM) is put forward. The instance analysis shows that the method is simple in calculation, and provides a convenient way for the system reliability theory of foundation excavations.


Sign in / Sign up

Export Citation Format

Share Document