Can machine learning consistently improve the scoring power of classical scoring functions? Insights into the role of machine learning in scoring functions

Author(s):  
Chao Shen ◽  
Ye Hu ◽  
Zhe Wang ◽  
Xujun Zhang ◽  
Haiyang Zhong ◽  
...  

Abstract How to accurately estimate protein–ligand binding affinity remains a key challenge in computer-aided drug design (CADD). In many cases, it has been shown that the binding affinities predicted by classical scoring functions (SFs) cannot correlate well with experimentally measured biological activities. In the past few years, machine learning (ML)-based SFs have gradually emerged as potential alternatives and outperformed classical SFs in a series of studies. In this study, to better recognize the potential of classical SFs, we have conducted a comparative assessment of 25 commonly used SFs. Accordingly, the scoring power was systematically estimated by using the state-of-the-art ML methods that replaced the original multiple linear regression method to refit individual energy terms. The results show that the newly-developed ML-based SFs consistently performed better than classical ones. In particular, gradient boosting decision tree (GBDT) and random forest (RF) achieved the best predictions in most cases. The newly-developed ML-based SFs were also tested on another benchmark modified from PDBbind v2007, and the impacts of structural and sequence similarities were evaluated. The results indicated that the superiority of the ML-based SFs could be fully guaranteed when sufficient similar targets were contained in the training set. Moreover, the effect of the combinations of features from multiple SFs was explored, and the results indicated that combining NNscore2.0 with one to four other classical SFs could yield the best scoring power. However, it was not applicable to derive a generic target-specific SF or SF combination.

2021 ◽  
pp. 187-198
Author(s):  
Shima Zahmatkesh ◽  
Alessio Bernardo ◽  
Emanuele Falzone ◽  
Edgardo Di Nicola Carena ◽  
Emanuele Della Valle

Industries that sell products with short-term or seasonal life cycles must regularly introduce new products. Forecasting the demand for New Product Introduction (NPI) can be challenging due to the fluctuations of many factors such as trend, seasonality, or other external and unpredictable phenomena (e.g., COVID-19 pandemic). Traditionally, NPI is an expertcentric process. This paper presents a study on automating the forecast of NPI demands using statistical Machine Learning (namely, Gradient Boosting and XGBoost). We show how to overcome shortcomings of the traditional data preparation that underpins the manual process. Moreover, we illustrate the role of cross-validation techniques for the hyper-parameter tuning and the validation of the models. Finally, we provide empirical evidence that statistical Machine Learning can forecast NPI demand better than experts.


Information ◽  
2020 ◽  
Vol 11 (6) ◽  
pp. 332
Author(s):  
Ernest Kwame Ampomah ◽  
Zhiguang Qin ◽  
Gabriel Nyame

Forecasting the direction and trend of stock price is an important task which helps investors to make prudent financial decisions in the stock market. Investment in the stock market has a big risk associated with it. Minimizing prediction error reduces the investment risk. Machine learning (ML) models typically perform better than statistical and econometric models. Also, ensemble ML models have been shown in the literature to be able to produce superior performance than single ML models. In this work, we compare the effectiveness of tree-based ensemble ML models (Random Forest (RF), XGBoost Classifier (XG), Bagging Classifier (BC), AdaBoost Classifier (Ada), Extra Trees Classifier (ET), and Voting Classifier (VC)) in forecasting the direction of stock price movement. Eight different stock data from three stock exchanges (NYSE, NASDAQ, and NSE) are randomly collected and used for the study. Each data set is split into training and test set. Ten-fold cross validation accuracy is used to evaluate the ML models on the training set. In addition, the ML models are evaluated on the test set using accuracy, precision, recall, F1-score, specificity, and area under receiver operating characteristics curve (AUC-ROC). Kendall W test of concordance is used to rank the performance of the tree-based ML algorithms. For the training set, the AdaBoost model performed better than the rest of the models. For the test set, accuracy, precision, F1-score, and AUC metrics generated results significant to rank the models, and the Extra Trees classifier outperformed the other models in all the rankings.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Yeonhee Lee ◽  
Jiwon Ryu ◽  
Min Woo Kang ◽  
Kyung Ha Seo ◽  
Jayoun Kim ◽  
...  

AbstractThe precise prediction of acute kidney injury (AKI) after nephrectomy for renal cell carcinoma (RCC) is an important issue because of its relationship with subsequent kidney dysfunction and high mortality. Herein we addressed whether machine learning (ML) algorithms could predict postoperative AKI risk better than conventional logistic regression (LR) models. A total of 4104 RCC patients who had undergone unilateral nephrectomy from January 2003 to December 2017 were reviewed. ML models such as support vector machine, random forest, extreme gradient boosting, and light gradient boosting machine (LightGBM) were developed, and their performance based on the area under the receiver operating characteristic curve, accuracy, and F1 score was compared with that of the LR-based scoring model. Postoperative AKI developed in 1167 patients (28.4%). All the ML models had higher performance index values than the LR-based scoring model. Among them, the LightGBM model had the highest value of 0.810 (0.783–0.837). The decision curve analysis demonstrated a greater net benefit of the ML models than the LR-based scoring model over all the ranges of threshold probabilities. The application of ML algorithms improves the predictability of AKI after nephrectomy for RCC, and these models perform better than conventional LR-based models.


2021 ◽  
Author(s):  
Yuki Kataoka ◽  
Yuya Kimura ◽  
Tatsuyoshi Ikenoue ◽  
Yoshinori Matsuoka ◽  
Junji Kumasawa ◽  
...  

Abstract Background We developed and validated a machine learning diagnostic model for novel coronavirus (COVID-19) disease, integrating artificial-intelligence-based computed tomography (CT) imaging and clinical features. Methods We conducted a retrospective cohort study in 11 Japanese tertiary care facilities that treated COVID-19 patients. Participants were tested using both real-time reverse transcription polymerase chain reaction (RT-PCR) and chest CT between January 1 and May 30, 2020. We chronologically split the dataset in each hospital into training and test sets, containing patients in a 7:3 ratio. Light Gradient Boosting Machine model was used for analysis. Results A total of 703 patients were included with two models — the full model and the A-blood model — developed for their diagnosis. The A-blood model included eight variables (the Ali-M3 confidence, along with seven clinical features of blood counts and biochemistry markers). The areas under the receiver-operator curve of both models (0.91, 95% confidence interval (CI), 0.86 to 0.95 for the full model and 0.90, 95% CI, 0.86 to 0.94 for the A-blood model) were better than that of the Ali-M3 confidence (0.78, 95% CI, 0.71 to 0.83) in the test set. Conclusions The A-blood model, a COVID-19 diagnostic model developed in this study, combines machine-learning and CT evaluation with blood test data and is better than the Ali-M3 framework existing for this purpose. This would significantly aid physicians in making a quicker diagnosis of COVID-19.


2021 ◽  
Vol 11 ◽  
Author(s):  
Yinghao Meng ◽  
Hao Zhang ◽  
Qi Li ◽  
Fang Liu ◽  
Xu Fang ◽  
...  

PurposeTo develop and validate a machine learning classifier based on multidetector computed tomography (MDCT), for the preoperative prediction of tumor–stroma ratio (TSR) expression in patients with pancreatic ductal adenocarcinoma (PDAC).Materials and MethodsIn this retrospective study, 227 patients with PDAC underwent an MDCT scan and surgical resection. We quantified the TSR by using hematoxylin and eosin staining and extracted 1409 arterial and portal venous phase radiomics features for each patient, respectively. Moreover, we used the least absolute shrinkage and selection operator logistic regression algorithm to reduce the features. The extreme gradient boosting (XGBoost) was developed using a training set consisting of 167 consecutive patients, admitted between December 2016 and December 2017. The model was validated in 60 consecutive patients, admitted between January 2018 and April 2018. We determined the XGBoost classifier performance based on its discriminative ability, calibration, and clinical utility.ResultsWe observed low and high TSR in 91 (40.09%) and 136 (59.91%) patients, respectively. A log-rank test revealed significantly longer survival for patients in the TSR-low group than those in the TSR-high group. The prediction model revealed good discrimination in the training (area under the curve [AUC]= 0.93) and moderate discrimination in the validation set (AUC= 0.63). While the sensitivity, specificity, accuracy, positive predictive value, and negative predictive value for the training set were 94.06%, 81.82%, 0.89, 0.89, and 0.90, respectively, those for the validation set were 85.71%, 48.00%, 0.70, 0.70, and 0.71, respectively.ConclusionsThe CT radiomics-based XGBoost classifier provides a potentially valuable noninvasive tool to predict TSR in patients with PDAC and optimize risk stratification.


2019 ◽  
Vol 9 (1) ◽  
pp. 10
Author(s):  
Maéli M. F. Civa ◽  
Dirceu G. de Souza ◽  
Renata G. Silva ◽  
Dayany da S. A. Maciel ◽  
Ricardo L. Tranquilin ◽  
...  

The coordination of metal ions with flavonoids is applied to improve its pharmacological properties. To evaluate the role of ions on diosmin new complexes with Fe(II), Cu(II) and Co(II) ions were synthetized and characterized by UV, FT-IR and XRD techniques and surface morphology by SEM. The biological activity of coordination complexes in vitro, the antioxidant (ABTS), antibacterial (disc diffusion and MIC) and antitumoral activities (MTT) were analyzed. Diosmin when reacting with Fe(II) at 50ºC loses the sugar molecule becoming diosmetin (D) coordinated at 1D:1Fe ratio. In presence of Cu(II) and Co(II) at the same conditions besides losing the sugar, diosmin loses the methyl group at C4’ and H at C3’, producing a new ligand and complexes at 1D:2Cu or Co ratio, to produce DCu and DCo, respectively. The coordination of Cu and Fe improve the antioxidant activity of diosmin. DCo was the only presented antibacterial activity. Additionally, a specific antitumor effect of diosmin and metal complexes upon human leukemia cells was demonstrated, suggesting an immune regulatory action. The anti-melanoma activity of DCo is 10 times better than diosmin. Metal coordination could be used to improve drug activity and to give direction to a new possibility of clinical use for diosmin.


2021 ◽  
Author(s):  
Ada Y. Chen ◽  
Juyong Lee ◽  
Ana Damjanovic ◽  
Bernard R. Brooks

We present four tree-based machine learning models for protein pKa prediction. The four models, Random Forest, Extra Trees, eXtreme Gradient Boosting (XGBoost) and Light Gradient Boosting Machine (LightGBM), were trained on three experimental PDB and pKa datasets, two of which included a notable portion of internal residues. We observed similar performance among the four machine learning algorithms. The best model trained on the largest dataset performs 37% better than the widely used empirical pKa prediction tool PROPKA. The overall RMSE for this model is 0.69, with surface and buried RMSE values being 0.56 and 0.78, respectively, considering six residue types (Asp, Glu, His, Lys, Cys and Tyr), and 0.63 when considering Asp, Glu, His and Lys only. We provide pKa predictions for proteins in human proteome from the AlphaFold Protein Structure Database and observed that 1% of Asp/Glu/Lys residues have highly shifted pKa values close to the physiological pH.


2021 ◽  
Vol 36 (Supplement_1) ◽  
Author(s):  
Sejoong Kim ◽  
Yeonhee Lee ◽  
Seung Seok Han

Abstract Background and Aims The precise prediction of acute kidney injury (AKI) after nephrectomy for renal cell carcinoma (RCC) is an important issue because of its relationship with subsequent kidney dysfunction and high mortality. Herein we addressed whether machine learning algorithms could predict postoperative AKI risk better than conventional logistic regression (LR) models. Method A total of 4,104 RCC patients who had undergone unilateral nephrectomy from January 2003 to December 2017 were reviewed. Machine learning models such as support vector machine, random forest, extreme gradient boosting, and light gradient boosting machine (LightGBM) were developed, and their performance based on the area under the receiver operating characteristic curve, accuracy, and F1 score was compared with that of the LR-based scoring model. Results Postoperative AKI developed in 1,167 patients (28.4%). All the machine learning models had higher performance index values than the LR-based scoring model. Among them, the LightGBM model had the highest value of 0.810 (0.783–0.837). The decision curve analysis demonstrated a greater net benefit of the machine learning models than the LR-based scoring model over all the ranges of threshold probabilities. The LightGBM and random forest models, but not others, were well calibrated. Conclusion The application of machine learning algorithms improves the predictability of AKI after nephrectomy for RCC, and these models perform better than conventional LR-based models.


2019 ◽  
Vol 2019 ◽  
pp. 1-14 ◽  
Author(s):  
Shuai Sun ◽  
Jun Zhang ◽  
Jun Bi ◽  
Yongxing Wang

It is of great significance to improve the driving range prediction accuracy to provide battery electric vehicle users with reliable information. A model built by the conventional multiple linear regression method is feasible to predict the driving range, but the residual errors between -3.6975 km and 3.3865 km are relatively unfaithful for real-world driving. The study is innovative in its application of machine learning method, the gradient boosting decision tree algorithm, on the driving range prediction which includes a very large number of factors that cannot be considered by conventional regression methods. The result of the machine learning method shows that the maximum prediction error is 1.58 km, the minimum prediction error is -1.41 km, and the average prediction error is about 0.7 km. The predictive accuracy of the gradient boosting decision tree is compared against that of the conventional approaches.


2020 ◽  
Vol 12 (18) ◽  
pp. 3104
Author(s):  
Gangqiang An ◽  
Minfeng Xing ◽  
Binbin He ◽  
Chunhua Liao ◽  
Xiaodong Huang ◽  
...  

Chlorophyll is an essential pigment for photosynthesis in crops, and leaf chlorophyll content can be used as an indicator for crop growth status and help guide nitrogen fertilizer applications. Estimating crop chlorophyll content plays an important role in precision agriculture. In this study, a variable, rate of change in reflectance between wavelengths ‘a’ and ‘b’ (RCRWa-b), derived from in situ hyperspectral remote sensing data combined with four advanced machine learning techniques, Gaussian process regression (GPR), random forest regression (RFR), support vector regression (SVR), and gradient boosting regression tree (GBRT), were used to estimate the chlorophyll content (measured by a portable soil–plant analysis development meter) of rice. The performances of the four machine learning models were assessed and compared using root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (R2). The results revealed that four features of RCRWa-b, RCRW551.0–565.6, RCRW739.5–743.5, RCRW684.4–687.1 and RCRW667.9–672.0, were effective in estimating the chlorophyll content of rice, and the RFR model generated the highest prediction accuracy (training set: RMSE = 1.54, MAE =1.23 and R2 = 0.95; validation set: RMSE = 2.64, MAE = 1.99 and R2 = 0.80). The GPR model was found to have the strongest generalization (training set: RMSE = 2.83, MAE = 2.16 and R2 = 0.77; validation set: RMSE = 2.97, MAE = 2.30 and R2 = 0.76). We conclude that RCRWa-b is a useful variable to estimate chlorophyll content of rice, and RFR and GPR are powerful machine learning algorithms for estimating the chlorophyll content of rice.


Sign in / Sign up

Export Citation Format

Share Document