A Comparison of Scaling Methods to Obtain Calibrated Probabilities of Activity for Ligand-Target Predictions

Author(s):  
Lewis Mervin ◽  
Avid M. Afzal ◽  
Ola Engkvist ◽  
Andreas Bender

In the context of bioactivity prediction, the question of how to calibrate a score produced by a machine learning method into reliable probability of binding to a protein target is not yet satisfactorily addressed. In this study, we compared the performance of three such methods, namely Platt Scaling, Isotonic Regression and Venn-ABERS in calibrating prediction scores for ligand-target prediction comprising the Naïve Bayes, Support Vector Machines and Random Forest algorithms with bioactivity data available at AstraZeneca (40 million data points (compound-target pairs) across 2112 targets). Performance was assessed using Stratified Shuffle Split (SSS) and Leave 20% of Scaffolds Out (L20SO) validation.

2020 ◽  
Author(s):  
Lewis Mervin ◽  
Avid M. Afzal ◽  
Ola Engkvist ◽  
Andreas Bender

In the context of bioactivity prediction, the question of how to calibrate a score produced by a machine learning method into reliable probability of binding to a protein target is not yet satisfactorily addressed. In this study, we compared the performance of three such methods, namely Platt Scaling, Isotonic Regression and Venn-ABERS in calibrating prediction scores for ligand-target prediction comprising the Naïve Bayes, Support Vector Machines and Random Forest algorithms with bioactivity data available at AstraZeneca (40 million data points (compound-target pairs) across 2112 targets). Performance was assessed using Stratified Shuffle Split (SSS) and Leave 20% of Scaffolds Out (L20SO) validation.


2020 ◽  
Author(s):  
Lewis Mervin ◽  
Avid M. Afzal ◽  
Ola Engkvist ◽  
Andreas Bender

In the context of bioactivity prediction, the question of how to calibrate a score produced by a machine learning method into reliable probability of binding to a protein target is not yet satisfactorily addressed. In this study, we compared the performance of three such methods, namely Platt Scaling, Isotonic Regression and Venn-ABERS in calibrating prediction scores for ligand-target prediction comprising the Naïve Bayes, Support Vector Machines and Random Forest algorithms with bioactivity data available at AstraZeneca (40 million data points (compound-target pairs) across 2112 targets). Performance was assessed using Stratified Shuffle Split (SSS) and Leave 20% of Scaffolds Out (L20SO) validation.


Author(s):  
Jeffrey S. Racine

This chapter covers two advanced topics: a machine learning method (support vector machines useful for classification) and nonparametric kernel regression.


2021 ◽  
Vol 13 (1) ◽  
pp. 133
Author(s):  
Hao Sun ◽  
Yajing Cui

Downscaling microwave remotely sensed soil moisture (SM) is an effective way to obtain spatial continuous SM with fine resolution for hydrological and agricultural applications on a regional scale. Downscaling factors and functions are two basic components of SM downscaling where the former is particularly important in the era of big data. Based on machine learning method, this study evaluated Land Surface Temperature (LST), Land surface Evaporative Efficiency (LEE), and geographical factors from Moderate Resolution Imaging Spectroradiometer (MODIS) products for downscaling SMAP (Soil Moisture Active and Passive) SM products. This study spans from 2015 to the end of 2018 and locates in the central United States. Original SMAP SM and in-situ SM at sparse networks and core validation sites were used as reference. Experiment results indicated that (1) LEE presented comparative performance with LST as downscaling factors; (2) adding geographical factors can significantly improve the performance of SM downscaling; (3) integrating LST, LEE, and geographical factors got the best performance; (4) using Z-score normalization or hyperbolic-tangent normalization methods did not change the above conclusions, neither did using support vector regression nor feed forward neural network methods. This study demonstrates the possibility of LEE as an alternative of LST for downscaling SM when there is no available LST due to cloud contamination. It also provides experimental evidence for adding geographical factors in the downscaling process.


Author(s):  
Jian Yi

The stability of the economic market is an important factor for the rapid development of the economy, especially for the listed companies, whose financial and economic stability affects the stability of the financial market. It is helpful for the healthy development of enterprises and financial markets to make an accurate early warning of the financial economy of listed enterprises. This paper briefly introduced the support vector machine (SVM) and back-propagation neural network (BPNN) algorithms in the machine learning method. To make up for the defects of the two algorithms, they were combined and applied to the enterprise financial economics early warning. A simulation experiment was carried out on the single SVM algorithm-based, single BPNN algorithm-based, and SVM algorithm and BPNN algorithm combined model with the MATLAB software. The results show that the SVM algorithm and BP algorithm combined model converges faster and has higher precision and recall rate and larger area under the curve (AUC) than the single SVM algorithm-based model and the single BPNN algorithm-based model.


2020 ◽  
Vol 10 (11) ◽  
pp. 4016 ◽  
Author(s):  
Xudong Hu ◽  
Han Zhang ◽  
Hongbo Mei ◽  
Dunhui Xiao ◽  
Yuanyuan Li ◽  
...  

Landslide susceptibility mapping is considered to be a prerequisite for landslide prevention and mitigation. However, delineating the spatial occurrence pattern of the landslide remains a challenge. This study investigates the potential application of the stacking ensemble learning technique for landslide susceptibility assessment. In particular, support vector machine (SVM), artificial neural network (ANN), logical regression (LR), and naive Bayes (NB) were selected as base learners for the stacking ensemble method. The resampling scheme and Pearson’s correlation analysis were jointly used to evaluate the importance level of these base learners. A total of 388 landslides and 12 conditioning factors in the Lushui area (Southwest China) were used as the dataset to develop landslide modeling. The landslides were randomly separated into two parts, with 70% used for model training and 30% used for model validation. The models’ performance was evaluated using the area under the receiver operating characteristic (ROC) curve (AUC) and statistical measures. The results showed that the stacking-based ensemble model achieved an improved predictive accuracy as compared to the single algorithms, while the SVM-ANN-NB-LR (SANL) model, the SVM-ANN-NB (SAN) model, and the ANN-NB-LR (ANL) models performed equally well, with AUC values of 0.931, 0.940, and 0.932, respectively, for validation stage. The correlation coefficient between the LR and SVM was the highest for all resampling rounds, with a value of 0.72 on average. This connotes that LR and SVM played an almost equal role when the ensemble of SANL was applied for landslide susceptibility analysis. Therefore, it is feasible to use the SAN model or the ANL model for the study area. The finding from this study suggests that the stacking ensemble machine learning method is promising for landslide susceptibility mapping in the Lushui area and is capable of targeting areas prone to landslides.


2014 ◽  
Vol 2014 ◽  
pp. 1-7 ◽  
Author(s):  
Xiaoyong Liu ◽  
Hui Fu

Disease diagnosis is conducted with a machine learning method. We have proposed a novel machine learning method that hybridizes support vector machine (SVM), particle swarm optimization (PSO), and cuckoo search (CS). The new method consists of two stages: firstly, a CS based approach for parameter optimization of SVM is developed to find the better initial parameters of kernel function, and then PSO is applied to continue SVM training and find the best parameters of SVM. Experimental results indicate that the proposed CS-PSO-SVM model achieves better classification accuracy and F-measure than PSO-SVM and GA-SVM. Therefore, we can conclude that our proposed method is very efficient compared to the previously reported algorithms.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Rulan Wang ◽  
Zhuo Wang ◽  
Hongfei Wang ◽  
Yuxuan Pang ◽  
Tzong-Yi Lee

AbstractLysine crotonylation (Kcr) is a type of protein post-translational modification (PTM), which plays important roles in a variety of cellular regulation and processes. Several methods have been proposed for the identification of crotonylation. However, most of these methods can predict efficiently only on histone or non-histone protein. Therefore, this work aims to give a more balanced performance in different species, here plant (non-histone) and mammalian (histone) are involved. SVM (support vector machine) and RF (random forest) were employed in this study. According to the results of cross-validations, the RF classifier based on EGAAC attribute achieved the best predictive performance which performs competitively good as existed methods, meanwhile more robust when dealing with imbalanced datasets. Moreover, an independent test was carried out, which compared the performance of this study and existed methods based on the same features or the same classifier. The classifiers of SVM and RF could achieve best performances with 92% sensitivity, 88% specificity, 90% accuracy, and an MCC of 0.80 in the mammalian dataset, and 77% sensitivity, 83% specificity, 70% accuracy and 0.54 MCC in a relatively small dataset of mammalian and a large-scaled plant dataset respectively. Moreover, a cross-species independent testing was also carried out in this study, which has proved the species diversity in plant and mammalian.


Sign in / Sign up

Export Citation Format

Share Document