scholarly journals A decision-theoretic approach to the evaluation of machine learning algorithms in computational drug discovery

2019 ◽  
Vol 35 (22) ◽  
pp. 4656-4663 ◽  
Author(s):  
Oliver P Watson ◽  
Isidro Cortes-Ciriano ◽  
Aimee R Taylor ◽  
James A Watson

Abstract Motivation Artificial intelligence, trained via machine learning (e.g. neural nets, random forests) or computational statistical algorithms (e.g. support vector machines, ridge regression), holds much promise for the improvement of small-molecule drug discovery. However, small-molecule structure-activity data are high dimensional with low signal-to-noise ratios and proper validation of predictive methods is difficult. It is poorly understood which, if any, of the currently available machine learning algorithms will best predict new candidate drugs. Results The quantile-activity bootstrap is proposed as a new model validation framework using quantile splits on the activity distribution function to construct training and testing sets. In addition, we propose two novel rank-based loss functions which penalize only the out-of-sample predicted ranks of high-activity molecules. The combination of these methods was used to assess the performance of neural nets, random forests, support vector machines (regression) and ridge regression applied to 25 diverse high-quality structure-activity datasets publicly available on ChEMBL. Model validation based on random partitioning of available data favours models that overfit and ‘memorize’ the training set, namely random forests and deep neural nets. Partitioning based on quantiles of the activity distribution correctly penalizes extrapolation of models onto structurally different molecules outside of the training data. Simpler, traditional statistical methods such as ridge regression can outperform state-of-the-art machine learning methods in this setting. In addition, our new rank-based loss functions give considerably different results from mean squared error highlighting the necessity to define model optimality with respect to the decision task at hand. Availability and implementation All software and data are available as Jupyter notebooks found at https://github.com/owatson/QuantileBootstrap. Supplementary information Supplementary data are available at Bioinformatics online.

Geophysics ◽  
2013 ◽  
Vol 78 (3) ◽  
pp. WB113-WB126 ◽  
Author(s):  
Matthew J. Cracknell ◽  
Anya M. Reading

Inductive machine learning algorithms attempt to recognize patterns in, and generalize from empirical data. They provide a practical means of predicting lithology, or other spatially varying physical features, from multidimensional geophysical data sets. It is for this reason machine learning approaches are increasing in popularity for geophysical data inference. A key motivation for their use is the ease with which uncertainty measures can be estimated for nonprobabilistic algorithms. We have compared and evaluated the abilities of two nonprobabilistic machine learning algorithms, random forests (RF) and support vector machines (SVM), to recognize ambiguous supervised classification predictions using uncertainty calculated from estimates of class membership probabilities. We formulated a method to establish optimal uncertainty threshold values to identify and isolate the maximum number of incorrect predictions while preserving most of the correct classifications. This is illustrated using a case example of the supervised classification of surface lithologies in a folded, structurally complex, metamorphic terrain. We found that (1) the use of optimal uncertainty thresholds significantly improves overall classification accuracy of RF predictions, but not those of SVM, by eliminating the maximum number of incorrectly classified samples while preserving the maximum number of correctly classified samples; (2) RF, unlike SVM, was able to exploit dependencies and structures contained within spatially varying input data; and (3) high RF prediction uncertainty is spatially coincident with transitions in lithology and associated contact zones, and regions of intense deformation. Uncertainty has its upside in the identification of areas of key geologic interest and has wide application across the geosciences, where transition zones are important classes in their own right. The techniques used in this study are of practical value in prioritizing subsequent geologic field activities, which, with the aid of this analysis, may be focused on key lithology contacts and problematic localities.


2020 ◽  
Vol 9 (1) ◽  
pp. 14-18
Author(s):  
Sapna Yadav ◽  
Pankaj Agarwal

Analyzing online or digital data for detecting epidemics is one of the hot areas of research and now becomes more relevant during the present outbreak of Covid-19. There are several different types of the influenza virus and moreover they keep evolving constantly in the same manner the COVID-19 virus has done. As a result, they pose a greater challenge when it comes to analyzing them, predicting when, where and at what degree of severity it will outbreak during the flu season across the world. There is need for greater surveillance to both seasonal and pandemic influenza to ensure the health and safety of the mankind. The objective of work is to apply machine learning algorithms for building predictive models that can predict where the occurrence, peak and severity of influenza in each season. For this work we have considered a freely available dataset of Ireland which is recorded for the duration of 2005 to 2016. Specifically, we have tested three ML Algorithms namely Linear Regression, Support Vector Regression and Random Forests. We found Random Forests is giving better predictive results. We also conducted experiment through weka tool and tested Zero R, Linear Regression, Lazy Kstar, Random Forest, REP Tree, Multilayer Perceptron models. We again found the Random Forest is performing better in comparison to all other models. We also evaluated other regression models including Ridge Regression, modified Ridge regression, Lasso Regression, K Neighbor Regression and evaluated the mean absolute errors. We found that modified Ridge regression is producing minimum error. The proposed work is inclined towards finding the suitability & appropriate ML algorithm for solving this problem on Flu.


2018 ◽  
Vol 7 (2.8) ◽  
pp. 684 ◽  
Author(s):  
V V. Ramalingam ◽  
Ayantan Dandapath ◽  
M Karthik Raja

Heart related diseases or Cardiovascular Diseases (CVDs) are the main reason for a huge number of death in the world over the last few decades and has emerged as the most life-threatening disease, not only in India but in the whole world. So, there is a need of reliable, accurate and feasible system to diagnose such diseases in time for proper treatment. Machine Learning algorithms and techniques have been applied to various medical datasets to automate the analysis of large and complex data. Many researchers, in recent times, have been using several machine learning techniques to help the health care industry and the professionals in the diagnosis of heart related diseases. This paper presents a survey of various models based on such algorithms and techniques andanalyze their performance. Models based on supervised learning algorithms such as Support Vector Machines (SVM), K-Nearest Neighbour (KNN), NaïveBayes, Decision Trees (DT), Random Forest (RF) and ensemble models are found very popular among the researchers.


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Yao Huimin

With the development of cloud computing and distributed cluster technology, the concept of big data has been expanded and extended in terms of capacity and value, and machine learning technology has also received unprecedented attention in recent years. Traditional machine learning algorithms cannot solve the problem of effective parallelization, so a parallelization support vector machine based on Spark big data platform is proposed. Firstly, the big data platform is designed with Lambda architecture, which is divided into three layers: Batch Layer, Serving Layer, and Speed Layer. Secondly, in order to improve the training efficiency of support vector machines on large-scale data, when merging two support vector machines, the “special points” other than support vectors are considered, that is, the points where the nonsupport vectors in one subset violate the training results of the other subset, and a cross-validation merging algorithm is proposed. Then, a parallelized support vector machine based on cross-validation is proposed, and the parallelization process of the support vector machine is realized on the Spark platform. Finally, experiments on different datasets verify the effectiveness and stability of the proposed method. Experimental results show that the proposed parallelized support vector machine has outstanding performance in speed-up ratio, training time, and prediction accuracy.


Author(s):  
Nor Azizah Hitam ◽  
Amelia Ritahani Ismail

Machine Learning is part of Artificial Intelligence that has the ability to make future forecastings based on the previous experience. Methods has been proposed to construct models including machine learning algorithms such as Neural Networks (NN), Support Vector Machines (SVM) and Deep Learning. This paper presents a comparative performance of Machine Learning algorithms for cryptocurrency forecasting. Specifically, this paper concentrates on forecasting of time series data. SVM has several advantages over the other models in forecasting, and previous research revealed that SVM provides a result that is almost or close to actual result yet also improve the accuracy of the result itself. However, recent research has showed that due to small range of samples and data manipulation by inadequate evidence and professional analyzers, overall status and accuracy rate of the forecasting needs to be improved in further studies. Thus, advanced research on the accuracy rate of the forecasted price has to be done.


2011 ◽  
Vol 230-232 ◽  
pp. 625-628
Author(s):  
Lei Shi ◽  
Xin Ming Ma ◽  
Xiao Hong Hu

E-bussiness has grown rapidly in the last decade and massive amount of data on customer purchases, browsing pattern and preferences has been generated. Classification of electronic data plays a pivotal role to mine the valuable information and thus has become one of the most important applications of E-bussiness. Support Vector Machines are popular and powerful machine learning techniques, and they offer state-of-the-art performance. Rough set theory is a formal mathematical tool to deal with incomplete or imprecise information and one of its important applications is feature selection. In this paper, rough set theory and support vector machines are combined to construct a classification model to classify the data of E-bussiness effectively.


2020 ◽  
Vol 2020 ◽  
pp. 1-7
Author(s):  
Nalindren Naicker ◽  
Timothy Adeliyi ◽  
Jeanette Wing

Educational Data Mining (EDM) is a rich research field in computer science. Tools and techniques in EDM are useful to predict student performance which gives practitioners useful insights to develop appropriate intervention strategies to improve pass rates and increase retention. The performance of the state-of-the-art machine learning classifiers is very much dependent on the task at hand. Investigating support vector machines has been used extensively in classification problems; however, the extant of literature shows a gap in the application of linear support vector machines as a predictor of student performance. The aim of this study was to compare the performance of linear support vector machines with the performance of the state-of-the-art classical machine learning algorithms in order to determine the algorithm that would improve prediction of student performance. In this quantitative study, an experimental research design was used. Experiments were set up using feature selection on a publicly available dataset of 1000 alpha-numeric student records. Linear support vector machines benchmarked with ten categorical machine learning algorithms showed superior performance in predicting student performance. The results of this research showed that features like race, gender, and lunch influence performance in mathematics whilst access to lunch was the primary factor which influences reading and writing performance.


2010 ◽  
Vol 07 (01) ◽  
pp. 59-80
Author(s):  
D. CHENG ◽  
S. Q. XIE ◽  
E. HÄMMERLE

Local descriptor matching is the most overlooked stage of the three stages of the local descriptor process, and this paper proposes a new method for matching local descriptors based on support vector machines. Results from experiments show that the developed method is more robust for matching local descriptors for all image transformations considered. The method is able to be integrated with different local descriptor methods, and with different machine learning algorithms and this shows that the approach is sufficiently robust and versatile.


2022 ◽  
Vol 2161 (1) ◽  
pp. 012019
Author(s):  
Rencita Maria Colaco ◽  
Shreya ◽  
N V Subba Reddy ◽  
U Dinesh Acharya

Abstract Global terror that has shaken the world named, COVID-19 virus has taken away huge number of lives. According to the research there are lot of recovery cases also. Most important thing to survive from this disease is having good immunity. Everyone does not have same level of immunity. One main factor on which immunity depends is having a healthy diet. If the routine of having healthy diet is maintained, then the immunity to fight against this virus increases. It is much required that people need to be informed about having an healthy diet. Using the dataset of healthy dietary and using various machine learning algorithms we can determine what type of diet one person needs to have. By using algorithms like Random Forest, KNN, logistic regression and Support Vector Machines we can determine the type of diet and probability of recovery. The dataset required for analysis needs to have all the information regarding the diet. Based on the dataset the prediction is taken place by using Decision Tree algorithm. This method of finding the appropriate diet of a particular person based on amount of Sugar level, Blood Pressure and BMI can be the most useful research in this pandemic time.


Sign in / Sign up

Export Citation Format

Share Document