Quantum Machine Learning for Drug Discovery

Author(s):  
Kushal Batra ◽  
Kimberley M. Zorn ◽  
Daniel H. Foil ◽  
Eni Minerali ◽  
Victor O. Gawriljuk ◽  
...  

<p>The growing public and private datasets focused on small molecules screened against biological targets or whole organisms <sup>1</sup> provides a wealth of drug discovery relevant data. Increasingly this is used to create machine learning models which can be used for enabling target-based design <sup>2-4</sup>, predict on- or off-target effects and create scoring functions <sup>5,6</sup>. This is matched by the availability of machine learning algorithms such as Support Vector Machines (SVM) and Deep Neural Networks (DNN) that are computationally expensive to perform on very large datasets and thousands of molecular descriptors. Quantum computer (QC) algorithms have been proposed to offer an approach to accelerate quantum machine learning over classical computer (CC) algorithms, however with significant limitations. In the case of cheminformatics, one of the challenges to overcome is the need for compression of large numbers of molecular descriptors for use on QC. Here we show how to achieve compression with datasets using hundreds of molecules (SARS-CoV-2) to hundreds of thousands (whole cell screening datasets for plague and <i>M. tuberculosis</i>) with SVM and data re-uploading classifier (a DNN equivalent algorithm) on a QC benchmarked against CC and hybrid approaches. This illustrates a quantum advantage for drug discovery to build upon in future.</p>

2020 ◽  
Author(s):  
Kushal Batra ◽  
Kimberley M. Zorn ◽  
Daniel H. Foil ◽  
Eni Minerali ◽  
Victor O. Gawriljuk ◽  
...  

<p>The growing public and private datasets focused on small molecules screened against biological targets or whole organisms <sup>1</sup> provides a wealth of drug discovery relevant data. Increasingly this is used to create machine learning models which can be used for enabling target-based design <sup>2-4</sup>, predict on- or off-target effects and create scoring functions <sup>5,6</sup>. This is matched by the availability of machine learning algorithms such as Support Vector Machines (SVM) and Deep Neural Networks (DNN) that are computationally expensive to perform on very large datasets and thousands of molecular descriptors. Quantum computer (QC) algorithms have been proposed to offer an approach to accelerate quantum machine learning over classical computer (CC) algorithms, however with significant limitations. In the case of cheminformatics, one of the challenges to overcome is the need for compression of large numbers of molecular descriptors for use on QC. Here we show how to achieve compression with datasets using hundreds of molecules (SARS-CoV-2) to hundreds of thousands (whole cell screening datasets for plague and <i>M. tuberculosis</i>) with SVM and data re-uploading classifier (a DNN equivalent algorithm) on a QC benchmarked against CC and hybrid approaches. This illustrates a quantum advantage for drug discovery to build upon in future.</p>


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Isabella A. Guedes ◽  
André M. S. Barreto ◽  
Diogo Marinho ◽  
Eduardo Krempser ◽  
Mélaine A. Kuenemann ◽  
...  

AbstractScoring functions are essential for modern in silico drug discovery. However, the accurate prediction of binding affinity by scoring functions remains a challenging task. The performance of scoring functions is very heterogeneous across different target classes. Scoring functions based on precise physics-based descriptors better representing protein–ligand recognition process are strongly needed. We developed a set of new empirical scoring functions, named DockTScore, by explicitly accounting for physics-based terms combined with machine learning. Target-specific scoring functions were developed for two important drug targets, proteases and protein–protein interactions, representing an original class of molecules for drug discovery. Multiple linear regression (MLR), support vector machine and random forest algorithms were employed to derive general and target-specific scoring functions involving optimized MMFF94S force-field terms, solvation and lipophilic interactions terms, and an improved term accounting for ligand torsional entropy contribution to ligand binding. DockTScore scoring functions demonstrated to be competitive with the current best-evaluated scoring functions in terms of binding energy prediction and ranking on four DUD-E datasets and will be useful for in silico drug design for diverse proteins as well as for specific targets such as proteases and protein–protein interactions. Currently, the MLR DockTScore is available at www.dockthor.lncc.br.


2020 ◽  
Author(s):  
Thomas R. Lane ◽  
Daniel H. Foil ◽  
Eni Minerali ◽  
Fabio Urbina ◽  
Kimberley M. Zorn ◽  
...  

<p>Machine learning methods are attracting considerable attention from the pharmaceutical industry for use in drug discovery and applications beyond. In recent studies we have applied multiple machine learning algorithms, modeling metrics and in some cases compared molecular descriptors to build models for individual targets or properties on a relatively small scale. Several research groups have used large numbers of datasets from public databases such as ChEMBL in order to evaluate machine learning methods of interest to them. The largest of these types of studies used on the order of 1400 datasets. We have now extracted well over 5000 datasets from CHEMBL for use with the ECFP6 fingerprint and comparison of our proprietary software Assay Central<sup>TM</sup> with random forest, k-Nearest Neighbors, support vector classification, naïve Bayesian, AdaBoosted decision trees, and deep neural networks (3 levels). Model performance <a>was</a> assessed using an array of five-fold cross-validation metrics including area-under-the-curve, F1 score, Cohen’s kappa and Matthews correlation coefficient. <a>Based on ranked normalized scores for the metrics or datasets all methods appeared comparable while the distance from the top indicated Assay Central<sup>TM</sup> and support vector classification were comparable. </a>Unlike prior studies which have placed considerable emphasis on deep neural networks (deep learning), no advantage was seen in this case where minimal tuning was performed of any of the methods. If anything, Assay Central<sup>TM</sup> may have been at a slight advantage as the activity cutoff for each of the over 5000 datasets representing over 570,000 unique compounds was based on Assay Central<sup>TM</sup>performance, but support vector classification seems to be a strong competitor. We also apply Assay Central<sup>TM</sup> to prospective predictions for PXR and hERG to further validate these models. This work currently appears to be the largest comparison of machine learning algorithms to date. Future studies will likely evaluate additional databases, descriptors and algorithms, as well as further refining methods for evaluating and comparing models. </p><p><b> </b></p>


Author(s):  
Kushal Batra ◽  
Kimberley M. Zorn ◽  
Daniel H. Foil ◽  
Eni Minerali ◽  
Victor O. Gawriljuk ◽  
...  

AI ◽  
2020 ◽  
Vol 1 (2) ◽  
pp. 276-285
Author(s):  
Dragos Paul Mihai ◽  
Cosmin Trif ◽  
Gheorghe Stancov ◽  
Denise Radulescu ◽  
George Mihai Nitulescu

Transient receptor potential ankyrin 1 (TRPA1) is a ligand-gated calcium channel activated by cold temperatures and by a plethora of electrophilic environmental irritants (allicin, acrolein, mustard-oil) and endogenously oxidized lipids (15-deoxy-∆12, 14-prostaglandin J2 and 5, 6-eposyeicosatrienoic acid). These oxidized lipids work as agonists, making TRPA1 a key player in inflammatory and neuropathic pain. TRPA1 antagonists acting as non-central pain blockers are a promising choice for future treatment of pain-related conditions having advantages over current therapeutic choices A large variety of in silico methods have been used in drug design to speed up the development of new active compounds such as molecular docking, quantitative structure-activity relationship models (QSAR), and machine learning classification algorithms. Artificial intelligence methods can significantly improve the drug discovery process and it is an attractive field that can bring together computer scientists and experts in drug development. In our paper, we aimed to develop three machine learning algorithms frequently used in drug discovery research: feedforward neural networks (FFNN), random forests (RF), and support vector machines (SVM), for discovering novel TRPA1 antagonists. All three machine learning methods used the same class of independent variables (multilevel neighborhoods of atoms descriptors) as prediction of activity spectra for substances (PASS) software. The model with the highest accuracy and most optimal performance metrics was the random forest algorithm, showing 99% accuracy and 0.9936 ROC AUC. Thus, our study emphasized that simpler and robust machine learning algorithms such as random forests perform better in correctly classifying TRPA1 antagonists since the dimension of the dependent variables dataset is relatively modest.


2020 ◽  
Author(s):  
Thomas R. Lane ◽  
Daniel H. Foil ◽  
Eni Minerali ◽  
Fabio Urbina ◽  
Kimberley M. Zorn ◽  
...  

<p>Machine learning methods are attracting considerable attention from the pharmaceutical industry for use in drug discovery and applications beyond. In recent studies we have applied multiple machine learning algorithms, modeling metrics and in some cases compared molecular descriptors to build models for individual targets or properties on a relatively small scale. Several research groups have used large numbers of datasets from public databases such as ChEMBL in order to evaluate machine learning methods of interest to them. The largest of these types of studies used on the order of 1400 datasets. We have now extracted well over 5000 datasets from CHEMBL for use with the ECFP6 fingerprint and comparison of our proprietary software Assay Central<sup>TM</sup> with random forest, k-Nearest Neighbors, support vector classification, naïve Bayesian, AdaBoosted decision trees, and deep neural networks (3 levels). Model performance <a>was</a> assessed using an array of five-fold cross-validation metrics including area-under-the-curve, F1 score, Cohen’s kappa and Matthews correlation coefficient. <a>Based on ranked normalized scores for the metrics or datasets all methods appeared comparable while the distance from the top indicated Assay Central<sup>TM</sup> and support vector classification were comparable. </a>Unlike prior studies which have placed considerable emphasis on deep neural networks (deep learning), no advantage was seen in this case where minimal tuning was performed of any of the methods. If anything, Assay Central<sup>TM</sup> may have been at a slight advantage as the activity cutoff for each of the over 5000 datasets representing over 570,000 unique compounds was based on Assay Central<sup>TM</sup>performance, but support vector classification seems to be a strong competitor. We also apply Assay Central<sup>TM</sup> to prospective predictions for PXR and hERG to further validate these models. This work currently appears to be the largest comparison of machine learning algorithms to date. Future studies will likely evaluate additional databases, descriptors and algorithms, as well as further refining methods for evaluating and comparing models. </p><p><b> </b></p>


2020 ◽  
Vol 12 (2) ◽  
pp. 84-99
Author(s):  
Li-Pang Chen

In this paper, we investigate analysis and prediction of the time-dependent data. We focus our attention on four different stocks are selected from Yahoo Finance historical database. To build up models and predict the future stock price, we consider three different machine learning techniques including Long Short-Term Memory (LSTM), Convolutional Neural Networks (CNN) and Support Vector Regression (SVR). By treating close price, open price, daily low, daily high, adjusted close price, and volume of trades as predictors in machine learning methods, it can be shown that the prediction accuracy is improved.


Author(s):  
Anantvir Singh Romana

Accurate diagnostic detection of the disease in a patient is critical and may alter the subsequent treatment and increase the chances of survival rate. Machine learning techniques have been instrumental in disease detection and are currently being used in various classification problems due to their accurate prediction performance. Various techniques may provide different desired accuracies and it is therefore imperative to use the most suitable method which provides the best desired results. This research seeks to provide comparative analysis of Support Vector Machine, Naïve bayes, J48 Decision Tree and neural network classifiers breast cancer and diabetes datsets.


2021 ◽  
Vol 186 (Supplement_1) ◽  
pp. 445-451
Author(s):  
Yifei Sun ◽  
Navid Rashedi ◽  
Vikrant Vaze ◽  
Parikshit Shah ◽  
Ryan Halter ◽  
...  

ABSTRACT Introduction Early prediction of the acute hypotensive episode (AHE) in critically ill patients has the potential to improve outcomes. In this study, we apply different machine learning algorithms to the MIMIC III Physionet dataset, containing more than 60,000 real-world intensive care unit records, to test commonly used machine learning technologies and compare their performances. Materials and Methods Five classification methods including K-nearest neighbor, logistic regression, support vector machine, random forest, and a deep learning method called long short-term memory are applied to predict an AHE 30 minutes in advance. An analysis comparing model performance when including versus excluding invasive features was conducted. To further study the pattern of the underlying mean arterial pressure (MAP), we apply a regression method to predict the continuous MAP values using linear regression over the next 60 minutes. Results Support vector machine yields the best performance in terms of recall (84%). Including the invasive features in the classification improves the performance significantly with both recall and precision increasing by more than 20 percentage points. We were able to predict the MAP with a root mean square error (a frequently used measure of the differences between the predicted values and the observed values) of 10 mmHg 60 minutes in the future. After converting continuous MAP predictions into AHE binary predictions, we achieve a 91% recall and 68% precision. In addition to predicting AHE, the MAP predictions provide clinically useful information regarding the timing and severity of the AHE occurrence. Conclusion We were able to predict AHE with precision and recall above 80% 30 minutes in advance with the large real-world dataset. The prediction of regression model can provide a more fine-grained, interpretable signal to practitioners. Model performance is improved by the inclusion of invasive features in predicting AHE, when compared to predicting the AHE based on only the available, restricted set of noninvasive technologies. This demonstrates the importance of exploring more noninvasive technologies for AHE prediction.


2021 ◽  
pp. 1-17
Author(s):  
Ahmed Al-Tarawneh ◽  
Ja’afer Al-Saraireh

Twitter is one of the most popular platforms used to share and post ideas. Hackers and anonymous attackers use these platforms maliciously, and their behavior can be used to predict the risk of future attacks, by gathering and classifying hackers’ tweets using machine-learning techniques. Previous approaches for detecting infected tweets are based on human efforts or text analysis, thus they are limited to capturing the hidden text between tweet lines. The main aim of this research paper is to enhance the efficiency of hacker detection for the Twitter platform using the complex networks technique with adapted machine learning algorithms. This work presents a methodology that collects a list of users with their followers who are sharing their posts that have similar interests from a hackers’ community on Twitter. The list is built based on a set of suggested keywords that are the commonly used terms by hackers in their tweets. After that, a complex network is generated for all users to find relations among them in terms of network centrality, closeness, and betweenness. After extracting these values, a dataset of the most influential users in the hacker community is assembled. Subsequently, tweets belonging to users in the extracted dataset are gathered and classified into positive and negative classes. The output of this process is utilized with a machine learning process by applying different algorithms. This research build and investigate an accurate dataset containing real users who belong to a hackers’ community. Correctly, classified instances were measured for accuracy using the average values of K-nearest neighbor, Naive Bayes, Random Tree, and the support vector machine techniques, demonstrating about 90% and 88% accuracy for cross-validation and percentage split respectively. Consequently, the proposed network cyber Twitter model is able to detect hackers, and determine if tweets pose a risk to future institutions and individuals to provide early warning of possible attacks.


Sign in / Sign up

Export Citation Format

Share Document