A Bayesian machine learning approach for drug target identification using diverse data types

ABSTRACTGene functional enrichment is a mainstay of genomics, but it relies on manually curated databases of gene functions that are incomplete and unaware of the biological context. Here we present an alternative machine learning approach, Deep Functional Synthesis (DeepSyn), which moves beyond gene function databases to dynamically infer the functions of a gene set from its associated network of literature and data, conditioned on the disease and drug context of the current experiment. Using a knowledge graph with 3,048,803 associations between genes, diseases, drugs, and functions, DeepSyn obtained accurate performance (range 0.74 AUC to 0.96 AUC) on a variety of biological applications including drug target identification, gene set functional enrichment, and disease gene prediction.AvailabilityThe DeepSyn codebase is available on GitHub at http://github.com/wangshenguiuc/DeepSyn/ under an open source distribution license.

Download Full-text

Drug Target Identification with Machine Learning: How to Choose Negative Examples

International Journal of Molecular Sciences ◽

10.3390/ijms22105118 ◽

2021 ◽

Vol 22 (10) ◽

pp. 5118

Author(s):

Matthieu Najm ◽

Chloé-Agathe Azencott ◽

Benoit Playe ◽

Véronique Stoven

Keyword(s):

Machine Learning ◽

Drug Target ◽

Target Identification ◽

Target Prediction ◽

False Positives ◽

Machine Learning Algorithms ◽

Statistical Bias ◽

Protein Targets ◽

Drug Target Identification ◽

Approved Drugs

Identification of the protein targets of hit molecules is essential in the drug discovery process. Target prediction with machine learning algorithms can help accelerate this search, limiting the number of required experiments. However, Drug-Target Interactions databases used for training present high statistical bias, leading to a high number of false positives, thus increasing time and cost of experimental validation campaigns. To minimize the number of false positives among predicted targets, we propose a new scheme for choosing negative examples, so that each protein and each drug appears an equal number of times in positive and negative examples. We artificially reproduce the process of target identification for three specific drugs, and more globally for 200 approved drugs. For the detailed three drug examples, and for the larger set of 200 drugs, training with the proposed scheme for the choice of negative examples improved target prediction results: the average number of false positives among the top ranked predicted targets decreased, and overall, the rank of the true targets was improved.Our method corrects databases’ statistical bias and reduces the number of false positive predictions, and therefore the number of useless experiments potentially undertaken.

Download Full-text

A New Big-Data Paradigm for Target Identification and Drug Discovery

10.1101/134973 ◽

2017 ◽

Cited By ~ 9

Author(s):

Neel S. Madhukar ◽

Prashant K. Khade ◽

Linda Huang ◽

Kaitlyn Gayvert ◽

Giuseppe Galletti ◽

...

Keyword(s):

Drug Discovery ◽

Small Molecules ◽

Clinical Application ◽

Small Molecule ◽

Target Identification ◽

Clinical Development ◽

Data Types ◽

Public Data ◽

Drug Target Identification ◽

Approved Drugs

AbstractDrug target identification is one of the most important aspects of pre-clinical development yet it is also among the most complex, labor-intensive, and costly. This represents a major issue, as lack of proper target identification can be detrimental in determining the clinical application of a bioactive small molecule. To improve target identification, we developed BANDIT, a novel paradigm that integrates multiple data types within a Bayesian machine-learning framework to predict the targets and mechanisms for small molecules with unprecedented accuracy and versatility. Using only public data BANDIT achieved an accuracy of approximately 90% over 2000 different small molecules – substantially better than any other published target identification platform. We applied BANDIT to a library of small molecules with no known targets and generated ∼4,000 novel molecule-target predictions. From this set we identified and experimentally validated a set of novel microtubule inhibitors, including three with activity on cancer cells resistant to clinically used anti-microtubule therapies. We next applied BANDIT to ONC201 – an active anti- cancer small molecule in clinical development – whose target has remained elusive since its discovery in 2009. BANDIT identified dopamine receptor 2 as the unexpected target of ONC201, a prediction that we experimentally validated. Not only does this open the door for clinical trials focused on target-based selection of patient populations, but it also represents a novel way to target GPCRs in cancer. Additionally, BANDIT identified previously undocumented connections between approved drugs with disparate indications, shedding light onto previously unexplained clinical observations and suggesting new uses of marketed drugs. Overall, BANDIT represents an efficient and highly accurate platform that can be used as a resource to accelerate drug discovery and direct the clinical application of small molecule therapeutics with improved precision.

Download Full-text

Machine Learning on Human Muscle Transcriptomic Data for Biomarker Discovery and Tissue-Specific Drug Target Identification

Frontiers in Genetics ◽

10.3389/fgene.2018.00242 ◽

2018 ◽

Vol 9 ◽

Cited By ~ 34

Author(s):

Polina Mamoshina ◽

Marina Volosnikova ◽

Ivan V. Ozerov ◽

Evgeny Putin ◽

Ekaterina Skibina ◽

...

Keyword(s):

Machine Learning ◽

Drug Target ◽

Biomarker Discovery ◽

Target Identification ◽

Human Muscle ◽

Specific Drug ◽

Tissue Specific ◽

Transcriptomic Data ◽

Drug Target Identification

Download Full-text

Mobile Collaborative Spectrum Sensing for Heterogeneous Networks: A Bayesian Machine Learning Approach

IEEE Transactions on Signal Processing ◽

10.1109/tsp.2018.2870379 ◽

2018 ◽

Vol 66 (21) ◽

pp. 5634-5647 ◽

Cited By ~ 19

Author(s):

Yizhen Xu ◽

Peng Cheng ◽

Zhuo Chen ◽

Yonghui Li ◽

Branka Vucetic

Keyword(s):

Machine Learning ◽

Spectrum Sensing ◽

Heterogeneous Networks ◽

Learning Approach ◽

Collaborative Spectrum Sensing ◽

Machine Learning Approach ◽

Bayesian Machine Learning

Download Full-text

Machine learning prediction of oncology drug targets based on protein and network properties

10.21203/rs.2.15798/v1 ◽

2019 ◽

Author(s):

Zoltan Dezso ◽

Michele Ceccarelli

Keyword(s):

Machine Learning ◽

Clinical Trial ◽

Drug Target ◽

Drug Targets ◽

Validation Dataset ◽

Learning Approach ◽

Biological Functions ◽

Machine Learning Approach ◽

Network Properties ◽

Trial Drug

Abstract Background The selection and prioritization of drug targets is a central problem in drug discovery. Computational approaches can leverage the growing number of large-scale human genomics and proteomics data to make in-silico target identification, reducing the cost and the time needed. Results We developed a machine learning approach to score proteins to generate a druggability score of novel targets. In our model we incorporated 70 protein features which included properties derived from the sequence, features characterizing protein functions as well as network properties derived from the protein-protein interaction network. The advantage of this approach is that it is unbiased and even less studied proteins with limited information about their function can score well as most of the features are independent of the accumulated literature. We build models on a training set which consist of targets with approved drugs and a negative set of non-drug targets. The machine learning techniques help to identify the most important combination of features differentiating validated targets from non-targets. We validated our predictions on an independent set of clinical trial drug targets, achieving a high accuracy characterized by an AUC of 0.89. Our most predictive features included biological function of proteins, network centrality measures, protein essentiality, tissue specificity, localization and solvent accessibility. Our predictions, based on a small set of 102 validated oncology targets, recovered the majority of known drug targets and identifies a novel set of proteins as drug target candidates. Conclusions We developed a machine learning approach to prioritize proteins according to their similarity to approved drug targets. We have shown that the method proposed is highly predictive on a validation dataset consisting of 277 targets of clinical trial drug confirming that our computational approach is an efficient and cost-effective tool for drug target discovery and prioritization. Our predictions were based on oncology targets and cancer relevant biological functions, resulting in significantly higher scores for targets of oncology clinical trial drugs compared to the scores of targets of trial drugs for other indications. Our approach can be used to make indication specific drug-target prediction by combining generic druggability features with indication specific biological functions.

Download Full-text

A Review of Recent Advances and Research on Drug Target Identification Methods

Current Drug Metabolism ◽

10.2174/1389200219666180925091851 ◽

2019 ◽

Vol 20 (3) ◽

pp. 209-216 ◽

Cited By ~ 6

Author(s):

Yang Hu ◽

Tianyi Zhao ◽

Ningyi Zhang ◽

Ying Zhang ◽

Liang Cheng

Keyword(s):

Machine Learning ◽

Computational Methods ◽

Drug Target ◽

Drug Targets ◽

Target Identification ◽

Machine Learning Algorithms ◽

Topological Features ◽

Drug Target Identification ◽

Incomplete Datasets ◽

Optimal Set

Background:From a therapeutic viewpoint, understanding how drugs bind and regulate the functions of their target proteins to protect against disease is crucial. The identification of drug targets plays a significant role in drug discovery and studying the mechanisms of diseases. Therefore the development of methods to identify drug targets has become a popular issue.Methods:We systematically review the recent work on identifying drug targets from the view of data and method. We compiled several databases that collect data more comprehensively and introduced several commonly used databases. Then divided the methods into two categories: biological experiments and machine learning, each of which is subdivided into different subclasses and described in detail.Results:Machine learning algorithms are the majority of new methods. Generally, an optimal set of features is chosen to predict successful new drug targets with similar properties. The most widely used features include sequence properties, network topological features, structural properties, and subcellular locations. Since various machine learning methods exist, improving their performance requires combining a better subset of features and choosing the appropriate model for the various datasets involved.Conclusion:The application of experimental and computational methods in protein drug target identification has become increasingly popular in recent years. Current biological and computational methods still have many limitations due to unbalanced and incomplete datasets or imperfect feature selection methods

Download Full-text

Machine-Learning Approach Optimizes Well Spacing

Journal of Petroleum Technology ◽

10.2118/0921-0044-jpt ◽

2021 ◽

Vol 73 (09) ◽

pp. 44-45

Author(s):

Chris Carpenter

Keyword(s):

Machine Learning ◽

Uncertainty Quantification ◽

Feature Reduction ◽

Unconventional Reservoirs ◽

Learning Approach ◽

Permian Basin ◽

Well Spacing ◽

Public Data ◽

Machine Learning Approach ◽

Spacing Problem

This article, written by JPT Technology Editor Chris Carpenter, contains highlights of paper SPE 201698, “Finding a Trend Out of Chaos: A Machine-Learning Approach for Well-Spacing Optimization,” by Zheren Ma, Ehsan Davani, SPE, and Xiaodan Ma, SPE, Quantum Reservoir Impact, et al., prepared for the 2020 SPE Annual Technical Conference and Exhibition, originally scheduled to be held in Denver, Colorado, 5–7 October. The paper has not been peer reviewed. Data-driven decisions powered by machine-learning (ML) methods are increasing in popularity when optimizing field development in unconventional reservoirs. However, because well performance is affected by many factors, the challenge is to uncover trends within all the noise. By leveraging basin-level knowledge captured by big data sculpting, integrating private and public data with the use of uncertainty quantification, a process the authors describe as augmented artificial intelligence (AI) can provide quick, science-based answers for well spacing and fracturing optimization and can assess the full potential of an asset in unconventional reservoirs. A case study in the Midland Basin is detailed in the complete paper. Introduction Augmented AI is a process wherein ML and human expertise are coupled to improve solutions. The augmented AI work flow (Fig. 1) starts with data sculpting, which includes information retrieval; data cleaning and standardization; and smart, deep, and systematic data quality control (QC). Feature engineering generates all relevant parameters entering the ML model. More than 50 features have been generated for this work and categorized. The final step is to perform model tuning and ensemble, evaluating model robustness and generating model explanation and uncertainty quantification. Geology The complete paper provides a detailed geological background of the Permian Basin and its Wolfcamp unconventional layer, an organic-rich shale formation with tight reservoir properties. To find a solution for the multidimensional well-spacing problem in the Permian Basin, multiple sources and types of data were gathered using publicly available sources. The detailed geological attributes, including structure, petrophysics, geochemistry, basin-level features, and cultural information (such as counties or lease boundaries) have been combined in an integrated database to extract and generate features for the ML algorithm. Most attributes are available either in a limited number of wells, mostly vertical, or through the low number of available cored wells across the basin. Therefore, a significant amount of data imputation has been processed with mapping exercises using geostatistical modeling techniques. The mapping process augmented the ML attribute-generation step because these features were distributed in both vertical and lateral dimensions. All horizontal wells within the area of interest across the Permian Basin have been resampled with the logged and mapped information. The geological features also are reengineered into multiple indices to reduce the number of labeled features to include in the ML process. This feature-reduction process also has helped in ranking and selecting the most-important parameters relevant to the well-spacing problem. Here, a key attribute called the shale-oil index was introduced, which is generated for the ML-driven process and is used in understanding the level of contribution of geological sweet spots to well-spacing optimization. In addition, the initial well, reservoir, or laboratory data, including logs, have been normalized before mapping and modeling to eliminate potential bias. This study has focused on Wolfcamp layers; however, both geological and engineering attribute generation work flows used for this practical ML methodology to find optimization solutions for common problems are highly applicable to other unconventional layers, such as Bone Spring or Spraberry.

Download Full-text

A Bayesian Machine Learning Approach for Efficient Integrity Management of Steel Lazy Wave Risers

10.1115/1.0000856v ◽

2021 ◽

Author(s):

Seyed Rasoul Hejazi ◽

Andrew Grime ◽

Mark Randolph ◽

Mike Efthymiou

Keyword(s):

Machine Learning ◽

Learning Approach ◽

Machine Learning Approach ◽

Integrity Management ◽

Bayesian Machine Learning

Download Full-text

Machine learning prediction of oncology drug targets based on protein and network properties

10.21203/rs.2.15798/v2 ◽

2019 ◽

Author(s):

Zoltan Dezso ◽

Michele Ceccarelli

Keyword(s):

Machine Learning ◽

Clinical Trial ◽

Drug Target ◽

Drug Targets ◽

Validation Dataset ◽

Learning Approach ◽

Biological Functions ◽

Machine Learning Approach ◽

Network Properties ◽

Trial Drug

Abstract Background The selection and prioritization of drug targets is a central problem in drug discovery. Computational approaches can leverage the growing number of large-scale human genomics and proteomics data to make in-silico target identification, reducing the cost and the time needed. Results We developed a machine learning approach to score proteins to generate a druggability score of novel targets. In our model we incorporated 70 protein features which included properties derived from the sequence, features characterizing protein functions as well as network properties derived from the protein-protein interaction network. The advantage of this approach is that it is unbiased and even less studied proteins with limited information about their function can score well as most of the features are independent of the accumulated literature. We build models on a training set which consist of targets with approved drugs and a negative set of non-drug targets. The machine learning techniques help to identify the most important combination of features differentiating validated targets from non-targets. We validated our predictions on an independent set of clinical trial drug targets, achieving a high accuracy characterized by an AUC of 0.89. Our most predictive features included biological function of proteins, network centrality measures, protein essentiality, tissue specificity, localization and solvent accessibility. Our predictions, based on a small set of 102 validated oncology targets, recovered the majority of known drug targets and identifies a novel set of proteins as drug target candidates. Conclusions We developed a machine learning approach to prioritize proteins according to their similarity to approved drug targets. We have shown that the method proposed is highly predictive on a validation dataset consisting of 277 targets of clinical trial drug confirming that our computational approach is an efficient and cost-effective tool for drug target discovery and prioritization. Our predictions were based on oncology targets and cancer relevant biological functions, resulting in significantly higher scores for targets of oncology clinical trial drugs compared to the scores of targets of trial drugs for other indications. Our approach can be used to make indication specific drug-target prediction by combining generic druggability features with indication specific biological functions.

Download Full-text