Similarity-Based Methods and Machine Learning Approaches for Target Prediction in Early Drug Discovery: Performance and Scope

Neann Mathai; Johannes Kirchmair

doi:10.3390/ijms21103585

Similarity-Based Methods and Machine Learning Approaches for Target Prediction in Early Drug Discovery: Performance and Scope

International Journal of Molecular Sciences ◽

10.3390/ijms21103585 ◽

2020 ◽

Vol 21 (10) ◽

pp. 3585 ◽

Cited By ~ 3

Author(s):

Neann Mathai ◽

Johannes Kirchmair

Keyword(s):

Machine Learning ◽

Drug Discovery ◽

Target Prediction ◽

Training Data ◽

Target Space ◽

Learning Approach ◽

Learning Approaches ◽

Individual Test ◽

Machine Learning Approach ◽

The Individual

Computational methods for predicting the macromolecular targets of drugs and drug-like compounds have evolved as a key technology in drug discovery. However, the established validation protocols leave several key questions regarding the performance and scope of methods unaddressed. For example, prediction success rates are commonly reported as averages over all compounds of a test set and do not consider the structural relationship between the individual test compounds and the training instances. In order to obtain a better understanding of the value of ligand-based methods for target prediction, we benchmarked a similarity-based method and a random forest based machine learning approach (both employing 2D molecular fingerprints) under three testing scenarios: a standard testing scenario with external data, a standard time-split scenario, and a scenario that is designed to most closely resemble real-world conditions. In addition, we deconvoluted the results based on the distances of the individual test molecules from the training data. We found that, surprisingly, the similarity-based approach generally outperformed the machine learning approach in all testing scenarios, even in cases where queries were structurally clearly distinct from the instances in the training (or reference) data, and despite a much higher coverage of the known target space.

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition

10.26434/chemrxiv.5513581.v1 ◽

2017 ◽

Author(s):

Sabrina Jaeger ◽

Simone Fulle ◽

Samo Turk

Keyword(s):

Machine Learning ◽

Language Processing ◽

Supervised Machine Learning ◽

Learning Approach ◽

Learning Approaches ◽

Unsupervised Machine Learning ◽

Feature Representations ◽

Machine Learning Approach ◽

The Individual ◽

Vector Representations

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.

Predictors of remission from body dysmorphic disorder after internet-delivered cognitive behavior therapy: a machine learning approach

10.31234/osf.io/eqcdx ◽

2019 ◽

Author(s):

Oskar Flygare ◽

Jesper Enander ◽

Erik Andersson ◽

Brjánn Ljótsson ◽

Volen Z Ivanov ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forests ◽

Clinical Utility ◽

Body Dysmorphic Disorder ◽

Prediction Models ◽

Behavioral Therapy ◽

Learning Approach ◽

Learning Approaches ◽

Machine Learning Approach

**Background:** Previous attempts to identify predictors of treatment outcomes in body dysmorphic disorder (BDD) have yielded inconsistent findings. One way to increase precision and clinical utility could be to use machine learning methods, which can incorporate multiple non-linear associations in prediction models. **Methods:** This study used a random forests machine learning approach to test if it is possible to reliably predict remission from BDD in a sample of 88 individuals that had received internet-delivered cognitive behavioral therapy for BDD. The random forest models were compared to traditional logistic regression analyses. **Results:** Random forests correctly identified 78% of participants as remitters or non-remitters at post-treatment. The accuracy of prediction was lower in subsequent follow-ups (68%, 66% and 61% correctly classified at 3-, 12- and 24-month follow-ups, respectively). Depressive symptoms, treatment credibility, working alliance, and initial severity of BDD were among the most important predictors at the beginning of treatment. By contrast, the logistic regression models did not identify consistent and strong predictors of remission from BDD. **Conclusions:** The results provide initial support for the clinical utility of machine learning approaches in the prediction of outcomes of patients with BDD. **Trial registration:** ClinicalTrials.gov ID: NCT02010619.

Validation of an Internationally Derived Patient Severity Phenotype to Support COVID-19 Analytics from Electronic Health Record Data

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocab018 ◽

2021 ◽

Author(s):

Jeffrey G Klann ◽

Griffin M Weber ◽

Hossein Estiri ◽

Bertrand Moal ◽

Paul Avillach ◽

...

Keyword(s):

Machine Learning ◽

Electronic Health Record ◽

Chart Review ◽

Learning Approach ◽

Health Record ◽

Learning Approaches ◽

Electronic Health Record Data ◽

Icu Admission ◽

Machine Learning Approach ◽

Electronic Health

Abstract Introduction The Consortium for Clinical Characterization of COVID-19 by EHR (4CE) is an international collaboration addressing COVID-19 with federated analyses of electronic health record (EHR) data. Objective We sought to develop and validate a computable phenotype for COVID-19 severity. Methods Twelve 4CE sites participated. First we developed an EHR-based severity phenotype consisting of six code classes, and we validated it on patient hospitalization data from the 12 4CE clinical sites against the outcomes of ICU admission and/or death. We also piloted an alternative machine-learning approach and compared selected predictors of severity to the 4CE phenotype at one site. Results The full 4CE severity phenotype had pooled sensitivity of 0.73 and specificity 0.83 for the combined outcome of ICU admission and/or death. The sensitivity of individual code categories for acuity had high variability - up to 0.65 across sites. At one pilot site, the expert-derived phenotype had mean AUC 0.903 (95% CI: 0.886, 0.921), compared to AUC 0.956 (95% CI: 0.952, 0.959) for the machine-learning approach. Billing codes were poor proxies of ICU admission, with as low as 49% precision and recall compared to chart review. Discussion We developed a severity phenotype using 6 code classes that proved resilient to coding variability across international institutions. In contrast, machine-learning approaches may overfit hospital-specific orders. Manual chart review revealed discrepancies even in the gold-standard outcomes, possibly due to heterogeneous pandemic conditions. Conclusion We developed an EHR-based severity phenotype for COVID-19 in hospitalized patients and validated it at 12 international sites.

Analysis of Machine Learning Approach for the modemodel in SWC Mapping in Automotive Systems

Embedded Selforganising Systems ◽

10.14464/ess.v7i1.447 ◽

2021 ◽

Vol 7 (1) ◽

pp. 16-19

Author(s):

Owes Khan ◽

Geri Shahini ◽

Wolfram Hardt

Keyword(s):

Machine Learning ◽

Autonomous Driving ◽

Learning Approach ◽

Software Components ◽

Control Mechanisms ◽

Learning Approaches ◽

Software Applications ◽

Automotive Systems ◽

Machine Learning Approach ◽

Development Processes

Automotive technologies are ever-increasinglybecoming digital. Highly autonomous driving togetherwith digital E/E control mechanisms include thousandsof software applications which are called as software components. Together with the industry requirements, and rigorous software development processes, mappingof components as a software pool becomes very difficult.This article analyses and discusses the integration possiblilities of machine learning approaches to our previously introduced concept of mapping of software components through a common software pool.

Providing the ‘Best’ Lipophilicity Assessment in a Drug Discovery Environment

10.26434/chemrxiv.14292485 ◽

2021 ◽

Author(s):

george chang ◽

Nathaniel Woody ◽

Christopher Keefer

Keyword(s):

Machine Learning ◽

Drug Discovery ◽

High Throughput ◽

In Silico ◽

Shake Flask ◽

Chromatographic Method ◽

Learning Approach ◽

Rule Based ◽

Machine Learning Approach ◽

High Throughput Screens

Lipophilicity is a fundamental structural property that influences almost every aspect of drug discovery. Within Pfizer, we have two complementary high-throughput screens for measuring lipophilicity as a distribution coefficient (LogD) – a miniaturized shake-flask method (SFLogD) and a chromatographic method (ELogD). The results from these two assays are not the same (see Figure 1), with each assay being applicable or more reliable in particular chemical spaces. In addition to LogD assays, the ability to predict the LogD value for virtual compounds is equally vital. Here we present an in-silico LogD model, applicable to all chemical spaces, based on the integration of the LogD data from both assays. We developed two approaches towards a single LogD model – a Rule-based and a Machine Learning approach. Ultimately, the Machine Learning LogD model was found to be superior to both internally developed and commercial LogD models.<br>

Ranking Chemical Structures for Drug Discovery: A New Machine Learning Approach

Journal of Chemical Information and Modeling ◽

10.1021/ci9003865 ◽

2010 ◽

Vol 50 (5) ◽

pp. 716-731 ◽

Cited By ~ 75

Author(s):

Shivani Agarwal ◽

Deepak Dugar ◽

Shiladitya Sengupta

Keyword(s):

Machine Learning ◽

Drug Discovery ◽

Learning Approach ◽

Chemical Structures ◽

Machine Learning Approach ◽

New Machine

Providing the ‘Best’ Lipophilicity Assessment in a Drug Discovery Environment

10.26434/chemrxiv.14292485.v1 ◽

2021 ◽

Author(s):

george chang ◽

Nathaniel Woody ◽

Christopher Keefer

Keyword(s):

Machine Learning ◽

Drug Discovery ◽

High Throughput ◽

In Silico ◽

Shake Flask ◽

Chromatographic Method ◽

Learning Approach ◽

Rule Based ◽

Machine Learning Approach ◽

High Throughput Screens

Retinal Area Segmentation using Adaptive Superpixalation and its Classification using RBFN

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v6i6.pp2674-2681 ◽

2016 ◽

Vol 6 (6) ◽

pp. 2674

Author(s):

Nimisha Singh ◽

Rana Gill

Keyword(s):

Machine Learning ◽

Retinal Disease ◽

Learning Approach ◽

Learning Approaches ◽

Retinal Area ◽

Original Image ◽

Feature Generation ◽

Medical Field ◽

Machine Learning Approach ◽

Image Pattern

<p class="Abstract">Retinal disease is the very important issue in medical field. To diagnose the disease, it needs to detect the true retinal area. Artefacts like eyelids and eyelashes are come along with retinal part so removal of artefacts is the big task for better diagnosis of disease into the retinal part. In this paper, we have proposed the segmentation and use machine learning approaches to detect the true retinal part. Preprocessing is done on the original image using Gamma Normalization which helps to enhance the image that can gives detail information about the image. Then the segmentation is performed on the Gamma Normalized image by Superpixel method. Superpixel is the group of pixel into different regions which is based on compactness and regional size. Superpixel is used to reduce the complexity of image processing task and provide suitable primitive image pattern. Then feature generation must be done and machine learning approach helps to extract true retinal area. The experimental evaluation gives the better result with accuracy of 96%.</p>

Validation of a Derived International Patient Severity Algorithm to Support COVID-19 Analytics from Electronic Health Record Data

10.1101/2020.10.13.20201855 ◽

2020 ◽

Cited By ~ 2

Author(s):

Jeffrey G Klann ◽

Griffin M Weber ◽

Hossein Estiri ◽

Bertrand Moal ◽

Paul Avillach ◽

...

Keyword(s):

Machine Learning ◽

Chart Review ◽

Learning Approach ◽

Learning Approaches ◽

Electronic Health Record Data ◽

Icu Admission ◽

Machine Learning Approach ◽

Proxy Measure ◽

Definition Of

AbstractIntroductionThe Consortium for Clinical Characterization of COVID-19 by EHR (4CE) includes hundreds of hospitals internationally using a federated computational approach to COVID-19 research using the EHR.ObjectiveWe sought to develop and validate a standard definition of COVID-19 severity from readily accessible EHR data across the Consortium.MethodsWe developed an EHR-based severity algorithm and validated it on patient hospitalization data from 12 4CE clinical sites against the outcomes of ICU admission and/or death. We also used a machine learning approach to compare selected predictors of severity to the 4CE algorithm at one site.ResultsThe 4CE severity algorithm performed with pooled sensitivity of 0.73 and specificity 0.83 for the combined outcome of ICU admission and/or death. The sensitivity of single code categories for acuity were unacceptably inaccurate - varying by up to 0.65 across sites. A multivariate machine learning approach identified codes resulting in mean AUC 0.956 (95% CI: 0.952, 0.959) compared to 0.903 (95% CI: 0.886, 0.921) using expert-derived codes. Billing codes were poor proxies of ICU admission, with 49% precision and recall compared against chart review at one partner institution.DiscussionWe developed a proxy measure of severity that proved resilient to coding variability internationally by using a set of 6 code classes. In contrast, machine-learning approaches may tend to overfit hospital-specific orders. Manual chart review revealed discrepancies even in the gold standard outcomes, possibly due to pandemic conditions.ConclusionWe developed an EHR-based algorithm for COVID-19 severity and validated it at 12 international sites.

A Comprehensive Analysis of Deep Neural-Based Cerebral Microbleeds Detection System

Electronics ◽

10.3390/electronics10182208 ◽

2021 ◽

Vol 10 (18) ◽

pp. 2208

Author(s):

Maria Anna Ferlin ◽

Michał Grochowski ◽

Arkadiusz Kwasigroch ◽

Agnieszka Mikołajczyk ◽

Edyta Szurowska ◽

...

Keyword(s):

Machine Learning ◽

Detection System ◽

Three Dimensional ◽

Magnetic Resonance Images ◽

Cerebral Microbleeds ◽

Training Data ◽

Learning Approach ◽

Dimensional Problem ◽

Reliable System ◽

Machine Learning Approach

Machine learning-based systems are gaining interest in the field of medicine, mostly in medical imaging and diagnosis. In this paper, we address the problem of automatic cerebral microbleeds (CMB) detection in magnetic resonance images. It is challenging due to difficulty in distinguishing a true CMB from its mimics, however, if successfully solved, it would streamline the radiologists work. To deal with this complex three-dimensional problem, we propose a machine learning approach based on a 2D Faster RCNN network. We aimed to achieve a reliable system, i.e., with balanced sensitivity and precision. Therefore, we have researched and analysed, among others, impact of the way the training data are provided to the system, their pre-processing, the choice of model and its structure, and also the ways of regularisation. Furthermore, we also carefully analysed the network predictions and proposed an algorithm for its post-processing. The proposed approach enabled for obtaining high precision (89.74%), sensitivity (92.62%), and F1 score (90.84%). The paper presents the main challenges connected with automatic cerebral microbleeds detection, its deep analysis and developed system. The conducted research may significantly contribute to automatic medical diagnosis.