Balancing Data on Proteochemometrics Activity Classification

10.26434/chemrxiv.13634849.v1 ◽

2021 ◽

Author(s):

Angela Lopez-del Rio ◽

Sergio Picart ◽

Alexandre Perera-Lluna

Keyword(s):

Data Augmentation ◽

Prediction Models ◽

In Silico Analysis ◽

Activity Classification ◽

Activity Prediction ◽

Activity Data ◽

Target Activity ◽

Performance Estimates ◽

Silico Analysis ◽

Compound Series

<div>In silico analysis of biological activity data has become an essential technique in pharmaceutical development. </div><div>Specifically, the so-called proteochemometric models aim to share information between targets in machine learning ligand-target activity prediction models. </div><div>However, bioactivity datasets used in proteochemometrics modeling are usually imbalanced, which could potentially affect the performance of the models. In this work, we explored the effect of different balancing strategies in deep learning proteochemometric target-compound activity classification models while controlling for the compound series bias through clustering. These strategies were: (1) no_resampling, (2) resampling_after_clustering, (3) resampling_before_clustering and (4) semi_resampling. </div><div>These schemas were evaluated in kinases and GPCRs from BindingDB. </div><div>We observed that the predicted proportion of positives was driven by the actual data balance in the test set. </div><div>Additionally, it was confirmed that data balance had an impact on the performance estimates of the proteochemometrics model. </div><div>We recommend a combination of data augmentation and clustering in the training set (semi_resampling) in order to mitigate the data imbalance effect in a realistic scenario. </div><div>The code of this analysis is publicly available at https://github.com/b2slab/imbalance_pcm_benchmark.</div>

Download Full-text

DL-CRISPR: A Deep Learning Method for Off-Target Activity Prediction in CRISPR/Cas9 With Data Augmentation

IEEE Access ◽

10.1109/access.2020.2989454 ◽

2020 ◽

Vol 8 ◽

pp. 76610-76617

Author(s):

Yu Zhang ◽

Yahui Long ◽

Rui Yin ◽

Chee Keong Kwoh

Keyword(s):

Deep Learning ◽

Data Augmentation ◽

Learning Method ◽

Activity Prediction ◽

Target Activity

Download Full-text

The ‘SAR Matrix’ method and its extensions for applications in medicinal chemistry and chemogenomics

F1000Research ◽

10.12688/f1000research.4185.1 ◽

2014 ◽

Vol 3 ◽

pp. 113 ◽

Cited By ~ 12

Author(s):

Disha Gupta-Ostermann ◽

Jürgen Bajorath

Keyword(s):

Pharmaceutical Research ◽

Design Activity ◽

Data Sets ◽

Activity Prediction ◽

Activity Spaces ◽

Structure Activity ◽

Different Types ◽

Computational Methodology ◽

Target Activity ◽

Compound Series

We describe the ‘Structure-Activity Relationship (SAR) Matrix’ (SARM) methodology that is based upon a special two-step application of the matched molecular pair (MMP) formalism. The SARM method has originally been designed for the extraction, organization, and visualization of compound series and associated SAR information from compound data sets. It has been further developed and adapted for other applications including compound design, activity prediction, library extension, and the navigation of multi-target activity spaces. The SARM approach and its extensions are presented here in context to introduce different types of applications and provide an example for the evolution of a computational methodology in pharmaceutical research.

Download Full-text

A systematic review with in silico analysis on transcriptomic profile of gallbladder carcinoma

Seminars in Oncology ◽

10.1053/j.seminoncol.2020.02.012 ◽

2020 ◽

Vol 47 (6) ◽

pp. 398-408

Author(s):

Sonam Tulsyan ◽

Showket Hussain ◽

Balraj Mittal ◽

Sundeep Singh Saluja ◽

Pranay Tanwar ◽

...

Keyword(s):

Systematic Review ◽

Gallbladder Carcinoma ◽

In Silico ◽

In Silico Analysis ◽

Transcriptomic Profile ◽

Silico Analysis

Download Full-text

In Silico Analysis of Single Nucleotide Polymorphism in Human Prothrombin Gene

10.1055/s-0039-1680245 ◽

2019 ◽

Author(s):

I. Farah ◽

A. El-Mubark ◽

M. Osman ◽

A. Soliman ◽

F. Ali ◽

...

Keyword(s):

Single Nucleotide Polymorphism ◽

In Silico ◽

In Silico Analysis ◽

Nucleotide Polymorphism ◽

Single Nucleotide ◽

Prothrombin Gene ◽

Silico Analysis

Download Full-text

Enalos Suite of Tools: Enhancing Cheminformatics and Nanoinfor - matics through KNIME

Current Medicinal Chemistry ◽

10.2174/0929867327666200727114410 ◽

2020 ◽

Vol 27 (38) ◽

pp. 6523-6535 ◽

Cited By ~ 3

Author(s):

Antreas Afantitis ◽

Andreas Tsoumanis ◽

Georgia Melagraki

Keyword(s):

Data Analysis ◽

Virtual Screening ◽

In Silico ◽

Model Development ◽

In Silico Analysis ◽

Material Design ◽

Efficient Solutions ◽

Biological Data ◽

Cost Efficient ◽

Silico Analysis

Drug discovery as well as (nano)material design projects demand the in silico analysis of large datasets of compounds with their corresponding properties/activities, as well as the retrieval and virtual screening of more structures in an effort to identify new potent hits. This is a demanding procedure for which various tools must be combined with different input and output formats. To automate the data analysis required we have developed the necessary tools to facilitate a variety of important tasks to construct workflows that will simplify the handling, processing and modeling of cheminformatics data and will provide time and cost efficient solutions, reproducible and easier to maintain. We therefore develop and present a toolbox of >25 processing modules, Enalos+ nodes, that provide very useful operations within KNIME platform for users interested in the nanoinformatics and cheminformatics analysis of chemical and biological data. With a user-friendly interface, Enalos+ Nodes provide a broad range of important functionalities including data mining and retrieval from large available databases and tools for robust and predictive model development and validation. Enalos+ Nodes are available through KNIME as add-ins and offer valuable tools for extracting useful information and analyzing experimental and virtual screening results in a chem- or nano- informatics framework. On top of that, in an effort to: (i) allow big data analysis through Enalos+ KNIME nodes, (ii) accelerate time demanding computations performed within Enalos+ KNIME nodes and (iii) propose new time and cost efficient nodes integrated within Enalos+ toolbox we have investigated and verified the advantage of GPU calculations within the Enalos+ nodes. Demonstration data sets, tutorial and educational videos allow the user to easily apprehend the functions of the nodes that can be applied for in silico analysis of data.

Download Full-text