scholarly journals Balancing Data on Proteochemometrics Activity Classification

Author(s):  
Angela Lopez-del Rio ◽  
Sergio Picart ◽  
Alexandre Perera-Lluna

<div>In silico analysis of biological activity data has become an essential technique in pharmaceutical development. </div><div>Specifically, the so-called proteochemometric models aim to share information between targets in machine learning ligand-target activity prediction models. </div><div>However, bioactivity datasets used in proteochemometrics modeling are usually imbalanced, which could potentially affect the performance of the models. In this work, we explored the effect of different balancing strategies in deep learning proteochemometric target-compound activity classification models while controlling for the compound series bias through clustering. These strategies were: (1) no_resampling, (2) resampling_after_clustering, (3) resampling_before_clustering and (4) semi_resampling. </div><div>These schemas were evaluated in kinases and GPCRs from BindingDB. </div><div>We observed that the predicted proportion of positives was driven by the actual data balance in the test set. </div><div>Additionally, it was confirmed that data balance had an impact on the performance estimates of the proteochemometrics model. </div><div>We recommend a combination of data augmentation and clustering in the training set (semi_resampling) in order to mitigate the data imbalance effect in a realistic scenario. </div><div>The code of this analysis is publicly available at https://github.com/b2slab/imbalance_pcm_benchmark.</div>

2021 ◽  
Author(s):  
Angela Lopez-del Rio ◽  
Sergio Picart ◽  
Alexandre Perera-Lluna

<div>In silico analysis of biological activity data has become an essential technique in pharmaceutical development. </div><div>Specifically, the so-called proteochemometric models aim to share information between targets in machine learning ligand-target activity prediction models. </div><div>However, bioactivity datasets used in proteochemometrics modeling are usually imbalanced, which could potentially affect the performance of the models. In this work, we explored the effect of different balancing strategies in deep learning proteochemometric target-compound activity classification models while controlling for the compound series bias through clustering. These strategies were: (1) no_resampling, (2) resampling_after_clustering, (3) resampling_before_clustering and (4) semi_resampling. </div><div>These schemas were evaluated in kinases and GPCRs from BindingDB. </div><div>We observed that the predicted proportion of positives was driven by the actual data balance in the test set. </div><div>Additionally, it was confirmed that data balance had an impact on the performance estimates of the proteochemometrics model. </div><div>We recommend a combination of data augmentation and clustering in the training set (semi_resampling) in order to mitigate the data imbalance effect in a realistic scenario. </div><div>The code of this analysis is publicly available at https://github.com/b2slab/imbalance_pcm_benchmark.</div>


F1000Research ◽  
2014 ◽  
Vol 3 ◽  
pp. 113 ◽  
Author(s):  
Disha Gupta-Ostermann ◽  
Jürgen Bajorath

We describe the ‘Structure-Activity Relationship (SAR) Matrix’ (SARM) methodology that is based upon a special two-step application of the matched molecular pair (MMP) formalism. The SARM method has originally been designed for the extraction, organization, and visualization of compound series and associated SAR information from compound data sets. It has been further developed and adapted for other applications including compound design, activity prediction, library extension, and the navigation of multi-target activity spaces. The SARM approach and its extensions are presented here in context to introduce different types of applications and provide an example for the evolution of a computational methodology in pharmaceutical research.


2020 ◽  
Vol 47 (6) ◽  
pp. 398-408
Author(s):  
Sonam Tulsyan ◽  
Showket Hussain ◽  
Balraj Mittal ◽  
Sundeep Singh Saluja ◽  
Pranay Tanwar ◽  
...  

2020 ◽  
Vol 27 (38) ◽  
pp. 6523-6535 ◽  
Author(s):  
Antreas Afantitis ◽  
Andreas Tsoumanis ◽  
Georgia Melagraki

Drug discovery as well as (nano)material design projects demand the in silico analysis of large datasets of compounds with their corresponding properties/activities, as well as the retrieval and virtual screening of more structures in an effort to identify new potent hits. This is a demanding procedure for which various tools must be combined with different input and output formats. To automate the data analysis required we have developed the necessary tools to facilitate a variety of important tasks to construct workflows that will simplify the handling, processing and modeling of cheminformatics data and will provide time and cost efficient solutions, reproducible and easier to maintain. We therefore develop and present a toolbox of >25 processing modules, Enalos+ nodes, that provide very useful operations within KNIME platform for users interested in the nanoinformatics and cheminformatics analysis of chemical and biological data. With a user-friendly interface, Enalos+ Nodes provide a broad range of important functionalities including data mining and retrieval from large available databases and tools for robust and predictive model development and validation. Enalos+ Nodes are available through KNIME as add-ins and offer valuable tools for extracting useful information and analyzing experimental and virtual screening results in a chem- or nano- informatics framework. On top of that, in an effort to: (i) allow big data analysis through Enalos+ KNIME nodes, (ii) accelerate time demanding computations performed within Enalos+ KNIME nodes and (iii) propose new time and cost efficient nodes integrated within Enalos+ toolbox we have investigated and verified the advantage of GPU calculations within the Enalos+ nodes. Demonstration data sets, tutorial and educational videos allow the user to easily apprehend the functions of the nodes that can be applied for in silico analysis of data.


2013 ◽  
Vol 9 (4) ◽  
pp. 608-616 ◽  
Author(s):  
Zaheer Ul-Haq ◽  
Saman Usmani ◽  
Uzma Mahmood ◽  
Mariya al-Rashida ◽  
Ghulam Abbas

Sign in / Sign up

Export Citation Format

Share Document