LIBRUS: combined machine learning and homology information for sequence-based ligand-binding residue prediction

Abstract Motivation Knowledge of protein–ligand binding residues is important for understanding the functions of proteins and their interaction mechanisms. From experimentally solved protein structures, how to accurately identify its potential binding sites of a specific ligand on the protein is still a challenging problem. Compared with structure-alignment-based methods, machine learning algorithms provide an alternative flexible solution which is less dependent on annotated homogeneous protein structures. Several factors are important for an efficient protein–ligand prediction model, e.g. discriminative feature representation and effective learning architecture to deal with both the large-scale and severely imbalanced data. Results In this study, we propose a novel deep-learning-based method called DELIA for protein–ligand binding residue prediction. In DELIA, a hybrid deep neural network is designed to integrate 1D sequence-based features with 2D structure-based amino acid distance matrices. To overcome the problem of severe data imbalance between the binding and nonbinding residues, strategies of oversampling in mini-batch, random undersampling and stacking ensemble are designed to enhance the model. Experimental results on five benchmark datasets demonstrate the effectiveness of proposed DELIA pipeline. Availability and implementation The web server of DELIA is available at www.csbio.sjtu.edu.cn/bioinf/delia/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Ligand-Binding Residue Prediction

Introduction to Protein Structure Prediction ◽

10.1002/9780470882207.ch16 ◽

2010 ◽

pp. 343-368 ◽

Cited By ~ 2

Author(s):

Chris Kauffman ◽

George Karypis

Keyword(s):

Ligand Binding ◽

Binding Residue ◽

Binding Residue Prediction

Download Full-text

A simple iterative method to optimize protein–ligand-binding residue prediction

Journal of Theoretical Biology ◽

10.1016/j.jtbi.2012.10.028 ◽

2013 ◽

Vol 317 ◽

pp. 219-223 ◽

Cited By ~ 3

Author(s):

Zhijun Qiu ◽

Cuili Qin ◽

Min Jiu ◽

Xicheng Wang

Keyword(s):

Iterative Method ◽

Ligand Binding ◽

Binding Residue ◽

Binding Residue Prediction

Download Full-text

Feature-incorporated alignment based ligand-binding residue prediction for carbohydrate-binding modules

Bioinformatics ◽

10.1093/bioinformatics/btq084 ◽

2010 ◽

Vol 26 (8) ◽

pp. 1022-1028 ◽

Cited By ~ 6

Author(s):

Wei-Yao Chou ◽

Wei-I Chou ◽

Tun-Wen Pai ◽

Shu-Chuan Lin ◽

Ting-Ying Jiang ◽

...

Keyword(s):

Ligand Binding ◽

Carbohydrate Binding ◽

Binding Residue ◽

Carbohydrate Binding Modules ◽

Binding Residue Prediction

Download Full-text

Ollivier Persistent Ricci Curvature-Based Machine Learning for the Protein–Ligand Binding Affinity Prediction

Journal of Chemical Information and Modeling ◽

10.1021/acs.jcim.0c01415 ◽

2021 ◽

Author(s):

JunJie Wee ◽

Kelin Xia

Keyword(s):

Machine Learning ◽

Ligand Binding ◽

Binding Affinity ◽

Ricci Curvature ◽

Binding Affinity Prediction ◽

Affinity Prediction

Download Full-text

Protein-DNA Binding Residue Prediction via Bagging Strategy and Sequence-based Cube-Format Feature

IEEE/ACM Transactions on Computational Biology and Bioinformatics ◽

10.1109/tcbb.2021.3123828 ◽

2021 ◽

pp. 1-1

Author(s):

Jun Hu ◽

Yan-Song Bai ◽

Lin-Lin Zheng ◽

Ning-Xin Jia ◽

Dong-Jun Yu ◽

...

Keyword(s):

Dna Binding ◽

Binding Residue ◽

Binding Residue Prediction

Download Full-text

Integration of element specific persistent homology and machine learning for protein-ligand binding affinity prediction

International Journal for Numerical Methods in Biomedical Engineering ◽

10.1002/cnm.2914 ◽

2017 ◽

Vol 34 (2) ◽

pp. e2914 ◽

Cited By ~ 43

Author(s):

Zixuan Cang ◽

Guo-Wei Wei

Keyword(s):

Machine Learning ◽

Ligand Binding ◽

Binding Affinity ◽

Persistent Homology ◽

Binding Affinity Prediction ◽

Affinity Prediction

Download Full-text

RASPD+: Fast Protein-Ligand Binding Free Energy Prediction Using Simplified Physicochemical Features

10.26434/chemrxiv.12636704.v1 ◽

2020 ◽

Cited By ~ 1

Author(s):

Stefan Holderbach ◽

Lukas Adam ◽

Bhyravabhotla Jayaram ◽

Rebecca Wade ◽

Goutam Mukherjee

Keyword(s):

Machine Learning ◽

Ligand Binding ◽

Protein Binding ◽

Binding Affinity ◽

Binding Free Energy ◽

Rapid Screening ◽

Scoring Functions ◽

Energy Prediction ◽

Molecular Features ◽

Physicochemical Descriptors

The virtual screening of large numbers of compounds against target protein binding sites has become an integral component of drug discovery workflows. This screening is often done by computationally docking ligands into a protein binding site of interest, but this has the drawback that a large number of poses must be evaluated to obtain accurate estimates of protein-ligand binding affinity. We here introduce a fast prefiltering method for ligand prioritization that is based on a set of machine learning models and uses simple pose-invariant physicochemical descriptors of the ligands and the protein binding pocket. Our method, Rapid Screening with Physicochemical Descriptors + machine learning (RASPD+), is trained on PDBbind data and achieves a regression performance better than for the original RASPD method and comparable to traditional scoring functions on a range of different test sets without the need for generating ligand poses. Additionally, we use RASPD+ to identify molecular features important for binding affinity and assess the ability of RASPD+ to enrich active molecules from decoys.

Download Full-text

3D Convolutional Neural Networks and a CrossDocked Dataset for Structure-Based Drug Design

10.26434/chemrxiv.11833323.v2 ◽

2020 ◽

Author(s):

Paul Francoeur ◽

Tomohide Masuda ◽

David R. Koes

Keyword(s):

Machine Learning ◽

Ligand Binding ◽

Binding Affinity ◽

Mean Squared Error ◽

Comprehensive Evaluation ◽

Training Data ◽

Learning Approaches ◽

Neural Network Models ◽

Structure Based Drug Design ◽

Affinity Prediction

One of the main challenges in drug discovery is predicting protein-ligand binding affinity. Recently, machine learning approaches have made substantial progress on this task. However, current methods of model evaluation are overly optimistic in measuring generalization to new targets, and there does not exist a standard dataset of sufficient size to compare performance between models. We present a new dataset for structure-based machine learning, the CrossDocked2020 set, with 22.5 million poses of ligands docked into multiple similar binding pockets across the Protein Data Bank and perform a comprehensive evaluation of grid-based convolutional neural network models on this dataset. We also demonstrate how the partitioning of the training data and test data can impact the results of models trained with the PDBbind dataset, how performance improves by adding more, lower-quality training data, and how training with docked poses imparts pose sensitivity to the predicted affinity of a complex. Our best performing model, an ensemble of 5 densely connected convolutional newtworks, achieves a root mean squared error of 1.42 and Pearson R of 0.612 on the affinity prediction task, an AUC of 0.956 at binding pose classification, and a 68.4% accuracy at pose selection on the CrossDocked2020 set. By providing data splits for clustered cross-validation and the raw data for the CrossDocked2020 set, we establish the first standardized dataset for training machine learning models to recognize ligands in non-cognate target structures while also greatly expanding the number of poses available for training. In order to facilitate community adoption of this dataset for benchmarking protein-ligand binding affinity prediction, we provide our models, weights, and the CrossDocked2020 set at https://github.com/gnina/models.

Download Full-text

Learning from Docked Ligands: Ligand-Based Features Rescue Structure-Based Scoring Functions When Trained On Docked Poses

10.26434/chemrxiv.13637756 ◽

2021 ◽

Author(s):

Fergus Boyles ◽

Charlotte M Deane ◽

Garrett Morris

Keyword(s):

Machine Learning ◽

Ligand Binding ◽

Crystal Structures ◽

Binding Affinity ◽

Scoring Function ◽

Scoring Functions ◽

Data Set ◽

Core Sets ◽

Strong Performance

Machine learning scoring functions for protein-ligand binding affinity have been found to consistently outperform classical scoring functions when trained and tested on crystal structures of bound protein-ligand complexes. However, it is less clear how these methods perform when applied to docked poses of complexes.<br><br>We explore how the use of docked, rather than crystallographic, poses for both training and testing affects the performance of machine learning scoring functions. Using the PDBbind Core Sets as benchmarks, we show that the performance of a structure-based machine learning scoring function trained and tested on docked poses is lower than that of the same scoring function trained and tested on crystallographic poses. We construct a hybrid scoring function by combining both structure-based and ligand-based features, and show that its ability to predict binding affinity using docked poses is comparable to that of purely structure-based scoring functions trained and tested on crystal poses. Despite strong performance on docked poses of the PDBbind Core Sets, we find that our hybrid scoring function fails to generalise to anew data set, demonstrating the need for improved scoring functions and additional validation benchmarks. <br><br>Code and data to reproduce our results are available from https://github.com/oxpig/learning-from-docked-poses.

Download Full-text