scholarly journals Hidden Bias in the DUD-E Dataset Leads to Misleading Performance of Deep Learning in Structure-Based Virtual Screening

Author(s):  
Lieyang Chen ◽  
Anthony Cruz ◽  
Steven Ramsey ◽  
Callum J. Dickson ◽  
José S. Duca ◽  
...  

<p>Recently much effort has been invested in using convolutional neural network (CNN) models trained on 3D structural images of protein-ligand complexes to distinguish binding from non-binding ligands for virtual screening. However, the dearth of reliable protein-ligand x-ray structures and binding affinity data has required the use of constructed datasets for the training and evaluation of CNN molecular recognition models. Here, we outline various sources of bias in one such widely-used dataset, the Directory of Useful Decoys: Enhanced (DUD-E). We have constructed and performed tests to investigate whether CNN models developed using DUD-E are properly learning the underlying physics of molecular recognition, as intended, or are instead learning biases inherent in the dataset itself. We find that superior enrichment efficiency in CNN models can be attributed to the analogue and decoy bias hidden in the DUD-E dataset rather than successful generalization of the pattern of protein-ligand interactions. Comparing additional deep learning models trained on PDBbind datasets, we found that their enrichment performances using DUD-E are not superior to the performance of the docking program AutoDock Vina. Together, these results suggest that biases that could be present in constructed datasets should be thoroughly evaluated before applying them to machine learning based methodology development. </p>

Author(s):  
Lieyang Chen ◽  
Anthony Cruz ◽  
Steven Ramsey ◽  
Callum J. Dickson ◽  
José S. Duca ◽  
...  

<p>Recently much effort has been invested in using convolutional neural network (CNN) models trained on 3D structural images of protein-ligand complexes to distinguish binding from non-binding ligands for virtual screening. However, the dearth of reliable protein-ligand x-ray structures and binding affinity data has required the use of constructed datasets for the training and evaluation of CNN molecular recognition models. Here, we outline various sources of bias in one such widely-used dataset, the Directory of Useful Decoys: Enhanced (DUD-E). We have constructed and performed tests to investigate whether CNN models developed using DUD-E are properly learning the underlying physics of molecular recognition, as intended, or are instead learning biases inherent in the dataset itself. We find that superior enrichment efficiency in CNN models can be attributed to the analogue and decoy bias hidden in the DUD-E dataset rather than successful generalization of the pattern of protein-ligand interactions. Comparing additional deep learning models trained on PDBbind datasets, we found that their enrichment performances using DUD-E are not superior to the performance of the docking program AutoDock Vina. Together, these results suggest that biases that could be present in constructed datasets should be thoroughly evaluated before applying them to machine learning based methodology development. </p>


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e11261
Author(s):  
Asim Kumar Bepari ◽  
Hasan Mahmud Reza

Background The COVID-19 pandemic, caused by the SARS-CoV-2 virus, has ravaged lives across the globe since December 2019, and new cases are still on the rise. Peoples’ ongoing sufferings trigger scientists to develop safe and effective remedies to treat this deadly viral disease. While repurposing the existing FDA-approved drugs remains in the front line, exploring drug candidates from synthetic and natural compounds is also a viable alternative. This study employed a comprehensive computational approach to screen inhibitors for SARS-CoV-2 3CL-PRO (also known as the main protease), a prime molecular target to treat coronavirus diseases. Methods We performed 100 ns GROMACS molecular dynamics simulations of three high-resolution X-ray crystallographic structures of 3CL-PRO. We extracted frames at 10 ns intervals to mimic conformational diversities of the target protein in biological environments. We then used AutoDock Vina molecular docking to virtual screen the Sigma–Aldrich MyriaScreen Diversity Library II, a rich collection of 10,000 druglike small molecules with diverse chemotypes. Subsequently, we adopted in silico computation of physicochemical properties, pharmacokinetic parameters, and toxicity profiles. Finally, we analyzed hydrogen bonding and other protein-ligand interactions for the short-listed compounds. Results Over the 100 ns molecular dynamics simulations of 3CL-PRO’s crystal structures, 6LZE, 6M0K, and 6YB7, showed overall integrity with mean Cα root-mean-square deviation (RMSD) of 1.96 (±0.35) Å, 1.98 (±0.21) Å, and 1.94 (±0.25) Å, respectively. Average root-mean-square fluctuation (RMSF) values were 1.21 ± 0.79 (6LZE), 1.12 ± 0.72 (6M0K), and 1.11 ± 0.60 (6YB7). After two phases of AutoDock Vina virtual screening of the MyriaScreen Diversity Library II, we prepared a list of the top 20 ligands. We selected four promising leads considering predicted oral bioavailability, druglikeness, and toxicity profiles. These compounds also demonstrated favorable protein-ligand interactions. We then employed 50-ns molecular dynamics simulations for the four selected molecules and the reference ligand 11a in the crystallographic structure 6LZE. Analysis of RMSF, RMSD, and hydrogen bonding along the simulation trajectories indicated that S51765 would form a more stable protein-ligand complexe with 3CL-PRO compared to other molecules. Insights into short-range Coulombic and Lennard-Jones potentials also revealed favorable binding of S51765 with 3CL-PRO. Conclusion We identified a potential lead for antiviral drug discovery against the SARS-CoV-2 main protease. Our results will aid global efforts to find safe and effective remedies for COVID-19.


2019 ◽  
Vol 18 (05) ◽  
pp. 1950027 ◽  
Author(s):  
Qiangna Lu ◽  
Lian-Wen Qi ◽  
Jinfeng Liu

Water plays a significant role in determining the protein–ligand binding modes, especially when water molecules are involved in mediating protein–ligand interactions, and these important water molecules are receiving more and more attention in recent years. Considering the effects of water molecules has gradually become a routine process for accurate description of the protein–ligand interactions. As a free docking program, Autodock has been most widely used in predicting the protein–ligand binding modes. However, whether the inclusion of water molecules in Autodock would improve its docking performance has not been systematically investigated. Here, we incorporate important bridging water molecules into Autodock program, and systematically investigate the effectiveness of these water molecules in protein–ligand docking. This approach was evaluated using 18 structurally diverse protein–ligand complexes, in which several water molecules bridge the protein–ligand interactions. Different treatment of water molecules were tested by using the fixed and rotatable water molecules, and a considerable improvement in successful docking simulations was found when including these water molecules. This study illustrates the necessity of inclusion of water molecules in Autodock docking, and emphasizes the importance of a proper treatment of water molecules in protein–ligand binding predictions.


2012 ◽  
Vol 2012 ◽  
pp. 1-13 ◽  
Author(s):  
Santhosh K. Venkatesan ◽  
Vikash Kumar Dubey

Structure-based virtual screening of NCI Diversity set II compounds was performed to indentify novel inhibitor scaffolds of trypanothione reductase (TR) fromLeishmania infantum. The top 50 ranked hits were clustered using the AuPoSOM tool. Majority of the top-ranked compounds were Tricyclic. Clustering of hits yielded four major clusters each comprising varying number of subclusters differing in their mode of binding and orientation in the active site. Moreover, for the first time, we report selected alkaloids and dibenzothiazepines as inhibitors ofLeishmania infantumTR. The mode of binding observed among the clusters also potentiates the probablein vitroinhibition kinetics and aids in defining key interaction which might contribute to the inhibition of enzymatic reduction of T[S] 2. The method provides scope for automation and integration into the virtual screening process employing docking softwares, for clustering the small molecule inhibitors based upon protein-ligand interactions.


Sign in / Sign up

Export Citation Format

Share Document