Journal of Cheminformatics

AbstractGiven an objective function that predicts key properties of a molecule, goal-directed de novo molecular design is a useful tool to identify molecules that maximize or minimize said objective function. Nonetheless, a common drawback of these methods is that they tend to design synthetically unfeasible molecules. In this paper we describe a Lamarckian evolutionary algorithm for de novo drug design (LEADD). LEADD attempts to strike a balance between optimization power, synthetic accessibility of designed molecules and computational efficiency. To increase the likelihood of designing synthetically accessible molecules, LEADD represents molecules as graphs of molecular fragments, and limits the bonds that can be formed between them through knowledge-based pairwise atom type compatibility rules. A reference library of drug-like molecules is used to extract fragments, fragment preferences and compatibility rules. A novel set of genetic operators that enforce these rules in a computationally efficient manner is presented. To sample chemical space more efficiently we also explore a Lamarckian evolutionary mechanism that adapts the reproductive behavior of molecules. LEADD has been compared to both standard virtual screening and a comparable evolutionary algorithm using a standardized benchmark suite and was shown to be able to identify fitter molecules more efficiently. Moreover, the designed molecules are predicted to be easier to synthesize than those designed by other evolutionary algorithms. Graphical Abstract

Uncertainty-aware prediction of chemical reaction yields with graph neural networks

Journal of Cheminformatics ◽

10.1186/s13321-021-00579-z ◽

2022 ◽

Vol 14 (1) ◽

Author(s):

Youngchun Kwon ◽

Dongseon Lee ◽

Youn-Suk Choi ◽

Seokho Kang

Keyword(s):

Neural Network ◽

Neural Networks ◽

Uncertainty Quantification ◽

Chemical Reaction ◽

Predictive Distribution ◽

Data Driven ◽

Permutation Invariance ◽

Molecular Graphs ◽

Benchmark Datasets ◽

Graph Neural Networks

AbstractIn this paper, we present a data-driven method for the uncertainty-aware prediction of chemical reaction yields. The reactants and products in a chemical reaction are represented as a set of molecular graphs. The predictive distribution of the yield is modeled as a graph neural network that directly processes a set of graphs with permutation invariance. Uncertainty-aware learning and inference are applied to the model to make accurate predictions and to evaluate their uncertainty. We demonstrate the effectiveness of the proposed method on benchmark datasets with various settings. Compared to the existing methods, the proposed method improves the prediction and uncertainty quantification performance in most settings.

HobPre: accurate prediction of human oral bioavailability for small molecules

Journal of Cheminformatics ◽

10.1186/s13321-021-00580-6 ◽

2022 ◽

Vol 14 (1) ◽

Author(s):

Min Wei ◽

Xudong Zhang ◽

Xiaolin Pan ◽

Bo Wang ◽

Changge Ji ◽

...

Keyword(s):

Drug Development ◽

Small Molecules ◽

Oral Bioavailability ◽

Computational Models ◽

Experimental Tests ◽

New Drugs ◽

Drug Candidates ◽

Drug Molecules ◽

Key Factor ◽

Input Variables

AbstractHuman oral bioavailability (HOB) is a key factor in determining the fate of new drugs in clinical trials. HOB is conventionally measured using expensive and time-consuming experimental tests. The use of computational models to evaluate HOB before the synthesis of new drugs will be beneficial to the drug development process. In this study, a total of 1588 drug molecules with HOB data were collected from the literature for the development of a classifying model that uses the consensus predictions of five random forest models. The consensus model shows excellent prediction accuracies on two independent test sets with two cutoffs of 20% and 50% for classification of molecules. The analysis of the importance of the input variables allowed the identification of the main molecular descriptors that affect the HOB class value. The model is available as a web server at www.icdrug.com/ICDrug/ADMET for quick assessment of oral bioavailability for small molecules. The results from this study provide an accurate and easy-to-use tool for screening of drug candidates based on HOB, which may be used to reduce the risk of failure in late stage of drug development. Graphical Abstract

TorsiFlex: an automatic generator of torsional conformers. Application to the twenty proteinogenic amino acids

Journal of Cheminformatics ◽

10.1186/s13321-021-00578-0 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

David Ferro-Costas ◽

Irea Mosquera-Lois ◽

Antonio Fernández-Ramos

Keyword(s):

Amino Acids ◽

Electronic Structure ◽

Energy Surface ◽

Random Search ◽

Torsional Potential ◽

Complete Work ◽

Chemical Knowledge ◽

Validation Tests ◽

User Friendly

AbstractIn this work, we introduce , a user-friendly software written in Python 3 and designed to find all the torsional conformers of flexible acyclic molecules in an automatic fashion. For the mapping of the torsional potential energy surface, the algorithm implemented in combines two searching strategies: preconditioned and stochastic. The former is a type of systematic search based on chemical knowledge and should be carried out before the stochastic (random) search. The algorithm applies several validation tests to accelerate the exploration of the torsional space. For instance, the optimized structures are stored and this information is used to prevent revisiting these points and their surroundings in future iterations. operates with a dual-level strategy by which the initial search is carried out at an inexpensive electronic structure level of theory and the located conformers are reoptimized at a higher level. Additionally, the program takes advantage of conformational enantiomerism, when possible. As a case study, and in order to exemplify the effectiveness and capabilities of this program, we have employed to locate the conformers of the twenty proteinogenic amino acids in their neutral canonical form. has produced a number of conformers that roughly doubles the amount of the most complete work to date. Graphical Abstract

Splitting chemical structure data sets for federated privacy-preserving machine learning

Journal of Cheminformatics ◽

10.1186/s13321-021-00576-2 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Jaak Simm ◽

Lina Humbeck ◽

Adam Zalewski ◽

Noe Sturm ◽

Wouter Heyndrickx ◽

...

Keyword(s):

Machine Learning ◽

Quality Criteria ◽

Privacy Preserving ◽

Locality Sensitive Hashing ◽

Data Sets ◽

Data Set ◽

Test Set ◽

Chemical Structures ◽

Multiple Partners ◽

Applications Of Machine Learning

AbstractWith the increase in applications of machine learning methods in drug design and related fields, the challenge of designing sound test sets becomes more and more prominent. The goal of this challenge is to have a realistic split of chemical structures (compounds) between training, validation and test set such that the performance on the test set is meaningful to infer the performance in a prospective application. This challenge is by its own very interesting and relevant, but is even more complex in a federated machine learning approach where multiple partners jointly train a model under privacy-preserving conditions where chemical structures must not be shared between the different participating parties. In this work we discuss three methods which provide a splitting of a data set and are applicable in a federated privacy-preserving setting, namely: a. locality-sensitive hashing (LSH), b. sphere exclusion clustering, c. scaffold-based binning (scaffold network). For evaluation of these splitting methods we consider the following quality criteria (compared to random splitting): bias in prediction performance, classification label and data imbalance, similarity distance between the test and training set compounds. The main findings of the paper are a. both sphere exclusion clustering and scaffold-based binning result in high quality splitting of the data sets, b. in terms of compute costs sphere exclusion clustering is very expensive in the case of federated privacy-preserving setting.

Processing binding data using an open-source workflow

Journal of Cheminformatics ◽

10.1186/s13321-021-00577-1 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Errol L. G. Samuel ◽

Secondra L. Holmes ◽

Damian W. Young

Keyword(s):

Open Source ◽

Protein Ligand Interactions ◽

Open Source Data ◽

Binding Data ◽

Wide Range ◽

Ligand Discovery ◽

Ligand Interactions ◽

Analysis Workflow ◽

Set Up ◽

Programming Knowledge

AbstractThe thermal shift assay (TSA)—also known as differential scanning fluorimetry (DSF), thermofluor, and Tm shift—is one of the most popular biophysical screening techniques used in fragment-based ligand discovery (FBLD) to detect protein–ligand interactions. By comparing the thermal stability of a target protein in the presence and absence of a ligand, potential binders can be identified. The technique is easy to set up, has low protein consumption, and can be run on most real-time polymerase chain reaction (PCR) instruments. While data analysis is straightforward in principle, it becomes cumbersome and time-consuming when the screens involve multiple 96- or 384-well plates. There are several approaches that aim to streamline this process, but most involve proprietary software, programming knowledge, or are designed for specific instrument output files. We therefore developed an analysis workflow implemented in the Konstanz Information Miner (KNIME), a free and open-source data analytics platform, which greatly streamlined our data processing timeline for 384-well plates. The implementation is code-free and freely available to the community for improvement and customization to accommodate a wide range of instrument input files and workflows. Graphical Abstract

Prediction of small-molecule compound solubility in organic solvents by machine learning algorithms

Journal of Cheminformatics ◽

10.1186/s13321-021-00575-3 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Zhuyifan Ye ◽

Defang Ouyang

Keyword(s):

Machine Learning ◽

Organic Solvents ◽

Structural Features ◽

Machine Learning Algorithms ◽

Solubility Data ◽

Solubility Prediction ◽

Molecular Fingerprints ◽

Different Temperatures ◽

Small Molecule Compound ◽

Experimental Solubility Data

AbstractRapid solvent selection is of great significance in chemistry. However, solubility prediction remains a crucial challenge. This study aimed to develop machine learning models that can accurately predict compound solubility in organic solvents. A dataset containing 5081 experimental temperature and solubility data of compounds in organic solvents was extracted and standardized. Molecular fingerprints were selected to characterize structural features. lightGBM was compared with deep learning and traditional machine learning (PLS, Ridge regression, kNN, DT, ET, RF, SVM) to develop models for predicting solubility in organic solvents at different temperatures. Compared to other models, lightGBM exhibited significantly better overall generalization (logS ± 0.20). For unseen solutes, our model gave a prediction accuracy (logS ± 0.59) close to the expected noise level of experimental solubility data. lightGBM revealed the physicochemical relationship between solubility and structural features. Our method enables rapid solvent screening in chemistry and may be applied to solubility prediction in other solvents.

ChemTables: a dataset for semantic classification on tables in chemical patents

Journal of Cheminformatics ◽

10.1186/s13321-021-00568-2 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Zenan Zhai ◽

Christian Druckenbrodt ◽

Camilo Thorne ◽

Saber A. Akhondi ◽

Dat Quoc Nguyen ◽

...

Keyword(s):

Language Processing ◽

Pharmaceutical Research ◽

Network Models ◽

Relevant Information ◽

Classification Task ◽

Neural Network Models ◽

Data Set ◽

Content Type ◽

Semantic Classification ◽

Link Type

AbstractChemical patents are a commonly used channel for disclosing novel compounds and reactions, and hence represent important resources for chemical and pharmaceutical research. Key chemical data in patents is often presented in tables. Both the number and the size of tables can be very large in patent documents. In addition, various types of information can be presented in tables in patents, including spectroscopic and physical data, or pharmacological use and effects of chemicals. Since images of Markush structures and merged cells are commonly used in these tables, their structure also shows substantial variation. This heterogeneity in content and structure of tables in chemical patents makes relevant information difficult to find. We therefore propose a new text mining task of automatically categorising tables in chemical patents based on their contents. Categorisation of tables based on the nature of their content can help to identify tables containing key information, improving the accessibility of information in patents that is highly relevant for new inventions. For developing and evaluating methods for the table classification task, we developed a new dataset, called ChemTables, which consists of 788 chemical patent tables with labels of their content type. We introduce this data set in detail. We further establish strong baselines for the table classification task in chemical patents by applying state-of-the-art neural network models developed for natural language processing, including TabNet, ResNet and Table-BERT on ChemTables. The best performing model, Table-BERT, achieves a performance of 88.66 micro-averaged $$F_1$$ F 1 score on the table classification task. The ChemTables dataset is publicly available at https://doi.org/10.17632/g7tjh7tbrj.3, subject to the CC BY NC 3.0 license. Code/models evaluated in this work are in a Github repository https://github.com/zenanz/ChemTables.

Exploration and augmentation of pharmacological space via adversarial auto-encoder model for facilitating kinase-centric drug development

Journal of Cheminformatics ◽

10.1186/s13321-021-00574-4 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Xinyu Bai ◽

Yuxin Yin

Keyword(s):

Drug Development ◽

Protein Interactions ◽

Kinase Inhibitors ◽

Prediction Models ◽

Target Prediction ◽

Representation Learning ◽

Training Data ◽

Internal Validation ◽

Data Points ◽

Prediction Problems

AbstractPredicting compound–protein interactions (CPIs) is of great importance for drug discovery and repositioning, yet still challenging mainly due to the sparse nature of CPI matrixes, resulting in poor generalization performance. Hence, unlike typical CPI prediction models focused on representation learning or model selection, we propose a deep neural network-based strategy, PCM-AAE, that re-explores and augments the pharmacological space of kinase inhibitors by introducing the adversarial auto-encoder model (AAE) to improve the generalization of the prediction model. To complete the data space, we constructed Ensemble of PCM-AAE (EPA), an ensemble model that quickly and accurately yields quantitative predictions of binding affinity between any human kinase and inhibitor. In rigorous internal validation, EPA showed excellent performance, consistently outperforming the model trained with the imbalanced set, especially for targets with relatively fewer training data points. Improved prediction accuracy of EPA for external datasets enhances its generalization ability, making it possible to gracefully handle previously unseen kinases and inhibitors. EPA showed promising potential when directly applied to virtual screening and off-target prediction, exhibiting its practicality in hit prediction. Our strategy is expected to facilitate kinase-centric drug development, as well as to solve more challenging prediction problems with insufficient data points.

Chemical toxicity prediction based on semi-supervised learning and graph convolutional neural network

Journal of Cheminformatics ◽

10.1186/s13321-021-00570-8 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Jiarui Chen ◽

Yain-Whar Si ◽

Chon-Wai Un ◽

Shirley W. I. Siu

Keyword(s):

Neural Network ◽

Supervised Learning ◽

Superior Performance ◽

Chemical Toxicity ◽

Animal Testing ◽

Toxicity Prediction ◽

Discovery Research ◽

Drug Discovery Research

AbstractAs safety is one of the most important properties of drugs, chemical toxicology prediction has received increasing attentions in the drug discovery research. Traditionally, researchers rely on in vitro and in vivo experiments to test the toxicity of chemical compounds. However, not only are these experiments time consuming and costly, but experiments that involve animal testing are increasingly subject to ethical concerns. While traditional machine learning (ML) methods have been used in the field with some success, the limited availability of annotated toxicity data is the major hurdle for further improving model performance. Inspired by the success of semi-supervised learning (SSL) algorithms, we propose a Graph Convolution Neural Network (GCN) to predict chemical toxicity and trained the network by the Mean Teacher (MT) SSL algorithm. Using the Tox21 data, our optimal SSL-GCN models for predicting the twelve toxicological endpoints achieve an average ROC-AUC score of 0.757 in the test set, which is a 6% improvement over GCN models trained by supervised learning and conventional ML methods. Our SSL-GCN models also exhibit superior performance when compared to models constructed using the built-in DeepChem ML methods. This study demonstrates that SSL can increase the prediction power of models by learning from unannotated data. The optimal unannotated to annotated data ratio ranges between 1:1 and 4:1. This study demonstrates the success of SSL in chemical toxicity prediction; the same technique is expected to be beneficial to other chemical property prediction tasks by utilizing existing large chemical databases. Our optimal model SSL-GCN is hosted on an online server accessible through: https://app.cbbio.online/ssl-gcn/home.

Journal of Cheminformatics
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Springer (Biomed Central Ltd.)

LEADD: Lamarckian evolutionary algorithm for de novo drug design

Uncertainty-aware prediction of chemical reaction yields with graph neural networks

HobPre: accurate prediction of human oral bioavailability for small molecules

TorsiFlex: an automatic generator of torsional conformers. Application to the twenty proteinogenic amino acids

Splitting chemical structure data sets for federated privacy-preserving machine learning

Processing binding data using an open-source workflow

Prediction of small-molecule compound solubility in organic solvents by machine learning algorithms

ChemTables: a dataset for semantic classification on tables in chemical patents

Exploration and augmentation of pharmacological space via adversarial auto-encoder model for facilitating kinase-centric drug development

Chemical toxicity prediction based on semi-supervised learning and graph convolutional neural network

Export Citation Format

Journal of CheminformaticsLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Springer (Biomed Central Ltd.)

LEADD: Lamarckian evolutionary algorithm for de novo drug design

Uncertainty-aware prediction of chemical reaction yields with graph neural networks

HobPre: accurate prediction of human oral bioavailability for small molecules

TorsiFlex: an automatic generator of torsional conformers. Application to the twenty proteinogenic amino acids

Splitting chemical structure data sets for federated privacy-preserving machine learning

Processing binding data using an open-source workflow

Prediction of small-molecule compound solubility in organic solvents by machine learning algorithms

ChemTables: a dataset for semantic classification on tables in chemical patents

Exploration and augmentation of pharmacological space via adversarial auto-encoder model for facilitating kinase-centric drug development

Chemical toxicity prediction based on semi-supervised learning and graph convolutional neural network

Journal of Cheminformatics
Latest Publications