FP2VEC: a new molecular featurizer for learning molecular properties

Woosung Jeon; Dongsup Kim

doi:10.1093/bioinformatics/btz307

FP2VEC: a new molecular featurizer for learning molecular properties

Bioinformatics ◽

10.1093/bioinformatics/btz307 ◽

2019 ◽

Vol 35 (23) ◽

pp. 4979-4985 ◽

Cited By ~ 8

Author(s):

Woosung Jeon ◽

Dongsup Kim

Keyword(s):

Deep Learning ◽

Language Processing ◽

Chemical Compounds ◽

Qsar Model ◽

Quantitative Structure Activity Relationship ◽

Supplementary Information ◽

Learning Technology ◽

Natural Languages ◽

Qsar Models ◽

Benchmark Datasets

Abstract Motivation One of the most successful methods for predicting the properties of chemical compounds is the quantitative structure–activity relationship (QSAR) methods. The prediction accuracy of QSAR models has recently been greatly improved by employing deep learning technology. Especially, newly developed molecular featurizers based on graph convolution operations on molecular graphs significantly outperform the conventional extended connectivity fingerprints (ECFP) feature in both classification and regression tasks, indicating that it is critical to develop more effective new featurizers to fully realize the power of deep learning techniques. Motivated by the fact that there is a clear analogy between chemical compounds and natural languages, this work develops a new molecular featurizer, FP2VEC, which represents a chemical compound as a set of trainable embedding vectors. Results To implement and test our new featurizer, we build a QSAR model using a simple convolutional neural network (CNN) architecture that has been successfully used for natural language processing tasks such as sentence classification task. By testing our new method on several benchmark datasets, we demonstrate that the combination of FP2VEC and CNN model can achieve competitive results in many QSAR tasks, especially in classification tasks. We also demonstrate that the FP2VEC model is especially effective for multitask learning. Availability and implementation FP2VEC is available from https://github.com/wsjeon92/FP2VEC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Interpretable SMILES-based QSAR model of inhibitory activity of sirtuins 1 and 2

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207323666200902141907 ◽

2020 ◽

Vol 23 ◽

Author(s):

Apilak Worachartcheewan ◽

Alla P. Toropova ◽

Andrey A. Toropov ◽

Reny Pratiwi ◽

Virapong Prachayasittikul ◽

...

Keyword(s):

Histone Deacetylases ◽

Rational Design ◽

Qsar Model ◽

Quantitative Structure Activity Relationship ◽

Sirtuin 1 ◽

Data Set ◽

Functional Roles ◽

Molecular Features ◽

Oecd Principles ◽

Qsar Models

Background: Sirtuin 1 (Sirt1) and sirtuin 2 (Sirt2) are NAD+ -dependent histone deacetylases which play important functional roles in removal of the acetyl group of acetyl-lysine substrates. Considering the dysregulation of Sirt1 and Sirt2 as etiological causes of diseases, Sirt1 and Sirt2 are lucrative target proteins for treatment, thus there has been great interest in the development of Sirt1 and Sirt2 inhibitors. Objective: This study compiled the bioactivity data of Sirt1 and Sirt2 for the construction of quantitative structure-activity relationship (QSAR) models in accordance with the OECD principles. Method: Simplified molecular input line entry system (SMILES)-based molecular descriptors were used to characterize the molecular features of inhibitors while the Monte Carlo method of the CORAL software was employed for multivariate analysis. The data set was subjected to 3 random splits in which each split separated the data into 4 subsets consisting of training, invisible training, calibration and external sets. Results: Statistical indices for the evaluation of QSAR models suggested good statistical quality for models of Sirt1 and Sirt2 inhibitors. Furthermore, mechanistic interpretation of molecular substructures that are responsible for modulating the bioactivity (i.e. promoters of increase or decrease of bioactivity) was extracted via the analysis of correlation weights. It exhibited molecular features involved Sirt1 and Sirt2 inhibitors. Conclusion: It is anticipated that QSAR models presented herein can be useful as guidelines in the rational design of potential Sirt1 and Sirt2 inhibitors for the treatment of Sirtuin-related diseases.

Download Full-text

A natural language processing approach based on embedding deep learning from heterogeneous compounds for quantitative structure–activity relationship modeling

Chemical Biology & Drug Design ◽

10.1111/cbdd.13742 ◽

2020 ◽

Vol 96 (3) ◽

pp. 961-972

Author(s):

Khalid Bouhedjar ◽

Abdelbasset Boukelia ◽

Abdelmalek Khorief Nacereddine ◽

Anouar Boucheham ◽

Amine Belaidi ◽

...

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Quantitative Structure Activity Relationship ◽

Structure Activity Relationship ◽

Activity Relationship ◽

Quantitative Structure ◽

Structure Activity ◽

Processing Approach

Download Full-text

Astrid

Proceedings of the VLDB Endowment ◽

10.14778/3436905.3436907 ◽

2020 ◽

Vol 14 (4) ◽

pp. 471-484

Author(s):

Suraj Shetiya ◽

Saravanan Thirumuruganathan ◽

Nick Koudas ◽

Gautam Das

Keyword(s):

Deep Learning ◽

Objective Function ◽

Pattern Matching ◽

Language Processing ◽

Language Model ◽

Language Models ◽

Selectivity Estimation ◽

Statistical Correlations ◽

Benchmark Datasets ◽

Traditional Approaches

Accurate selectivity estimation for string predicates is a long-standing research challenge in databases. Supporting pattern matching on strings (such as prefix, substring, and suffix) makes this problem much more challenging, thereby necessitating a dedicated study. Traditional approaches often build pruned summary data structures such as tries followed by selectivity estimation using statistical correlations. However, this produces insufficiently accurate cardinality estimates resulting in the selection of sub-optimal plans by the query optimizer. Recently proposed deep learning based approaches leverage techniques from natural language processing such as embeddings to encode the strings and use it to train a model. While this is an improvement over traditional approaches, there is a large scope for improvement. We propose Astrid, a framework for string selectivity estimation that synthesizes ideas from traditional and deep learning based approaches. We make two complementary contributions. First, we propose an embedding algorithm that is query-type (prefix, substring, and suffix) and selectivity aware. Consider three strings 'ab', 'abc' and 'abd' whose prefix frequencies are 1000, 800 and 100 respectively. Our approach would ensure that the embedding for 'ab' is closer to 'abc' than 'abd'. Second, we describe how neural language models could be used for selectivity estimation. While they work well for prefix queries, their performance for substring queries is sub-optimal. We modify the objective function of the neural language model so that it could be used for estimating selectivities of pattern matching queries. We also propose a novel and efficient algorithm for optimizing the new objective function. We conduct extensive experiments over benchmark datasets and show that our proposed approaches achieve state-of-the-art results.

Download Full-text

DeepPurpose: a deep learning library for drug–target interaction prediction

Bioinformatics ◽

10.1093/bioinformatics/btaa1005 ◽

2020 ◽

Author(s):

Kexin Huang ◽

Tianfan Fu ◽

Lucas M Glass ◽

Marinka Zitnik ◽

Cao Xiao ◽

...

Keyword(s):

Deep Learning ◽

Drug Target ◽

Prediction Models ◽

State Of The Art ◽

Supplementary Information ◽

Target Interaction ◽

Interaction Prediction ◽

Computer Scientists ◽

Benchmark Datasets ◽

Biomedical Field

Abstract Summary Accurate prediction of drug–target interactions (DTI) is crucial for drug discovery. Recently, deep learning (DL) models for show promising performance for DTI prediction. However, these models can be difficult to use for both computer scientists entering the biomedical field and bioinformaticians with limited DL experience. We present DeepPurpose, a comprehensive and easy-to-use DL library for DTI prediction. DeepPurpose supports training of customized DTI prediction models by implementing 15 compound and protein encoders and over 50 neural architectures, along with providing many other useful features. We demonstrate state-of-the-art performance of DeepPurpose on several benchmark datasets. Availability and implementation https://github.com/kexinhuang12345/DeepPurpose. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

LBERT: Lexically aware Transformer-based Bidirectional Encoder Representation model for learning universal bio-entity relations

Bioinformatics ◽

10.1093/bioinformatics/btaa721 ◽

2020 ◽

Author(s):

Neha Warikoo ◽

Yung-Chun Chang ◽

Wen-Lian Hsu

Keyword(s):

Deep Learning ◽

Language Processing ◽

Predictive Analytics ◽

Relation Extraction ◽

Data Representation ◽

Supplementary Information ◽

Biomedical Domain ◽

Critical Function ◽

Representation Model ◽

Classification Tasks

Abstract Motivation Natural Language Processing techniques are constantly being advanced to accommodate the influx of data as well as to provide exhaustive and structured knowledge dissemination. Within the biomedical domain, relation detection between bio-entities known as the Bio-Entity Relation Extraction (BRE) task has a critical function in knowledge structuring. Although recent advances in deep learning-based biomedical domain embedding have improved BRE predictive analytics, these works are often task selective or use external knowledge-based pre-/post-processing. In addition, deep learning-based models do not account for local syntactic contexts, which have improved data representation in many kernel classifier-based models. In this study, we propose a universal BRE model, i.e. LBERT, which is a Lexically aware Transformer-based Bidirectional Encoder Representation model, and which explores both local and global contexts representations for sentence-level classification tasks. Results This article presents one of the most exhaustive BRE studies ever conducted over five different bio-entity relation types. Our model outperforms state-of-the-art deep learning models in protein–protein interaction (PPI), drug–drug interaction and protein–bio-entity relation classification tasks by 0.02%, 11.2% and 41.4%, respectively. LBERT representations show a statistically significant improvement over BioBERT in detecting true bio-entity relation for large corpora like PPI. Our ablation studies clearly indicate the contribution of the lexical features and distance-adjusted attention in improving prediction performance by learning additional local semantic context along with bi-directionally learned global context. Availability and implementation Github. https://github.com/warikoone/LBERT. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DeepCarc: Deep Learning-Powered Carcinogenicity Prediction Using Model-Level Representation

Frontiers in Artificial Intelligence ◽

10.3389/frai.2021.757780 ◽

2021 ◽

Vol 4 ◽

Author(s):

Ting Li ◽

Weida Tong ◽

Ruth Roberts ◽

Zhichao Liu ◽

Shraddha Thakkar

Keyword(s):

Deep Learning ◽

Animal Studies ◽

Environmental Chemistry ◽

Quantitative Structure Activity Relationship ◽

Data Set ◽

Test Set ◽

Improvement Rate ◽

Average Improvement ◽

Qsar Models ◽

Toxicological Research

Carcinogenicity testing plays an essential role in identifying carcinogens in environmental chemistry and drug development. However, it is a time-consuming and label-intensive process to evaluate the carcinogenic potency with conventional 2-years rodent animal studies. Thus, there is an urgent need for alternative approaches to providing reliable and robust assessments on carcinogenicity. In this study, we proposed a DeepCarc model to predict carcinogenicity for small molecules using deep learning-based model-level representations. The DeepCarc Model was developed using a data set of 692 compounds and evaluated on a test set containing 171 compounds in the National Center for Toxicological Research liver cancer database (NCTRlcdb). As a result, the proposed DeepCarc model yielded a Matthews correlation coefficient (MCC) of 0.432 for the test set, outperforming four advanced deep learning (DL) powered quantitative structure-activity relationship (QSAR) models with an average improvement rate of 37%. Furthermore, the DeepCarc model was also employed to screen the carcinogenicity potential of the compounds from both DrugBank and Tox21. Altogether, the proposed DeepCarc model could serve as an early detection tool (https://github.com/TingLi2016/DeepCarc) for carcinogenicity assessment.

Download Full-text

A QSAR Modeling on Aurone Derivatives as Antimalarial Agents

Asian Journal of Chemistry ◽

10.14233/ajchem.2020.22846 ◽

2020 ◽

Vol 32 (11) ◽

pp. 2839-2845

Author(s):

R. Hadanau

Keyword(s):

Qsar Model ◽

Quantitative Structure Activity Relationship ◽

Multilinear Regression ◽

Qsar Modeling ◽

Qsar Analysis ◽

Antimalarial Agents ◽

Inhibition Concentration ◽

Qsar Models ◽

Semi Empirical ◽

Statistical Criteria

A quantitative structure activity relationship (QSAR) analysis was performed on several compound and aurone derivatives (1-16) and 17-21 compounds were used as internal and external tests, respectively. Studies have investigated aurone derivatives; however, for aurone compounds, QSAR analysis has not been conducted. The semi-empirical PM3 method of HyperChem for Windows 8.0 was used to optimise the aurone derivative structures to acquire descriptors. For 15 influential descriptors, the multilinear regression MLR analysis was conducted by employing the backward method, and four new QSAR models were obtained. According to statistical criteria, model 2 was the optimum QSAR model for predicting the inhibition concentration (IC50) theoretical value against novel aurone derivatives. The modelling of 40 (22-61) aurone compounds was achieved. Six novel compounds (54, 55, 58, 59, 60, and 61) were synthesized in a laboratory because the IC50 of these compounds was lower than that of chloroquine (IC50 = 0.14 μM).

Download Full-text

Importance of Applicability Domain of QSAR Models

Quantitative Structure-Activity Relationships in Drug Design, Predictive Toxicology, and Risk Assessment - Advances in Chemical and Materials Engineering ◽

10.4018/978-1-4666-8136-1.ch005 ◽

2015 ◽

pp. 180-211 ◽

Cited By ~ 8

Author(s):

Kunal Roy ◽

Supratik Kar

Keyword(s):

Environmental Fate ◽

Interpolation Space ◽

Model Development ◽

Qsar Model ◽

Quantitative Structure Activity Relationship ◽

Applicability Domain ◽

Property Prediction ◽

Qsar Models ◽

Environmental Fate Modeling

Quantitative Structure-Activity Relationship (QSAR) models have manifold applications in drug discovery, environmental fate modeling, risk assessment, and property prediction of chemicals and pharmaceuticals. One of the principles recommended by the Organization of Economic Co-operation and Development (OECD) for model validation requires defining the Applicability Domain (AD) for QSAR models, which allows one to estimate the uncertainty in the prediction of a compound based on how similar it is to the training compounds, which are used in the model development. The AD is a significant tool to build a reliable QSAR model, which is generally limited in use to query chemicals structurally similar to the training compounds. Thus, characterization of interpolation space is significant in defining the AD. An attempt is made in this chapter to address the important concepts and methodology of the AD as well as criteria for estimating AD through training set interpolation in the descriptor space.

Download Full-text

A QSAR STUDY OF SUBSTITUTED PYRAZOLINE DERIVATIVES AS POTENTIAL ANTI-TUBERCULOSIS AGENTS

INDIAN DRUGS ◽

10.53879/id.54.04.10781 ◽

2017 ◽

Vol 54 (04) ◽

pp. 22-31

Author(s):

M. C Sharma ◽

Keyword(s):

Statistical Significance ◽

External Validation ◽

Model Development ◽

Qsar Model ◽

Quantitative Structure Activity Relationship ◽

Partial Least Square ◽

Least Square ◽

Qsar Study ◽

Partial Least Square Analysis ◽

Qsar Models

A quantitative structure–activity relationship (QSAR) of a series of substituted pyrazoline derivatives, in regard to their anti-tuberculosis activity, has been studied using the partial least square (PLS) analysis method. QSAR model development of 64 pyrazoline derivatives was carried out to predict anti-tubercular activity. Partial least square analysis was applied to derive QSAR models, which were further evaluated for statistical significance and predictive power by internal and external validation. The best QSAR model with good external and internal predictivity for the training and test set has shown cross validation (q2) and external validation (pred_r2) values of 0.7426 and 0.7903, respectively. Two-dimensional QSAR analyses of such pyrazoline derivatives provide important structural insights for designing potent antituberculosis drugs.

Download Full-text

An In Silico QSAR Model Study Using Electrophilicity as a Possible Descriptor Against T. Brucei

International Journal of Chemoinformatics and Chemical Engineering ◽

10.4018/ijcce.20190701.oa1 ◽

2019 ◽

Vol 8 (2) ◽

pp. 57-68 ◽

Cited By ~ 1

Author(s):

Ranita Pal ◽

Goutam Pal ◽

Gourhari Jana ◽

Pratim Kumar Chattaraj

Keyword(s):

Qsar Model ◽

Quantitative Structure Activity Relationship ◽

Disease Spread ◽

Tsetse Flies ◽

Electrophilicity Index ◽

Data Set ◽

Model Study ◽

Sum Of Ranking Differences ◽

Qsar Models ◽

Vector Borne

Human African trypanosomiasis (HAT) is a vector-borne sleeping sickness parasitic disease spread through the bite of infected tsetse flies (Glossina genus), which is highly populated in rural Africa. The present study constructed quantitative structure-activity relationship (QSAR) models based on quantum chemical electronic descriptors to bring out the extent to which the electronic factor of the selected compounds affects the HAT activity. Theoretical prediction of toxicity (pIC50) of the series of heterocyclic scaffolds consisting 32 pyridyl benzamide derivatives towards HAT is investigated by considering all possible combinations of electrophilicity index (ω) and the square of electrophilicity index (ω2) as descriptors in the studied models along with other descriptors previously used by Masand et al. A multiple linear regression (MLR) analysis is conducted to develop the models. Further, in order to obtain the variable selection on the overall data set having diverse functional groups, the analysis using sum of ranking differences methodology with ties is carried out.

Download Full-text