FP2VEC: a new molecular featurizer for learning molecular properties

2019 ◽  
Vol 35 (23) ◽  
pp. 4979-4985 ◽  
Author(s):  
Woosung Jeon ◽  
Dongsup Kim

Abstract Motivation One of the most successful methods for predicting the properties of chemical compounds is the quantitative structure–activity relationship (QSAR) methods. The prediction accuracy of QSAR models has recently been greatly improved by employing deep learning technology. Especially, newly developed molecular featurizers based on graph convolution operations on molecular graphs significantly outperform the conventional extended connectivity fingerprints (ECFP) feature in both classification and regression tasks, indicating that it is critical to develop more effective new featurizers to fully realize the power of deep learning techniques. Motivated by the fact that there is a clear analogy between chemical compounds and natural languages, this work develops a new molecular featurizer, FP2VEC, which represents a chemical compound as a set of trainable embedding vectors. Results To implement and test our new featurizer, we build a QSAR model using a simple convolutional neural network (CNN) architecture that has been successfully used for natural language processing tasks such as sentence classification task. By testing our new method on several benchmark datasets, we demonstrate that the combination of FP2VEC and CNN model can achieve competitive results in many QSAR tasks, especially in classification tasks. We also demonstrate that the FP2VEC model is especially effective for multitask learning. Availability and implementation FP2VEC is available from https://github.com/wsjeon92/FP2VEC. Supplementary information Supplementary data are available at Bioinformatics online.

Author(s):  
Apilak Worachartcheewan ◽  
Alla P. Toropova ◽  
Andrey A. Toropov ◽  
Reny Pratiwi ◽  
Virapong Prachayasittikul ◽  
...  

Background: Sirtuin 1 (Sirt1) and sirtuin 2 (Sirt2) are NAD+ -dependent histone deacetylases which play important functional roles in removal of the acetyl group of acetyl-lysine substrates. Considering the dysregulation of Sirt1 and Sirt2 as etiological causes of diseases, Sirt1 and Sirt2 are lucrative target proteins for treatment, thus there has been great interest in the development of Sirt1 and Sirt2 inhibitors. Objective: This study compiled the bioactivity data of Sirt1 and Sirt2 for the construction of quantitative structure-activity relationship (QSAR) models in accordance with the OECD principles. Method: Simplified molecular input line entry system (SMILES)-based molecular descriptors were used to characterize the molecular features of inhibitors while the Monte Carlo method of the CORAL software was employed for multivariate analysis. The data set was subjected to 3 random splits in which each split separated the data into 4 subsets consisting of training, invisible training, calibration and external sets. Results: Statistical indices for the evaluation of QSAR models suggested good statistical quality for models of Sirt1 and Sirt2 inhibitors. Furthermore, mechanistic interpretation of molecular substructures that are responsible for modulating the bioactivity (i.e. promoters of increase or decrease of bioactivity) was extracted via the analysis of correlation weights. It exhibited molecular features involved Sirt1 and Sirt2 inhibitors. Conclusion: It is anticipated that QSAR models presented herein can be useful as guidelines in the rational design of potential Sirt1 and Sirt2 inhibitors for the treatment of Sirtuin-related diseases.


2020 ◽  
Vol 14 (4) ◽  
pp. 471-484
Author(s):  
Suraj Shetiya ◽  
Saravanan Thirumuruganathan ◽  
Nick Koudas ◽  
Gautam Das

Accurate selectivity estimation for string predicates is a long-standing research challenge in databases. Supporting pattern matching on strings (such as prefix, substring, and suffix) makes this problem much more challenging, thereby necessitating a dedicated study. Traditional approaches often build pruned summary data structures such as tries followed by selectivity estimation using statistical correlations. However, this produces insufficiently accurate cardinality estimates resulting in the selection of sub-optimal plans by the query optimizer. Recently proposed deep learning based approaches leverage techniques from natural language processing such as embeddings to encode the strings and use it to train a model. While this is an improvement over traditional approaches, there is a large scope for improvement. We propose Astrid, a framework for string selectivity estimation that synthesizes ideas from traditional and deep learning based approaches. We make two complementary contributions. First, we propose an embedding algorithm that is query-type (prefix, substring, and suffix) and selectivity aware. Consider three strings 'ab', 'abc' and 'abd' whose prefix frequencies are 1000, 800 and 100 respectively. Our approach would ensure that the embedding for 'ab' is closer to 'abc' than 'abd'. Second, we describe how neural language models could be used for selectivity estimation. While they work well for prefix queries, their performance for substring queries is sub-optimal. We modify the objective function of the neural language model so that it could be used for estimating selectivities of pattern matching queries. We also propose a novel and efficient algorithm for optimizing the new objective function. We conduct extensive experiments over benchmark datasets and show that our proposed approaches achieve state-of-the-art results.


Author(s):  
Kexin Huang ◽  
Tianfan Fu ◽  
Lucas M Glass ◽  
Marinka Zitnik ◽  
Cao Xiao ◽  
...  

Abstract Summary Accurate prediction of drug–target interactions (DTI) is crucial for drug discovery. Recently, deep learning (DL) models for show promising performance for DTI prediction. However, these models can be difficult to use for both computer scientists entering the biomedical field and bioinformaticians with limited DL experience. We present DeepPurpose, a comprehensive and easy-to-use DL library for DTI prediction. DeepPurpose supports training of customized DTI prediction models by implementing 15 compound and protein encoders and over 50 neural architectures, along with providing many other useful features. We demonstrate state-of-the-art performance of DeepPurpose on several benchmark datasets. Availability and implementation https://github.com/kexinhuang12345/DeepPurpose. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Neha Warikoo ◽  
Yung-Chun Chang ◽  
Wen-Lian Hsu

Abstract Motivation Natural Language Processing techniques are constantly being advanced to accommodate the influx of data as well as to provide exhaustive and structured knowledge dissemination. Within the biomedical domain, relation detection between bio-entities known as the Bio-Entity Relation Extraction (BRE) task has a critical function in knowledge structuring. Although recent advances in deep learning-based biomedical domain embedding have improved BRE predictive analytics, these works are often task selective or use external knowledge-based pre-/post-processing. In addition, deep learning-based models do not account for local syntactic contexts, which have improved data representation in many kernel classifier-based models. In this study, we propose a universal BRE model, i.e. LBERT, which is a Lexically aware Transformer-based Bidirectional Encoder Representation model, and which explores both local and global contexts representations for sentence-level classification tasks. Results This article presents one of the most exhaustive BRE studies ever conducted over five different bio-entity relation types. Our model outperforms state-of-the-art deep learning models in protein–protein interaction (PPI), drug–drug interaction and protein–bio-entity relation classification tasks by 0.02%, 11.2% and 41.4%, respectively. LBERT representations show a statistically significant improvement over BioBERT in detecting true bio-entity relation for large corpora like PPI. Our ablation studies clearly indicate the contribution of the lexical features and distance-adjusted attention in improving prediction performance by learning additional local semantic context along with bi-directionally learned global context. Availability and implementation Github. https://github.com/warikoone/LBERT. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 4 ◽  
Author(s):  
Ting Li ◽  
Weida Tong ◽  
Ruth Roberts ◽  
Zhichao Liu ◽  
Shraddha Thakkar

Carcinogenicity testing plays an essential role in identifying carcinogens in environmental chemistry and drug development. However, it is a time-consuming and label-intensive process to evaluate the carcinogenic potency with conventional 2-years rodent animal studies. Thus, there is an urgent need for alternative approaches to providing reliable and robust assessments on carcinogenicity. In this study, we proposed a DeepCarc model to predict carcinogenicity for small molecules using deep learning-based model-level representations. The DeepCarc Model was developed using a data set of 692 compounds and evaluated on a test set containing 171 compounds in the National Center for Toxicological Research liver cancer database (NCTRlcdb). As a result, the proposed DeepCarc model yielded a Matthews correlation coefficient (MCC) of 0.432 for the test set, outperforming four advanced deep learning (DL) powered quantitative structure-activity relationship (QSAR) models with an average improvement rate of 37%. Furthermore, the DeepCarc model was also employed to screen the carcinogenicity potential of the compounds from both DrugBank and Tox21. Altogether, the proposed DeepCarc model could serve as an early detection tool (https://github.com/TingLi2016/DeepCarc) for carcinogenicity assessment.


2020 ◽  
Vol 32 (11) ◽  
pp. 2839-2845
Author(s):  
R. Hadanau

A quantitative structure activity relationship (QSAR) analysis was performed on several compound and aurone derivatives (1-16) and 17-21 compounds were used as internal and external tests, respectively. Studies have investigated aurone derivatives; however, for aurone compounds, QSAR analysis has not been conducted. The semi-empirical PM3 method of HyperChem for Windows 8.0 was used to optimise the aurone derivative structures to acquire descriptors. For 15 influential descriptors, the multilinear regression MLR analysis was conducted by employing the backward method, and four new QSAR models were obtained. According to statistical criteria, model 2 was the optimum QSAR model for predicting the inhibition concentration (IC50) theoretical value against novel aurone derivatives. The modelling of 40 (22-61) aurone compounds was achieved. Six novel compounds (54, 55, 58, 59, 60, and 61) were synthesized in a laboratory because the IC50 of these compounds was lower than that of chloroquine (IC50 = 0.14 μM).


Author(s):  
Kunal Roy ◽  
Supratik Kar

Quantitative Structure-Activity Relationship (QSAR) models have manifold applications in drug discovery, environmental fate modeling, risk assessment, and property prediction of chemicals and pharmaceuticals. One of the principles recommended by the Organization of Economic Co-operation and Development (OECD) for model validation requires defining the Applicability Domain (AD) for QSAR models, which allows one to estimate the uncertainty in the prediction of a compound based on how similar it is to the training compounds, which are used in the model development. The AD is a significant tool to build a reliable QSAR model, which is generally limited in use to query chemicals structurally similar to the training compounds. Thus, characterization of interpolation space is significant in defining the AD. An attempt is made in this chapter to address the important concepts and methodology of the AD as well as criteria for estimating AD through training set interpolation in the descriptor space.


INDIAN DRUGS ◽  
2017 ◽  
Vol 54 (04) ◽  
pp. 22-31
Author(s):  
M. C Sharma ◽  

A quantitative structure–activity relationship (QSAR) of a series of substituted pyrazoline derivatives, in regard to their anti-tuberculosis activity, has been studied using the partial least square (PLS) analysis method. QSAR model development of 64 pyrazoline derivatives was carried out to predict anti-tubercular activity. Partial least square analysis was applied to derive QSAR models, which were further evaluated for statistical significance and predictive power by internal and external validation. The best QSAR model with good external and internal predictivity for the training and test set has shown cross validation (q2) and external validation (pred_r2) values of 0.7426 and 0.7903, respectively. Two-dimensional QSAR analyses of such pyrazoline derivatives provide important structural insights for designing potent antituberculosis drugs.


Author(s):  
Ranita Pal ◽  
Goutam Pal ◽  
Gourhari Jana ◽  
Pratim Kumar Chattaraj

Human African trypanosomiasis (HAT) is a vector-borne sleeping sickness parasitic disease spread through the bite of infected tsetse flies (Glossina genus), which is highly populated in rural Africa. The present study constructed quantitative structure-activity relationship (QSAR) models based on quantum chemical electronic descriptors to bring out the extent to which the electronic factor of the selected compounds affects the HAT activity. Theoretical prediction of toxicity (pIC50) of the series of heterocyclic scaffolds consisting 32 pyridyl benzamide derivatives towards HAT is investigated by considering all possible combinations of electrophilicity index (ω) and the square of electrophilicity index (ω2) as descriptors in the studied models along with other descriptors previously used by Masand et al. A multiple linear regression (MLR) analysis is conducted to develop the models. Further, in order to obtain the variable selection on the overall data set having diverse functional groups, the analysis using sum of ranking differences methodology with ties is carried out.


Sign in / Sign up

Export Citation Format

Share Document