scholarly journals DeepCarc: Deep Learning-Powered Carcinogenicity Prediction Using Model-Level Representation

2021 ◽  
Vol 4 ◽  
Author(s):  
Ting Li ◽  
Weida Tong ◽  
Ruth Roberts ◽  
Zhichao Liu ◽  
Shraddha Thakkar

Carcinogenicity testing plays an essential role in identifying carcinogens in environmental chemistry and drug development. However, it is a time-consuming and label-intensive process to evaluate the carcinogenic potency with conventional 2-years rodent animal studies. Thus, there is an urgent need for alternative approaches to providing reliable and robust assessments on carcinogenicity. In this study, we proposed a DeepCarc model to predict carcinogenicity for small molecules using deep learning-based model-level representations. The DeepCarc Model was developed using a data set of 692 compounds and evaluated on a test set containing 171 compounds in the National Center for Toxicological Research liver cancer database (NCTRlcdb). As a result, the proposed DeepCarc model yielded a Matthews correlation coefficient (MCC) of 0.432 for the test set, outperforming four advanced deep learning (DL) powered quantitative structure-activity relationship (QSAR) models with an average improvement rate of 37%. Furthermore, the DeepCarc model was also employed to screen the carcinogenicity potential of the compounds from both DrugBank and Tox21. Altogether, the proposed DeepCarc model could serve as an early detection tool (https://github.com/TingLi2016/DeepCarc) for carcinogenicity assessment.

Author(s):  
Apilak Worachartcheewan ◽  
Alla P. Toropova ◽  
Andrey A. Toropov ◽  
Reny Pratiwi ◽  
Virapong Prachayasittikul ◽  
...  

Background: Sirtuin 1 (Sirt1) and sirtuin 2 (Sirt2) are NAD+ -dependent histone deacetylases which play important functional roles in removal of the acetyl group of acetyl-lysine substrates. Considering the dysregulation of Sirt1 and Sirt2 as etiological causes of diseases, Sirt1 and Sirt2 are lucrative target proteins for treatment, thus there has been great interest in the development of Sirt1 and Sirt2 inhibitors. Objective: This study compiled the bioactivity data of Sirt1 and Sirt2 for the construction of quantitative structure-activity relationship (QSAR) models in accordance with the OECD principles. Method: Simplified molecular input line entry system (SMILES)-based molecular descriptors were used to characterize the molecular features of inhibitors while the Monte Carlo method of the CORAL software was employed for multivariate analysis. The data set was subjected to 3 random splits in which each split separated the data into 4 subsets consisting of training, invisible training, calibration and external sets. Results: Statistical indices for the evaluation of QSAR models suggested good statistical quality for models of Sirt1 and Sirt2 inhibitors. Furthermore, mechanistic interpretation of molecular substructures that are responsible for modulating the bioactivity (i.e. promoters of increase or decrease of bioactivity) was extracted via the analysis of correlation weights. It exhibited molecular features involved Sirt1 and Sirt2 inhibitors. Conclusion: It is anticipated that QSAR models presented herein can be useful as guidelines in the rational design of potential Sirt1 and Sirt2 inhibitors for the treatment of Sirtuin-related diseases.


2018 ◽  
Vol 19 (11) ◽  
pp. 3423 ◽  
Author(s):  
Ting Wang ◽  
Lili Tang ◽  
Feng Luan ◽  
M. Natália D. S. Cordeiro

Organic compounds are often exposed to the environment, and have an adverse effect on the environment and human health in the form of mixtures, rather than as single chemicals. In this paper, we try to establish reliable and developed classical quantitative structure–activity relationship (QSAR) models to evaluate the toxicity of 99 binary mixtures. The derived QSAR models were built by forward stepwise multiple linear regression (MLR) and nonlinear radial basis function neural networks (RBFNNs) using the hypothetical descriptors, respectively. The statistical parameters of the MLR model provided were N (number of compounds in training set) = 79, R2 (the correlation coefficient between the predicted and observed activities)= 0.869, LOOq2 (leave-one-out correlation coefficient) = 0.864, F (Fisher’s test) = 165.494, and RMS (root mean square) = 0.599 for the training set, and Next (number of compounds in external test set) = 20, R2 = 0.853, qext2 (leave-one-out correlation coefficient for test set)= 0.825, F = 30.861, and RMS = 0.691 for the external test set. The RBFNN model gave the statistical results, namely N = 79, R2 = 0.925, LOOq2 = 0.924, F = 950.686, RMS = 0.447 for the training set, and Next = 20, R2 = 0.896, qext2 = 0.890, F = 155.424, RMS = 0.547 for the external test set. Both of the MLR and RBFNN models were evaluated by some statistical parameters and methods. The results confirm that the built models are acceptable, and can be used to predict the toxicity of the binary mixtures.


Author(s):  
Ranita Pal ◽  
Goutam Pal ◽  
Gourhari Jana ◽  
Pratim Kumar Chattaraj

Human African trypanosomiasis (HAT) is a vector-borne sleeping sickness parasitic disease spread through the bite of infected tsetse flies (Glossina genus), which is highly populated in rural Africa. The present study constructed quantitative structure-activity relationship (QSAR) models based on quantum chemical electronic descriptors to bring out the extent to which the electronic factor of the selected compounds affects the HAT activity. Theoretical prediction of toxicity (pIC50) of the series of heterocyclic scaffolds consisting 32 pyridyl benzamide derivatives towards HAT is investigated by considering all possible combinations of electrophilicity index (ω) and the square of electrophilicity index (ω2) as descriptors in the studied models along with other descriptors previously used by Masand et al. A multiple linear regression (MLR) analysis is conducted to develop the models. Further, in order to obtain the variable selection on the overall data set having diverse functional groups, the analysis using sum of ranking differences methodology with ties is carried out.


2017 ◽  
Author(s):  
Ariel Rokem ◽  
Yue Wu ◽  
Aaron Lee

AbstractDeep learning algorithms have tremendous potential utility in the classification of biomedical images. For example, images acquired with retinal optical coherence tomography (OCT) can be used to accurately classify patients with adult macular degeneration (AMD), and distinguish them from healthy control patients. However, previous research has suggested that large amounts of data are required in order to train deep learning algorithms, because of the large number of parameters that need to be fit. Here, we show that a moderate amount of data (data from approximately 1,800 patients) may be enough to reach close-to-maximal performance in the classification of AMD patients from OCT images. These results suggest that deep learning algorithms can be trained on moderate amounts of data, provided that images are relatively homogenous, and the effective number of parameters is sufficiently small. Furthermore, we demonstrate that in this application, cross-validation with a separate test set that is not used in any part of the training does not differ substantially from cross-validation with a validation data-set used to determine the optimal stopping point for training.


2019 ◽  
Vol 35 (23) ◽  
pp. 4979-4985 ◽  
Author(s):  
Woosung Jeon ◽  
Dongsup Kim

Abstract Motivation One of the most successful methods for predicting the properties of chemical compounds is the quantitative structure–activity relationship (QSAR) methods. The prediction accuracy of QSAR models has recently been greatly improved by employing deep learning technology. Especially, newly developed molecular featurizers based on graph convolution operations on molecular graphs significantly outperform the conventional extended connectivity fingerprints (ECFP) feature in both classification and regression tasks, indicating that it is critical to develop more effective new featurizers to fully realize the power of deep learning techniques. Motivated by the fact that there is a clear analogy between chemical compounds and natural languages, this work develops a new molecular featurizer, FP2VEC, which represents a chemical compound as a set of trainable embedding vectors. Results To implement and test our new featurizer, we build a QSAR model using a simple convolutional neural network (CNN) architecture that has been successfully used for natural language processing tasks such as sentence classification task. By testing our new method on several benchmark datasets, we demonstrate that the combination of FP2VEC and CNN model can achieve competitive results in many QSAR tasks, especially in classification tasks. We also demonstrate that the FP2VEC model is especially effective for multitask learning. Availability and implementation FP2VEC is available from https://github.com/wsjeon92/FP2VEC. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
pp. 1-8 ◽  
Author(s):  
Okyaz Eminaga ◽  
Nurettin Eminaga ◽  
Axel Semjonow ◽  
Bernhard Breil

Purpose The recognition of cystoscopic findings remains challenging for young colleagues and depends on the examiner’s skills. Computer-aided diagnosis tools using feature extraction and deep learning show promise as instruments to perform diagnostic classification. Materials and Methods Our study considered 479 patient cases that represented 44 urologic findings. Image color was linearly normalized and was equalized by applying contrast-limited adaptive histogram equalization. Because these findings can be viewed via cystoscopy from every possible angle and side, we ultimately generated images rotated in 10-degree grades and flipped them vertically or horizontally, which resulted in 18,681 images. After image preprocessing, we developed deep convolutional neural network (CNN) models (ResNet50, VGG-19, VGG-16, InceptionV3, and Xception) and evaluated these models using F1 scores. Furthermore, we proposed two CNN concepts: 90%-previous-layer filter size and harmonic-series filter size. A training set (60%), a validation set (10%), and a test set (30%) were randomly generated from the study data set. All models were trained on the training set, validated on the validation set, and evaluated on the test set. Results The Xception-based model achieved the highest F1 score (99.52%), followed by models that were based on ResNet50 (99.48%) and the harmonic-series concept (99.45%). All images with cancer lesions were correctly determined by these models. When the focus was on the images misclassified by the model with the best performance, 7.86% of images that showed bladder stones with indwelling catheter and 1.43% of images that showed bladder diverticulum were falsely classified. Conclusion The results of this study show the potential of deep learning for the diagnostic classification of cystoscopic images. Future work will focus on integration of artificial intelligence–aided cystoscopy into clinical routines and possibly expansion to other clinical endoscopy applications.


2021 ◽  
Author(s):  
Baohong Guo

ABSTRACTGenomic predictions have been recognized as a new promising technique in animal and plant breeding. Linear mixed model is a widely used statistical technique, but it may not be desirable for large training sets and number of molecular markers, because it is intensive in computation. Deep learning is a subfield of machine learning and it can be used for complex predictions on a large scale. Multi task deep learning (MT-DL) incorporates related tasks(labels or traits) into one learning process to enable the learning model to perform better than single task deep learning (ST-DL). I applied MT-DL to genotype by environment genomic predictions to predict the performances of breeding lines at multiple environments. I compared MT-DL with linear mixed model-based Bayesian genotype × environment method (BGGE) and separate genomic predictions on single environments with widely used rrBLUP, ridge regression and ST-DL using cross validations. Compared with rrBLUP, MT-DL and non-linear BGGE showed a moderate increase of 9.4 and 7.6%, respectively, ST-DL has a small increase of 5.4%, ridge regression had a similar prediction accuracy and linear BGGE had a small decrease of −2.0% for prediction accuracy. I also found that all methods including rrBLUP had an overfitting, this is likely because yield genomic predictions are complex and the data set used in this study are small. rrBLUP, ridge regression, ST-DL and MT-DL has similar overfitting. Difference between training and test set prediction accuracies was between 0.344 and 0. 387. Linear and nonlinear BGGE methods seem to have much worse overfitting than other methods. Difference between training and test set prediction accuracies were 0.429 and 0.472, respectively. I also discussed the potential applications of ST-DL and MT-DL in genomic predictions of hybrid crops such as maize


Author(s):  
Rosmahaida Jamaludin ◽  
Mohamed Noor Hasan

The increase in resistance to older drugs and the emergence of new types of infection have created an urgent need for discovery and development of new compounds with antimalarial activity. Quantitative-Structure Activity Relationship (QSAR) methodology has been performed to develop models that correlate antimalarial activity of artemisinin analogs and their molecular structures. In this study, the data set consisted of 197 compounds with their activities expressed as log RA (relative activity). These compounds were randomly divided into training set (n=157) and test set (n=40). The initial stage of the study was the generation of a series of descriptors from three-dimensional representations of the compounds in the data set. Several types of descriptors which include topological, connectivity indices, geometrical, physical properties and charge descriptors have been generated. The number of descriptors was then reduced to a set of relevant descriptors by performing a systematic variable selection procedure which includes zero test, pairwisecorrelation analysis and genetic algorithm (GA). Several models were developed using different combinations of modelling techniques such as multiple linear regression (MLR) and partial least square (PLS) regression. Statistical significance of the final model was characterized by correlation coefficient, r2 and root-mean-square error calibration, RMSEC. The results obtained were comparable to those from previous study on the same data set with r2 values greater than 0.8. Both internal and external validations were carried out to verify that the models have good stability, robustness and predictive ability. The cross-validated regression coefficient (r2cv) and prediction regression coefficient (r2 test) for the external test set were consistently greater than 0.7. The QSAR models developed in this study should facilitate the search for new compounds with antimalarial activity.


2019 ◽  
Vol 22 (6) ◽  
pp. 387-399 ◽  
Author(s):  
Neda Ahmadinejad ◽  
Fatemeh Shafiei

Aim and Objective:A Quantitative Structure-Activity Relationship (QSAR) has been widely developed to derive a correlation between chemical structures of molecules to their known activities. In the present investigation, QSAR models have been carried out on 76 Camptothecin (CPT) derivatives as anticancer drugs to develop a robust model for the prediction of physicochemical properties.Materials and Methods:A training set of 60 structurally diverse CPT derivatives was used to construct QSAR models for the prediction of physiochemical parameters such as Van der Waals surface area (SvdW), Van der Waals Volume (VvdW), Molar Refractivity (MR) and Polarizability (α). The QSAR models were optimized using Multiple Linear Regression (MLR) analysis. A test set of 16 compounds was evaluated using the defined models.:The Genetic Algorithm And Multiple Linear Regression Analysis (GA-MLR) were used to select the descriptors derived from the Dragon software to generate the correlation models that relate the structural features to the studied properties.Results:QSAR models were used to delineate the important descriptors responsible for the properties of the CPT derivatives. The statistically significant QSAR models derived by GA-MLR analysis were validated by Leave-One-Out Cross-Validation (LOOCV) and test set validation methods. The multicollinearity and autocorrelation properties of the descriptors contributed in the models were tested by calculating the Variance Inflation Factor (VIF) and the Durbin–Watson (DW) statistics.Conclusion:The predictive ability of the models was found to be satisfactory. Thus, QSAR models derived from this study may be helpful for modeling and designing some new CPT derivatives and for predicting their activity.


Author(s):  
Rahman Abdizadeh ◽  
Esfandiar Heidarian ◽  
Farzin Hadizadeh ◽  
Tooba Abdizadeh

Background: Histone lysine demetylases1 (LSD1) is a promising medication to treat cancer, which plays a crucial role in epigenetic modulation of gene expression. Inhibition of LSD1with small molecules has emerged as a vital mechanism to treat cancer. Objective: In the present research, molecular modeling investigations, such as CoMFA, CoMFA-RF, CoMSIA and HQSAR, molecular docking and molecular dynamics (MD) simulations were carried out on some tranylcypromine derivatives as LSD1 inhibitors. Methods: The QSAR models were carried out on a series of Tranylcypromine derivatives as data set via the SYBYLX2.1.1 program. Molecular docking and MD simulations were carried out by the MOE software and the SYBYL program, respectively. The internal and external predictability performances related to the generated models for these LSD1 inhibitors were justified by evaluating cross-validated correlation coefficient (q2 ), non-cross-validated correlation coefficient ( ) and predicted correlation coefficient ( ) of the training and test set molecules, respectively. Results: The CoMFA (q2 , 0.670; , 0.930; 0.968), CoMFA-RF (q2 , 0.694; , 0.926; 0.927), CoMSIA (q2 , 0.834; , 0.956; 0.958) and HQSAR models (q2 , 0.854; , 0.900; 0.728) for training as well as test set of LSD1 inhibition resulted insignificant findings. Conclusion: These QSAR models were perfect, strong with better predictability. Contour maps of all models were generated and proved by molecular docking studies and molecular dynamics simulation, that the hydrophobic, electrostatic and hydrogen bonding fields are crucial in these models for improving the binding affinity and determine of structure- activity relationship. These theoretical results are possibly beneficial to design new strong LSD1 inhibitors with enhanced activity to treat cancer.


Sign in / Sign up

Export Citation Format

Share Document