scholarly journals Evaluation of Cross-Validation Strategies in Sequence-Based Binding Prediction Using Deep Learning

2019 ◽  
Vol 59 (4) ◽  
pp. 1645-1657 ◽  
Author(s):  
Angela Lopez-del Rio ◽  
Alfons Nonell-Canals ◽  
David Vidal ◽  
Alexandre Perera-Lluna
2018 ◽  
Author(s):  
Angela Lopez-del Rio ◽  
Alfons Nonell-Canals ◽  
David Vidal ◽  
Alexandre Perera-Lluna

Binding prediction between targets and drug-like compounds through Deep Neural Networks have generated promising results in recent years, outperforming traditional machine learning-based methods. However, the generalization capability of these classification models is still an issue to be addressed. In this work, we explored how different cross-validation strategies applied to data from different molecular databases affect to the performance of binding prediction proteochemometrics models. These strategies are: (1) random splitting, (2) splitting based on K-means clustering (both of actives and inactives), (3) splitting based on source database and (4) splitting based both in the clustering and in the source database. These schemas are applied to a Deep Learning proteochemometrics model and to a simple logistic regression model to be used as baseline. Additionally, two different ways of describing molecules in the model are tested: (1) by their SMILES and (2) by three fingerprints. The classification performance of our Deep Learning-based proteochemometrics model is comparable to the state of the art. Our results show that the lack of generalization of these models is due to a bias in public molecular databases and that a restrictive cross-validation schema based on compounds clustering leads to worse but more robust and credible results. Our results also show better performance when representing molecules by their fingerprints.


2018 ◽  
Author(s):  
Angela Lopez-del Rio ◽  
Alfons Nonell-Canals ◽  
David Vidal ◽  
Alexandre Perera-Lluna

Binding prediction between targets and drug-like compounds through Deep Neural Networks have generated promising results in recent years, outperforming traditional machine learning-based methods. However, the generalization capability of these classification models is still an issue to be addressed. In this work, we explored how different cross-validation strategies applied to data from different molecular databases affect to the performance of binding prediction proteochemometrics models. These strategies are: (1) random splitting, (2) splitting based on K-means clustering (both of actives and inactives), (3) splitting based on source database and (4) splitting based both in the clustering and in the source database. These schemas are applied to a Deep Learning proteochemometrics model and to a simple logistic regression model to be used as baseline. Additionally, two different ways of describing molecules in the model are tested: (1) by their SMILES and (2) by three fingerprints. The classification performance of our Deep Learning-based proteochemometrics model is comparable to the state of the art. Our results show that the lack of generalization of these models is due to a bias in public molecular databases and that a restrictive cross-validation schema based on compounds clustering leads to worse but more robust and credible results. Our results also show better performance when representing molecules by their fingerprints.


2020 ◽  
Vol 22 (Supplement_2) ◽  
pp. ii148-ii148
Author(s):  
Yoshihiro Muragaki ◽  
Yutaka Matsui ◽  
Takashi Maruyama ◽  
Masayuki Nitta ◽  
Taiichi Saito ◽  
...  

Abstract INTRODUCTION It is useful to know the molecular subtype of lower-grade gliomas (LGG) when deciding on a treatment strategy. This study aims to diagnose this preoperatively. METHODS A deep learning model was developed to predict the 3-group molecular subtype using multimodal data including magnetic resonance imaging (MRI), positron emission tomography (PET), and computed tomography (CT). The performance was evaluated using leave-one-out cross validation with a dataset containing information from 217 LGG patients. RESULTS The model performed best when the dataset contained MRI, PET, and CT data. The model could predict the molecular subtype with an accuracy of 96.6% for the training dataset and 68.7% for the test dataset. The model achieved test accuracies of 58.5%, 60.4%, and 59.4% when the dataset contained only MRI, MRI and PET, and MRI and CT data, respectively. The conventional method used to predict mutations in the isocitrate dehydrogenase (IDH) gene and the codeletion of chromosome arms 1p and 19q (1p/19q) sequentially had an overall accuracy of 65.9%. This is 2.8 percent point lower than the proposed method, which predicts the 3-group molecular subtype directly. CONCLUSIONS AND FUTURE PERSPECTIVE A deep learning model was developed to diagnose the molecular subtype preoperatively based on multi-modality data in order to predict the 3-group classification directly. Cross-validation showed that the proposed model had an overall accuracy of 68.7% for the test dataset. This is the first model to double the expected value for a 3-group classification problem, when predicting the LGG molecular subtype. We plan to apply the techniques of heat map and/or segmentation for an increase in prediction accuracy.


2021 ◽  
Vol 15 ◽  
pp. 117793222110303
Author(s):  
Asad Ahmed ◽  
Bhavika Mam ◽  
Ramanathan Sowdhamini

Protein-ligand binding prediction has extensive biological significance. Binding affinity helps in understanding the degree of protein-ligand interactions and is a useful measure in drug design. Protein-ligand docking using virtual screening and molecular dynamic simulations are required to predict the binding affinity of a ligand to its cognate receptor. Performing such analyses to cover the entire chemical space of small molecules requires intense computational power. Recent developments using deep learning have enabled us to make sense of massive amounts of complex data sets where the ability of the model to “learn” intrinsic patterns in a complex plane of data is the strength of the approach. Here, we have incorporated convolutional neural networks to find spatial relationships among data to help us predict affinity of binding of proteins in whole superfamilies toward a diverse set of ligands without the need of a docked pose or complex as user input. The models were trained and validated using a stringent methodology for feature extraction. Our model performs better in comparison to some existing methods used widely and is suitable for predictions on high-resolution protein crystal (⩽2.5 Å) and nonpeptide ligand as individual inputs. Our approach to network construction and training on protein-ligand data set prepared in-house has yielded significant insights. We have also tested DEELIG on few COVID-19 main protease-inhibitor complexes relevant to the current public health scenario. DEELIG-based predictions can be incorporated in existing databases including RSCB PDB, PDBMoad, and PDBbind in filling missing binding affinity data for protein-ligand complexes.


2021 ◽  
pp. 2103807
Author(s):  
Jon Lundstrøm ◽  
Emma Korhonen ◽  
Frédérique Lisacek ◽  
Daniel Bojar

2018 ◽  
Author(s):  
John-William Sidhom ◽  
Drew Pardoll ◽  
Alexander Baras

AbstractMotivationThe immune system has potential to present a wide variety of peptides to itself as a means of surveillance for pathogenic invaders. This means of surveillances allows the immune system to detect peptides derives from bacterial, viral, and even oncologic sources. However, given the breadth of the epitope repertoire, in order to study immune responses to these epitopes, investigators have relied on in-silico prediction algorithms to help narrow down the list of candidate epitopes, and current methods still have much in the way of improvement.ResultsWe present Allele-Integrated MHC (AI-MHC), a deep learning architecture with improved performance over the current state-of-the-art algorithms in human Class I and Class II MHC binding prediction. Our architecture utilizes a convolutional neural network that improves prediction accuracy by 1) allowing one neural network to be trained on all peptides for all alleles of a given class of MHC molecules by making the allele an input to the net and 2) introducing a global max pooling operation with an optimized kernel size that allows the architecture to achieve translational invariance in MHC-peptide binding analysis, making it suitable for sequence analytics where a frame of interest needs to be learned in a longer, variable length sequence. We assess AI-MHC against internal independent test sets and compare against all algorithms in the IEDB automated server benchmarks, demonstrating our algorithm achieves state-of-the-art for both Class I and Class II prediction.Availability and ImplementationAI-MHC can be used via web interface at baras.pathology.jhu.edu/[email protected]


2021 ◽  
Vol 17 (2) ◽  
pp. e1008767
Author(s):  
Zutan Li ◽  
Hangjin Jiang ◽  
Lingpeng Kong ◽  
Yuanyuan Chen ◽  
Kun Lang ◽  
...  

N6-methyladenine (6mA) is an important DNA modification form associated with a wide range of biological processes. Identifying accurately 6mA sites on a genomic scale is crucial for under-standing of 6mA’s biological functions. However, the existing experimental techniques for detecting 6mA sites are cost-ineffective, which implies the great need of developing new computational methods for this problem. In this paper, we developed, without requiring any prior knowledge of 6mA and manually crafted sequence features, a deep learning framework named Deep6mA to identify DNA 6mA sites, and its performance is superior to other DNA 6mA prediction tools. Specifically, the 5-fold cross-validation on a benchmark dataset of rice gives the sensitivity and specificity of Deep6mA as 92.96% and 95.06%, respectively, and the overall prediction accuracy is 94%. Importantly, we find that the sequences with 6mA sites share similar patterns across different species. The model trained with rice data predicts well the 6mA sites of other three species: Arabidopsis thaliana, Fragaria vesca and Rosa chinensis with a prediction accuracy over 90%. In addition, we find that (1) 6mA tends to occur at GAGG motifs, which means the sequence near the 6mA site may be conservative; (2) 6mA is enriched in the TATA box of the promoter, which may be the main source of its regulating downstream gene expression.


2020 ◽  
Vol 10 (14) ◽  
pp. 4870 ◽  
Author(s):  
Luca Coviello ◽  
Marco Cristoforetti ◽  
Giuseppe Jurman ◽  
Cesare Furlanello

We introduce here the Grape Berries Counting Net (GBCNet), a tool for accurate fruit yield estimation from smartphone cameras, by adapting Deep Learning algorithms originally developed for crowd counting. We test GBCNet using cross-validation procedure on two original datasets CR1 and CR2 of grape pictures taken in-field before veraison. A total of 35,668 berries have been manually annotated for the task. GBCNet achieves good performances on both the seven grape varieties dataset CR1, although with a different accuracy level depending on the variety, and on the single variety dataset CR2: in particular Mean Average Error (MAE) ranges from 0.85% for Pinot Gris to 11.73% for Marzemino on CR1 and reaches 7.24% on the Teroldego CR2 dataset.


2021 ◽  
Author(s):  
Karansher S Sandhu ◽  
Meriem Aoun ◽  
Craig Morris ◽  
Arron H Carter

Breeding for grain yield, biotic and abiotic stress resistance, and end-use quality are important goals of wheat breeding programs. Screening for end-use quality traits is usually secondary to grain yield due to high labor needs, cost of testing, and large seed requirements for phenotyping. Hence, testing is delayed until later stages in the breeding program. Delayed phenotyping results in advancement of inferior end-use quality lines into the program. Genomic selection provides an alternative to predict performance using genome-wide markers. Due to large datasets in breeding programs, we explored the potential of the machine and deep learning models to predict fourteen end-use quality traits in a winter wheat breeding program. The population used consisted of 666 wheat genotypes screened for five years (2015-19) at two locations (Pullman and Lind, WA, USA). Nine different models, including two machine learning (random forest and support vector machine) and two deep learning models (convolutional neural network and multilayer perceptron), were explored for cross-validation, forward, and across locations predictions. The prediction accuracies for different traits varied from 0.45-0.81, 0.29-0.55, and 0.27-0.50 under cross-validation, forward, and across location predictions. In general, forward prediction accuracies kept increasing over time due to increments in training data size and was more evident for machine and deep learning models. Deep learning models performed superior over the traditional ridge regression best linear unbiased prediction (RRBLUP) and Bayesian models under all prediction scenarios. The high accuracy observed for end-use quality traits in this study support predicting them in early generations, leading to the advancement of superior genotypes to more extensive grain yield trailing. Furthermore, the superior performance of machine and deep learning models strengthen the idea to include them in large scale breeding programs for predicting complex traits.


Sign in / Sign up

Export Citation Format

Share Document