scholarly journals Benchmarks for interpretation of QSAR models

Author(s):  
Mariia Matveieva ◽  
Pavel Polishchuk

Abstract Interpretation of QSAR models is useful to understand the complex nature of biological or physicochemical processes, guide structural optimization or perform knowledge-based validation of QSAR models. Highly predictive models are usually complex and their interpretation is non-trivial. This is particularly true for modern neural networks. Various approaches to interpretation of these models exist. However, it is difficult to evaluate and compare performance and applicability of these ever-emerging methods. Herein, we developed several benchmark data sets with end-points determined by pre-defined patterns. These data sets are purposed for evaluation of the ability of interpretation approaches to retrieve these patterns. They represent tasks with different complexity levels: from simple atom-based additive properties to pharmacophore hypotheses. We proposed several quantitative metrics of interpretation performance. Applicability of benchmarks and metrics was demonstrated on a set of conventional models and end-to-end graph convolutional neural networks interpreted by the previously suggested universal ML-agnostic approach for structural interpretation. We anticipate these benchmarks to be useful in evaluation of new interpretation approaches and investigation of decision making of complex “black box” models.

2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Mariia Matveieva ◽  
Pavel Polishchuk

AbstractInterpretation of QSAR models is useful to understand the complex nature of biological or physicochemical processes, guide structural optimization or perform knowledge-based validation of QSAR models. Highly predictive models are usually complex and their interpretation is non-trivial. This is particularly true for modern neural networks. Various approaches to interpretation of these models exist. However, it is difficult to evaluate and compare performance and applicability of these ever-emerging methods. Herein, we developed several benchmark data sets with end-points determined by pre-defined patterns. These data sets are purposed for evaluation of the ability of interpretation approaches to retrieve these patterns. They represent tasks with different complexity levels: from simple atom-based additive properties to pharmacophore hypothesis. We proposed several quantitative metrics of interpretation performance. Applicability of benchmarks and metrics was demonstrated on a set of conventional models and end-to-end graph convolutional neural networks, interpreted by the previously suggested universal ML-agnostic approach for structural interpretation. We anticipate these benchmarks to be useful in evaluation of new interpretation approaches and investigation of decision making of complex “black box” models.


Author(s):  
Mahmoud A. Al-Sha'er ◽  
Mutasem O. Taha

Introduction: Tyrosine threonine kinase (TTK1) is a key regulator of chromosome segregation. TTK targeting received recent concern for the enhancement of possible anticancer therapies. Objective: In this regard we employed our well-known method of QSAR-guided selection of best crystallographic pharmacophore(s) to discover considerable binding interactions that anchore inhibitors into TTK1 binding site. Method:Sixtyone TTK1 crystallographic complexes were used to extract 315 pharmacophore hypotheses. QSAR modeling was subsequently used to choose a single crystallographic pharmacophore that when combined with other physicochemical descriptors elucidates bioactivity discrepancy within a list of 55 miscellaneous inhibitors. Results: The best QSAR model was robust and predictive (r2(55) = 0.75, r2LOO = 0.72 , r2press against external testing list of 12 compounds = 0.67), Standard error of estimate (training set) (S)= 0.63 , Standard error of estimate (testing set)(Stest) = 0.62. The resulting pharmacophore and QSAR models were used to scan the National Cancer Institute (NCI) database for new TTK1 inhibitors. Conclusion: Five hits confirmed significant TTK1 inhibitory profiles with IC50 values ranging between 11.7 and 76.6 micM.


2020 ◽  
Vol 6 ◽  
Author(s):  
Jaime de Miguel Rodríguez ◽  
Maria Eugenia Villafañe ◽  
Luka Piškorec ◽  
Fernando Sancho Caparrini

Abstract This work presents a methodology for the generation of novel 3D objects resembling wireframes of building types. These result from the reconstruction of interpolated locations within the learnt distribution of variational autoencoders (VAEs), a deep generative machine learning model based on neural networks. The data set used features a scheme for geometry representation based on a ‘connectivity map’ that is especially suited to express the wireframe objects that compose it. Additionally, the input samples are generated through ‘parametric augmentation’, a strategy proposed in this study that creates coherent variations among data by enabling a set of parameters to alter representative features on a given building type. In the experiments that are described in this paper, more than 150 k input samples belonging to two building types have been processed during the training of a VAE model. The main contribution of this paper has been to explore parametric augmentation for the generation of large data sets of 3D geometries, showcasing its problems and limitations in the context of neural networks and VAEs. Results show that the generation of interpolated hybrid geometries is a challenging task. Despite the difficulty of the endeavour, promising advances are presented.


Author(s):  
Kamyab Keshtkar

As a relatively high percentage of adenoma polyps are missed, a computer-aided diagnosis (CAD) tool based on deep learning can aid the endoscopist in diagnosing colorectal polyps or colorectal cancer in order to decrease polyps missing rate and prevent colorectal cancer mortality. Convolutional Neural Network (CNN) is a deep learning method and has achieved better results in detecting and segmenting specific objects in images in the last decade than conventional models such as regression, support vector machines or artificial neural networks. In recent years, based on the studies in medical imaging criteria, CNN models have acquired promising results in detecting masses and lesions in various body organs, including colorectal polyps. In this review, the structure and architecture of CNN models and how colonoscopy images are processed as input and converted to the output are explained in detail. In most primary studies conducted in the colorectal polyp detection and classification field, the CNN model has been regarded as a black box since the calculations performed at different layers in the model training process have not been clarified precisely. Furthermore, I discuss the differences between the CNN and conventional models, inspect how to train the CNN model for diagnosing colorectal polyps or cancer, and evaluate model performance after the training process.


2020 ◽  
Author(s):  
Thibaut Sellinger ◽  
Diala Abu Awad ◽  
Aurélien Tellier

AbstractMany methods based on the Sequentially Markovian Coalescent (SMC) have been and are being developed. These methods make use of genome sequence data to uncover population demographic history. More recently, new methods have extended the original theoretical framework, allowing the simultaneous estimation of the demographic history and other biological variables. These methods can be applied to many different species, under different model assumptions, in hopes of unlocking the population/species evolutionary history. Although convergence proofs in particular cases have been given using simulated data, a clear outline of the performance limits of these methods is lacking. We here explore the limits of this methodology, as well as present a tool that can be used to help users quantify what information can be confidently retrieved from given datasets. In addition, we study the consequences for inference accuracy violating the hypotheses and the assumptions of SMC approaches, such as the presence of transposable elements, variable recombination and mutation rates along the sequence and SNP call errors. We also provide a new interpretation of the SMC through the use of the estimated transition matrix and offer recommendations for the most efficient use of these methods under budget constraints, notably through the building of data sets that would be better adapted for the biological question at hand.


Sign in / Sign up

Export Citation Format

Share Document