scholarly journals Deep Imputation on Large-Scale Drug Discovery Data

Author(s):  
Benedict Irwin ◽  
Thomas Whitehead ◽  
Scott Rowland ◽  
Samar Mahmoud ◽  
Gareth Conduit ◽  
...  

More accurate predictions of the biological properties of chemical compounds would guide the selection and design of new compounds in drug discovery and help to address the enormous cost and low success-rate of pharmaceutical R&D. However this domain presents a significant challenge for AI methods due to the sparsity of compound data and the noise inherent in results from biological experiments. In this paper, we demonstrate how data imputation using deep learning provides substantial improvements over quantitative structure-activity relationship (QSAR) machine learning models that are widely applied in drug discovery. We present the largest-to-date successful application of deep-learning imputation to datasets which are comparable in size to the corporate data repository of a pharmaceutical company (678,994 compounds by 1166 endpoints). We demonstrate this improvement for three areas of practical application linked to distinct use cases; i) target activity data compiled from a range of drug discovery projects, ii) a high value and heterogeneous dataset covering complex absorption, distribution, metabolism and elimination properties and, iii) high throughput screening data, testing the algorithm’s limits on early-stage noisy and very sparse data. Achieving median coefficients of determination, R, of 0.69, 0.36 and 0.43 respectively across these applications, the deep learning imputation method offers an unambiguous improvement over random forest QSAR methods, which achieve median R values of 0.28, 0.19 and 0.23 respectively. We also demonstrate that robust estimates of the uncertainties in the predicted values correlate strongly with the accuracies in prediction, enabling greater confidence in decision-making based on the imputed values.

Author(s):  
Benedict Irwin ◽  
Thomas Whitehead ◽  
Scott Rowland ◽  
Samar Mahmoud ◽  
Gareth Conduit ◽  
...  

More accurate predictions of the biological properties of chemical compounds would guide the selection and design of new compounds in drug discovery and help to address the enormous cost and low success-rate of pharmaceutical R&D. However this domain presents a significant challenge for AI methods due to the sparsity of compound data and the noise inherent in results from biological experiments. In this paper, we demonstrate how data imputation using deep learning provides substantial improvements over quantitative structure-activity relationship (QSAR) machine learning models that are widely applied in drug discovery. We present the largest-to-date successful application of deep-learning imputation to datasets which are comparable in size to the corporate data repository of a pharmaceutical company (678,994 compounds by 1166 endpoints). We demonstrate this improvement for three areas of practical application linked to distinct use cases; i) target activity data compiled from a range of drug discovery projects, ii) a high value and heterogeneous dataset covering complex absorption, distribution, metabolism and elimination properties and, iii) high throughput screening data, testing the algorithm’s limits on early-stage noisy and very sparse data. Achieving median coefficients of determination, R, of 0.69, 0.36 and 0.43 respectively across these applications, the deep learning imputation method offers an unambiguous improvement over random forest QSAR methods, which achieve median R values of 0.28, 0.19 and 0.23 respectively. We also demonstrate that robust estimates of the uncertainties in the predicted values correlate strongly with the accuracies in prediction, enabling greater confidence in decision-making based on the imputed values.


2021 ◽  
Vol 28 ◽  
Author(s):  
Jannis Born ◽  
Matteo Manica

: It is more pressing than ever to reduce the time and costs for developing lead compounds in the pharmaceutical industry. The co-occurrence of advances in high-throughput screening and the rise of deep learning (DL) have enabled the development of large-scale multimodal predictive models for virtual drug screening. Recently, deep generative models have emerged as a powerful tool for exploring the chemical space and raising hopes to expedite the drug discovery process. Following this progress in chemocentric approaches for generative chemistry, the next challenge is to build multimodal conditional generative models that leverage disparate knowledge sources when biochemical mapping properties to target structures. Here, we call the community to bridge drug discovery more closely with systems biology when designing deep generative models. Complementing the plethora of reviews on the role of DL in chemoinformatics, we herein specifically focus on the interface of predictive and generative modeling for drug discovery. Through a systematic publication keyword search on PubMed and a selection of preprint servers (arXiv, biorXiv, chemRxiv, and medRxiv), we quantify trends in the field and find that molecular graphs and VAEs have become the most widely adopted molecular representations and architectures in generative models, respectively. We discuss progress on DL for toxicity, drug-target affinity, and drug sensitivity prediction and specifically focus on conditional molecular generative models that encompass multimodal prediction models. Moreover, we outline prospects in the field and identify challenges such as the integration of deep learning systems into experimental workflows in a closed-loop manner or the adoption of federated machine learning techniques to overcome data sharing barriers. Other challenges include, but are not limited to interpretability in generative models, more sophisticated metrics for the evaluation of molecular generative models, and, following up on that, community-accepted benchmarks for both multimodal drug property prediction and property-driven molecular design.


2019 ◽  
Vol 25 (1) ◽  
pp. 9-20 ◽  
Author(s):  
Olivia W. Lee ◽  
Shelley Austin ◽  
Madison Gamma ◽  
Dorian M. Cheff ◽  
Tobie D. Lee ◽  
...  

Cell-based phenotypic screening is a commonly used approach to discover biological pathways, novel drug targets, chemical probes, and high-quality hit-to-lead molecules. Many hits identified from high-throughput screening campaigns are ruled out through a series of follow-up potency, selectivity/specificity, and cytotoxicity assays. Prioritization of molecules with little or no cytotoxicity for downstream evaluation can influence the future direction of projects, so cytotoxicity profiling of screening libraries at an early stage is essential for increasing the likelihood of candidate success. In this study, we assessed the cell-based cytotoxicity of nearly 10,000 compounds in the National Institutes of Health, National Center for Advancing Translational Sciences annotated libraries and more than 100,000 compounds in a diversity library against four normal cell lines (HEK 293, NIH 3T3, CRL-7250, and HaCat) and one cancer cell line (KB 3-1, a HeLa subline). This large-scale library profiling was analyzed for overall screening outcomes, hit rates, pan-activity, and selectivity. For the annotated library, we also examined the primary targets and mechanistic pathways regularly associated with cell death. To our knowledge, this is the first study to use high-throughput screening to profile a large screening collection (>100,000 compounds) for cytotoxicity in both normal and cancer cell lines. The results generated here constitute a valuable resource for the scientific community and provide insight into the extent of cytotoxic compounds in screening libraries, allowing for the identification and avoidance of compounds with cytotoxicity during high-throughput screening campaigns.


Author(s):  
Jingyan Qiu ◽  
Linjian Li ◽  
Yida Liu ◽  
Yingjun Ou ◽  
Yubei Lin

Alzheimer’s disease (AD) is one of the most common forms of dementia. The early stage of the disease is defined as Mild Cognitive Impairment (MCI). Recent research results have shown the prospect of combining Magnetic Resonance Imaging (MRI) scanning of the brain and deep learning to diagnose AD. However, the CNN deep learning model requires a large scale of samples for training. Transfer learning is the key to enable a model with high accuracy by using limited data for training. In this paper, DenseNet and Inception V4, which were pre-trained on the ImageNet dataset to obtain initialization values of weights, are, respectively, used for the graphic classification task. The ensemble method is employed to enhance the effectiveness and efficiency of the classification models and the result of different models are eventually processed through probability-based fusion. Our experiments were completely conducted on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) public dataset. Only the ternary classification is made due to a higher demand for medical detection and diagnosis. The accuracies of AD/MCI/Normal Control (NC) of different models are estimated in this paper. The results of the experiments showed that the accuracies of the method achieved a maximum of 92.65%, which is a remarkable outcome compared with the accuracies of the state-of-the-art methods.


Molecules ◽  
2020 ◽  
Vol 25 (22) ◽  
pp. 5277
Author(s):  
Lauv Patel ◽  
Tripti Shukla ◽  
Xiuzhen Huang ◽  
David W. Ussery ◽  
Shanzhi Wang

The advancements of information technology and related processing techniques have created a fertile base for progress in many scientific fields and industries. In the fields of drug discovery and development, machine learning techniques have been used for the development of novel drug candidates. The methods for designing drug targets and novel drug discovery now routinely combine machine learning and deep learning algorithms to enhance the efficiency, efficacy, and quality of developed outputs. The generation and incorporation of big data, through technologies such as high-throughput screening and high through-put computational analysis of databases used for both lead and target discovery, has increased the reliability of the machine learning and deep learning incorporated techniques. The use of these virtual screening and encompassing online information has also been highlighted in developing lead synthesis pathways. In this review, machine learning and deep learning algorithms utilized in drug discovery and associated techniques will be discussed. The applications that produce promising results and methods will be reviewed.


Author(s):  
Zhenxing Wu ◽  
Dejun Jiang ◽  
Chang-Yu Hsieh ◽  
Guangyong Chen ◽  
Ben Liao ◽  
...  

Abstract Accurate predictions of druggability and bioactivities of compounds are desirable to reduce the high cost and time of drug discovery. After more than five decades of continuing developments, quantitative structure–activity relationship (QSAR) methods have been established as indispensable tools that facilitate fast, reliable and affordable assessments of physicochemical and biological properties of compounds in drug-discovery programs. Currently, there are mainly two types of QSAR methods, descriptor-based methods and graph-based methods. The former is developed based on predefined molecular descriptors, whereas the latter is developed based on simple atomic and bond information. In this study, we presented a simple but highly efficient modeling method by combining molecular graphs and molecular descriptors as the input of a modified graph neural network, called hyperbolic relational graph convolution network plus (HRGCN+). The evaluation results show that HRGCN+ achieves state-of-the-art performance on 11 drug-discovery-related datasets. We also explored the impact of the addition of traditional molecular descriptors on the predictions of graph-based methods, and found that the addition of molecular descriptors can indeed boost the predictive power of graph-based methods. The results also highlight the strong anti-noise capability of our method. In addition, our method provides a way to interpret models at both the atom and descriptor levels, which can help medicinal chemists extract hidden information from complex datasets. We also offer an HRGCN+'s online prediction service at https://quantum.tencent.com/hrgcn/.


2012 ◽  
Vol 17 (4) ◽  
pp. 519-529 ◽  
Author(s):  
Michael Prummer

Following the success of small-molecule high-throughput screening (HTS) in drug discovery, other large-scale screening techniques are currently revolutionizing the biological sciences. Powerful new statistical tools have been developed to analyze the vast amounts of data in DNA chip studies, but have not yet found their way into compound screening. In HTS, characterization of single-point hit lists is often done only in retrospect after the results of confirmation experiments are available. However, for prioritization, for optimal use of resources, for quality control, and for comparison of screens it would be extremely valuable to predict the rates of false positives and false negatives directly from the primary screening results. Making full use of the available information about compounds and controls contained in HTS results and replicated pilot runs, the Z score and from it the p value can be estimated for each measurement. Based on this consideration, we have applied the concept of p-value distribution analysis (PVDA), which was originally developed for gene expression studies, to HTS data. PVDA allowed prediction of all relevant error rates as well as the rate of true inactives, and excellent agreement with confirmation experiments was found.


2018 ◽  
Author(s):  
Olivia W. Lee ◽  
Shelley Austin ◽  
Madison Gamma ◽  
Dorian M. Cheff ◽  
Tobie D. Lee ◽  
...  

AbstractCell-based phenotypic screening is a commonly used approach to discover biological pathways, novel drug targets, chemical probes and high-quality hit-to-lead molecules. Many hits identified from high-throughput screening campaigns are ruled out through a series of follow-up potency, selectivity/specificity, and cytotoxicity assays. Prioritization of molecules with little or no cytotoxicity for downstream evaluation can influence the future direction of projects, so cytotoxicity profiling of screening libraries at an early stage is essential for increasing the likelihood of candidate success. In this study, we assessed cell-based cytotoxicity of nearly 10,000 compounds in NCATS annotated libraries, and over 100,000 compounds in a diversity library, against four ‘normal’ cell lines (HEK 293, NIH 3T3, CRL-7250 and HaCat) and one cancer cell line (KB 3-1, a HeLa subline). This large-scale library profiling was analyzed for overall screening outcomes, hit rates, pan-activity and selectivity. For the annotated library, we also examined the primary targets and mechanistic pathways regularly associated with cell death. To our knowledge, this is the first study to use high-throughput screening to profile a large screening collection (>100,000 compounds) for cytotoxicity in both normal and cancer cell lines. The results generated here constitutes a valuable resource for the scientific community and provides insight on the extent of cytotoxic compounds in screening libraries, identifying and avoiding compounds with cytotoxicity during high-throughput screening campaigns.


2001 ◽  
Vol 73 (9) ◽  
pp. 1487-1498 ◽  
Author(s):  
Ferenc Darvas ◽  
Gyorgy Dorman ◽  
Laszlo Urge ◽  
Istvan Szabo ◽  
Zsolt Ronai ◽  
...  

In the age of high-throughput screening and combinatorial chemistry, the focus of drug discovery is to replace the sequential approach with the most effective parallel approach. By the completion of the human gene-map, understanding and healing a disease require the integration of genomics, proteomics, and, very recently, metabolomics with early utilization of diverse small-molecule libraries to create a more powerful "total" drug discovery approach.In this post-genomic era, there is an enhanced demand for information-enriched combinatorial libraries which are high-quality, chemically and physiologically stable, diverse, and supported by measured and predicted data. Furthermore, specific marker libraries could be used for early functional profiling of the genome, proteome, and metabolome. In this new operating model, called "combinatorial chemical genomics", an optimal combination of the marker and high-quality libraries provides a novel synergy for the drug discovery process at a very early stage.


2008 ◽  
Vol 14 (1) ◽  
pp. 66-76 ◽  
Author(s):  
Isabel Coma ◽  
Liz Clark ◽  
Emilio Diez ◽  
Gavin Harper ◽  
Jesus Herranz ◽  
...  

The use of large-scale compound screening has become a key component of drug discovery projects in both the pharmaceutical and the biotechnological industries. More recently, these activities have also been embraced by the academic community as a major tool for chemical genomic activities. High-throughput screening (HTS) activities constitute a major step in the initial drug discovery efforts and involve the use of large quantities of biological reagents, hundreds of thousands to millions of compounds, and the utilization of expensive equipment. All these factors make it very important to evaluate in advance of the HTS campaign any potential issues related to reproducibility of the experimentation and the quality of the results obtained at the end of these very costly activities. In this article, the authors describe how GlaxoSmithKline (GSK) has addressed the need of a true validation of the HTS process before embarking in full HTS campaigns. They present 2 different aspects of the so-called validation process: (1) optimization of the HTS workflow and its validation as a quality process and (2) the statistical evaluation of the HTS, focusing on the reproducibility of results and the ability to distinguish active from nonactive compounds in a vast collection of samples. The authors describe a variety of reproducibility indexes that are either innovative or have been adapted from generic medical diagnostic screening strategies. In addition, they exemplify how these validation tools have been implemented in a number of case studies at GSK. ( Journal of Biomolecular Screening 2009:66-76)


Sign in / Sign up

Export Citation Format

Share Document