scholarly journals Inductive Transfer Learning for Molecular Activity Prediction: Next-Gen QSAR Models with MolPMoFiT

Author(s):  
Xinhao Li ◽  
Denis Fourches

<p>Deep neural networks can directly learn from chemical structures without extensive, user-driven selection of descriptors in order to predict molecular properties/activities with high reliability. But these approaches typically require large training sets to learn the endpoint-specific structural features and ensure reasonable prediction accuracy. Even though large datasets are becoming the new normal in drug discovery, especially when it comes to high-throughput screening or metabolomics datasets, one should also consider smaller datasets with challenging endpoints to model and forecast. Thus, it would be highly relevant to better utilize the tremendous compendium of unlabeled compounds from publicly-available datasets for improving the model performances for the user’s particular series of compounds. In this study, we propose the <b>Mol</b>ecular <b>P</b>rediction <b>Mo</b>del <b>Fi</b>ne-<b>T</b>uning (<b>MolPMoFiT</b>) approach, an effective transfer learning method based on self-supervised pre-training + task-specific fine-tuning for QSPR/QSAR modeling. A large-scale molecular structure prediction model is pre-trained using one million unlabeled molecules from ChEMBL in a self-supervised learning manner, and can then be fine-tuned on various QSPR/QSAR tasks for smaller chemical datasets with specific endpoints. Herein, the method is evaluated on four benchmark datasets (lipophilicity, FreeSolv, HIV, and blood-brain barrier penetration). The results showed the method can achieve strong performances for all four datasets compared to other state-of-the-art machine learning modeling techniques reported in the literature so far. <br></p>

Author(s):  
Xinhao Li ◽  
Denis Fourches

<p>Deep neural networks can directly learn from chemical structures without extensive, user-driven selection of descriptors in order to predict molecular properties/activities with high reliability. But these approaches typically require large training sets to learn the endpoint-specific structural features and ensure reasonable prediction accuracy. Even though large datasets are becoming the new normal in drug discovery, especially when it comes to high-throughput screening or metabolomics datasets, one should also consider smaller datasets with challenging endpoints to model and forecast. Thus, it would be highly relevant to better utilize the tremendous compendium of unlabeled compounds from publicly-available datasets for improving the model performances for the user’s particular series of compounds. In this study, we propose the <b>Mol</b>ecular <b>P</b>rediction <b>Mo</b>del <b>Fi</b>ne-<b>T</b>uning (<b>MolPMoFiT</b>) approach, an effective transfer learning method based on self-supervised pre-training + task-specific fine-tuning for QSPR/QSAR modeling. A large-scale molecular structure prediction model is pre-trained using one million unlabeled molecules from ChEMBL in a self-supervised learning manner, and can then be fine-tuned on various QSPR/QSAR tasks for smaller chemical datasets with specific endpoints. Herein, the method is evaluated on four benchmark datasets (lipophilicity, FreeSolv, HIV, and blood-brain barrier penetration). The results showed the method can achieve strong performances for all four datasets compared to other state-of-the-art machine learning modeling techniques reported in the literature so far. <br></p>


2019 ◽  
Author(s):  
Xinhao Li ◽  
Denis Fourches

<p>Deep neural networks can directly learn from chemical structures without extensive, user-driven selection of descriptors in order to predict molecular properties/activities with high reliability. But these approaches typically require very large training sets to truly learn the best endpoint-specific structural features and ensure reasonable prediction accuracy. Even though large datasets are becoming the new normal in drug discovery, especially when it comes to high-throughput screening or metabolomics datasets, one should also consider smaller datasets with very challenging endpoints to model and forecast. Thus, it would be highly relevant to better utilize the tremendous compendium of unlabeled compounds from publicly-available datasets for improving the model performances for the user’s particular series of compounds. In this study, we propose the <b>M</b><b>ol</b>ecular <b>P</b>rediction <b>M</b><b>o</b>del <b>Fi</b>ne-<b>T</b>uning (<b>MolPMoFiT</b>) approach, an effective transfer learning method that can be applied to any QSPR/QSAR problems. A large-scale molecular structure prediction model is pre-trained using one million unlabeled molecules from ChEMBL in a self-supervised learning manor, and can then be fine-tuned on various QSPR/QSAR tasks for smaller chemical datasets with a specific endpoints. Herein, the method is evaluated on three benchmark datasets (lipophilicity, HIV, and blood-brain barrier penetration). The results showed the method can achieve comparable or better prediction performances on all three datasets compared to <i>state-of-the-art</i> prediction techniques reported in the literature so far. </p>


2019 ◽  
Author(s):  
Mohammad Atif Faiz Afzal ◽  
Mojtaba Haghighatlari ◽  
Sai Prasad Ganesh ◽  
Chong Cheng ◽  
Johannes Hachmann

<div>We present a high-throughput computational study to identify novel polyimides (PIs) with exceptional refractive index (RI) values for use as optic or optoelectronic materials. Our study utilizes an RI prediction protocol based on a combination of first-principles and data modeling developed in previous work, which we employ on a large-scale PI candidate library generated with the ChemLG code. We deploy the virtual screening software ChemHTPS to automate the assessment of this extensive pool of PI structures in order to determine the performance potential of each candidate. This rapid and efficient approach yields a number of highly promising leads compounds. Using the data mining and machine learning program package ChemML, we analyze the top candidates with respect to prevalent structural features and feature combinations that distinguish them from less promising ones. In particular, we explore the utility of various strategies that introduce highly polarizable moieties into the PI backbone to increase its RI yield. The derived insights provide a foundation for rational and targeted design that goes beyond traditional trial-and-error searches.</div>


2021 ◽  
Author(s):  
Jeremy Feinstein ◽  
ganesh sivaraman ◽  
Kurt Picel ◽  
Brian Peters ◽  
Alvaro Vazquez-Mayagoitia ◽  
...  

In this article, we present our recent study on computational methodology for predicting the toxicity of PFAS known as “forever chemicals” based on chemical structures through evaluation of multiple machine learning methods. To address the scarcity of PFAS toxicity data, a deep “transfer learning” method has been investigated by leveraging toxicity information over the entire organic chemical domain and an uncertainty-informed workflow by incorporating SelectiveNet architecture, which can support future guidance of high throughput screening with knowledge of chemical structures, has been developed.


2021 ◽  
Author(s):  
Jeremy Feinstein ◽  
ganesh sivaraman ◽  
Kurt Picel ◽  
Brian Peters ◽  
Alvaro Vazquez-Mayagoitia ◽  
...  

In this article, we present our recent study on computational methodology for predicting the toxicity of PFAS known as “forever chemicals” based on chemical structures through evaluation of multiple machine learning methods. To address the scarcity of PFAS toxicity data, a deep “transfer learning” method has been investigated by leveraging toxicity information over the entire organic chemical domain and an uncertainty-informed workflow by incorporating SelectiveNet architecture, which can support future guidance of high throughput screening with knowledge of chemical structures, has been developed.


2005 ◽  
Vol 11 (1) ◽  
pp. 48-56 ◽  
Author(s):  
Kurumi Y. Horiuchi ◽  
Yuan Wang ◽  
Scott L. Diamond ◽  
Haiching Ma

A central challenge in chemical biology is profiling the activity of a large number of chemical structures against hundreds of biological targets, such as kinases. Conventional 32P-incorporation or immunoassay of phosphorylated residues produces high-quality signals formonitoring kinase reactions but is difficult to use in high-throughput screening (HTS) because of cost and the need for well-plate washing. The authors report a method for densely archiving compounds in nanodroplets on peptide or protein substrate-coated microarrays for subsequent profiling by aerosol deposition of kinases. Each microarray contains over 6000 reaction centers (1.0 nL each) whose phosphorylation progress can be detected by immunofluorescence. For p60c-src, the microarray produced a signal-to-background ratio of 36.3 and Z' factor of 0.63 for HTS and accurate enzyme kinetic parameters (K m ATP = 3.3 µ M) and IC50 values for staurosporine (210 nM) and PP2 (326 nM) at 10 µ M adenosine triphosphate (ATP). Similarly, B-Raf phosphorylation of MEK-coatedmicroarrayswas inhibited in the nanoliter reactions by GW5074 at the expected IC 50of 9 nM. Common kinase inhibitors were printed onmicroarrays, and their inhibitory activities were systematically profiled against B-Raf (V599E), KDR, Met, Flt-3 (D835Y), Lyn, EGFR, PDGFRβ, and Tie2. All results indicate that this platform is well suited for kinetic analysis, HTS, large-scale IC 50 determinations, and selectivity profiling.


2020 ◽  
Vol 34 (01) ◽  
pp. 115-122 ◽  
Author(s):  
Baijun Ji ◽  
Zhirui Zhang ◽  
Xiangyu Duan ◽  
Min Zhang ◽  
Boxing Chen ◽  
...  

Transfer learning between different language pairs has shown its effectiveness for Neural Machine Translation (NMT) in low-resource scenario. However, existing transfer methods involving a common target language are far from success in the extreme scenario of zero-shot translation, due to the language space mismatch problem between transferor (the parent model) and transferee (the child model) on the source side. To address this challenge, we propose an effective transfer learning approach based on cross-lingual pre-training. Our key idea is to make all source languages share the same feature space and thus enable a smooth transition for zero-shot translation. To this end, we introduce one monolingual pre-training method and two bilingual pre-training methods to obtain a universal encoder for different languages. Once the universal encoder is constructed, the parent model built on such encoder is trained with large-scale annotated data and then directly applied in zero-shot translation scenario. Experiments on two public datasets show that our approach significantly outperforms strong pivot-based baseline and various multilingual NMT approaches.


2021 ◽  
Author(s):  
Geoffrey F. Schau ◽  
Hassan Ghani ◽  
Erik A. Burlingame ◽  
Guillaume Thibault ◽  
Joe W. Gray ◽  
...  

AbstractAccurate diagnosis of metastatic cancer is essential for prescribing optimal control strategies to halt further spread of metastasizing disease. While pathological inspection aided by immunohistochemistry staining provides a valuable gold standard for clinical diagnostics, deep learning methods have emerged as powerful tools for identifying clinically relevant features of whole slide histology relevant to a tumor’s metastatic origin. Although deep learning models require significant training data to learn effectively, transfer learning paradigms provide mechanisms to circumvent limited training data by first training a model on related data prior to fine-tuning on smaller data sets of interest. In this work we propose a transfer learning approach that trains a convolutional neural network to infer the metastatic origin of tumor tissue from whole slide images of hematoxylin and eosin (H&E) stained tissue sections and illustrate the advantages of pre-training network on whole slide images of primary tumor morphology. We further characterize statistical dissimilarity between primary and metastatic tumors of various indications on patch-level images to highlight limitations of our indication-specific transfer learning approach. Using a primary-to-metastatic transfer learning approach, we achieved mean class-specific areas under receiver operator characteristics curve (AUROC) of 0.779, which outperformed comparable models trained on only images of primary tumor (mean AUROC of 0.691) or trained on only images of metastatic tumor (mean AUROC of 0.675), supporting the use of large scale primary tumor imaging data in developing computer vision models to characterize metastatic origin of tumor lesions.


2019 ◽  
Author(s):  
Mohammad Atif Faiz Afzal ◽  
Mojtaba Haghighatlari ◽  
Sai Prasad Ganesh ◽  
Chong Cheng ◽  
Johannes Hachmann

<div>We present a high-throughput computational study to identify novel polyimides (PIs) with exceptional refractive index (RI) values for use as optic or optoelectronic materials. Our study utilizes an RI prediction protocol based on a combination of first-principles and data modeling developed in previous work, which we employ on a large-scale PI candidate library generated with the ChemLG code. We deploy the virtual screening software ChemHTPS to automate the assessment of this extensive pool of PI structures in order to determine the performance potential of each candidate. This rapid and efficient approach yields a number of highly promising leads compounds. Using the data mining and machine learning program package ChemML, we analyze the top candidates with respect to prevalent structural features and feature combinations that distinguish them from less promising ones. In particular, we explore the utility of various strategies that introduce highly polarizable moieties into the PI backbone to increase its RI yield. The derived insights provide a foundation for rational and targeted design that goes beyond traditional trial-and-error searches.</div>


2017 ◽  
Author(s):  
Qing Liu ◽  
Chun Wang ◽  
Xiaozhen Jiao ◽  
Huawei Zhang ◽  
Lili Song ◽  
...  

AbstractThe CRISPR/Cas system has been extensively applied to make precise genetic modifications in various organisms. Despite its importance and widespread use, large-scale mutation screening remains time-consuming, labour-intensive and costly. Here, we describe a cheap, practicable and high-throughput screening strategy that allows parallel screening of 96 × N (N denotes the number of targets) genome-modified sites. The strategy simplified and streamlined the process of next-generation sequencing (NGS) library construction by fixing the bridge sequences and barcoding primers. We also developed Hi-TOM (available at http://www.hi-tom.net/hi-tom/), an online tool to track the mutations with precise percentage. Analysis of the samples from rice, hexaploid wheat and human cells reveals that the Hi-TOM tool has high reliability and sensitivity in tracking various mutations, especially complex chimeric mutations that frequently induced by genome editing. Hi-TOM does not require specially design of barcode primers, cumbersome parameter configuration or additional data analysis. Thus, the streamlined NGS library construction and comprehensive result output make Hi-TOM particularly suitable for high-throughput identification of all types of mutations induced by CRISPR/Cas systems.


Sign in / Sign up

Export Citation Format

Share Document