SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning

Mapping Intimacies ◽

10.26434/chemrxiv.12339368.v1 ◽

2020 ◽

Author(s):

Xinhao Li ◽

Denis Fourches

Keyword(s):

Deep Learning ◽

Prediction Models ◽

Data Driven ◽

Learning Models ◽

Generation Task ◽

Property Prediction ◽

Important Research Topic ◽

Benchmark Datasets ◽

Atom Level ◽

Python Package

SMILES-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES Pair Encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for deep learning models. As a result, SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances for both molecular generation and property prediction tasks. In molecular generation task, SPE can boost the validity and novelty of generated SMILES. Herein, the molecular property prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level tokenization. Therefore SPE could be a promising tokenization method for SMILES-based deep learning models. An open source Python package <i>SmilesPE</i> was developed to implement this algorithm and is now available at <a href="https://github.com/XinhaoLi74/SmilesPE">https://github.com/XinhaoLi74/SmilesPE</a>.

Download Full-text

SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning

10.26434/chemrxiv.12339368 ◽

2020 ◽

Author(s):

Xinhao Li ◽

Denis Fourches

Keyword(s):

Deep Learning ◽

Prediction Models ◽

Data Driven ◽

Learning Models ◽

Generation Task ◽

Property Prediction ◽

Important Research Topic ◽

Benchmark Datasets ◽

Atom Level ◽

Python Package

Download Full-text

Comparison of deep learning with traditional models to predict preventable acute care use and spending among heart failure patients

European Heart Journal ◽

10.1093/eurheartj/ehab724.3048 ◽

2021 ◽

Vol 42 (Supplement_1) ◽

Author(s):

M Lewis ◽

J Figueroa

Keyword(s):

Heart Failure ◽

Deep Learning ◽

Prediction Models ◽

Evaluation Metrics ◽

Gradient Boosting ◽

Learning Models ◽

Private Company ◽

Preventable Hospitalizations ◽

Targeted Interventions ◽

Ed Visits

Abstract Recent health reforms have created incentives for cardiologists and accountable care organizations to participate in value-based care models for heart failure (HF). Accurate risk stratification of HF patients is critical to efficiently deploy interventions aimed at reducing preventable utilization. The goal of this paper was to compare deep learning approaches with traditional logistic regression (LR) to predict preventable utilization among HF patients. We conducted a prognostic study using data on 93,260 HF patients continuously enrolled for 2-years in a large U.S. commercial insurer to develop and validate prediction models for three outcomes of interest: preventable hospitalizations, preventable emergency department (ED) visits, and preventable costs. Patients were split into training, validation, and testing samples. Outcomes were modeled using traditional and enhanced LR and compared to gradient boosting model and deep learning models using sequential and non-sequential inputs. Evaluation metrics included precision (positive predictive value) at k, cost capture, and Area Under the Receiver operating characteristic (AUROC). Deep learning models consistently outperformed LR for all three outcomes with respect to the chosen evaluation metrics. Precision at 1% for preventable hospitalizations was 43% for deep learning compared to 30% for enhanced LR. Precision at 1% for preventable ED visits was 39% for deep learning compared to 33% for enhanced LR. For preventable cost, cost capture at 1% was 30% for sequential deep learning, compared to 18% for enhanced LR. The highest AUROCs for deep learning were 0.778, 0.681 and 0.727, respectively. These results offer a promising approach to identify patients for targeted interventions. FUNDunding Acknowledgement Type of funding sources: Private company. Main funding source(s): internally funded by Diagnostic Robotics Inc.

Download Full-text

DeepPurpose: a deep learning library for drug–target interaction prediction

Bioinformatics ◽

10.1093/bioinformatics/btaa1005 ◽

2020 ◽

Author(s):

Kexin Huang ◽

Tianfan Fu ◽

Lucas M Glass ◽

Marinka Zitnik ◽

Cao Xiao ◽

...

Keyword(s):

Deep Learning ◽

Drug Target ◽

Prediction Models ◽

State Of The Art ◽

Supplementary Information ◽

Target Interaction ◽

Interaction Prediction ◽

Computer Scientists ◽

Benchmark Datasets ◽

Biomedical Field

Abstract Summary Accurate prediction of drug–target interactions (DTI) is crucial for drug discovery. Recently, deep learning (DL) models for show promising performance for DTI prediction. However, these models can be difficult to use for both computer scientists entering the biomedical field and bioinformaticians with limited DL experience. We present DeepPurpose, a comprehensive and easy-to-use DL library for DTI prediction. DeepPurpose supports training of customized DTI prediction models by implementing 15 compound and protein encoders and over 50 neural architectures, along with providing many other useful features. We demonstrate state-of-the-art performance of DeepPurpose on several benchmark datasets. Availability and implementation https://github.com/kexinhuang12345/DeepPurpose. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Empirical study of shallow and deep learning models for sarcasm detection using context in benchmark datasets

Journal of Ambient Intelligence and Humanized Computing ◽

10.1007/s12652-019-01419-7 ◽

2019 ◽

Cited By ~ 3

Author(s):

Akshi Kumar ◽

Geetanjali Garg

Keyword(s):

Deep Learning ◽

Empirical Study ◽

Learning Models ◽

Benchmark Datasets

Download Full-text

Visualizing Deep Networks by Optimizing with Integrated Gradients

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6863 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11890-11898

Author(s):

Zhongang Qi ◽

Saeed Khorram ◽

Li Fuxin

Keyword(s):

Deep Learning ◽

State Of The Art ◽

User Needs ◽

Learning Models ◽

Local Optima ◽

Popular Approach ◽

Deep Network ◽

Deep Networks ◽

Benchmark Datasets ◽

Descent Directions

Understanding and interpreting the decisions made by deep learning models is valuable in many domains. In computer vision, computing heatmaps from a deep network is a popular approach for visualizing and understanding deep networks. However, heatmaps that do not correlate with the network may mislead human, hence the performance of heatmaps in providing a faithful explanation to the underlying deep network is crucial. In this paper, we propose I-GOS, which optimizes for a heatmap so that the classification scores on the masked image would maximally decrease. The main novelty of the approach is to compute descent directions based on the integrated gradients instead of the normal gradient, which avoids local optima and speeds up convergence. Compared with previous approaches, our method can flexibly compute heatmaps at any resolution for different user needs. Extensive experiments on several benchmark datasets show that the heatmaps produced by our approach are more correlated with the decision of the underlying deep network, in comparison with other state-of-the-art approaches.

Download Full-text

Quantitative Toxicity Prediction via Ensembling of Heterogeneous Predictors

10.21203/rs.2.19338/v1 ◽

2019 ◽

Author(s):

Abdul Karim ◽

Vahid Riahi ◽

Avinash Mishra ◽

Abdollah Dehzangi ◽

M. A. Hakim Newton ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Prediction Models ◽

Individual Performance ◽

Learning Model ◽

Data Representation ◽

Toxicity Prediction ◽

Machine Learning Model ◽

Machine Learning Approach ◽

Benchmark Datasets

Abstract Representing molecules in the form of only one type of features and using those features to predict their activities is one of the most important approaches for machine-learning-based chemical-activity-prediction. For molecular activities like quantitative toxicity prediction, the performance depends on the type of features extracted and the machine learning approach used. For such cases, using one type of features and machine learning model restricts the prediction performance to specific representation and model used. In this paper, we study quantitative toxicity prediction and propose a machine learning model for the same. Our model uses an ensemble of heterogeneous predictors instead of typically using homogeneous predictors. The predictors that we use vary either on the type of features used or on the deep learning architecture employed. Each of these predictors presumably has its own strengths and weaknesses in terms of toxicity prediction. Our motivation is to make a combined model that utilizes different types of features and architectures to obtain better collective performance that could go beyond the performance of each individual predictor. We use six predictors in our model and test the model on four standard quantitative toxicity benchmark datasets. Experimental results show that our model outperforms the state-of-the-art toxicity prediction models in 8 out of 12 accuracy measures. Our experiments show that ensembling heterogeneous predictor improves the performance over single predictors and homogeneous ensembling of single predictors.The results show that each data representation or deep learning based predictor has its own strengths and weaknesses, thus employing a model ensembling multiple heterogeneous predictors could go beyond individual performance of each data representation or each predictor type.

Download Full-text

A merged molecular representation learning for molecular properties prediction with a web-based service

Scientific Reports ◽

10.1038/s41598-021-90259-7 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Hyunseob Kim ◽

Jeongcheol Lee ◽

Sunil Ahn ◽

Jongsuk Ruth Lee

Keyword(s):

Deep Learning ◽

Quantitative Estimation ◽

Chemical Properties ◽

Representation Learning ◽

Fine Tuning ◽

Learning Models ◽

Web Based ◽

Property Prediction ◽

Matrix Embedding ◽

Molecular Properties Prediction

AbstractDeep learning has brought a dramatic development in molecular property prediction that is crucial in the field of drug discovery using various representations such as fingerprints, SMILES, and graphs. In particular, SMILES is used in various deep learning models via character-based approaches. However, SMILES has a limitation in that it is hard to reflect chemical properties. In this paper, we propose a new self-supervised method to learn SMILES and chemical contexts of molecules simultaneously in pre-training the Transformer. The key of our model is learning structures with adjacency matrix embedding and learning logics that can infer descriptors via Quantitative Estimation of Drug-likeness prediction in pre-training. As a result, our method improves the generalization of the data and achieves the best average performance by benchmarking downstream tasks. Moreover, we develop a web-based fine-tuning service to utilize our model on various tasks.

Download Full-text

The Future of PHM Could be Tiny under Cloud: Exploring Potential Application Patterns of TinyML in PHM Scenarios

Annual Conference of the PHM Society ◽

10.36001/phmconf.2021.v13i1.3054 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Xingyu Zhou ◽

Zhuangwei Kang ◽

Robert Canady ◽

Shunxing Bao ◽

Daniel Allen Balasubramanian ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Remaining Useful Life ◽

Data Driven ◽

Learning Models ◽

Level Data ◽

Data Source ◽

Single Data ◽

The Impact ◽

Machine Learning Models

Deep learning has shown impressive performance acrosshealth management and prognostics applications. Nowadays, an emerging trend of machine learning deployment on resource constraint hardware devices like micro-controllers(MCU) has aroused much attention. Given the distributed andresource constraint nature of many PHM applications, using tiny machine learning models close to data source sensors for on-device inferences would be beneficial to save both time andadditional hardware resources. Even though there has beenpast works that bring TinyML on MCUs for some PHM ap-plications, they are mainly targeting single data source usage without higher-level data incorporation with cloud computing.We study the impact of potential cooperation patterns betweenTinyML on edge and more powerful computation resources oncloud and how this would make an impact on the application patterns in data-driven prognostics. We introduce potential ap-plications where sensor readings are utilized for system health status prediction including status classification and remaining useful life regression. We find that MCUs and cloud com-puting can be adaptive to different kinds of machine learning models and combined in flexible ways for diverse requirement.Our work also shows limitations of current MCU-based deep learning in data-driven prognostics And we hope our work can

Download Full-text

A Deep Learning Method to Detect Opioid Prescription and Opioid Use Disorder from Electronic Health Records

10.1101/2021.09.13.21263524 ◽

2021 ◽

Author(s):

Aditya Kashyap ◽

Chris Callison-Burch ◽

Mary Regina Boland

Keyword(s):

Deep Learning ◽

Prediction Models ◽

The United States ◽

Opioid Use Disorder ◽

Unstructured Data ◽

Learning Approaches ◽

Learning Models ◽

Opioid Use ◽

Opioid Prescription ◽

Patients At Risk

Objective: As the opioid epidemic continues across the United States, methods are needed to accurately and quickly identify patients at risk for opioid use disorder (OUD). The purpose of this study is to develop two predictive algorithms: one to predict opioid prescription and one to predict OUD. Materials and Methods: We developed an informatics algorithm that trains two deep learning models over patient EHRs using the MIMIC-III database. We utilize both the structured and unstructured parts of the EHR and show that it is possible to predict both of these challenging outcomes. Results: Our deep learning models incorporate both structured and unstructured data elements from the EHRs to predict opioid prescription with an F1-score of 0.88 +/- 0.003 and an AUC-ROC of 0.93 +/- 0.002. We also constructed a model to predict OUD diagnosis achieving an F1-score of 0.82 +/- 0.05 and AUC-ROC of 0.94 +/- 0.008. Discussion: Our model for OUD prediction outperformed prior algorithms for specificity, F1 score and AUC-ROC while achieving equivalent sensitivity. This demonstrates the importance of a.) deep learning approaches in predicting OUD and b.) incorporating both structured and unstructured data for this prediction task. No prediction models for opioid prescription as an outcome were found in the literature and therefore this represents an important contribution of our work as opioid prescriptions are more common than OUDs. Conclusion: Algorithms such as those described in this paper will become increasingly important to understand the drivers underlying this national epidemic.

Download Full-text

Improving Cross-Domain Performance for Relation Extraction via Dependency Prediction and Information Flow Control

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/716 ◽

2019 ◽

Cited By ~ 1

Author(s):

Amir Pouran Ben Veyseh ◽

Thien Nguyen ◽

Dejing Dou

Keyword(s):

Deep Learning ◽

Information Flow ◽

Language Processing ◽

Relation Extraction ◽

Learning Models ◽

Information Flow Control ◽

Cross Domain ◽

Benchmark Datasets ◽

Dependency Trees ◽

Use Dependency

Relation Extraction (RE) is one of the fundamental tasks in Information Extraction and Natural Language Processing. Dependency trees have been shown to be a very useful source of information for this task. The current deep learning models for relation extraction has mainly exploited this dependency information by guiding their computation along the structures of the dependency trees. One potential problem with this approach is it might prevent the models from capturing important context information beyond syntactic structures and cause the poor cross-domain generalization. This paper introduces a novel method to use dependency trees in RE for deep learning models that jointly predicts dependency and semantics relations. We also propose a new mechanism to control the information flow in the model based on the input entity mentions. Our extensive experiments on benchmark datasets show that the proposed model outperforms the existing methods for RE significantly.

Download Full-text