Data augmentation strategies to improve reaction yield predictions and estimate uncertainty

Chemical reactions describe how precursor molecules react together and transform into products. The reaction yield describes the percentage of the precursors successfully transformed into products relative to the theoretical maximum. The prediction of reaction yields can help chemists navigate reaction space and accelerate the design of more effective routes. Here, we investigate the best-studied high-throughput experiment data set and show how data augmentation on chemical reactions can improve yield predictions' accuracy, even when only small data sets are available. Previous work used molecular fingerprints, physics-based or categorical descriptors of the precursors. In this manuscript, we fine-tune natural language processing-inspired reaction transformer models on different augmented data sets to predict yields solely using a text-based representation of chemical reactions. When the random training sets contain 2.5% or more of the data, our models outperform previous models, including those using physics-based descriptors as inputs. Moreover, we demonstrate the use of test-time augmentation to generate uncertainty estimates, which correlate with the prediction errors.

Download Full-text

Data augmentation strategies to improve reaction yield predictions and estimate uncertainty

10.26434/chemrxiv.13286741 ◽

2020 ◽

Author(s):

Philippe Schwaller ◽

Alain C. Vaucher ◽

Teodoro Laino ◽

Jean-Louis Reymond

Keyword(s):

Chemical Reactions ◽

Language Processing ◽

Data Augmentation ◽

Test Time ◽

Reaction Yield ◽

Small Data ◽

Data Sets ◽

Prediction Errors ◽

Data Set ◽

Uncertainty Estimates

Download Full-text

Unassisted Noise-Reduction of Chemical Reactions Data Sets

10.26434/chemrxiv.12395120.v2 ◽

2021 ◽

Author(s):

Alessandra Toniato ◽

Philippe Schwaller ◽

Antonio Cardinale ◽

Joppe Geluykens ◽

Teodoro Laino

Keyword(s):

Noise Reduction ◽

Chemical Reactions ◽

Language Processing ◽

Prediction Models ◽

Open Data ◽

Data Sets ◽

Data Set ◽

Intelligence Models ◽

Artificial Intelligence Models ◽

Jensen Shannon Divergence

Existing deep learning models applied to reaction prediction in organic chemistry can reach high levels of accuracy (> 90% for Natural Language Processing-based ones).With no chemical knowledge embedded than the information learnt from reaction data, the quality of the data sets plays a crucial role in the performance of the prediction models. While human curation is prohibitively expensive, the need for unaided approaches to remove chemically incorrect entries from existing data sets is essential to improve artificial intelligence models' performance in synthetic chemistry tasks. Here we propose a machine learning-based, unassisted approach to remove chemically wrong entries from chemical reaction collections. We applied this method to the collection of chemical reactions Pistachio and to an open data set, both extracted from USPTO (United States Patent Office) patents. Our results show an improved prediction quality for models trained on the cleaned and balanced data sets. For the retrosynthetic models, the round-trip accuracy metric grows by 13 percentage points and the value ofthe cumulative Jensen Shannon divergence decreases by 30% compared to its original record. The coverage remains high with 97%, and the value of the class-diversity is not affected by the cleaning. The proposed strategy is the first unassisted rule-free technique to address automatic noise reduction in chemical data sets.

Download Full-text

Classification of jujube defects in small data sets based on transfer learning

Neural Computing and Applications ◽

10.1007/s00521-021-05715-2 ◽

2021 ◽

Author(s):

Jianping Ju ◽

Hong Zheng ◽

Xiaohang Xu ◽

Zhongyuan Guo ◽

Zhaohui Zheng ◽

...

Keyword(s):

Transfer Learning ◽

Loss Function ◽

Training Model ◽

Parameter Distribution ◽

Test Accuracy ◽

Small Data ◽

Data Sets ◽

Data Set ◽

Small Data Sets

AbstractAlthough convolutional neural networks have achieved success in the field of image classification, there are still challenges in the field of agricultural product quality sorting such as machine vision-based jujube defects detection. The performance of jujube defect detection mainly depends on the feature extraction and the classifier used. Due to the diversity of the jujube materials and the variability of the testing environment, the traditional method of manually extracting the features often fails to meet the requirements of practical application. In this paper, a jujube sorting model in small data sets based on convolutional neural network and transfer learning is proposed to meet the actual demand of jujube defects detection. Firstly, the original images collected from the actual jujube sorting production line were pre-processed, and the data were augmented to establish a data set of five categories of jujube defects. The original CNN model is then improved by embedding the SE module and using the triplet loss function and the center loss function to replace the softmax loss function. Finally, the depth pre-training model on the ImageNet image data set was used to conduct training on the jujube defects data set, so that the parameters of the pre-training model could fit the parameter distribution of the jujube defects image, and the parameter distribution was transferred to the jujube defects data set to complete the transfer of the model and realize the detection and classification of the jujube defects. The classification results are visualized by heatmap through the analysis of classification accuracy and confusion matrix compared with the comparison models. The experimental results show that the SE-ResNet50-CL model optimizes the fine-grained classification problem of jujube defect recognition, and the test accuracy reaches 94.15%. The model has good stability and high recognition accuracy in complex environments.

Download Full-text

A Comparison Study of Mahalanobis-Taguchi System and Neural Network for Multivariate Pattern Recognition

Design Engineering, Parts A and B ◽

10.1115/imece2005-80029 ◽

2005 ◽

Cited By ~ 10

Author(s):

Jungeui Hong ◽

Elizabeth A. Cudney ◽

Genichi Taguchi ◽

Rajesh Jugulum ◽

Kioumars Paryani ◽

...

Keyword(s):

Neural Network ◽

Small Data ◽

Data Sets ◽

Comparison Study ◽

Data Set ◽

Set Size ◽

Breast Cancer Study ◽

Discriminant Ability ◽

Small Data Sets ◽

Multivariate Pattern

The Mahalanobis-Taguchi System is a diagnosis and predictive method for analyzing patterns in multivariate cases. The goal of this study is to compare the ability of the Mahalanobis-Taguchi System and a neural network to discriminate using small data sets. We examine the discriminant ability as a function of data set size using an application area where reliable data is publicly available. The study uses the Wisconsin Breast Cancer study with nine attributes and one class.

Download Full-text

Methods for estimating uncertainty in factor analytic solutions

Atmospheric Measurement Techniques ◽

10.5194/amt-7-781-2014 ◽

2014 ◽

Vol 7 (3) ◽

pp. 781-797 ◽

Cited By ~ 174

Author(s):

P. Paatero ◽

S. Eberly ◽

S. G. Brown ◽

G. A. Norris

Keyword(s):

Environmental Protection Agency ◽

Synthetic Data ◽

Analytic Solutions ◽

Data Sets ◽

Random Errors ◽

Data Set ◽

Factor Analytic ◽

Uncertainty Estimates ◽

Multilinear Engine ◽

Analytic Models

Abstract. The EPA PMF (Environmental Protection Agency positive matrix factorization) version 5.0 and the underlying multilinear engine-executable ME-2 contain three methods for estimating uncertainty in factor analytic models: classical bootstrap (BS), displacement of factor elements (DISP), and bootstrap enhanced by displacement of factor elements (BS-DISP). The goal of these methods is to capture the uncertainty of PMF analyses due to random errors and rotational ambiguity. It is shown that the three methods complement each other: depending on characteristics of the data set, one method may provide better results than the other two. Results are presented using synthetic data sets, including interpretation of diagnostics, and recommendations are given for parameters to report when documenting uncertainty estimates from EPA PMF or ME-2 applications.

Download Full-text

The relationship between journal citation impact and citation sentiment: A study of 32 million citances in PubMed Central

Quantitative Science Studies ◽

10.1162/qss_a_00040 ◽

2020 ◽

pp. 1-11

Author(s):

Erjia Yan ◽

Zheng Chen ◽

Kai Li

Keyword(s):

Social Science ◽

Natural Science ◽

Citation Impact ◽

Large Data ◽

Small Data ◽

Data Sets ◽

Data Set ◽

Pubmed Central ◽

Small Data Sets ◽

Sentiment Score

Citation sentiment plays an important role in citation analysis and scholarly communication research, but prior citation sentiment studies have used small data sets and relied largely on manual annotation. This paper uses a large data set of PubMed Central (PMC) full-text publications and analyzes citation sentiment in more than 32 million citances within PMC, revealing citation sentiment patterns at the journal and discipline levels. This paper finds a weak relationship between a journal’s citation impact (as measured by CiteScore) and the average sentiment score of citances to its publications. When journals are aggregated into quartiles based on citation impact, we find that journals in higher quartiles are cited more favorably than those in the lower quartiles. Further, social science journals are found to be cited with higher sentiment, followed by engineering and natural science and biomedical journals, respectively. This result may be attributed to disciplinary discourse patterns in which social science researchers tend to use more subjective terms to describe others’ work than do natural science or biomedical researchers.

Download Full-text

Ambulatory and Laboratory Stress Detection Based on Raw Electrocardiogram Signals Using a Convolutional Neural Network

Sensors ◽

10.3390/s19204408 ◽

2019 ◽

Vol 19 (20) ◽

pp. 4408 ◽

Cited By ~ 2

Author(s):

Hyun-Myung Cho ◽

Heesu Park ◽

Suh-Yeon Dong ◽

Inchan Youn

Keyword(s):

Neural Network ◽

Transfer Learning ◽

Mental Arithmetic ◽

Small Data ◽

Data Sets ◽

Learning Method ◽

Data Set ◽

Electrocardiogram Signals ◽

Proposed Model ◽

Small Data Set

The goals of this study are the suggestion of a better classification method for detecting stressed states based on raw electrocardiogram (ECG) data and a method for training a deep neural network (DNN) with a smaller data set. We suggest an end-to-end architecture to detect stress using raw ECGs. The architecture consists of successive stages that contain convolutional layers. In this study, two kinds of data sets are used to train and validate the model: A driving data set and a mental arithmetic data set, which smaller than the driving data set. We apply a transfer learning method to train a model with a small data set. The proposed model shows better performance, based on receiver operating curves, than conventional methods. Compared with other DNN methods using raw ECGs, the proposed model improves the accuracy from 87.39% to 90.19%. The transfer learning method improves accuracy by 12.01% and 10.06% when 10 s and 60 s of ECG signals, respectively, are used in the model. In conclusion, our model outperforms previous models using raw ECGs from a small data set and, so, we believe that our model can significantly contribute to mobile healthcare for stress management in daily life.

Download Full-text

Methods for estimating uncertainty in factor analytic solutions

Atmospheric Measurement Techniques Discussions ◽

10.5194/amtd-6-7593-2013 ◽

2013 ◽

Vol 6 (4) ◽

pp. 7593-7631 ◽

Cited By ~ 9

Author(s):

P. Paatero ◽

S. Eberly ◽

S. G. Brown ◽

G. A. Norris

Keyword(s):

Synthetic Data ◽

Analytic Solutions ◽

The Other ◽

Data Sets ◽

Random Errors ◽

Data Set ◽

Factor Analytic ◽

Uncertainty Estimates ◽

Multilinear Engine ◽

Analytic Models

Abstract. EPA PMF version 5.0 and the underlying multilinear engine executable ME-2 contain three methods for estimating uncertainty in factor analytic models: classical bootstrap (BS), displacement of factor elements (DISP), and bootstrap enhanced by displacement of factor elements (BS-DISP). The goal of these methods is to capture the uncertainty of PMF analyses due to random errors and rotational ambiguity. It is shown that the three methods complement each other: depending on characteristics of the data set, one method may provide better results than the other two. Results are presented using synthetic data sets, including interpretation of diagnostics, and recommendations are given for parameters to report when documenting uncertainty estimates from EPA PMF or ME-2 applications.

Download Full-text

Cross Data Set Generalization of Ultrasound Image Augmentation using Representation Learning: A Case Study

Current Directions in Biomedical Engineering ◽

10.1515/cdbme-2021-2193 ◽

2021 ◽

Vol 7 (2) ◽

pp. 755-758

Author(s):

Daniel Wulff ◽

Mohamad Mehdi ◽

Floris Ernst ◽

Jannis Hagenah

Keyword(s):

Data Augmentation ◽

Ultrasound Image ◽

Image Data ◽

Representation Learning ◽

Generative Adversarial Networks ◽

Data Sets ◽

Imaging Data ◽

Data Set ◽

Adversarial Networks ◽

Classical Image

Abstract Data augmentation is a common method to make deep learning assessible on limited data sets. However, classical image augmentation methods result in highly unrealistic images on ultrasound data. Another approach is to utilize learning-based augmentation methods, e.g. based on variational autoencoders or generative adversarial networks. However, a large amount of data is necessary to train these models, which is typically not available in scenarios where data augmentation is needed. One solution for this problem could be a transfer of augmentation models between different medical imaging data sets. In this work, we present a qualitative study of the cross data set generalization performance of different learning-based augmentation methods for ultrasound image data. We could show that knowledge transfer is possible in ultrasound image augmentation and that the augmentation partially results in semantically meaningful transfers of structures, e.g. vessels, across domains.

Download Full-text

A review and framework for the evaluation of pixel-level uncertainty estimates in satellite aerosol remote sensing

Atmospheric Measurement Techniques ◽

10.5194/amt-13-373-2020 ◽

2020 ◽

Vol 13 (2) ◽

pp. 373-404 ◽

Cited By ~ 4

Author(s):

Andrew M. Sayer ◽

Yves Govaerts ◽

Pekka Kolmonen ◽

Antti Lipponen ◽

Marta Luffarelli ◽

...

Keyword(s):

Uncertainty Propagation ◽

Optimal Estimation ◽

Sensitivity Analyses ◽

Data Sets ◽

Validation Data ◽

Data Set ◽

Formal Error ◽

Uncertainty Estimates ◽

Diagnostic Approaches

Abstract. Recent years have seen the increasing inclusion of per-retrieval prognostic (predictive) uncertainty estimates within satellite aerosol optical depth (AOD) data sets, providing users with quantitative tools to assist in the optimal use of these data. Prognostic estimates contrast with diagnostic (i.e. relative to some external truth) ones, which are typically obtained using sensitivity and/or validation analyses. Up to now, however, the quality of these uncertainty estimates has not been routinely assessed. This study presents a review of existing prognostic and diagnostic approaches for quantifying uncertainty in satellite AOD retrievals, and it presents a general framework to evaluate them based on the expected statistical properties of ensembles of estimated uncertainties and actual retrieval errors. It is hoped that this framework will be adopted as a complement to existing AOD validation exercises; it is not restricted to AOD and can in principle be applied to other quantities for which a reference validation data set is available. This framework is then applied to assess the uncertainties provided by several satellite data sets (seven over land, five over water), which draw on methods from the empirical to sensitivity analyses to formal error propagation, at 12 Aerosol Robotic Network (AERONET) sites. The AERONET sites are divided into those for which it is expected that the techniques will perform well and those for which some complexity about the site may provide a more severe test. Overall, all techniques show some skill in that larger estimated uncertainties are generally associated with larger observed errors, although they are sometimes poorly calibrated (i.e. too small or too large in magnitude). No technique uniformly performs best. For powerful formal uncertainty propagation approaches such as optimal estimation, the results illustrate some of the difficulties in appropriate population of the covariance matrices required by the technique. When the data sets are confronted by a situation strongly counter to the retrieval forward model (e.g. potentially mixed land–water surfaces or aerosol optical properties outside the family of assumptions), some algorithms fail to provide a retrieval, while others do but with a quantitatively unreliable uncertainty estimate. The discussion suggests paths forward for the refinement of these techniques.

Download Full-text