Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining

Machine learning (ML) methods have great potential to transform chemical discovery by accelerating the exploration of chemical space and drawing scientific insights from data. However, modern chemical reaction ML models, such as those based on graph neural networks (GNNs), must be trained on a large amount of labelled data in order to avoid overfitting the data and thus possessing low accuracy and transferability. In this work, we propose a strategy to leverage unlabelled data to learn accurate ML models for small labelled chemical reaction data. We focus on an old and prominent problem—classifying reactions into distinct families—and build a GNN model for this task. We first pretrain the model on unlabelled reaction data using unsupervised contrastive learning and then fine-tune it on a small number of labelled reactions. The contrastive pretraining learns by making the representations of two augmented versions of a reaction similar to each other but distinct from other reactions. We propose chemically consistent reaction augmentation methods that protect the reaction center and find they are the key for the model to extract relevant information from unlabelled data to aid the reaction classification task. The transfer learned model outperforms a supervised model trained from scratch by a large margin. Further, it consistently performs better than models based on traditional rule-driven reaction fingerprints, which have long been the default choice for small datasets. In addition to reaction classification, the effectiveness of the strategy is tested on regression datasets; the learned GNN-based reaction fingerprints can also be used to navigate the chemical reaction space, which we demonstrate by querying for similar reactions. The strategy can be readily applied to other predictive reaction problems to uncover the power of unlabelled data for learning better models with a limited supply of labels.

Download Full-text

Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining

10.26434/chemrxiv-2021-xr8tf ◽

2021 ◽

Author(s):

Mingjian Wen ◽

Samuel M. Blau ◽

Xiaowei Xie ◽

Shyam Dwaraknath ◽

Kristin A. Persson

Keyword(s):

Machine Learning ◽

Chemical Reaction ◽

Chemical Space ◽

Relevant Information ◽

Learning Performance ◽

Reaction Data ◽

Unlabelled Data ◽

Graph Neural Networks ◽

Small Chemical ◽

Fine Tune

Machine learning (ML) methods have great potential to transform chemical discovery by accelerating the exploration of chemical space and drawing scientific insights from data. However, modern chemical reaction ML models, such as those based on graph neural networks (GNNs), must be trained on a large amount of labelled data in order to avoid overfitting the data and thus possessing low accuracy and transferability. In this work, we propose a strategy to leverage unlabelled data to learn accurate ML models for small labelled chemical reaction data. We focus on an old and prominent problem—classifying reactions into distinct families—and build a GNN model for this task. We first pretrain the model on unlabelled reaction data using unsupervised contrastive learning and then fine-tune it on a small number of labelled reactions. The contrastive pretraining learns by making the representations of two augmented versions of a reaction similar to each other but distinct from other reactions. We propose chemically consistent reaction augmentation methods that protect the reaction center and find they are the key for the model to extract relevant information from unlabelled data to aid the reaction classification task. The transfer learned model outperforms a supervised model trained from scratch by a large margin. Further, it consistently performs better than models based on traditional rule-driven reaction fingerprints, which have long been the default choice for small datasets. In addition to reaction classification, the learned GNN-based reaction fingerprints can also be used to navigate the chemical reaction space, which we demonstrate by querying for similar reactions. The strategy can be readily applied to other predictive reaction problems to uncover the power of unlabelled data for learning better models with a limited supply of labels.

Download Full-text

Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining

Chemical Science ◽

10.1039/d1sc06515g ◽

2022 ◽

Author(s):

Mingjian Wen ◽

Samuel M. Blau ◽

Xiaowei Xie ◽

Shyam Dwaraknath ◽

Kristin Persson

Keyword(s):

Machine Learning ◽

Chemical Reaction ◽

Chemical Space ◽

Learning Performance ◽

Reaction Data ◽

Modern Chemical ◽

Small Chemical

Download Full-text

Towards a Design of Active Oxygen Evolution Catalysts: Insights from Automated Density Functional Theory Calculations and Machine Learning

10.26434/chemrxiv.7926869 ◽

2019 ◽

Author(s):

Seoin Back ◽

Kevin Tran ◽

Zachary Ulissi

Keyword(s):

Machine Learning ◽

Oxygen Evolution ◽

Active Sites ◽

Density Functional ◽

Chemical Space ◽

Density Functional Theory Calculations ◽

Catalyst Design ◽

Transition Metal Catalysts ◽

Oxide Materials ◽

Design Strategies

<div> <div> <div> <div><p>Developing active and stable oxygen evolution catalysts is a key to enabling various future energy technologies and the state-of-the-art catalyst is Ir-containing oxide materials. Understanding oxygen chemistry on oxide materials is significantly more complicated than studying transition metal catalysts for two reasons: the most stable surface coverage under reaction conditions is extremely important but difficult to understand without many detailed calculations, and there are many possible active sites and configurations on O* or OH* covered surfaces. We have developed an automated and high-throughput approach to solve this problem and predict OER overpotentials for arbitrary oxide surfaces. We demonstrate this for a number of previously-unstudied IrO2 and IrO3 polymorphs and their facets. We discovered that low index surfaces of IrO2 other than rutile (110) are more active than the most stable rutile (110), and we identified promising active sites of IrO2 and IrO3 that outperform rutile (110) by 0.2 V in theoretical overpotential. Based on findings from DFT calculations, we pro- vide catalyst design strategies to improve catalytic activity of Ir based catalysts and demonstrate a machine learning model capable of predicting surface coverages and site activity. This work highlights the importance of investigating unexplored chemical space to design promising catalysts.<br></p></div></div></div></div><div><div><div> </div> </div> </div>

Download Full-text

Comparison of Machine Learning Performance for Earnings Forecasting

Journal of Taxation and Accounting ◽

10.35850/kjta.20.6.01 ◽

2019 ◽

Vol 20 (6) ◽

pp. 9-34

Author(s):

Woo June Jung

Keyword(s):

Machine Learning ◽

Learning Performance ◽

Earnings Forecasting

Download Full-text

Applications of Quantitative Structure-Activity Relationships (QSAR) based Virtual Screening in Drug Design: A Review

Mini-Reviews in Medicinal Chemistry ◽

10.2174/1389557520666200429102334 ◽

2020 ◽

Vol 20 (14) ◽

pp. 1375-1388 ◽

Cited By ~ 2

Author(s):

Patnala Ganga Raju Achary

Keyword(s):

Machine Learning ◽

Drug Discovery ◽

Virtual Screening ◽

Model Building ◽

Chemical Space ◽

Qsar Model ◽

Quantitative Structure ◽

Efficient Manner ◽

Qsar Analysis ◽

Structure Activity

The scientists, and the researchers around the globe generate tremendous amount of information everyday; for instance, so far more than 74 million molecules are registered in Chemical Abstract Services. According to a recent study, at present we have around 1060 molecules, which are classified as new drug-like molecules. The library of such molecules is now considered as ‘dark chemical space’ or ‘dark chemistry.’ Now, in order to explore such hidden molecules scientifically, a good number of live and updated databases (protein, cell, tissues, structure, drugs, etc.) are available today. The synchronization of the three different sciences: ‘genomics’, proteomics and ‘in-silico simulation’ will revolutionize the process of drug discovery. The screening of a sizable number of drugs like molecules is a challenge and it must be treated in an efficient manner. Virtual screening (VS) is an important computational tool in the drug discovery process; however, experimental verification of the drugs also equally important for the drug development process. The quantitative structure-activity relationship (QSAR) analysis is one of the machine learning technique, which is extensively used in VS techniques. QSAR is well-known for its high and fast throughput screening with a satisfactory hit rate. The QSAR model building involves (i) chemo-genomics data collection from a database or literature (ii) Calculation of right descriptors from molecular representation (iii) establishing a relationship (model) between biological activity and the selected descriptors (iv) application of QSAR model to predict the biological property for the molecules. All the hits obtained by the VS technique needs to be experimentally verified. The present mini-review highlights: the web-based machine learning tools, the role of QSAR in VS techniques, successful applications of QSAR based VS leading to the drug discovery and advantages and challenges of QSAR.

Download Full-text

The Disagreement Deconvolution: Bringing Machine Learning Performance Metrics In Line With Reality

Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems ◽

10.1145/3411764.3445423 ◽

2021 ◽

Author(s):

Mitchell L. Gordon ◽

Kaitlyn Zhou ◽

Kayur Patel ◽

Tatsunori Hashimoto ◽

Michael S. Bernstein

Keyword(s):

Machine Learning ◽

Performance Metrics ◽

Learning Performance

Download Full-text

Evaluating Machine Learning Performance for Safe, Intelligent Robots

2021 IEEE International Conference on Intelligence and Safety for Robotics (ISR) ◽

10.1109/isr50024.2021.9419381 ◽

2021 ◽

Author(s):

Raymond Sheh

Keyword(s):

Machine Learning ◽

Learning Performance ◽

Intelligent Robots

Download Full-text

Metabolomics-Guided Elucidation of Plant Abiotic Stress Responses in the 4IR Era: An Overview

Metabolites ◽

10.3390/metabo11070445 ◽

2021 ◽

Vol 11 (7) ◽

pp. 445

Author(s):

Morena M. Tinte ◽

Kekeletso H. Chele ◽

Justin J. J. van der Hooft ◽

Fidele Tugizimana

Keyword(s):

Machine Learning ◽

Abiotic Stress ◽

Stress Responses ◽

Abiotic Stresses ◽

Industrial Revolution ◽

Chemical Space ◽

Big Data Analytics ◽

Plant Responses ◽

Next Generation ◽

Computational Tools

Plants are constantly challenged by changing environmental conditions that include abiotic stresses. These are limiting their development and productivity and are subsequently threatening our food security, especially when considering the pressure of the increasing global population. Thus, there is an urgent need for the next generation of crops with high productivity and resilience to climate change. The dawn of a new era characterized by the emergence of fourth industrial revolution (4IR) technologies has redefined the ideological boundaries of research and applications in plant sciences. Recent technological advances and machine learning (ML)-based computational tools and omics data analysis approaches are allowing scientists to derive comprehensive metabolic descriptions and models for the target plant species under specific conditions. Such accurate metabolic descriptions are imperatively essential for devising a roadmap for the next generation of crops that are resilient to environmental deterioration. By synthesizing the recent literature and collating data on metabolomics studies on plant responses to abiotic stresses, in the context of the 4IR era, we point out the opportunities and challenges offered by omics science, analytical intelligence, computational tools and big data analytics. Specifically, we highlight technological advancements in (plant) metabolomics workflows and the use of machine learning and computational tools to decipher the dynamics in the chemical space that define plant responses to abiotic stress conditions.

Download Full-text

Learning chemistry: exploring the suitability of machine learning for the task of structure-based chemical ontology classification

Journal of Cheminformatics ◽

10.1186/s13321-021-00500-8 ◽

2021 ◽

Vol 13 (1) ◽

Cited By ~ 1

Author(s):

Janna Hastings ◽

Martin Glauer ◽

Adel Memariani ◽

Fabian Neuhaus ◽

Till Mossakowski

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Short Term Memory ◽

Chemical Space ◽

Chemical Data ◽

Learning Approaches ◽

Class Prediction ◽

Chemical Structures ◽

Chemical Ontology ◽

Chemical Ontologies

AbstractChemical data is increasingly openly available in databases such as PubChem, which contains approximately 110 million compound entries as of February 2021. With the availability of data at such scale, the burden has shifted to organisation, analysis and interpretation. Chemical ontologies provide structured classifications of chemical entities that can be used for navigation and filtering of the large chemical space. ChEBI is a prominent example of a chemical ontology, widely used in life science contexts. However, ChEBI is manually maintained and as such cannot easily scale to the full scope of public chemical data. There is a need for tools that are able to automatically classify chemical data into chemical ontologies, which can be framed as a hierarchical multi-class classification problem. In this paper we evaluate machine learning approaches for this task, comparing different learning frameworks including logistic regression, decision trees and long short-term memory artificial neural networks, and different encoding approaches for the chemical structures, including cheminformatics fingerprints and character-based encoding from chemical line notation representations. We find that classical learning approaches such as logistic regression perform well with sets of relatively specific, disjoint chemical classes, while the neural network is able to handle larger sets of overlapping classes but needs more examples per class to learn from, and is not able to make a class prediction for every molecule. Future work will explore hybrid and ensemble approaches, as well as alternative network architectures including neuro-symbolic approaches.

Download Full-text

Machine Learning Performance Validation and Training Using a ‘Perfect’ Expert System

MethodsX ◽

10.1016/j.mex.2021.101477 ◽

2021 ◽

pp. 101477

Author(s):

Jeremy Straub

Keyword(s):

Machine Learning ◽

Expert System ◽

Learning Performance ◽

Performance Validation ◽

And Training

Download Full-text