Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models

Generative models are becoming a tool of choice for exploring the molecular space. These models learn on a large training dataset and produce novel molecular structures with similar properties. Generated structures can be utilized for virtual screening or training semi-supervized predictive models in the downstream tasks. While there are plenty of generative models, it is unclear how to compare and rank them. In this work, we introduce a benchmarking platform called Molecular Sets (MOSES) to standardize training and comparison of molecular generative models. MOSES provides training and testing datasets, and a set of metrics to evaluate the quality and diversity of generated structures. We have implemented and compared several molecular generation models and suggest to use our results as reference points for further advancements in generative chemistry research. The platform and source code are available at https://github.com/molecularsets/moses.

Download Full-text

Automatic Acquisition of Annotated Training Corpora for Test-Code Generation

Information ◽

10.3390/info10020066 ◽

2019 ◽

Vol 10 (2) ◽

pp. 66

Author(s):

Magdalena Kacmajor ◽

John Kelleher

Keyword(s):

Natural Language ◽

Code Generation ◽

Source Code ◽

Generative Models ◽

Training Data ◽

Training Dataset ◽

Unit Testing ◽

Test Automation ◽

Parallel Corpora ◽

Parallel Text

Open software repositories make large amounts of source code publicly available. Potentially, this source code could be used as training data to develop new, machine learning-based programming tools. For many applications, however, raw code scraped from online repositories does not constitute an adequate training dataset. Building on the recent and rapid improvements in machine translation (MT), one possibly very interesting application is code generation from natural language descriptions. One of the bottlenecks in developing these MT-inspired systems is the acquisition of parallel text-code corpora required for training code-generative models. This paper addresses the problem of automatically synthetizing parallel text-code corpora in the software testing domain. Our approach is based on the observation that self-documentation through descriptive method names is widely adopted in test automation, in particular for unit testing. Therefore, we propose synthesizing parallel corpora comprised of parsed test function names serving as code descriptions, aligned with the corresponding function bodies. We present the results of applying one of the state-of-the-art MT methods on such a generated dataset. Our experiments show that a neural MT model trained on our dataset can generate syntactically correct and semantically relevant short Java functions from quasi-natural language descriptions of functionality.

Download Full-text

Retrosynthetic Accessibility Score (RAscore) - Rapid Machine Learned Synthesizability Classification from AI Driven Retrosynthetic Planning

10.26434/chemrxiv.13019993.v1 ◽

2020 ◽

Author(s):

Amol Thakkar ◽

Veronika Chadimova ◽

Esben Jannik Bjerrum ◽

Ola Engkvist ◽

Jean-Louis Reymond

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Virtual Screening ◽

Generative Models ◽

Synthetic Route ◽

Synthetic Accessibility ◽

Wide Range ◽

Computer Aided ◽

Synthesis Planning ◽

Retrosynthetic Analysis

<p>Computer aided synthesis planning (CASP) is part of a suite of artificial intelligence (AI) based tools that are able to propose synthesis to a wide range of compounds. However, at present they are too slow to be used to screen the synthetic feasibility of millions of generated or enumerated compounds before identification of potential bioactivity by virtual screening (VS) workflows. Herein we report a machine learning (ML) based method capable of classifying whether a synthetic route can be identified for a particular compound or not by the CASP tool AiZynthFinder. The resulting ML models return a retrosynthetic accessibility score (RAscore) of any molecule of interest, and computes 4,500 times faster than retrosynthetic analysis performed by the underlying CASP tool. The RAscore should be useful for the pre-screening millions of virtual molecules from enumerated databases or generative models for synthetic accessibility and produce higher quality databases for virtual screening of biological activity. </p>

Download Full-text

Beyond Generative Models: Superfast Traversal, Optimization, Novelty, Exploration and Discovery (STONED) Algorithm for Molecules using SELFIES

10.26434/chemrxiv.13383266.v2 ◽

2021 ◽

Author(s):

AkshatKumar Nigam ◽

Robert Pollice ◽

Mario Krenn ◽

Gabriel dos Passos Gomes ◽

Alan Aspuru-Guzik

Keyword(s):

Deep Learning ◽

Virtual Screening ◽

Chemical Space ◽

Generative Models ◽

Inverse Design ◽

Learning Models ◽

Structure Modification ◽

Design Models ◽

Comparable Performance ◽

And Training

Inverse design allows the design of molecules with desirable properties using property optimization. Deep generative models have recently been applied to tackle inverse design, as they possess the ability to optimize molecular properties directly through structure modification using gradients. While the ability to carry out direct property optimizations is promising, the use of generative deep learning models to solve practical problems requires large amounts of data and is very time-consuming. In this work, we propose STONED – a simple and efficient algorithm to perform interpolation and exploration in the chemical space, comparable to deep generative models. STONED bypasses the need for large amounts of data and training times by using string modifications in the SELFIES molecular representation. We achieve comparable performance on typical benchmarks without any training. We demonstrate applications in high-throughput virtual screening for the design of drugs, photovoltaics, and the construction of chemical paths, allowing for both property and structure-based interpolation in the chemical space. We anticipate our results to be a stepping stone for developing more sophisticated inverse design models and benchmarking tools, ultimately helping generative models achieve wide adoption.

Download Full-text

Encoding Health Records into Pathway Representations for Deep Learning

10.3233/shti210800 ◽

2021 ◽

Author(s):

Marco Luca Sbodio ◽

Natasha Mulligan ◽

Stefanie Speichert ◽

Vanessa Lopez ◽

Joao Bettencourt-Silva

Keyword(s):

Neural Network ◽

Machine Learning ◽

Deep Learning ◽

Source Code ◽

Training Dataset ◽

Health Records ◽

Learning Tasks ◽

Patient Pathways ◽

Computational Resources ◽

The Impact

There is a growing trend in building deep learning patient representations from health records to obtain a comprehensive view of a patient’s data for machine learning tasks. This paper proposes a reproducible approach to generate patient pathways from health records and to transform them into a machine-processable image-like structure useful for deep learning tasks. Based on this approach, we generated over a million pathways from FAIR synthetic health records and used them to train a convolutional neural network. Our initial experiments show the accuracy of the CNN on a prediction task is comparable or better than other autoencoders trained on the same data, while requiring significantly less computational resources for training. We also assess the impact of the size of the training dataset on autoencoders performances. The source code for generating pathways from health records is provided as open source.

Download Full-text

Deep learning framework for material design space exploration using active transfer learning and data augmentation

npj Computational Materials ◽

10.1038/s41524-021-00609-2 ◽

2021 ◽

Vol 7 (1) ◽

Author(s):

Yongtae Kim ◽

Youngsoo Kim ◽

Charles Yang ◽

Kundo Park ◽

Grace X. Gu ◽

...

Keyword(s):

Neural Network ◽

Transfer Learning ◽

Design Space Exploration ◽

Predictive Power ◽

Design Space ◽

Data Augmentation ◽

Generative Models ◽

Training Dataset ◽

Initial Training ◽

Active Transfer

AbstractNeural network-based generative models have been actively investigated as an inverse design method for finding novel materials in a vast design space. However, the applicability of conventional generative models is limited because they cannot access data outside the range of training sets. Advanced generative models that were devised to overcome the limitation also suffer from the weak predictive power on the unseen domain. In this study, we propose a deep neural network-based forward design approach that enables an efficient search for superior materials far beyond the domain of the initial training set. This approach compensates for the weak predictive power of neural networks on an unseen domain through gradual updates of the neural network with active transfer learning and data augmentation methods. We demonstrate the potential of our framework with a grid composite optimization problem that has an astronomical number of possible design configurations. Results show that our proposed framework can provide excellent designs close to the global optima, even with the addition of a very small dataset corresponding to less than 0.5% of the initial training dataset size.

Download Full-text

Insights into the influence of the molecular structures of fluorinated ionic liquids on their thermophysical properties. A soft-SAFT based approach

Physical Chemistry Chemical Physics ◽

10.1039/c8cp07522k ◽

2019 ◽

Vol 21 (12) ◽

pp. 6362-6380 ◽

Cited By ~ 10

Author(s):

Margarida L. Ferreira ◽

João M. M. Araújo ◽

Ana B. Pereiro ◽

Lourdes F. Vega

Keyword(s):

Ionic Liquids ◽

Thermophysical Properties ◽

Predictive Models ◽

Molecular Structures

Development of predictive models for FILs.

Download Full-text

Automatic crater detection over the Jezero crater area from HiRISE imagery

10.5194/egusphere-egu2020-6269 ◽

2020 ◽

Author(s):

Konstantinos Servis ◽

Anthony Lagain ◽

Gretchen Benedix ◽

David Flannery ◽

Chris Norman ◽

...

Keyword(s):

Detection Rate ◽

False Negative ◽

Detection Algorithm ◽

Impact Craters ◽

Reference Points ◽

Training Dataset ◽

False Detection ◽

Planetary Scale ◽

Area Of Interest ◽

Crater Detection

<p>Impact craters are used to determine the ages of planetary surfaces. Absolute dating of meteorites or in situ geochronology provide a few essential reference points, but these techniques are rare and not yet applicable at the planetary scale. Therefore, impact crater counting techniques will remain the major tool to decipher planetary surface history. This approach requires a tedious mapping and morphological inspection of a large number of circular features to distinguish true and primary impact craters. The most complete database of Martian craters includes a catalog of more than 384,000 impact structures larger than 1 km in diameter. This database is considered to be complete for this diameter range. A requirement to determine young surface ages on Mars must include smaller impact craters, typically a hundred meters in diameter, found on the area of interest.</p><p>To access to the crater population of this size range at a planetary scale we built a Crater Detection Algorithm (CDA) trained on THEMIS images where impact craters larger than 1 km from the Robbins & Hynek database have been identified. Our model offer a true detection rate of 0.9. We then applied our CDA on the global CTX mosaic within the &#177;45&#186; latitudinal band leading to ~17 million of detection >100m in diameter.</p><p>The ultimate goal of our work is now to automatically compile smaller impact craters (5m<D<100m) visible on HiRISE imagery dataset offering a resolution of 25cm/px. We trained our algorithm on a part of the HiRISE mosaic (NASA/JPL/MSSS/The Murray Lab) covering a part of the Jezero crater (E77-5_N18_0) where 1650 craters have been manually identified. A portion of this population of craters has then be selected in order to be sure to include the most confident impact features in the training dataset, finally resulting to 1624 craters over this entire image.</p><p>Our model has been applied over the entire HiRISE mosaic covering the Jezero crater where more than 27,298 craters >3m have been detected. In order to validate our results, we compared the detection obtained on 30 tiles of 960px x 960px randomly chosen on a part of the mosaic (E77-25_N18-25) which have not been included into the training dataset with a manual identification, thus constituting the ground truth. For this purpose, we decided to categorize each tile according to the type of terrain mostly represented on each of them: rocky terrain, smooth terrain and dunes fields. We have also specified when the image exhibited some vertical stripes leading to the fourth category.</p><p>On rocky and smooth terrains, the CDA produce very good results: only 5% of detection on the average are false detection and 16% of craters on average have not been detected by the CDA. However, the CDA is less efficient on dune fields since 35% of detection are false detection and 15% of craters have not been identified. Finally, images exhibiting some vertical stripes significantly decrease the detection rate of the CDA since 56% of detection are false negative and 20% of craters have not been detected.</p>

Download Full-text

MERMAID: An Open Source Automated Hit-to-Lead Method Based on Deep Reinforcement Learning

10.26434/chemrxiv.14450313.v1 ◽

2021 ◽

Author(s):

Daiki Erikawa ◽

Nobuaki Yasuo ◽

Masakazu Sekijima

Keyword(s):

Quantitative Estimate ◽

Chemical Structure ◽

Source Code ◽

Generative Models ◽

Tree Search ◽

Screening Assay ◽

Input Line ◽

Monte Carlo Tree Search ◽

Zinc Database ◽

Water Partition Coefficient

<div>The hit-to-lead process makes the physicochemical properties of the hit compounds that show the desired type of activity obtained in the screening assay more drug-like. Deep learning-based molecular generative models are expected to contribute to the hit-to-lead process.</div><div>The simplified molecular input line entry system (SMILES), which is a string of alphanumeric characters representing the chemical structure of a molecule, is one of the most commonly used representations of molecules, and molecular generative models based on SMILES have achieved significant success. However, in contrast to molecular graphs, during the process of generation, SMILES are not considered as valid SMILES. Further, it is quite difficult to generate molecules starting from a certain molecule, thus making it difficult to apply SMILES to the hit-to-lead process.In this study, we have developed a SMILES-based generative model that can be generated starting from a certain compound. This method generates partial SMILES and inserts it into the original SMILES using Monte Carlo Tree Search and a Recurrent Neural Network.We validated our method using a molecule dataset obtained from the ZINC database and successfully generated molecules that were both well optimized for the objectives of the quantitative estimate of drug-likeness (QED) and penalized octanol-water partition coefficient (PLogP) optimization.</div><div>The source code is available at https: //github.com/sekijima-lab/mermaid.</div>

Download Full-text

Evaluation of predictive models based on random forest, decision tree and support vector machine classifiers and virtual screening of anti-mycobacterial compounds

International Journal of Computational Biology and Drug Design ◽

10.1504/ijcbdd.2017.085410 ◽

2017 ◽

Vol 10 (3) ◽

pp. 248 ◽

Cited By ~ 1

Author(s):

Madhulata Kumari ◽

Neeraj Tiwari ◽

Naidu Subbarao ◽

Subhash Chandra

Keyword(s):

Support Vector Machine ◽

Random Forest ◽

Virtual Screening ◽

Decision Tree ◽

Predictive Models ◽

Support Vector

Download Full-text

Virtual Screening of C. Sativa Constituents for the Identification of Selective Ligands for Cannabinoid Receptor 2

International Journal of Molecular Sciences ◽

10.3390/ijms21155308 ◽

2020 ◽

Vol 21 (15) ◽

pp. 5308

Author(s):

Mikołaj Mizera ◽

Dorota Latek ◽

Judyta Cielecka-Piontek

Keyword(s):

Virtual Screening ◽

Cannabis Sativa ◽

Cannabinoid Receptor ◽

Data Bank ◽

Qsar Model ◽

Molecular Structures ◽

Computer Assisted ◽

Cannabinoid Receptor 2 ◽

Selective Ligands ◽

Qsar Models

The selective targeting of the cannabinoid receptor 2 (CB2) is crucial for the development of peripheral system-acting cannabinoid analgesics. This work aimed at computer-assisted identification of prospective CB2-selective compounds among the constituents of Cannabis Sativa. The molecular structures and corresponding binding affinities to CB1 and CB2 receptors were collected from ChEMBL. The molecular structures of Cannabis Sativa constituents were collected from a phytochemical database. The collected records were curated and applied for the development of quantitative structure-activity relationship (QSAR) models with a machine learning approach. The validated models predicted the affinities of Cannabis Sativa constituents. Four structures of CB2 were acquired from the Protein Data Bank (PDB) and the discriminatory ability of CB2-selective ligands and two sets of decoys were tested. We succeeded in developing the QSAR model by achieving Q2 5-CV > 0.62. The QSAR models helped to identify three prospective CB2-selective molecules that are dissimilar to already tested compounds. In a complementary structure-based virtual screening study that used available PDB structures of CB2, the agonist-bound, Cryogenic Electron Microscopy structure of CB2 showed the best statistical performance in discriminating between CB2-active and non-active ligands. The same structure also performed best in discriminating between CB2-selective ligands from non-selective ligands.

Download Full-text