Masked Graph Modeling for Molecule Generation

Kl Divergence ◽

Molecular Graphs ◽

Graph Modeling ◽

The Cost ◽

Training Objective

De novo, in-silico design of molecules is a challenging problem with applications in drug discovery and material design. Here, we introduce a masked graph model which learns a distribution over graphs by capturing all possible conditional distributions over unobserved nodes and edges given observed ones. We train our masked graph model on existing molecular graphs and then sample novel molecular graphs from it by iteratively masking and replacing different parts of initialized graphs. We evaluate our approach on the QM9 and ChEMBL datasets using the distribution-learning benchmark from the GuacaMol framework. The benchmark contains five metrics: the validity, uniqueness, novelty, KL-divergence and Fr{\'e}chet ChemNet Distance scores, the last two of which are measures of the similarity of the generated samples to the training, validation and test distributions. We find that KL-divergence and Fréchet ChemNet Distance scores are anti-correlated with novelty scores. By varying generation initialization and the fraction of the graph masked and replaced at each generation step, we can increase the Fréchet score at the cost of novelty. In this way, we show that our model offers transparent and tunable control of the trade-off between these metrics, a key point of control in design applications currently lacking in other approaches to molecular graph generation. Our model outperforms previously proposed graph-based approaches and is competitive with SMILES-based approaches. Finally, we observe that minimizing validation loss on the training task is a suitable proxy for improving generation quality, which shows the suitability of optimizing the training objective for improving generation.

Masked Graph Modeling for Molecule Generation

10.26434/chemrxiv.13143167.v1 ◽

2020 ◽

Author(s):

Omar Mahmood ◽

Elman Mansimov ◽

Richard Bonneau ◽

Kyunghyun Cho

Keyword(s):

De Novo ◽

Molecular Graph ◽

Graph Model ◽

Material Design ◽

Kl Divergence ◽

Molecular Graphs ◽

Graph Modeling ◽

The Cost ◽

Training Objective

De novo, in-silico design of molecules is a challenging problem with applications in drug discovery and material design. Here, we introduce a masked graph model which learns a distribution over graphs by capturing all possible conditional distributions over unobserved nodes and edges given observed ones. We train our masked graph model on existing molecular graphs and then sample novel molecular graphs from it by iteratively masking and replacing different parts of initialized graphs. We evaluate our approach on the QM9 and ChEMBL datasets using the distribution-learning benchmark from the GuacaMol framework. The benchmark contains five metrics: the validity, uniqueness, novelty, KL-divergence and Fr{\'e}chet ChemNet Distance scores, the last two of which are measures of the similarity of the generated samples to the training, validation and test distributions. We find that KL-divergence and Fréchet ChemNet Distance scores are anti-correlated with novelty scores. By varying generation initialization and the fraction of the graph masked and replaced at each generation step, we can increase the Fréchet score at the cost of novelty. In this way, we show that our model offers transparent and tunable control of the trade-off between these metrics, a key point of control in design applications currently lacking in other approaches to molecular graph generation. Our model outperforms previously proposed graph-based approaches and is competitive with SMILES-based approaches. Finally, we observe that minimizing validation loss on the training task is a suitable proxy for improving generation quality, which shows the suitability of optimizing the training objective for improving generation.

Masked graph modeling for molecule generation

Nature Communications ◽

10.1038/s41467-021-23415-2 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Omar Mahmood ◽

Elman Mansimov ◽

Richard Bonneau ◽

Kyunghyun Cho

Keyword(s):

Drug Discovery ◽

In Silico ◽

De Novo ◽

Graph Model ◽

Material Design ◽

Challenging Problem ◽

Trade Off ◽

Graph Modeling ◽

Different Parts

AbstractDe novo, in-silico design of molecules is a challenging problem with applications in drug discovery and material design. We introduce a masked graph model, which learns a distribution over graphs by capturing conditional distributions over unobserved nodes (atoms) and edges (bonds) given observed ones. We train and then sample from our model by iteratively masking and replacing different parts of initialized graphs. We evaluate our approach on the QM9 and ChEMBL datasets using the GuacaMol distribution-learning benchmark. We find that validity, KL-divergence and Fréchet ChemNet Distance scores are anti-correlated with novelty, and that we can trade off between these metrics more effectively than existing models. On distributional metrics, our model outperforms previously proposed graph-based approaches and is competitive with SMILES-based approaches. Finally, we show our model generates molecules with desired values of specified properties while maintaining physiochemical similarity to the training distribution.

Masked Graph Modeling for Molecule Generation

10.26434/chemrxiv.13143167.v3 ◽

2021 ◽

Author(s):

Omar Mahmood ◽

Elman Mansimov ◽

Richard Bonneau ◽

Kyunghyun Cho

Keyword(s):

Drug Discovery ◽

In Silico ◽

De Novo ◽

Graph Model ◽

Material Design ◽

Challenging Problem ◽

Trade Off ◽

Graph Modeling ◽

Different Parts

De novo, in-silico design of molecules is a challenging problem with applications in drug discovery and material design. We introduce a masked graph model, which learns a distribution over graphs by capturing conditional distributions over unobserved nodes (atoms) and edges (bonds) given observed ones. We train and then sample from our model by iteratively masking and replacing different parts of initialized graphs. We evaluate our approach on the QM9 and ChEMBL datasets using the GuacaMol distribution-learning benchmark. We find that validity, KL-divergence and Fréchet ChemNet Distance scores are anti-correlated with novelty, and that we can trade off between these metrics more effectively than existing models. On distributional metrics, our model outperforms previously proposed graph-based approaches and is competitive with SMILES-based approaches. Finally, we show our model generates molecules with desired values of specified properties while maintaining physiochemical similarity to the training distribution.

L-MolGAN: An improved implicit generative model for large molecular graphs

10.26434/chemrxiv.14569545.v2 ◽

2021 ◽

Author(s):

Yutaka Tsujimoto ◽

Satoru Hiwa ◽

Yushi Nakamura ◽

Yohei Oe ◽

Tomoyuki Hiroyasu

Keyword(s):

Molecular Graph ◽

Chemical Properties ◽

Molecular Size ◽

Material Design ◽

Generative Models ◽

Molecular Structures ◽

Baseline Model ◽

Molecular Graphs ◽

Deep generative models are used to generate arbitrary molecular structures with the desired chemical properties. MolGAN is a renowned molecular generation models that uses generative adversarial networks (GANs) and reinforcement learning to generate molecular graphs in one shot. MolGAN can effectively generate a small molecular graph with nine or fewer heavy atoms. However, the graphs tend to become disconnected as the molecular size increase. This poses a challenge to drug discovery and material design, where large molecules are potentially inclusive. This study develops an improved MolGAN for large molecule generation (L-MolGAN). In this model, the connectivity of molecular graphs is evaluated by a depth-first search during the model training process. When a disconnected molecular graph is generated, L-MolGAN rewards the graph a zero score. This procedure decreases the number of disconnected graphs, and consequently increases the number of connected molecular graphs. The effectiveness of L-MolGAN is experimentally evaluated. The size and connectivity of the molecular graphs generated with data from the ZINC-250k molecular dataset are confirmed using MolGAN as the baseline model. The model is then optimized for a quantitative estimate of drug-likeness (QED) to generate drug-like molecules. The experimental results indicate that the connectivity measure of generated molecular graphs improved by 1.96 compared with the baseline model at a larger maximum molecular size of 20 atoms. The molecules generated by L-MolGAN are evaluated in terms of multiple chemical properties, QED, synthetic accessibility, and log octanol–water partition coefficient, which are important in drug design. This result confirms that L-MolGAN can generate various drug-like molecules despite being optimized for a single property, i.e., QED. This method will contribute to the efficient discovery of new molecules of larger sizes than those being generated with the existing method.

L-MolGAN: An improved implicit generative model for large molecular graphs

10.26434/chemrxiv.14569545.v1 ◽

2021 ◽

Author(s):

Yutaka Tsujimoto ◽

Satoru Hiwa ◽

Yushi Nakamura ◽

Yohei Oe ◽

Tomoyuki Hiroyasu

Keyword(s):

Molecular Graph ◽

Chemical Properties ◽

Molecular Size ◽

Material Design ◽

Generative Models ◽

Molecular Structures ◽

Baseline Model ◽

Molecular Graphs ◽

Deep generative models are used to generate arbitrary molecular structures with the desired chemical properties. MolGAN is a renowned molecular generation models that uses generative adversarial networks (GANs) and reinforcement learning to generate molecular graphs in one shot. MolGAN can effectively generate a small molecular graph with nine or fewer heavy atoms. However, the graphs tend to become disconnected as the molecular size increase. This poses a challenge to drug discovery and material design, where large molecules are potentially inclusive. This study develops an improved MolGAN for large molecule generation (L-MolGAN). In this model, the connectivity of molecular graphs is evaluated by a depth-first search during the model training process. When a disconnected molecular graph is generated, L-MolGAN rewards the graph a zero score. This procedure decreases the number of disconnected graphs, and consequently increases the number of connected molecular graphs. The effectiveness of L-MolGAN is experimentally evaluated. The size and connectivity of the molecular graphs generated with data from the ZINC-250k molecular dataset are confirmed using MolGAN as the baseline model. The model is then optimized for a quantitative estimate of drug-likeness (QED) to generate drug-like molecules. The experimental results indicate that the connectivity measure of generated molecular graphs improved by 1.96 compared with the baseline model at a larger maximum molecular size of 20 atoms. The molecules generated by L-MolGAN are evaluated in terms of multiple chemical properties, QED, synthetic accessibility, and log octanol–water partition coefficient, which are important in drug design. This result confirms that L-MolGAN can generate various drug-like molecules despite being optimized for a single property, i.e., QED. This method will contribute to the efficient discovery of new molecules of larger sizes than those being generated with the existing method.

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

Communicative Representation Learning on Attributed Molecular Graphs

10.24963/ijcai.2020/392 ◽

2020 ◽

Cited By ~ 3

Author(s):

Ying Song ◽

Shuangjia Zheng ◽

Zhangming Niu ◽

Zhang-hua Fu ◽

Yutong Lu ◽

...

Keyword(s):

Neural Network ◽

Message Passing ◽

Molecular Graph ◽

Representation Learning ◽

Generation Process ◽

Molecular Graphs ◽

Graph Modeling ◽

Proposed Model ◽

The One ◽

Graph Neural Networks

Constructing proper representations of molecules lies at the core of numerous tasks such as molecular property prediction and drug design. Graph neural networks, especially message passing neural network (MPNN) and its variants, have recently made remarkable achievements in molecular graph modeling. Albeit powerful, the one-sided focuses on atom (node) or bond (edge) information of existing MPNN methods lead to the insufficient representations of the attributed molecular graphs. Herein, we propose a Communicative Message Passing Neural Network (CMPNN) to improve the molecular embedding by strengthening the message interactions between nodes and edges through a communicative kernel. In addition, the message generation process is enriched by introducing a new message booster module. Extensive experiments demonstrated that the proposed model obtained superior performances against state-of-the-art baselines on six chemical property datasets. Further visualization also showed better representation capacity of our model.

L-MolGAN: An improved implicit generative model for large molecular graphs

10.26434/chemrxiv.14569545.v3 ◽

2021 ◽

Author(s):

Yutaka Tsujimoto ◽

Satoru Hiwa ◽

Yushi Nakamura ◽

Yohei Oe ◽

Tomoyuki Hiroyasu

Keyword(s):

Molecular Graph ◽

Chemical Properties ◽

Molecular Size ◽

Material Design ◽

Generative Models ◽

Molecular Structures ◽

Baseline Model ◽

Molecular Graphs ◽

Deep generative models are used to generate arbitrary molecular structures with the desired chemical properties. MolGAN is a renowned molecular generation models that uses generative adversarial networks (GANs) and reinforcement learning to generate molecular graphs in one shot. MolGAN can effectively generate a small molecular graph with nine or fewer heavy atoms. However, the graphs tend to become disconnected as the molecular size increase. This poses a challenge to drug discovery and material design, where large molecules are potentially inclusive. This study develops an improved MolGAN for large molecule generation (L-MolGAN). In this model, the connectivity of molecular graphs is evaluated by a depth-first search during the model training process. When a disconnected molecular graph is generated, L-MolGAN rewards the graph a zero score. This procedure decreases the number of disconnected graphs, and consequently increases the number of connected molecular graphs. The effectiveness of L-MolGAN is experimentally evaluated. The size and connectivity of the molecular graphs generated with data from the ZINC-250k molecular dataset are confirmed using MolGAN as the baseline model. The model is then optimized for a quantitative estimate of drug-likeness (QED) to generate drug-like molecules. The experimental results indicate that the connectivity measure of generated molecular graphs improved by 1.96 compared with the baseline model at a larger maximum molecular size of 20 atoms. The molecules generated by L-MolGAN are evaluated in terms of multiple chemical properties, QED, synthetic accessibility, and log octanol–water partition coefficient, which are important in drug design. This result confirms that L-MolGAN can generate various drug-like molecules despite being optimized for a single property, i.e., QED. This method will contribute to the efficient discovery of new molecules of larger sizes than those being generated with the existing method.

L-MolGAN: An improved implicit generative model for large molecular graphs

10.26434/chemrxiv.14569545 ◽

2021 ◽

Author(s):

Yutaka Tsujimoto ◽

Satoru Hiwa ◽

Yushi Nakamura ◽

Yohei Oe ◽

Tomoyuki Hiroyasu

Keyword(s):

Molecular Graph ◽

Chemical Properties ◽

Molecular Size ◽

Material Design ◽

Generative Models ◽

Molecular Structures ◽

Baseline Model ◽

Molecular Graphs ◽

Deep generative models are used to generate arbitrary molecular structures with the desired chemical properties. MolGAN is a renowned molecular generation models that uses generative adversarial networks (GANs) and reinforcement learning to generate molecular graphs in one shot. MolGAN can effectively generate a small molecular graph with nine or fewer heavy atoms. However, the graphs tend to become disconnected as the molecular size increase. This poses a challenge to drug discovery and material design, where large molecules are potentially inclusive. This study develops an improved MolGAN for large molecule generation (L-MolGAN). In this model, the connectivity of molecular graphs is evaluated by a depth-first search during the model training process. When a disconnected molecular graph is generated, L-MolGAN rewards the graph a zero score. This procedure decreases the number of disconnected graphs, and consequently increases the number of connected molecular graphs. The effectiveness of L-MolGAN is experimentally evaluated. The size and connectivity of the molecular graphs generated with data from the ZINC-250k molecular dataset are confirmed using MolGAN as the baseline model. The model is then optimized for a quantitative estimate of drug-likeness (QED) to generate drug-like molecules. The experimental results indicate that the connectivity measure of generated molecular graphs improved by 1.96 compared with the baseline model at a larger maximum molecular size of 20 atoms. The molecules generated by L-MolGAN are evaluated in terms of multiple chemical properties, QED, synthetic accessibility, and log octanol–water partition coefficient, which are important in drug design. This result confirms that L-MolGAN can generate various drug-like molecules despite being optimized for a single property, i.e., QED. This method will contribute to the efficient discovery of new molecules of larger sizes than those being generated with the existing method.

LeafGo: Leaf to Genome, a quick workflow to produce high-quality De novo genomes with Third Generation Sequencing technology

10.1101/2021.01.25.428044 ◽

2021 ◽

Author(s):

Patrick Driguez ◽

Salim Bougouffa ◽

Karen Carty ◽

Alexander Putra ◽

Kamel Jabbari ◽

...

Keyword(s):

De Novo ◽

Rapid Development ◽

Plant Genome ◽

Plant Genomics ◽

High Quality ◽

High Molecular Weight Dna ◽

Tissue Samples ◽

Sequencing Technologies ◽

The Cost ◽

New Generation

AbstractRecent years have witnessed a rapid development of sequencing technologies. Fundamental differences and limitations among various platforms impact the time, the cost and the accuracy for sequencing whole genomes. Here we designed a complete de novo plant genome generation workflow that starts from plant tissue samples and produces high-quality draft genomes with relatively modest laboratory and bioinformatic resources within seven days. To optimize our workflow we selected different species of plants which were used to extract high molecular weight DNA, to make PacBio and ONT libraries for sequencing with the Sequel I, Sequel II and GridION platforms. We assembled high-quality draft genomes of two different Eucalyptus species E. rudis, and E. camaldulensis to chromosome level without using additional scaffolding technologies. For the rapid production of de novo genome assembly of plant species we showed that our DNA extraction protocol followed by PacBio high fidelity sequencing, and assembly with new generation assemblers such as hifiasm produce excellent results. Our findings will be a valuable benchmark for groups planning wet- and dry-lab plant genomics research and for high throughput plant genomics initiatives.