MassGenie: A Transformer-Based Deep Learning Method for Identifying Small Molecules From Their Mass Spectra

The ′inverse problem′ of mass spectrometric molecular identification (′given a mass spectrum, calculate the molecule whence it came′) is largely unsolved, and is especially acute in metabolomics where many small molecules remain unidentified. This is largely because the number of experimentally available electrospray mass spectra of small molecules is quite limited. However, the forward problem (′calculate a small molecule′s likely fragmentation and hence at least some of its mass spectrum from its structure alone′) is much more tractable, because the strengths of different chemical bonds are roughly known. This kind of molecular identification problem may be cast as a language translation problem in which the source language is a list of high-resolution mass spectral peaks and the ′translation′ a representation (for instance in SMILES) of the molecule. It is thus suitable for attack using the deep neural networks known as transformers. We here present MassGenie, a method that uses a transformer-based deep neural network, trained on ~6 million chemical structures with augmented SMILES encoding and their paired molecular fragments as generated in silico, explicitly including the protonated molecular ion. This architecture (containing some 400 million elements) is used to predict the structure of a molecule from the various fragments that may be expected to be observed when some of its bonds are broken. Despite being given essentially no detailed nor explicit rules about molecular fragmentation methods, isotope patterns, rearrangements, neutral losses, and the like, MassGenie learns the effective properties of the mass spectral fragment and valency space, and can generate candidate molecular structures that are very close or identical to those of the ′true′ molecules. We also use VAE-Sim, a previously published variational autoencoder, to generate candidate molecules that are ′similar′ to the top hit. In addition to using the ′top hits′ directly, we can produce a rank order of these by ′round-tripping′ candidate molecules and comparing them with the true molecules, where known. As a proof of principle, we confine ourselves to positive electrospray mass spectra from molecules with a molecular mass of 500Da or lower. The transformer method, applied here for the first time to mass spectral interpretation, works extremely effectively both for mass spectra generated in silico and on experimentally obtained mass spectra from pure compounds. The ability to create and to ′learn′ millions of fragmentation patterns in silico, and therefrom generate candidate structures (that do not have to be in existing libraries) directly, thus opens up entirely the field of de novo small molecule structure prediction from experimental mass spectra.

Download Full-text

MassGenie: A Transformer-Based Deep Learning Method for Identifying Small Molecules from Their Mass Spectra

Biomolecules ◽

10.3390/biom11121793 ◽

2021 ◽

Vol 11 (12) ◽

pp. 1793

Author(s):

Aditya Divyakant Shrivastava ◽

Neil Swainston ◽

Soumitra Samanta ◽

Ivayla Roberts ◽

Marina Wright Muelas ◽

...

Keyword(s):

Mass Spectrum ◽

Small Molecules ◽

Molecular Identification ◽

In Silico ◽

Mass Spectra ◽

De Novo ◽

Effective Properties ◽

Molecular Structures ◽

Mass Spectral ◽

Chemical Structures

The ‘inverse problem’ of mass spectrometric molecular identification (‘given a mass spectrum, calculate/predict the 2D structure of the molecule whence it came’) is largely unsolved, and is especially acute in metabolomics where many small molecules remain unidentified. This is largely because the number of experimentally available electrospray mass spectra of small molecules is quite limited. However, the forward problem (‘calculate a small molecule’s likely fragmentation and hence at least some of its mass spectrum from its structure alone’) is much more tractable, because the strengths of different chemical bonds are roughly known. This kind of molecular identification problem may be cast as a language translation problem in which the source language is a list of high-resolution mass spectral peaks and the ‘translation’ a representation (for instance in SMILES) of the molecule. It is thus suitable for attack using the deep neural networks known as transformers. We here present MassGenie, a method that uses a transformer-based deep neural network, trained on ~6 million chemical structures with augmented SMILES encoding and their paired molecular fragments as generated in silico, explicitly including the protonated molecular ion. This architecture (containing some 400 million elements) is used to predict the structure of a molecule from the various fragments that may be expected to be observed when some of its bonds are broken. Despite being given essentially no detailed nor explicit rules about molecular fragmentation methods, isotope patterns, rearrangements, neutral losses, and the like, MassGenie learns the effective properties of the mass spectral fragment and valency space, and can generate candidate molecular structures that are very close or identical to those of the ‘true’ molecules. We also use VAE-Sim, a previously published variational autoencoder, to generate candidate molecules that are ‘similar’ to the top hit. In addition to using the ‘top hits’ directly, we can produce a rank order of these by ‘round-tripping’ candidate molecules and comparing them with the true molecules, where known. As a proof of principle, we confine ourselves to positive electrospray mass spectra from molecules with a molecular mass of 500Da or lower, including those in the last CASMI challenge (for which the results are known), getting 49/93 (53%) precisely correct. The transformer method, applied here for the first time to mass spectral interpretation, works extremely effectively both for mass spectra generated in silico and on experimentally obtained mass spectra from pure compounds. It seems to act as a Las Vegas algorithm, in that it either gives the correct answer or simply states that it cannot find one. The ability to create and to ‘learn’ millions of fragmentation patterns in silico, and therefrom generate candidate structures (that do not have to be in existing libraries) directly, thus opens up entirely the field of de novo small molecule structure prediction from experimental mass spectra.

Download Full-text

MolDiscovery: Learning Mass Spectrometry Fragmentation of Small Molecules

10.1101/2020.11.28.401943 ◽

2020 ◽

Author(s):

Liu Cao ◽

Mustafa Guler ◽

Azat Tagirdzhanov ◽

Yiyuan Lee ◽

Alexey Gurevich ◽

...

Keyword(s):

Mass Spectrometry ◽

Small Molecules ◽

Small Molecule ◽

Probabilistic Model ◽

Mass Spectra ◽

Database Search ◽

Molecular Structures ◽

Mass Spectral Database ◽

Mass Spectral ◽

Spectral Database

AbstractIdentification of small molecules is a critical task in various areas of life science. Recent advances in mass spectrometry have enabled the collection of tandem mass spectra of small molecules from hundreds of thousands of environments. To identify which molecules are present in a sample, one can search mass spectra collected from the sample against millions of molecular structures in small molecule databases. This is a challenging task as currently it is not clear how small molecules are fragmented in mass spectrometry. The existing approaches use the domain knowledge from chemistry to predict fragmentation of molecules. However, these rule-based methods fail to explain many of the peaks in mass spectra of small molecules. Recently, spectral libraries with tens of thousands of labelled mass spectra of small molecules have emerged, paving the path for learning more accurate fragmentation models for mass spectral database search. We present molDiscovery, a mass spectral database search method that improves both efficiency and accuracy of small molecule identification by (i) utilizing an efficient algorithm to generate mass spectrometry fragmentations, and (ii) learning a probabilistic model to match small molecules with their mass spectra. We show our database search is an order of magnitude more efficient than the state-of-the-art methods, which enables searching against databases with millions of molecules. A search of over 8 million spectra from the Global Natural Product Social molecular networking infrastructure shows that our probabilistic model can correctly identify nearly six times more unique small molecules than previous methods. Moreover, by applying molDiscovery on microbial datasets with both mass spectral and genomics data we successfully discovered the novel biosynthetic gene clusters of three families of small molecules.AvailabilityThe command-line version of molDiscovery and its online web service through the GNPS infrastructure are available at https://github.com/mohimanilab/molDiscovery.

Download Full-text

MolDiscovery: Learning Mass Spectrometry Fragmentation of Small Molecules

10.21203/rs.3.rs-71854/v1 ◽

2020 ◽

Author(s):

Hosein Mohimani ◽

Liu Cao ◽

Mustafa Guler ◽

Azat Tagirdzhanov ◽

Alexey Gurevich

Keyword(s):

Mass Spectrometry ◽

Small Molecules ◽

Small Molecule ◽

Probabilistic Model ◽

Mass Spectra ◽

Database Search ◽

Molecular Structures ◽

Mass Spectral Database ◽

Mass Spectral ◽

Spectral Database

Abstract Identification of small molecules is a critical task in various areas of life science. Recent advances in mass spectrometry have enabled the collection of tandem mass spectra of small molecules from hundreds of thousands of environments. To identify which molecules are present in a sample, one can search mass spectra collected from the sample against millions of molecular structures in small molecule databases. This is a challenging task as currently it is not clear how small molecules are fragmented in mass spectrometry. The existing approaches use the domain knowledge from chemistry to predict fragmentation of molecules. However, these rule-based methods fail to explain many of the peaks in mass spectra of small molecules. Recently, spectral libraries with tens of thousands of labelled mass spectra of small molecules have emerged, paving the path for learning more accurate fragmentation models for mass spectral database search. We present molDiscovery, a mass spectral database search method that improves both efficiency and accuracy of small molecule identification by (i) utilizing an efficient algorithm to generate mass spectrometry fragmentations, and (ii) learning a probabilistic model to match small molecules with their mass spectra. We show our database search is an order of magnitude more efficient than the state-of-the-art methods, which enables searching against databases with millions of molecules. A search of over 8 million spectra from the Global Natural Product Social molecular networking infrastructure shows that our probabilistic model can correctly identify nearly six times more unique small molecules than previous methods. Moreover, by applying molDiscovery on microbial datasets with both mass spectral and genomics data we successfully discovered the novel biosynthetic gene clusters of three families of small molecules. Availability: The command-line version of molDiscovery and its online web service through the GNPS infrastructure are available at https://github.com/mohimanilab/molDiscovery.

Download Full-text

MolDiscovery: learning mass spectrometry fragmentation of small molecules

Nature Communications ◽

10.1038/s41467-021-23986-0 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Liu Cao ◽

Mustafa Guler ◽

Azat Tagirdzhanov ◽

Yi-Yuan Lee ◽

Alexey Gurevich ◽

...

Keyword(s):

Mass Spectrometry ◽

Small Molecules ◽

Small Molecule ◽

Domain Knowledge ◽

Mass Spectra ◽

Molecular Structures ◽

Mass Spectral Database ◽

Mass Spectral ◽

Tandem Mass Spectra ◽

Small Molecule Databases

AbstractIdentification of small molecules is a critical task in various areas of life science. Recent advances in mass spectrometry have enabled the collection of tandem mass spectra of small molecules from hundreds of thousands of environments. To identify which molecules are present in a sample, one can search mass spectra collected from the sample against millions of molecular structures in small molecule databases. The existing approaches are based on chemistry domain knowledge, and they fail to explain many of the peaks in mass spectra of small molecules. Here, we present molDiscovery, a mass spectral database search method that improves both efficiency and accuracy of small molecule identification by learning a probabilistic model to match small molecules with their mass spectra. A search of over 8 million spectra from the Global Natural Product Social molecular networking infrastructure shows that molDiscovery correctly identify six times more unique small molecules than previous methods.

Download Full-text

Spectrometric Study of the Nitrile-Ketenimine Tautomerism

International Journal of Spectroscopy ◽

10.1155/2009/408345 ◽

2009 ◽

Vol 2009 ◽

pp. 1-18 ◽

Cited By ~ 2

Author(s):

Hebe Saraví Cisneros ◽

Sergio Laurella ◽

Danila L. Ruiz ◽

Agustín Ponzinibbio ◽

Patricia E. Allegretti ◽

...

Keyword(s):

Mass Spectrometry ◽

Nuclear Magnetic Resonance ◽

Mass Spectrum ◽

Magnetic Resonance ◽

Mass Spectra ◽

Spectrometric Study ◽

Mass Spectral ◽

Ion Fragmentation ◽

Nuclear Magnetic Resonance Spectra ◽

Amination Product

Mass spectrometry is used to evaluate the occurrence of the nitrile-ketenimine tautomerism. Mass spectra of two differently substituted nitriles, ethyl-4,4-dicyano-3-methyl-3-butenoate and diethyl-2-cyano-3-methyl-2-pentenodiate are examined looking for common mass spectral behaviors. Ion fragmentation assignments for specific tautomers allow to predict the presence of the corresponding structures. Additionally, the mass spectrum and nuclear magnetic resonance spectra of ethyl-4,4-dicyano-2,2-diethyl-3-methyl-3-butenoate and that of the corresponding amination product support the occurrence of the ketenimine tautomer in the equilibrium.

Download Full-text

Predicting in silico electron ionization mass spectra using quantum chemistry

Journal of Cheminformatics ◽

10.1186/s13321-020-00470-3 ◽

2020 ◽

Vol 12 (1) ◽

Author(s):

Shunyang Wang ◽

Tobias Kind ◽

Dean J. Tantillo ◽

Oliver Fiehn

Keyword(s):

Quantum Chemistry ◽

In Silico ◽

Mass Spectra ◽

Large Scale ◽

Computation Time ◽

Molecular Size ◽

Electron Ionization ◽

Ionization Mass ◽

Compound Identification ◽

Mass Spectral

Abstract Compound identification by mass spectrometry needs reference mass spectra. While there are over 102 million compounds in PubChem, less than 300,000 curated electron ionization (EI) mass spectra are available from NIST or MoNA mass spectral databases. Here, we test quantum chemistry methods (QCEIMS) to generate in silico EI mass spectra (MS) by combining molecular dynamics (MD) with statistical methods. To test the accuracy of predictions, in silico mass spectra of 451 small molecules were generated and compared to experimental spectra from the NIST 17 mass spectral library. The compounds covered 43 chemical classes, ranging up to 358 Da. Organic oxygen compounds had a lower matching accuracy, while computation time exponentially increased with molecular size. The parameter space was probed to increase prediction accuracy including initial temperatures, the number of MD trajectories and impact excess energy (IEE). Conformational flexibility was not correlated to the accuracy of predictions. Overall, QCEIMS can predict 70 eV electron ionization spectra of chemicals from first principles. Improved methods to calculate potential energy surfaces (PES) are still needed before QCEIMS mass spectra of novel molecules can be generated at large scale.

Download Full-text

Predicting in-silico electron ionization mass spectra using quantum chemistry

10.21203/rs.3.rs-51664/v2 ◽

2020 ◽

Author(s):

Shunyang Wang ◽

Tobias Kind ◽

Dean J. Tantillo ◽

Oliver Fiehn

Keyword(s):

Quantum Chemistry ◽

In Silico ◽

Mass Spectra ◽

Large Scale ◽

Computation Time ◽

Molecular Size ◽

Electron Ionization ◽

Ionization Mass ◽

Compound Identification ◽

Mass Spectral

Abstract Compound identification by mass spectrometry needs reference mass spectra. While there are over 102 million compounds in PubChem, less than 300,000 curated electron ionization (EI) mass spectra are available from NIST or MoNA mass spectral databases. Here, we test quantum chemistry methods (QCEIMS) to generate in-silico EI mass spectra (MS) by combining molecular dynamics (MD) with statistical methods. To test the accuracy of predictions, in-silico mass spectra of 451 small molecules were generated and compared to experimental spectra from the NIST 17 mass spectral library. The compounds covered 43 chemical classes, ranging up to 358 Da. Organic oxygen compounds had a lower matching accuracy, while computation time exponentially increased with molecular size. The parameter space was probed to increase prediction accuracy including initial temperatures, the number of MD trajectories and impact excess energy (IEE). Conformational flexibility was not correlated to the accuracy of predictions. Overall, QCEIMS can predict 70 eV electron ionization spectra of chemicals from first principles. Improved methods to calculate potential energy surfaces (PES) are still needed before QCEIMS mass spectra of novel molecules can be generated at large scale.

Download Full-text

Physicochemical and Structural Parameters Contributing to the Antibacterial Activity and Efflux Susceptibility of Small Molecule Inhibitors of Escherichia coli

Antimicrobial Agents and Chemotherapy ◽

10.1128/aac.01925-20 ◽

2021 ◽

Author(s):

Sara S. El Zahed ◽

Shawn French ◽

Maya A. Farha ◽

Garima Kumar ◽

Eric D. Brown

Keyword(s):

Machine Learning ◽

Escherichia Coli ◽

Antibacterial Activity ◽

Small Molecules ◽

Small Molecule ◽

In Silico ◽

Molecular Descriptors ◽

Structural Parameters ◽

Side Chain ◽

Gram Negative

Discovering new Gram-negative antibiotics has been a challenge for decades. This has been largely attributed to a limited understanding of the molecular descriptors governing Gram-negative permeation and efflux evasion. Herein, we address the contribution of efflux using a novel approach that applies multivariate analysis, machine learning, and structure-based clustering to some 4,500 actives from a small molecule screen in efflux-compromised Escherichia coli. We employed principal-component analysis and trained two decision tree-based machine learning models to investigate descriptors contributing to the antibacterial activity and efflux susceptibility of these actives. This approach revealed that the Gram-negative activity of hydrophobic and planar small molecules with low molecular stability is limited to efflux-compromised E. coli. Further, molecules with reduced branching and compactness showed increased susceptibility to efflux. Given these distinct properties that govern efflux, we developed the first machine learning model, called Susceptibility to Efflux Random Forest (SERF), as a tool to analyze the molecular descriptors of small molecules and predict those that could be susceptible to efflux pumps in silico. Here, SERF demonstrated high accuracy in identifying such molecules. Further, we clustered all 4,500 actives based on their core structures and identified distinct clusters highlighting side chain moieties that cause marked changes in efflux susceptibility. In all, our work reveals a role for physicochemical and structural parameters in governing efflux, presents a machine learning tool for rapid in silico analysis of efflux susceptibility, and provides a proof of principle for the potential of exploiting side chain modification to design novel antimicrobials evading efflux pumps.

Download Full-text

Too sweet: cheminformatics for deglycosylation in natural products

Journal of Cheminformatics ◽

10.1186/s13321-020-00467-y ◽

2020 ◽

Vol 12 (1) ◽

Cited By ~ 1

Author(s):

Jonas Schaub ◽

Achim Zielesny ◽

Christoph Steinbeck ◽

Maria Sorokina

Keyword(s):

Natural Products ◽

Small Molecules ◽

Open Source Software ◽

In Silico ◽

Web Application ◽

Systematic Approach ◽

Computational Procedure ◽

Molecular Structures ◽

Biological Origin ◽

And Function

Abstract Sugar units in natural products are pharmacokinetically important but often redundant and therefore obstructing the study of the structure and function of the aglycon. Therefore, it is recommended to remove the sugars before a theoretical or experimental study of a molecule. Deglycogenases, enzymes that specialized in sugar removal from small molecules, are often used in laboratories to perform this task. However, there is no standardized computational procedure to perform this task in silico. In this work, we present a systematic approach for in silico removal of ring and linear sugars from molecular structures. Particular attention is given to molecules of biological origin and to their structural specificities. This approach is made available in two forms, through a free and open web application and as standalone open-source software.

Download Full-text