Translating the Molecules: Adapting Neural Machine Translation to Predict IUPAC Names from a Chemical Identifier
We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a chemical from its standard International Chemical Identifier (InChI). The model uses two stacks of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in state-of-the-art machine translation. Unlike neural machine translation, which usually tokenizes input and output into words or sub-words, our model processes the InChI and predicts the 2 IUPAC name character by character. The model was trained on a dataset of 10 million InChI/IUPAC name pairs freely downloaded from the National Library of Medicine’s online PubChem service. Training took five days on a Tesla K80 GPU, and the model achieved test-set accuracies of 95% (character-level) and 91% (whole name). The model performed particularly well on organics, with the exception of macrocycles. The predictions were less accurate for inorganic compounds, with a character-level accuracy of 71%. This can be explained by inherent limitations in InChI for representing inorganics, as well as low coverage (1.4 %) of the training data.