scholarly journals An Open Source Chemical Structure Curation Pipeline using RDKit

2020 ◽  
Author(s):  
A Patrícia Bento ◽  
Anne Hersey ◽  
Eloy Felix ◽  
Greg Landrum ◽  
Anna Gaulton ◽  
...  

Abstract BackgroundThe ChEMBL database is one of a number of public databases that contain bioactivity data on small molecule compounds curated from diverse sources. Incoming compounds are typically not standardised according to consistent rules. In order to maintain the quality of the final database and to easily compare and integrate data on the same compound from different sources it is necessary for the chemical structures in the database to be appropriately standardised.ResultsA chemical curation pipeline has been developed using the open source toolkit RDKit. It comprises three components: a Checker to test the validity of chemical structures and flag any serious errors; a Standardizer which formats compounds according to defined rules and conventions and a GetParent component that removes any salts and solvents from the compound to create its parent. This pipeline has been applied to the latest version of the ChEMBL database as well as uncurated datasets from other sources to test the robustness of the process and to identify common issues in database molecular structures. ConclusionAll the components of the structure pipeline have been made freely available for other researchers to use and adapt for their own use. The code is available in a GitHub repository and it can also be accessed via the ChEMBL Beaker webservices. It has been used successfully to standardise the nearly 2 million compounds in the ChEMBL database and the compound validity checker has been used to identify compounds with the most serious issues so that they can be prioritised for manual curation.

2020 ◽  
Author(s):  
A Patrícia Bento ◽  
Anne Hersey ◽  
Eloy Felix ◽  
Greg Landrum ◽  
Anna Gaulton ◽  
...  

Abstract Background The ChEMBL database is one of a number of public databases that contain bioactivity data on small molecule compounds curated from diverse sources. Incoming compounds are typically not standardised according to consistent rules. In order to maintain the quality of the final database and to easily compare and integrate data on the same compound from different sources it is necessary for the chemical structures in the database to be appropriately standardised. Results A chemical curation pipeline has been developed using the open source toolkit RDKit. It comprises three components: a Checker to test the validity of chemical structures and flag any serious errors; a Standardizer which formats compounds according to defined rules and conventions and a GetParent component that removes any salts and solvents from the compound to create its parent. This pipeline has been applied to the latest version of the ChEMBL database as well as uncurated datasets from other sources to test the robustness of the process and to identify common issues in database molecular structures. Conclusion All the components of the structure pipeline have been made freely available for other researchers to use and adapt for their own use. The code is available in a GitHub repository and it can also be accessed via the ChEMBL Beaker webservices. It has been used successfully to standardise the nearly 2 million compounds in the ChEMBL database and the compound validity checker has been used to identify compounds with the most serious issues so that they can be prioritised for manual curation.


2020 ◽  
Author(s):  
Faustine Lorquin ◽  
Fabio Ziarelli ◽  
Agnès Amouric ◽  
Carole Di Giorgio ◽  
Maxime Robin ◽  
...  

Abstract Pyomelanin is a polymer of homogentisic acid synthesized by microorganisms. This work aimed to develop a production process and evaluate the quality of the pigment. Three procedures have been elaborated and optimized, (1) an HGA-Mn2+ chemical autoxidation (PyoCHEM yield 0.317 g/g substrate), (2) an induced bacterial culture of Halomonas titanicae through the 4-hydroxyphenylacetic acid-1-hydroxylase route (PyoBACT, 0.55 g/L), and (3) a process using a recombinant laccase with the highest level produced (PyoENZ, 1.25 g/g substrate) and all the criteria for a large-scale prototype. The chemical structures had been investigated by 13C solid-state NMR (CP-MAS) and FTIR. Car-Car bindings predominated in the three polymers, Car-O-Car (ether) linkages being absent, proposing mainly C3-C6 (β-bindings) and C4-C6 (α-bindings) configurations. This work highlighted a biological decarboxylation by the laccase or bacterial oxidase(s), leading to the partly formation of gentisyl alcohol and gentisaldehyde that are integral parts of the polymer. By comparison, PyoENZ exhibited an Mw of 5,700 Da, was hyperthermostable, non-cytotoxic even after irradiation, scavenged ROS induced by keratinocytes, and had a highly DPPH-antioxidant and Fe3+-reducing activity. As a representative pigment of living cells and an available standard, PyoENZ might also be useful for applications in extreme conditions and skin protection.


Author(s):  
Prosanta Sarkar ◽  
Anita Pal ◽  
Nilanjan De

A graph is a mathematical model used to predict the topology of a given system. In chemical graph theory, a graph is designed by considering atoms as vertices and edges as bonds between atoms of a particular molecule. A topological index or molecular structure descriptor is a numeric quantity associated with the chemical constitution which correlated with various physiochemical properties of the chemical structure. In this paper, we study the [Formula: see text]-Zagreb index of line graphs of the subdivision graphs of some chemical structures.


Author(s):  
Aditya Divyakant Shrivastava ◽  
Neil Swainston ◽  
Soumitra Samanta ◽  
Ivayla Roberts ◽  
Marina Wright Muelas ◽  
...  

The ‘inverse problem’ of mass spectrometric molecular identification (‘given a mass spectrum, calculate/predict the 2D structure of the molecule whence it came’) is largely unsolved, and is especially acute in metabolomics where many small molecules remain unidentified. This is largely because the number of experimentally available electrospray mass spectra of small molecules is quite limited. However, the forward problem (‘calculate a small molecule’s likely fragmentation and hence at least some of its mass spectrum from its structure alone’) is much more tractable, because the strengths of different chemical bonds are roughly known. This kind of molecular identification problem may be cast as a language translation problem in which the source language is a list of high-resolution mass spectral peaks and the ‘translation’ a representation (for instance in SMILES) of the molecule. It is thus suitable for attack using the deep neural networks known as transformers. We here present MassGenie, a method that uses a transformer-based deep neural network, trained on ~6 million chemical structures with augmented SMILES encoding and their paired molecular fragments as generated in silico, explicitly including the protonated molecular ion. This architecture (containing some 400 million elements) is used to predict the structure of a molecule from the various fragments that may be expected to be observed when some of its bonds are broken. Despite being given essentially no detailed nor explicit rules about molecular fragmentation methods, isotope patterns, rearrangements, neutral losses, and the like, MassGenie learns the effective properties of the mass spectral fragment and valency space, and can generate candidate molecular structures that are very close or identical to those of the ‘true’ molecules. We also use VAE-Sim, a previously published variational autoencoder, to generate candidate molecules that are ‘similar’ to the top hit. In addition to using the ‘top hits’ directly, we can produce a rank order of these by ‘round-tripping’ candidate molecules and comparing them with the true molecules, where known. As a proof of principle, we confine ourselves to positive electrospray mass spectra from molecules with a molecular mass of 500Da or lower, including those in the last CASMI challenge (for which the results are known), getting 49/93 (53%) precisely correct. The transformer method, applied here for the first time to mass spectral interpretation, works extremely effectively both for mass spectra generated in silico and on experimentally obtained mass spectra from pure compounds. It seems to act as a Las Vegas algorithm, in that it either gives the correct answer or simply states that it cannot find one. The ability to create and to ‘learn’ millions of fragmentation patterns in silico, and therefrom generate candidate structures (that do not have to be in existing libraries) directly, thus opens up entirely the field of de novo small molecule structure prediction from experimental mass spectra.


2012 ◽  
Vol 2012 ◽  
pp. 1-15 ◽  
Author(s):  
Beatrice Ch. D. Salert ◽  
Armin Wedel ◽  
Lutz Grubert ◽  
Thomas Eberle ◽  
Rémi Anémian ◽  
...  

This paper describes the synthesis of new electron-transporting styrene monomers and their corresponding polystyrenes all with a 2,4,6-triphenyl-1,3,5-triazine basic structure in the side group. The monomers differ in the alkyl substitution and in the meta-/paralinkage of the triazine to the polymer backbone. The thermal and spectroscopic properties of the new electron-transporting polymers are discussed in regard to their chemical structures. Phosphorescent OLEDs were prepared using the obtained electron-transporting polymers as the emissive layer material in blend systems together with a green iridium-based emitter13and a small molecule as an additional cohost with wideband gap characteristics (CoH-001). The performance of the OLEDs was characterized and discussed in regard to the chemical structure of the new electron-transporting polymers.


2021 ◽  
Author(s):  
Aditya Divyakant Shrivastava ◽  
Neil Swainston ◽  
Soumitra Samanta ◽  
Ivayla Roberts ◽  
Marina Wright Muelas ◽  
...  

The ′inverse problem′ of mass spectrometric molecular identification (′given a mass spectrum, calculate the molecule whence it came′) is largely unsolved, and is especially acute in metabolomics where many small molecules remain unidentified. This is largely because the number of experimentally available electrospray mass spectra of small molecules is quite limited. However, the forward problem (′calculate a small molecule′s likely fragmentation and hence at least some of its mass spectrum from its structure alone′) is much more tractable, because the strengths of different chemical bonds are roughly known. This kind of molecular identification problem may be cast as a language translation problem in which the source language is a list of high-resolution mass spectral peaks and the ′translation′ a representation (for instance in SMILES) of the molecule. It is thus suitable for attack using the deep neural networks known as transformers. We here present MassGenie, a method that uses a transformer-based deep neural network, trained on ~6 million chemical structures with augmented SMILES encoding and their paired molecular fragments as generated in silico, explicitly including the protonated molecular ion. This architecture (containing some 400 million elements) is used to predict the structure of a molecule from the various fragments that may be expected to be observed when some of its bonds are broken. Despite being given essentially no detailed nor explicit rules about molecular fragmentation methods, isotope patterns, rearrangements, neutral losses, and the like, MassGenie learns the effective properties of the mass spectral fragment and valency space, and can generate candidate molecular structures that are very close or identical to those of the ′true′ molecules. We also use VAE-Sim, a previously published variational autoencoder, to generate candidate molecules that are ′similar′ to the top hit. In addition to using the ′top hits′ directly, we can produce a rank order of these by ′round-tripping′ candidate molecules and comparing them with the true molecules, where known. As a proof of principle, we confine ourselves to positive electrospray mass spectra from molecules with a molecular mass of 500Da or lower. The transformer method, applied here for the first time to mass spectral interpretation, works extremely effectively both for mass spectra generated in silico and on experimentally obtained mass spectra from pure compounds. The ability to create and to ′learn′ millions of fragmentation patterns in silico, and therefrom generate candidate structures (that do not have to be in existing libraries) directly, thus opens up entirely the field of de novo small molecule structure prediction from experimental mass spectra.


Author(s):  
N.-H. Cho ◽  
K.M. Krishnan ◽  
D.B. Bogy

Diamond-like carbon (DLC) films have attracted much attention due to their useful properties and applications. These properties are quite variable depending on film preparation techniques and conditions, DLC is a metastable state formed from highly non-equilibrium phases during the condensation of ionized particles. The nature of the films is therefore strongly dependent on their particular chemical structures. In this study, electron energy loss spectroscopy (EELS) was used to investigate how the chemical bonding configurations of DLC films vary as a function of sputtering power densities. The electrical resistivity of the films was determined, and related to their chemical structure.DLC films with a thickness of about 300Å were prepared at 0.1, 1.1, 2.1, and 10.0 watts/cm2, respectively, on NaCl substrates by d.c. magnetron sputtering. EEL spectra were obtained from diamond, graphite, and the films using a JEOL 200 CX electron microscope operating at 200 kV. A Gatan parallel EEL spectrometer and a Kevex data aquisition system were used to analyze the energy distribution of transmitted electrons. The electrical resistivity of the films was measured by the four point probe method.


2018 ◽  
Author(s):  
William A. Shirley ◽  
Brian P. Kelley ◽  
Yohann Potier ◽  
John H. Koschwanez ◽  
Robert Bruccoleri ◽  
...  

This pre-print explores ensemble modeling of natural product targets to match chemical structures to precursors found in large open-source gene cluster repository antiSMASH. Commentary on method, effectiveness, and limitations are enclosed. All structures are public domain molecules and have been reviewed for release.


Author(s):  
Oksana Bitlian ◽  
Oksana Kravchenko ◽  
Tetiana Kodak ◽  
Andrii Onyshchenko ◽  
Tetiana Konks

The analysis of literature sources shows that the type and material from which the packaging is made has an important place in the system of factors which influence on the storage of feed products and also prevents reducing the quality of raw materials and finished products. Therefore, the purpose of our research is the technological justification of changing the quality indexes of premix samples with salts of trace elements of different chemical nature in the process of storage. For the solution of the tasks, common zootechnical and statistical methods of the research were used. The use of premixes in feeding pigs is based on the fact that they should be used taking into account the biogeochemical properties of the region for which they are calculated. Foods depending on regional properties have a special biochemical composition and excess or lack of individual substances should be offset by the composition of premix. Ignoring this provision necessarily leads to the inappropriate use of BAR, the misbalance of the diet in relation to the physiological needs and inefficiency of the industry. In turn, it requires the purchase and conservation of products for the period of use. Various chemical structures and structures of BAR during the storage process react differently and change qualitative indexes, which leads to a decrease in the productive activity of active substances. It was determined that the humidity of premixes varied within the limits of 12.0-13.0 %, which exceeded the normative, but was not critical, the highest acidity had premix with sulfuric acid salts (6.9 units), the least - premix with lysates (5.7 unit). According to the results of the study, positive qualitative responses were found for the presence of vitamins A, D and B2, macro- and micronutrients: potassium, magnesium, copper, zinc, cobalt, iodine. The above facts of changes in the properties of premixes in the process of storage must be taken into account when providing technological bases for feeding pigs in order to obtain high gains and the quality of manufactured products. Key words: premix, micro-and macro elements, combined fodders, fodder mixes, chelating compounds, feeding, using, pigs' livestock.


2018 ◽  
Vol 15 (8) ◽  
pp. 1109-1123
Author(s):  
Jonas da Silva Santos ◽  
Joel Jones Junior ◽  
Flavia M. da Silva

Background: We present here the synthesis of 1,3-thiazolidin-4-one (1) and its functionalised analogues, such as the classical isosteres, glitazone (1,3-thiazolidine-2,4-dione) (2), rhodanine (2-thioxo-1,3- thiazolidin-4-one) (3) and pseudothiohydantoin (2-imino-1,3-thiazolidin-4-one) (4) started in the midnineteenth century to the present day (1865-2018). Objective: The review focuses on the differences in the representation of the molecular structures discussed here over time since the first discussions about the structural theory by Kekulé, Couper and Butlerov. Moreover, advanced synthesis methodologies have been developed for obtaining these functional group, including green chemistry. We discuss about its structure and stability and we show the great biological potential. Conclusion: The 1,3-thiazolidin-4-one nucleus and functionalised analogues such as glitazones (1,3- thiazolidine-2,4-diones), rhodanines (2-thioxo-1,3-thiazolidin-4-ones) and pseudothiohydantoins (2-imino-1,3- thiazolidine-2-4-ones) have great pharmacological importance, and they are already found in commercial pharmaceuticals. Studies indicate a promising future in the area of medicinal chemistry with potential activities against different diseases. The synthesis of these nuclei started in the mid-nineteenth century (1865), with the first discussions about the structural theory by Kekulé, Couper and Butlerov. The present study has demonstrated the differences in the representations of the molecular structures discussed here over time. Since then, various synthetic methodologies have been developed for obtaining these nuclei, and several studies on their structural and biological properties have been performed. Different studies with regards to the green synthesis of these compounds were also presented here. This is the result of the process of environmental awareness. Additionally, the planet Earth is already showing clear signs of depletion, which is currently decreasing the quality of life.


Sign in / Sign up

Export Citation Format

Share Document