scholarly journals DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature

2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Kohulan Rajan ◽  
Henning Otto Brinkhaus ◽  
Maria Sorokina ◽  
Achim Zielesny ◽  
Christoph Steinbeck

AbstractChemistry looks back at many decades of publications on chemical compounds, their structures and properties, in scientific articles. Liberating this knowledge (semi-)automatically and making it available to the world in open-access databases is a current challenge. Apart from mining textual information, Optical Chemical Structure Recognition (OCSR), the translation of an image of a chemical structure into a machine-readable representation, is part of this workflow. As the OCSR process requires an image containing a chemical structure, there is a need for a publicly available tool that automatically recognizes and segments chemical structure depictions from scientific publications. This is especially important for older documents which are only available as scanned pages. Here, we present DECIMER (Deep lEarning for Chemical IMagE Recognition) Segmentation, the first open-source, deep learning-based tool for automated recognition and segmentation of chemical structures from the scientific literature. The workflow is divided into two main stages. During the detection step, a deep learning model recognizes chemical structure depictions and creates masks which define their positions on the input page. Subsequently, potentially incomplete masks are expanded in a post-processing workflow. The performance of DECIMER Segmentation has been manually evaluated on three sets of publications from different publishers. The approach operates on bitmap images of journal pages to be applicable also to older articles before the introduction of vector images in PDFs. By making the source code and the trained model publicly available, we hope to contribute to the development of comprehensive chemical data extraction workflows. In order to facilitate access to DECIMER Segmentation, we also developed a web application. The web application, available at https://decimer.ai, lets the user upload a pdf file and retrieve the segmented structure depictions.

2021 ◽  
Author(s):  
Kohulan Rajan ◽  
Henning Otto brinkhaus ◽  
Maria Sorokina ◽  
Achim Zielesny ◽  
Christoph Steinbeck

<p>Chemistry looks back at many decades of publications on chemical compounds, their structures and properties, in scientific articles. Liberating this knowledge (semi-)automatically and making it available to the world in open-access databases is a current challenge. Apart from mining textual information, Optical Chemical Structure Recognition (OCSR), the translation of an image of a chemical structure into a machine-readable representation, is part of this workflow. As the OCSR process requires an image containing a chemical structure, there is a need for a publicly available tool that automatically recognizes and segments chemical structure depictions from scientific publications. This is especially important for older documents which are only available as scanned pages. Here, we present DECIMER (Deep lEarning for Chemical IMagE Recognition) Segmentation, the first open-source, deep learning-based tool for automated recognition and segmentation of chemical structures from the scientific literature.</p><br><p>The workflow is divided into two main stages. During the detection step, a deep learning model recognizes chemical structure depictions and creates masks which define their positions on the input page. Subsequently, potentially incomplete masks are expanded in a post-processing workflow. The performance of DECIMER Segmentation has been manually evaluated on three sets of publications from different publishers. The approach operates on bitmap images of journal pages to be applicable also to older articles before the introduction of vector images in PDFs. </p><br><p>By making the source code and the trained model publicly available, we hope to contribute to the development of comprehensive chemical data extraction workflows. In order to facilitate access to DECIMER Segmentation, we also developed a web application. The web application, available at <a href="https://decimer.ai">https://decimer.ai</a>, lets the user upload a pdf file and retrieve the segmented structure depictions.</p><div><br></div>


2021 ◽  
Author(s):  
Kohulan Rajan ◽  
Henning Otto brinkhaus ◽  
Maria Sorokina ◽  
Achim Zielesny ◽  
Christoph Steinbeck

<p>Chemistry looks back at many decades of publications on chemical compounds, their structures and properties, in scientific articles. Liberating this knowledge (semi-)automatically and making it available to the world in open-access databases is a current challenge. Apart from mining textual information, Optical Chemical Structure Recognition (OCSR), the translation of an image of a chemical structure into a machine-readable representation, is part of this workflow. As the OCSR process requires an image containing a chemical structure, there is a need for a publicly available tool that automatically recognizes and segments chemical structure depictions from scientific publications. This is especially important for older documents which are only available as scanned pages. Here, we present DECIMER (Deep lEarning for Chemical IMagE Recognition) Segmentation, the first open-source, deep learning-based tool for automated recognition and segmentation of chemical structures from the scientific literature.</p><br><p>The workflow is divided into two main stages. During the detection step, a deep learning model recognizes chemical structure depictions and creates masks which define their positions on the input page. Subsequently, potentially incomplete masks are expanded in a post-processing workflow. The performance of DECIMER Segmentation has been manually evaluated on three sets of publications from different publishers. The approach operates on bitmap images of journal pages to be applicable also to older articles before the introduction of vector images in PDFs. </p><br><p>By making the source code and the trained model publicly available, we hope to contribute to the development of comprehensive chemical data extraction workflows. In order to facilitate access to DECIMER Segmentation, we also developed a web application. The web application, available at <a href="https://decimer.ai">https://decimer.ai</a>, lets the user upload a pdf file and retrieve the segmented structure depictions.</p><div><br></div>


2021 ◽  
Author(s):  
Kohulan Rajan ◽  
Henning Otto brinkhaus ◽  
Maria Sorokina ◽  
Achim Zielesny ◽  
Christoph Steinbeck

<p>Chemistry looks back at many decades of publications on chemical compounds, their structures and properties, in scientific articles. Liberating this knowledge (semi-)automatically and making it available to the world in open-access databases is a current challenge. Apart from mining textual information, Optical Chemical Structure Recognition (OCSR), the translation of an image of a chemical structure into a machine-readable representation, is part of this workflow. As the OCSR process requires an image containing a chemical structure, there is a need for a publicly available tool that automatically recognizes and segments chemical structure depictions from scientific publications. This is especially important for older documents which are only available as scanned pages. Here, we present DECIMER (Deep lEarning for Chemical IMagE Recognition) Segmentation, the first open-source, deep learning-based tool for automated recognition and segmentation of chemical structures from the scientific literature.</p><br><p>The workflow is divided into two main stages. During the detection step, a deep learning model recognizes chemical structure depictions and creates masks which define their positions on the input page. Subsequently, potentially incomplete masks are expanded in a post-processing workflow. The performance of DECIMER Segmentation has been manually evaluated on three sets of publications from different publishers. The approach operates on bitmap images of journal pages to be applicable also to older articles before the introduction of vector images in PDFs. </p><br><p>By making the source code and the trained model publicly available, we hope to contribute to the development of comprehensive chemical data extraction workflows. In order to facilitate access to DECIMER Segmentation, we also developed a web application. The web application, available at <a href="https://decimer.ai">https://decimer.ai</a>, lets the user upload a pdf file and retrieve the segmented structure depictions.</p><div><br></div>


2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Kohulan Rajan ◽  
Achim Zielesny ◽  
Christoph Steinbeck

AbstractThe amount of data available on chemical structures and their properties has increased steadily over the past decades. In particular, articles published before the mid-1990 are available only in printed or scanned form. The extraction and storage of data from those articles in a publicly accessible database are desirable, but doing this manually is a slow and error-prone process. In order to extract chemical structure depictions and convert them into a computer-readable format, Optical Chemical Structure Recognition (OCSR) tools were developed where the best performing OCSR tools are mostly rule-based. The DECIMER (Deep lEarning for Chemical ImagE Recognition) project was launched to address the OCSR problem with the latest computational intelligence methods to provide an automated open-source software solution. Various current deep learning approaches were explored to seek a best-fitting solution to the problem. In a preliminary communication, we outlined the prospect of being able to predict SMILES encodings of chemical structure depictions with about 90% accuracy using a dataset of 50–100 million molecules. In this article, the new DECIMER model is presented, a transformer-based network, which can predict SMILES with above 96% accuracy from depictions of chemical structures without stereochemical information and above 89% accuracy for depictions with stereochemical information.


2021 ◽  
Author(s):  
Kohulan Rajan ◽  
Christoph Steinbeck ◽  
Achim Zielesny

The use of molecular string representations for deep learning in chemistry has been steadily increasing in recent years. The complexity of existing string representations, and the difficulty in creating meaningful tokens from them, lead to the development of new string representations for chemical structures. In this study, the translation of chemical structure depictions in the form of bitmap images to corresponding molecular string representations was examined. An analysis of the recently developed DeepSMILES and SELFIES representations in comparison with the most commonly used SMILES representation is presented where the ability to translate image features into string representations with transformer models was specifically tested. The SMILES representation exhibits the best overall performance whereas SELFIES guarantee valid chemical structures. DeepSMILES performs in between SMILES and SELFIES, InChIs are not appropriate for the learning task. All investigations were carried out with publicly available datasets and the code used to train and evaluate the models has been made available to the public.


2021 ◽  
Author(s):  
Kohulan Rajan ◽  
Achim Zielesny ◽  
Christoph Steinbeck

<p>The amount of data available on chemical structures and their properties has increased exponentially over the past decades. In particular, articles published before the mid-1990 are available only in printed or scanned form. The extraction and storage of data from those articles in a publicly accessible database are desirable, but doing this manually is a slow and error-prone process. In order to extract chemical structure depictions and convert them into a computer-readable format, optical chemical structure recognition (OCSR) tools were developed where the best performing OCSR tools are mostly rule-based.</p><p> </p><p>The DECIMER (Deep lEarning for Chemical ImagE Recognition) project was launched to address the OCSR problem with the latest computational intelligence methods to provide an automated open-source software solution. Various current deep learning approaches were explored to seek a best-fitting solution to the problem. In a preliminary communication, we outlined the prospect of being able to predict SMILES encodings of chemical structure depictions with about 90% accuracy using a dataset of 50-100 million molecules. In this article, the new DECIMER model is presented, a transformer-based network, which can predict SMILES with above 96% accuracy from depictions of chemical structures without stereochemical information and above 89% accuracy for depictions with stereochemical information.</p><p><br></p>


2021 ◽  
Author(s):  
Kohulan Rajan ◽  
Achim Zielesny ◽  
Christoph Steinbeck

<p>The amount of data available on chemical structures and their properties has increased exponentially over the past decades. In particular, articles published before the mid-1990 are available only in printed or scanned form. The extraction and storage of data from those articles in a publicly accessible database are desirable, but doing this manually is a slow and error-prone process. In order to extract chemical structure depictions and convert them into a computer-readable format, optical chemical structure recognition (OCSR) tools were developed where the best performing OCSR tools are mostly rule-based.</p><p> </p><p>The DECIMER (Deep lEarning for Chemical ImagE Recognition) project was launched to address the OCSR problem with the latest computational intelligence methods to provide an automated open-source software solution. Various current deep learning approaches were explored to seek a best-fitting solution to the problem. In a preliminary communication, we outlined the prospect of being able to predict SMILES encodings of chemical structure depictions with about 90% accuracy using a dataset of 50-100 million molecules. In this article, the new DECIMER model is presented, a transformer-based network, which can predict SMILES with above 96% accuracy from depictions of chemical structures without stereochemical information and above 89% accuracy for depictions with stereochemical information.</p><p><br></p>


Author(s):  
Hema R. ◽  
Ajantha Devi

Chemical entities can be represented in different forms like chemical names, chemical formulae, and chemical structures. Because of the different classification frameworks for chemical names, the task of distinguishing proof or extraction of chemical elements with less ambiguous is considered a major test. Compound named entity recognition (NER) is the initial phase in any chemical-related data extraction strategy. The majority of the chemical NER is done utilizing dictionary-based, rule-based, and machine learning procedures. Recently, deep learning methods have evolved, and, in this chapter, the authors sketch out the various deep learning techniques applied for chemical NER. First, the authors introduced the fundamental concepts of chemical named entity recognition, the textual contents of chemical documents, and how these chemicals are represented in chemical literature. The chapter concludes with the strengths and weaknesses of the above methods and also the types of the chemical entities extracted.


Author(s):  
Tianyi Zhao ◽  
Yang Hu ◽  
Liang Cheng

Abstract Motivation: The functional changes of the genes, RNAs and proteins will eventually be reflected in the metabolic level. Increasing number of researchers have researched mechanism, biomarkers and targeted drugs by metabolites. However, compared with our knowledge about genes, RNAs, and proteins, we still know few about diseases-related metabolites. All the few existed methods for identifying diseases-related metabolites ignore the chemical structure of metabolites, fail to recognize the association pattern between metabolites and diseases, and fail to apply to isolated diseases and metabolites. Results: In this study, we present a graph deep learning based method, named Deep-DRM, for identifying diseases-related metabolites. First, chemical structures of metabolites were used to calculate similarities of metabolites. The similarities of diseases were obtained based on their functional gene network and semantic associations. Therefore, both metabolites and diseases network could be built. Next, Graph Convolutional Network (GCN) was applied to encode the features of metabolites and diseases, respectively. Then, the dimension of these features was reduced by Principal components analysis (PCA) with retainment 99% information. Finally, Deep neural network was built for identifying true metabolite-disease pairs (MDPs) based on these features. The 10-cross validations on three testing setups showed outstanding AUC (0.952) and AUPR (0.939) of Deep-DRM compared with previous methods and similar approaches. Ten of top 15 predicted associations between diseases and metabolites got support by other studies, which suggests that Deep-DRM is an efficient method to identify MDPs. Contact: [email protected]. Availability and implementation: https://github.com/zty2009/GPDNN-for-Identify-ing-Disease-related-Metabolites.


2022 ◽  
Author(s):  
Kohulan Rajan ◽  
Christoph Steinbeck ◽  
Achim Zielesny

The use of molecular string representations for deep learning in chemistry has been steadily increasing in recent years. The complexity of existing string representations, and the difficulty in creating meaningful...


Sign in / Sign up

Export Citation Format

Share Document