scholarly journals Classifying natural products from plants, fungi or bacteria using the COCONUT database and machine learning

2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Alice Capecchi ◽  
Jean-Louis Reymond

AbstractNatural products (NPs) represent one of the most important resources for discovering new drugs. Here we asked whether NP origin can be assigned from their molecular structure in a subset of 60,171 NPs in the recently reported Collection of Open Natural Products (COCONUT) database assigned to plants, fungi, or bacteria. Visualizing this subset in an interactive tree-map (TMAP) calculated using MAP4 (MinHashed atom pair fingerprint) clustered NPs according to their assigned origin (https://tm.gdb.tools/map4/coconut_tmap/), and a support vector machine (SVM) trained with MAP4 correctly assigned the origin for 94% of plant, 89% of fungal, and 89% of bacterial NPs in this subset. An online tool based on an SVM trained with the entire subset correctly assigned the origin of further NPs with similar performance (https://np-svm-map4.gdb.tools/). Origin information might be useful when searching for biosynthetic genes of NPs isolated from plants but produced by endophytic microorganisms.

2021 ◽  
Author(s):  
Alice Capecchi ◽  
Jean-Louis Reymond

Natural products (NPs) represent one of the most important resources for discovering new drugs. Here we asked whether NP origin can be assigned from their molecular structure in a subset of 60,171 NPs in the recently reported Collection of Open Natural Products (COCONUT) database assigned to plants, fungi, or bacteria. Visualizing this subset in an interactive tree-map (TMAP) calculated using MAP4 (MinHashed atom pair fingerprint) clustered NPs according to their assigned origin (https://tm.gdb.tools/map4/coconut_tmap/), and a support vector machine (SVM) trained with MAP4 correctly assigned the origin for 94% of plant, 89% of fungal, and 89% of bacterial NPs in this subset. An online tool based on an SVM trained with the entire subset correctly assigned the origin of further NPs with similar performance (https://np-svm-map4.gdb.tools/). Origin information might be useful when searching for biosynthetic genes of NPs isolated from plants but produced by endophytic microorganisms.


2020 ◽  
Author(s):  
Alice Capecchi ◽  
Jean-Louis Reymond

<p>Microbial natural products (NPs) are an important source of drugs. However, their structural diversity remains poorly understood. Here we used our recently reported MinHashed Atom Pair fingerprint with diameter of four bonds (MAP4), a fingerprint suitable for molecules across very different sizes, to analyze the Natural Products Atlas (NPAtlas), a database of 25,523 NPs of bacterial or fungal origin downloaded from <a href="https://www.npatlas.org/joomla/">https://www.npatlas.org/joomla/</a>. To visualize NPAtlas by MAP4 similarity, we used the dimensionality reduction method tree map (TMAP) (<a href="http://tmap.gdb.tools/">http://tmap.gdb.tools</a>). The resulting interactive map (<a href="https://tm.gdb.tools/map4/npatlas_map_tmap/">https://tm.gdb.tools/map4/npatlas_map_tmap/</a>) organizes molecules by physico-chemical properties and compound families such as peptides, glycosides, polyphenols or terpenoids. Remarkably, the map separates bacterial and fungal NPs from one another, revealing that these two compound families are intrinsically different despite of their related biosynthetic pathways. We used these differences to train a machine learning model capable of distinguishing between NPs of bacterial or fungal origin. </p>


Biomolecules ◽  
2020 ◽  
Vol 10 (10) ◽  
pp. 1385
Author(s):  
Alice Capecchi ◽  
Jean-Louis Reymond

Microbial natural products (NPs) are an important source of drugs, however, their structural diversity remains poorly understood. Here we used our recently reported MinHashed Atom Pair fingerprint with diameter of four bonds (MAP4), a fingerprint suitable for molecules across very different sizes, to analyze the Natural Products Atlas (NPAtlas), a database of 25,523 NPs of bacterial or fungal origin. To visualize NPAtlas by MAP4 similarity, we used the dimensionality reduction method tree map (TMAP). The resulting interactive map organizes molecules by physico-chemical properties and compound families such as peptides and glycosides. Remarkably, the map separates bacterial and fungal NPs from one another, revealing that these two compound families are intrinsically different despite their related biosynthetic pathways. We used these differences to train a machine learning model capable of distinguishing between NPs of bacterial or fungal origin.


2021 ◽  
Vol 41 ◽  
pp. 02004
Author(s):  
Wisnu Ananta Kusuma

Introduction: Bioinformatics is a multi-disciplinary field that usually uses approaches in Computer Science such as algorithms and machine learning to solve problems in the domains of Biology, Biochemistry, and other domains involving molecular biology data. This approach can also be used to screen natural products that have certain properties. Jamu or Indonesian herbal medicine works with the principle of multi-component multi-target. This principle focuses on the complex interactions of system components that describe how multi-components (compounds) can work together to affect multi-targets (protein targets). This mechanism is also popularly called Network Pharmacology. In this study, we introduce a workflow to screen herbal compounds based on Network Pharmacology and machine learning approach. Methods: The workflow starts by screening for proteins that have an important role in relation to a certain disease. The screening was conducted by applying clustering and utilizing network topological features which were represented as graphs [1]. Furthermore, we performed enrichment analysis by integrating the protein-protein interaction network with the Gene Ontology (GO) network covering biological processes, molecular functions, and cellular components into k-partite graph and analyzing them using soft clustering method [2]. From the results of this enrichment analysis, we determined which proteins are really relevant and have important role in a certain disease [3]. Next, from these screened proteins, we built a predictive models of compound-protein interactions from drug data collected from the DrugBank and SuperTarget databases and train the models using machine learning or deep learning methods [4]. This model was then used to predict Indonesian herbal compounds from the HerbalDB database (http://herbaldb.farmasi.ui.ac.id/v3/) and IJAH Analytics. Results: To demonstrate the effectiveness of the workflow, we applied it to analize some diseases, such as hyperinflamation in Covid-19 and obesity. We found several potential plants such as Andrographis paniculata (Sambiloto) to reduce the inflammatory effect on Covid-19 and Murraya paniculata (Kemuning) to activate Brown Adipose Tissue (BAT) which has the potential to treat obesity. Certainly all of this requires proof through in vitro, in vivo, and clinical trials. We have also implemented several processes in the workflow into the IJAH Analytics application. Some of the features of IJAH are finding herbal compounds or plant formulas based on specific disease or protein targets; and otherwise looking for the efficacy of several combinations of plants or herbal compounds. In addition, IJAH Analytics can also visualize pharmacological networks from plants-compound-protein-diseases. IJAH is available to the public at https://ijah.apps.cs.ipb.ac.id for free. Conclusion: This study shows the potential of using bioinformatics approaches based on network pharmacology and machine learning in discovering the potential of natural products from Indonesia’s biodiversity. In addition, IJAH Analytics, although still in the refinement stage, can be an alternative application that can support researchers to screen potential Indonesian natural products.


2020 ◽  
Author(s):  
Alice Capecchi ◽  
Jean-Louis Reymond

<p>Microbial natural products (NPs) are an important source of drugs. However, their structural diversity remains poorly understood. Here we used our recently reported MinHashed Atom Pair fingerprint with diameter of four bonds (MAP4), a fingerprint suitable for molecules across very different sizes, to analyze the Natural Products Atlas (NPAtlas), a database of 25,523 NPs of bacterial or fungal origin downloaded from <a href="https://www.npatlas.org/joomla/">https://www.npatlas.org/joomla/</a>. To visualize NPAtlas by MAP4 similarity, we used the dimensionality reduction method tree map (TMAP) (<a href="http://tmap.gdb.tools/">http://tmap.gdb.tools</a>). The resulting interactive map (<a href="https://tm.gdb.tools/map4/npatlas_map_tmap/">https://tm.gdb.tools/map4/npatlas_map_tmap/</a>) organizes molecules by physico-chemical properties and compound families such as peptides, glycosides, polyphenols or terpenoids. Remarkably, the map separates bacterial and fungal NPs from one another, revealing that these two compound families are intrinsically different despite of their related biosynthetic pathways. We used these differences to train a machine learning model capable of distinguishing between NPs of bacterial or fungal origin. </p>


2018 ◽  
Vol 18 (12) ◽  
pp. 987-997 ◽  
Author(s):  
Li Zhang ◽  
Hui Zhang ◽  
Haixin Ai ◽  
Huan Hu ◽  
Shimeng Li ◽  
...  

Toxicity evaluation is an important part of the preclinical safety assessment of new drugs, which is directly related to human health and the fate of drugs. It is of importance to study how to evaluate drug toxicity accurately and economically. The traditional in vitro and in vivo toxicity tests are laborious, time-consuming, highly expensive, and even involve animal welfare issues. Computational methods developed for drug toxicity prediction can compensate for the shortcomings of traditional methods and have been considered useful in the early stages of drug development. Numerous drug toxicity prediction models have been developed using a variety of computational methods. With the advance of the theory of machine learning and molecular representation, more and more drug toxicity prediction models are developed using a variety of machine learning methods, such as support vector machine, random forest, naive Bayesian, back propagation neural network. And significant advances have been made in many toxicity endpoints, such as carcinogenicity, mutagenicity, and hepatotoxicity. In this review, we aimed to provide a comprehensive overview of the machine learning based drug toxicity prediction studies conducted in recent years. In addition, we compared the performance of the models proposed in these studies in terms of accuracy, sensitivity, and specificity, providing a view of the current state-of-the-art in this field and highlighting the issues in the current studies.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Ebraheem Alzahrani ◽  
Wajdi Alghamdi ◽  
Malik Zaka Ullah ◽  
Yaser Daanial Khan

AbstractProteins are a vital component of cells that perform physiological functions to ensure smooth operations of bodily functions. Identification of a protein's function involves a detailed understanding of the structure of proteins. Stress proteins are essential mediators of several responses to cellular stress and are categorized based on their structural characteristics. These proteins are found to be conserved across many eukaryotic and prokaryotic linkages and demonstrate varied crucial functional activities inside a cell. The in-vivo, ex vivo, and in-vitro identification of stress proteins are a time-consuming and costly task. This study is aimed at the identification of stress protein sequences with the aid of mathematical modelling and machine learning methods to supplement the aforementioned wet lab methods. The model developed using Random Forest showed remarkable results with 91.1% accuracy while models based on neural network and support vector machine showed 87.7% and 47.0% accuracy, respectively. Based on evaluation results it was concluded that random-forest based classifier surpassed all other predictors and is suitable for use in practical applications for the identification of stress proteins. Live web server is available at http://biopred.org/stressprotiens, while the webserver code available is at https://github.com/abdullah5naveed/SRP_WebServer.git


Author(s):  
Alexander M. Kloosterman ◽  
Peter Cimermancic ◽  
Somayah S. Elsayed ◽  
Chao Du ◽  
Michalis Hadjithomas ◽  
...  

AbstractMost clinical drugs are based on microbial natural products, with compound classes including polyketides (PKS), non-ribosomal peptides (NRPS), fluoroquinones and ribosomally synthesized and post-translationally modified peptides (RiPPs). While variants of biosynthetic gene clusters (BGCs) for known classes of natural products are easy to identify in genome sequences, BGCs for new compound classes escape attention. In particular, evidence is accumulating that for RiPPs, subclasses known thus far may only represent the tip of an iceberg. Here, we present decRiPPter (Data-driven Exploratory Class-independent RiPP TrackER), a RiPP genome mining algorithm aimed at the discovery of novel RiPP classes. DecRiPPter combines a Support Vector Machine (SVM) that identifies candidate RiPP precursors with pan-genomic analyses to identify which of these are encoded within operon-like structures that are part of the accessory genome of a genus. Subsequently, it prioritizes such regions based on the presence of new enzymology and based on patterns of gene cluster and precursor peptide conservation across species. We then applied decRiPPter to mine 1,295 Streptomyces genomes, which led to the identification of 42 new candidate RiPP families that could not be found by existing programs. One of these was studied further and elucidated as a novel subfamily of lanthipeptides, designated Class V. Two previously unidentified modifying enzymes are proposed to create the hallmark lanthionine bridges. Taken together, our work highlights how novel natural product families can be discovered by methods going beyond sequence similarity searches to integrate multiple pathway discovery criteria.Code and data availabilityThe source code of DecRiPPter is freely available online at https://github.com/Alexamk/decRiPPter. Results of the data analysis are available online at http://www.bioinformatics.nl/~medem005/decRiPPter_strict/index.html and http://www.bioinformatics.nl/~medem005/decRiPPter_mild/index.html (for the strict and mild filters, respectively). All training data and code used to generate these, as well as outputs of the data analyses, are available on Zenodo at doi:10.5281/zenodo.3834818.


2021 ◽  
Author(s):  
Shuyun He ◽  
Duancheng Zhao ◽  
Yanle Ling ◽  
Hanxuan Cai ◽  
Yike Cai ◽  
...  

AbstractSummaryBreast cancer (BC) has surpassed lung cancer as the most frequently occurring cancer, and it is the leading cause of cancer-related death in women. Therefore, there is an urgent need to discover or design new drug candidates for BC treatment. In this study, we first collected a series of structurally diverse datasets consisting of 33,757 active and 21,152 inactive compounds for 13 breast cancer cell lines and one normal breast cell line commonly used in in vitro antiproliferative assays. Predictive models were then developed using five conventional machine learning algorithms, including naïve Bayesian, support vector machine, k-Nearest Neighbors, random forest, and extreme gradient boosting, as well as five deep learning algorithms, including deep neural networks, graph convolutional networks, graph attention network, message passing neural networks, and Attentive FP. A total of 476 single models and 112 fusion models were constructed based on three types of molecular representations including molecular descriptors, fingerprints, and graphs. The evaluation results demonstrate that the best model for each BC cell subtype can achieve high predictive accuracy for the test sets with AUC values of 0.689–0.993. Moreover, important structural fragments related to BC cell inhibition were identified and interpreted. To facilitate the use of the model, an online webserver called ChemBC and its local version software were developed to predict potential anti-BC agents.AvailabilityChemBC webserver is available at http://chembc.idruglab.cn/ and its local version Python software is maintained at a GitHub repository (https://github.com/idruglab/ChemBC)[email protected] or [email protected] informationSupplementary data are available at Bioinformatics online.


2021 ◽  
Vol 12 ◽  
Author(s):  
Qingqing Li ◽  
Wenhui Xie ◽  
Liping Li ◽  
Lijing Wang ◽  
Qinyi You ◽  
...  

BackgroundArterial stiffness assessed by pulse wave velocity is a major risk factor for cardiovascular diseases. The incidence of cardiovascular events remains high in diabetics. However, a clinical prediction model for elevated arterial stiffness using machine learning to identify subjects consequently at higher risk remains to be developed.MethodsLeast absolute shrinkage and selection operator and support vector machine-recursive feature elimination were used for feature selection. Four machine learning algorithms were used to construct a prediction model, and their performance was compared based on the area under the receiver operating characteristic curve metric in a discovery dataset (n = 760). The model with the best performance was selected and validated in an independent dataset (n = 912) from the Dryad Digital Repository (https://doi.org/10.5061/dryad.m484p). To apply our model to clinical practice, we built a free and user-friendly web online tool.ResultsThe predictive model includes the predictors: age, systolic blood pressure, diastolic blood pressure, and body mass index. In the discovery cohort, the gradient boosting-based model outperformed other methods in the elevated arterial stiffness prediction. In the validation cohort, the gradient boosting model showed a good discrimination capacity. A cutoff value of 0.46 for the elevated arterial stiffness risk score in the gradient boosting model resulted in a good specificity (0.813 in the discovery data and 0.761 in the validation data) and sensitivity (0.875 and 0.738, respectively) trade-off points.ConclusionThe gradient boosting-based prediction system presents a good classification in elevated arterial stiffness prediction. The web online tool makes our gradient boosting-based model easily accessible for further clinical studies and utilization.


Sign in / Sign up

Export Citation Format

Share Document