scholarly journals Integrating structure-based machine learning and co-evolution to investigate specificity in plant sesquiterpene synthases

2020 ◽  
Author(s):  
Janani Durairaj ◽  
Elena Melillo ◽  
Harro J Bouwmeester ◽  
Jules Beekwilder ◽  
Dick de Ridder ◽  
...  

AbstractSesquiterpene synthases (STSs) catalyze the formation of a large class of plant volatiles called sesquiterpenes. While thousands of putative STS sequences from diverse plant species are available, only a small number of them have been functionally characterized. Sequence identity-based screening for desired enzymes, often used in biotechnological applications, is difficult to apply here as STS sequence similarity is strongly affected by species. This calls for more sophisticated computational methods for functionality prediction. We investigate the specificity of precursor cation formation in these elusive enzymes. By inspecting multi-product STSs, we demonstrate that STSs have a strong selectivity towards one precursor cation. We use a machine learning approach combining sequence and structure information to accurately predict precursor cation specificity for STSs across all plant species. We combine this with a co-evolutionary analysis on the wealth of uncharacterized putative STS sequences, to pinpoint residues and distant functional contacts influencing cation formation and reaction pathway selection. These structural factors can be used to predict and engineer enzymes with specific functions, as we demonstrate by predicting and characterizing two novel STSs from Citrus bergamia.Author summaryPredicting enzyme function is a popular problem in the bioinformatics field that grows more pressing with the increase in protein sequences, and more attainable with the increase in experimentally characterized enzymes. Terpenes and terpenoids form the largest classes of natural products and find use in many drugs, flavouring agents, and perfumes. Terpene synthases catalyze the biosynthesis of terpenes via multiple cyclizations and carbocation rearrangements, generating a vast array of product skeletons. In this work, we present a three-pronged computational approach to predict carbocation specificity in sesquiterpene synthases, a subset of terpene synthases with one of the highest diversities of products. Using homology modelling, machine learning and co-evolutionary analysis, our approach combines sparse structural data, large amounts of uncharacterized sequence data, and the current set of experimentally characterized enzymes to provide insight into residues and structural regions that likely play a role in determining product specifcity. Similar techniques can be repurposed for function prediction and enzyme engineering in many other classes of enzymes.

2021 ◽  
Vol 17 (3) ◽  
pp. e1008197
Author(s):  
Janani Durairaj ◽  
Elena Melillo ◽  
Harro J. Bouwmeester ◽  
Jules Beekwilder ◽  
Dick de Ridder ◽  
...  

Sesquiterpene synthases (STSs) catalyze the formation of a large class of plant volatiles called sesquiterpenes. While thousands of putative STS sequences from diverse plant species are available, only a small number of them have been functionally characterized. Sequence identity-based screening for desired enzymes, often used in biotechnological applications, is difficult to apply here as STS sequence similarity is strongly affected by species. This calls for more sophisticated computational methods for functionality prediction. We investigate the specificity of precursor cation formation in these elusive enzymes. By inspecting multi-product STSs, we demonstrate that STSs have a strong selectivity towards one precursor cation. We use a machine learning approach combining sequence and structure information to accurately predict precursor cation specificity for STSs across all plant species. We combine this with a co-evolutionary analysis on the wealth of uncharacterized putative STS sequences, to pinpoint residues and distant functional contacts influencing cation formation and reaction pathway selection. These structural factors can be used to predict and engineer enzymes with specific functions, as we demonstrate by predicting and characterizing two novel STSs from Citrus bergamia.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Dimitri Boeckaerts ◽  
Michiel Stock ◽  
Bjorn Criel ◽  
Hans Gerstmans ◽  
Bernard De Baets ◽  
...  

AbstractNowadays, bacteriophages are increasingly considered as an alternative treatment for a variety of bacterial infections in cases where classical antibiotics have become ineffective. However, characterizing the host specificity of phages remains a labor- and time-intensive process. In order to alleviate this burden, we have developed a new machine-learning-based pipeline to predict bacteriophage hosts based on annotated receptor-binding protein (RBP) sequence data. We focus on predicting bacterial hosts from the ESKAPE group, Escherichia coli, Salmonella enterica and Clostridium difficile. We compare the performance of our predictive model with that of the widely used Basic Local Alignment Search Tool (BLAST). Our best-performing predictive model reaches Precision-Recall Area Under the Curve (PR-AUC) scores between 73.6 and 93.8% for different levels of sequence similarity in the collected data. Our model reaches a performance comparable to that of BLASTp when sequence similarity in the data is high and starts outperforming BLASTp when sequence similarity drops below 75%. Therefore, our machine learning methods can be especially useful in settings in which sequence similarity to other known sequences is low. Predicting the hosts of novel metagenomic RBP sequences could extend our toolbox to tune the host spectrum of phages or phage tail-like bacteriocins by swapping RBPs.


Author(s):  
Tapan Kumar Mohanta ◽  
Yugal Kishore Mohanta ◽  
Ahmed Al-Harrasi

The Severe acute respiratory syndrome (SARS) corona virus (CoV) 2 SARS-CoV-2 mediated epidemic is a global pandemic. The first genome sequence data of SARS-CoV-2 (CoVid19) concluded that it has a bat reservoir and bat was the immediate donor. Andersen et al., (2020) has reported that it is improbable to do laboratory manipulation of SARS CoV [1]. But, Lau et al., (2010) has already reported the generation of recombinant bat SARS CoV and they had reported three recombinant genotypes. Hence laboratory based manipulation has already completed long before[2]. A deep comparative study of bat SARS CoV with other SARS CoVs (including human SARS CoV of German isolate) revealed, human SARS CoV-2 genomes (isolates of China, India, Italy, Nepal, and the United States of America) had sequence similarity of 79-80% only with bat SARS CoV and it has sequence similarity of approximately 60% with the human SARS CoV (German isolate). The presence of large genomic dissimilarity of bat SARS CoV genome with human SARS CoV-2 cannot be considered as an immediate donor to human SARS CoV-2. However, the genomic sequence similarity within the SARS CoV-2 isolates of China, India, Italy, Nepal, and USA shared 99-100% similarity. This suggests that human SARS CoV-2 did not undergo heavy mutation to generate immediate new genotype. If the SARS CoV-2 infection happened to the human through the SARS CoV of bat from Wuhan meat market, it should have sequence similarity of more than 99% which was not found in the study. Phylogenetic analysis revealed, bat SARS CoV did not fall with the group of SARS CoV-2 of China, India, Italy, Nepal, and USA isolates. This suggests that bat SARS CoV has genomic and evolutionary dissimilarity and cannot be considered as immediate and direct donor of human SARS CoV-2. The natural selection of bat genome before transfer to the zoonotic organism is a time-consuming process and natural selection in human post zoonotic transfer is also time-consuming event. Therefore, concept mentioned by Andersen et al., (2020)[1] regarding its transfer from a bat of Wuhan meat market is irrefutably incorrect. Sequence alignment revealed the presence of inserted codons in human SARS CoV-2 and synteny analysis corroborated with the presence of extra nucleotides/codons in the human SARS CoV-2. Relative time tree analysis revealed it origin before 0.00 million year ago, suggesting its recent synthetic/modified origin.


2014 ◽  
Vol 112 (3) ◽  
pp. 857-862 ◽  
Author(s):  
Yuuki Yamada ◽  
Tomohisa Kuzuyama ◽  
Mamoru Komatsu ◽  
Kazuo Shin-ya ◽  
Satoshi Omura ◽  
...  

Odoriferous terpene metabolites of bacterial origin have been known for many years. In genome-sequencedStreptomycetaceaemicroorganisms, the vast majority produces the degraded sesquiterpene alcohol geosmin. Two minor groups of bacteria do not produce geosmin, with one of these groups instead producing other sesquiterpene alcohols, whereas members of the remaining group do not produce any detectable terpenoid metabolites. Because bacterial terpene synthases typically show no significant overall sequence similarity to any other known fungal or plant terpene synthases and usually exhibit relatively low levels of mutual sequence similarity with other bacterial synthases, simple correlation of protein sequence data with the structure of the cyclized terpene product has been precluded. We have previously described a powerful search method based on the use of hidden Markov models (HMMs) and protein families database (Pfam) search that has allowed the discovery of monoterpene synthases of bacterial origin. Using an enhanced set of HMM parameters generated using a training set of 140 previously identified bacterial terpene synthase sequences, a Pfam search of 8,759,463 predicted bacterial proteins from public databases and in-house draft genome data has now revealed 262 presumptive terpene synthases. The biochemical function of a considerable number of these presumptive terpene synthase genes could be determined by expression in a specially engineered heterologousStreptomyceshost and spectroscopic identification of the resulting terpene products. In addition to a wide variety of terpenes that had been previously reported from fungal or plant sources, we have isolated and determined the complete structures of 13 previously unidentified cyclic sesquiterpenes and diterpenes.


2021 ◽  
Author(s):  
Eric W. Bell ◽  
Jacob H. Schwartz ◽  
Peter L. Freddolino ◽  
Yang Zhang

AbstractProteome-wide identification of protein-protein interactions is a formidable task which has yet to be sufficiently addressed by experimental methodologies. Many computational methods have been developed to predict proteome-wide interaction networks, but few leverage both the sensitivity of structural information and the wide availability of sequence data. We present PEPPI, a pipeline which integrates structural similarity, sequence similarity, functional association data, and machine learning-based classification through a naïve Bayesian classifier model to accurately predict protein-protein interactions at a proteomic scale. Through benchmarking against a set of 798 ground truth interactions and an equal number of noninteractions, we have found that PEPPI attains 4.5% higher AUROC than the best of other state-of-the-art methods. As a proteomic-scale application, PEPPI was applied to model the interactions which occur between SARS-CoV-2 and human host cells during coronavirus infection, where 403 high-confidence interactions were identified with predictions covering 73% of a gold standard dataset from PSICQUIC and demonstrating significant complementarity with the most recent high-throughput experiments. PEPPI is available both as a webserver and in a standalone version and should be a powerful and generally applicable tool for computational screening of protein-protein interactions.


2020 ◽  
Vol 24 (6) ◽  
pp. 1311-1328
Author(s):  
Jozsef Suto

Nowadays there are hundreds of thousands known plant species on the Earth and many are still unknown yet. The process of plant classification can be performed using different ways but the most popular approach is based on plant leaf characteristics. Most types of plants have unique leaf characteristics such as shape, color, and texture. Since machine learning and vision considerably developed in the past decade, automatic plant species (or leaf) recognition has become possible. Recently, the automated leaf classification is a standalone research area inside machine learning and several shallow and deep methods were proposed to recognize leaf types. From 2007 to present days several research papers have been published in this topic. In older studies the classifier was a shallow method while in current works many researchers applied deep networks for classification. During the overview of plant leaf classification literature, we found an interesting deficiency (lack of hyper-parameter search) and a key difference between studies (different test sets). This work gives an overall review about the efficiency of shallow and deep methods under different test conditions. It can be a basis to further research.


2019 ◽  
Vol 476 (5) ◽  
pp. 809-826
Author(s):  
Karthik V. Rajasekar ◽  
Shuangxi Ji ◽  
Rachel J. Coulthard ◽  
Jon P. Ride ◽  
Gillian L. Reynolds ◽  
...  

Abstract SPH (self-incompatibility protein homologue) proteins are a large family of small, disulfide-bonded, secreted proteins, initially found in the self-incompatibility response in the field poppy (Papaver rhoeas), but now known to be widely distributed in plants, many containing multiple members of this protein family. Using the Origami strain of Escherichia coli, we expressed one member of this family, SPH15 from Arabidopsis thaliana, as a folded thioredoxin fusion protein and purified it from the cytosol. The fusion protein was cleaved and characterised by analytical ultracentrifugation, circular dichroism and nuclear magnetic resonance (NMR) spectroscopy. This showed that SPH15 is monomeric and temperature stable, with a β-sandwich structure. The four strands in each sheet have the same topology as the unrelated proteins: human transthyretin, bacterial TssJ and pneumolysin, with no discernible sequence similarity. The NMR-derived structure was compared with a de novo model, made using a new deep learning algorithm based on co-evolution/correlated mutations, DeepCDPred, validating the method. The DeepCDPred de novo method and homology modelling to SPH15 were then both used to derive models of the 3D structure of the three known PrsS proteins from P. rhoeas, which have only 15–18% sequence homology to SPH15. The DeepCDPred method gave models with lower discreet optimised protein energy scores than the homology models. Three loops at one end of the poppy structures are postulated to interact with their respective pollen receptors to instigate programmed cell death in pollen tubes.


2021 ◽  
Author(s):  
Luc Blassel ◽  
Anna Tostevin ◽  
Christian Julian Villabona-Arenas ◽  
Martine Peeters ◽  
Stephane Hue ◽  
...  

Drug resistance mutations (DRMs) appear in HIV under treatment pressure. DRMs are commonly transmitted to naive patients. The standard approach to reveal new DRMs is to test for significant frequency differences of mutations between treated and naive patients. However, we then consider each mutation individually and cannot hope to study interactions between several mutations. Here, we aim to leverage the ever-growing quantity of high-quality sequence data and machine learning methods to study such interactions (i.e. epistasis), as well as try to find new DRMs. We trained classifiers to discriminate between Reverse Transcriptase Inhibitor (RTI)-experienced and RTI-naive samples on a large HIV-1 reverse transcriptase (RT) sequence dataset from the UK (n ≈ 55; 000), using all observed mutations as binary representation features. To assess the robustness of our findings, our classifiers were evaluated on independent data sets, both from the UK and Africa. Important representation features for each classifier were then extracted as potential DRMs. To find novel DRMs, we repeated this process by removing either features or samples associated to known DRMs. When keeping all known resistance signal, we detected sufficiently prevalent known DRMs, thus validating the approach. When removing features corresponding to known DRMs, our classifiers retained some prediction accuracy, and six new mutations significantly associated with resistance were identified. These six mutations have a low genetic barrier, are correlated to known DRMs, and are spatially close to either the RT active site or the regulatory binding pocket. When removing both known DRM features and sequences containing at least one known DRM, our classifiers lose all prediction accuracy. These results likely indicate that all mutations directly conferring resistance have been found, and that our newly discovered DRMs are accessory or compensatory mutations. Moreover, we did not find any significant signal of epistasis, beyond the standard resistance scheme associating major DRMs to auxiliary mutations.


2021 ◽  
Author(s):  
Tuomo Hartonen ◽  
Teemu Kivioja ◽  
Jussi Taipale

Deep learning models have in recent years gained success in various tasks related to understanding information coded in the DNA sequence. Rapidly developing genome-wide measurement technologies provide large quantities of data ideally suited for modeling using deep learning or other powerful machine learning approaches. Although offering state-of-the art predictive performance, the predictions made by deep learning models can be difficult to understand. In virtually all biological research, the understanding of how a predictive model works is as important as the raw predictive performance. Thus interpretation of deep learning models is an emerging hot topic especially in context of biological research. Here we describe plotMI, a mutual information based model interpretation strategy that can intuitively visualize positional preferences and pairwise interactions learned by any machine learning model trained on sequence data with a defined alphabet as input. PlotMI is freely available at https://github.com/hartonen/plotMI.


2021 ◽  
Author(s):  
Richard G Dorrell ◽  
Alan Kuo ◽  
Zoltan Fussy ◽  
Elisabeth H Richardson ◽  
Asaf Salamov ◽  
...  

The Arctic Ocean is being impacted by warming temperatures, increasing freshwater and highly variable ice conditions. The microalgal communities underpinning Arctic marine food webs, once thought to be dominated by diatoms, include a phylogenetically diverse range of small algal species, whose biology remains poorly understood. Here, we present genome sequences of a cryptomonad, a haptophyte, a chrysophyte, and a pelagophyte, isolated from the Arctic water column and ice. Comparing protein family distributions and sequence similarity across a densely-sampled set of algal genomes and transcriptomes, we note striking convergences in the biology of distantly related small Arctic algae, compared to non-Arctic relatives; although this convergence is largely exclusive of Arctic diatoms. Using high-throughput phylogenetic approaches, incorporating environmental sequence data from Tara Oceans, we demonstrate that this convergence was partly explained by horizontal gene transfers (HGT) between Arctic species, in over at least 30 other discrete gene families, and most notably in ice-binding domains (IBD). These Arctic-specific genes have been repeatedly transferred between Arctic algae, and are independent of equivalent HGTs in the Antarctic Southern Ocean. Our data provide insights into the specialised Arctic marine microbiome, and underlines the role of geographically-limited HGT as a driver of environmental adaptation in eukaryotic algae.


Sign in / Sign up

Export Citation Format

Share Document