scholarly journals Identification of Enzymatic Active Sites with Unsupervised Language Modeling

Author(s):  
Loïc Kwate Dassi ◽  
Matteo Manica ◽  
Daniel Probst ◽  
Philippe Schwaller ◽  
Yves Gaetan Nana Teukam ◽  
...  

The first decade of genome sequencing saw a surge in the characterization of proteins with unknown functionality. Even still, more than 20% of proteins in well-studied model animals have yet to be identified, making the discovery of their active site one of biology's greatest puzzle. Herein, we apply a Transformer architecture to a language representation of bio-catalyzed chemical reactions to learn the signal at the base of the substrate-active site atomic interactions. The language representation comprises a reaction simplified molecular-input line-entry system (SMILES) for substrate and products, complemented with amino acid (AA) sequence information for the enzyme. We demonstrate that by creating a custom tokenizer and a score based on attention values, we can capture the substrate-active site interaction signal and utilize it to determine the active site position in unknown protein sequences, unraveling complicated 3D interactions using just 1D representations. This approach exhibits remarkable results and can recover, with no supervision, 31.51% of the active site when considering co-crystallized substrate-enzyme structures as a ground-truth, vastly outperforming approaches based on sequence similarities only. Our findings are further corroborated by docking simulations on the 3D structure of few enzymes. This work confirms the unprecedented impact of natural language processing and more specifically of the Transformer architecture on domain-specific languages, paving the way to effective solutions for protein functional characterization and bio-catalysis engineering.

2021 ◽  
Author(s):  
Loïc Kwate Dassi ◽  
Matteo Manica ◽  
Daniel Probst ◽  
Philippe Schwaller ◽  
Yves Gaetan Nana Teukam ◽  
...  

The first decade of genome sequencing saw a surge in the characterization of proteins with unknown functionality. Even still, more than 20% of proteins in well-studied model animals have yet to be identified, making the discovery of their active site one of biology's greatest puzzle. Herein, we apply a Transformer architecture to a language representation of bio-catalyzed chemical reactions to learn the signal at the base of the substrate-active site atomic interactions. The language representation comprises a reaction simplified molecular-input line-entry system (SMILES) for substrate and products, complemented with amino acid (AA) sequence information for the enzyme. We demonstrate that by creating a custom tokenizer and a score based on attention values, we can capture the substrate-active site interaction signal and utilize it to determine the active site position in unknown protein sequences, unraveling complicated 3D interactions using just 1D representations. This approach exhibits remarkable results and can recover, with no supervision, 31.51% of the active site when considering co-crystallized substrate-enzyme structures as a ground-truth, vastly outperforming approaches based on sequence similarities only. Our findings are further corroborated by docking simulations on the 3D structure of few enzymes. This work confirms the unprecedented impact of natural language processing and more specifically of the Transformer architecture on domain-specific languages, paving the way to effective solutions for protein functional characterization and bio-catalysis engineering.


2018 ◽  
Author(s):  
Jayanthy Jyothikumar ◽  
Sushil Chandani ◽  
Tangirala Ramakrishna

AbstractAlanine racemase, a popular drug target fromMycobacterium tuberculosis, catalyzes the biosynthesis of D-alanine, an essential component in bacterial cell walls. With the help of elastic network models of alanine racemase fromMycobacterium tuberculosis, we show that the mycobacterial enzyme fluctuates between two undiscovered states—a closed and an open state. A previous experimental screen identified several drug-like lead compounds against the mycobacterial alanine racemase, whose inhibitory mechanisms are not known. Docking simulations of the inhibitor leads onto the mycobacterial enzyme conformations obtained from the dynamics of the enzyme provide first clues to a putative regulatory role for two new pockets targeted by the leads. Further, our results implicate the movements of a short helix, behind the communication between the new pockets and the active site, indicating allosteric mechanisms for the inhibition. Based on our findings, we theorize that catalysis is feasible only in the open state. The putative regulatory pockets and the enzyme fluctuations are conserved across several alanine racemase homologs from diverse bacterial species, mostly pathogenic, pointing to a common regulatory mechanism important in drug discovery.Author summaryIn spite of the discovery of many inhibitors against the TB-causing pathogenMycobacterium tuberculosis, only a very few have reached the market as effective TB drugs. Most of the marketed TB drugs induce toxic side effects in patients, as they non-specifically target human cells in addition to pathogens. One such TB drug, D-cycloserine, targets pyridoxal phosphate moiety non-specifically regardless of whether it is present in the pathogen or the human host enzymes. D-cycloserine was developed to inactivate alanine racemase in TB causing pathogen. Alanine racemase is a bacterial enzyme essential in cell wall synthesis. Serious side effects caused by TB drugs like D-cycloserine, lead to patients’ non-compliance with treatment regimen, often causing fatal outcomes. Current drug discovery efforts focus on finding specific, non-toxic TB drugs. Through computational studies, we have identified new pockets on the mycobacterial alanine racemase and show that they can bind drug-like compounds. The location of these pockets away from the pyridoxal phosphate-containing active site, make them attractive target sites for novel, specific TB drugs. We demonstrate the presence of these pockets in alanine racemases from several pathogens and expect our findings to accelerate the discovery of non-toxic drugs against TB and other bacterial infections.


F1000Research ◽  
2014 ◽  
Vol 3 ◽  
pp. 217 ◽  
Author(s):  
Sandeep Chakraborty ◽  
Basuthkar J. Rao ◽  
Bjarni Asgeirsson ◽  
Ravindra Venkatramani ◽  
Abhaya M. Dandekar

The remarkable diversity in biological systems is rooted in the ability of the twenty naturally occurring amino acids to perform multifarious catalytic functions by creating unique structural scaffolds known as the active site. Finding such structrual motifs within the protein structure is a key aspect of many computational methods. The algorithm for obtaining combinations of motifs of a certain length, although polynomial in complexity, runs in non-trivial computer time. Also, the search space expands considerably if stereochemically equivalent residues are allowed to replace an amino acid in the motif. In the present work, we propose a method to precompile all possible motifs comprising of a set (n=4 in this case) of predefined amino acid residues from a protein structure that occur within a specified distance (R) of each other (PREMONITION). PREMONITION rolls a sphere of radius R along the protein fold centered at the C atom of each residue, and all possible motifs are extracted within this sphere. The number of residues that can occur within a sphere centered around a residue is bounded by physical constraints, thus setting an upper limit on the processing times. After such a pre-compilation step, the computational time required for querying a protein structure with multiple motifs is considerably reduced. Previously, we had proposed a computational method to estimate the promiscuity of proteins with known active site residues and 3D structure using a database of known active sites in proteins (CSA) by querying each protein with the active site motif of every other residue. The runtimes for such a comparison is reduced from days to hours using the PREMONITION methodology.


Author(s):  
Zhuang Liu ◽  
Degen Huang ◽  
Kaiyu Huang ◽  
Zhuang Li ◽  
Jun Zhao

There is growing interest in the tasks of financial text mining. Over the past few years, the progress of Natural Language Processing (NLP) based on deep learning advanced rapidly. Significant progress has been made with deep learning showing promising results on financial text mining models. However, as NLP models require large amounts of labeled training data, applying deep learning to financial text mining is often unsuccessful due to the lack of labeled training data in financial fields. To address this issue, we present FinBERT (BERT for Financial Text Mining) that is a domain specific language model pre-trained on large-scale financial corpora. In FinBERT, different from BERT, we construct six pre-training tasks covering more knowledge, simultaneously trained on general corpora and financial domain corpora, which can enable FinBERT model better to capture language knowledge and semantic information. The results show that our FinBERT outperforms all current state-of-the-art models. Extensive experimental results demonstrate the effectiveness and robustness of FinBERT. The source code and pre-trained models of FinBERT are available online.


2019 ◽  
Author(s):  
M. Alexander Ardagh ◽  
Manish Shetty ◽  
Anatoliy Kuznetsov ◽  
Qi Zhang ◽  
Phillip Christopher ◽  
...  

Catalytic enhancement of chemical reactions via heterogeneous materials occurs through stabilization of transition states at designed active sites, but dramatically greater rate acceleration on that same active site is achieved when the surface intermediates oscillate in binding energy. The applied oscillation amplitude and frequency can accelerate reactions orders of magnitude above the catalytic rates of static systems, provided the active site dynamics are tuned to the natural frequencies of the surface chemistry. In this work, differences in the characteristics of parallel reactions are exploited via selective application of active site dynamics (0 < ΔU < 1.0 eV amplitude, 10<sup>-6</sup> < f < 10<sup>4</sup> Hz frequency) to control the extent of competing reactions occurring on the shared catalytic surface. Simulation of multiple parallel reaction systems with broad range of variation in chemical parameters revealed that parallel chemistries are highly tunable in selectivity between either pure product, even when specific products are not selectively produced under static conditions. Two mechanisms leading to dynamic selectivity control were identified: (i) surface thermodynamic control of one product species under strong binding conditions, or (ii) catalytic resonance of the kinetics of one reaction over the other. These dynamic parallel pathway control strategies applied to a host of chemical conditions indicate significant potential for improving the catalytic performance of many important industrial chemical reactions beyond their existing static performance.


2019 ◽  
Vol 16 (6) ◽  
pp. 637-644
Author(s):  
Hongyu Cao ◽  
Yanhua Wu ◽  
Xingzhi Zhou ◽  
Xuefang Zheng ◽  
Ge Jiang

Background: N-myc downstream regulated gene 3 (NDRG3) is a newly discovered oxygen-regulated protein which will bind with L-Lactate in hypoxia and further activate Raf (rapidly accelerated fibrosarcoma)-ERK (extracellular regulated protein kinases) pathway, promoting cell growth and angiogenesis. Methods: Competitive inhibition on the binding of NDRG3 and L-Lactate may be potentially a useful strategy for the repression of hypoxic response mediated by NDRG3. The threedimensional (3D) structure of NDRG3 was built by using homology modeling for its crystal structure was not available. Then, L-Lactate was docked into NDRG3, from which we knew it bound with amino acid residues Gln69, His183, Asn189, Ala72 and Pro66 of NDRG3 in the most possible active sites. Approximately 3000 compounds have been virtually screened and the 6 topranked compounds were selected as reference molecules to analyze their interaction relationships, which illustrated that some of them might form electrostatic interaction with Glu70 and Asp187, π-&π stack with Phe75 and Tyr180, hydrogen bonds with Gly71 and Asn189, hydrophobic effect with Ala72 and Ile184. Results: Novel molecules were designed through structural optimization of the 6 top-ranked compounds and subsequently their ADMET properties were predicted. Conclusion: These molecules may be potential drug candidates for the suppression of hypoxic response mediated by NDRG3 and targeted therapy for hypoxia-induced diseases.


Author(s):  
Mario Jojoa Acosta ◽  
Gema Castillo-Sánchez ◽  
Begonya Garcia-Zapirain ◽  
Isabel de la Torre Díez ◽  
Manuel Franco-Martín

The use of artificial intelligence in health care has grown quickly. In this sense, we present our work related to the application of Natural Language Processing techniques, as a tool to analyze the sentiment perception of users who answered two questions from the CSQ-8 questionnaires with raw Spanish free-text. Their responses are related to mindfulness, which is a novel technique used to control stress and anxiety caused by different factors in daily life. As such, we proposed an online course where this method was applied in order to improve the quality of life of health care professionals in COVID 19 pandemic times. We also carried out an evaluation of the satisfaction level of the participants involved, with a view to establishing strategies to improve future experiences. To automatically perform this task, we used Natural Language Processing (NLP) models such as swivel embedding, neural networks, and transfer learning, so as to classify the inputs into the following three categories: negative, neutral, and positive. Due to the limited amount of data available—86 registers for the first and 68 for the second—transfer learning techniques were required. The length of the text had no limit from the user’s standpoint, and our approach attained a maximum accuracy of 93.02% and 90.53%, respectively, based on ground truth labeled by three experts. Finally, we proposed a complementary analysis, using computer graphic text representation based on word frequency, to help researchers identify relevant information about the opinions with an objective approach to sentiment. The main conclusion drawn from this work is that the application of NLP techniques in small amounts of data using transfer learning is able to obtain enough accuracy in sentiment analysis and text classification stages.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Pilar López-Úbeda ◽  
Alexandra Pomares-Quimbaya ◽  
Manuel Carlos Díaz-Galiano ◽  
Stefan Schulz

Abstract Background Controlled vocabularies are fundamental resources for information extraction from clinical texts using natural language processing (NLP). Standard language resources available in the healthcare domain such as the UMLS metathesaurus or SNOMED CT are widely used for this purpose, but with limitations such as lexical ambiguity of clinical terms. However, most of them are unambiguous within text limited to a given clinical specialty. This is one rationale besides others to classify clinical text by the clinical specialty to which they belong. Results This paper addresses this limitation by proposing and applying a method that automatically extracts Spanish medical terms classified and weighted per sub-domain, using Spanish MEDLINE titles and abstracts as input. The hypothesis is biomedical NLP tasks benefit from collections of domain terms that are specific to clinical subdomains. We use PubMed queries that generate sub-domain specific corpora from Spanish titles and abstracts, from which token n-grams are collected and metrics of relevance, discriminatory power, and broadness per sub-domain are computed. The generated term set, called Spanish core vocabulary about clinical specialties (SCOVACLIS), was made available to the scientific community and used in a text classification problem obtaining improvements of 6 percentage points in the F-measure compared to the baseline using Multilayer Perceptron, thus demonstrating the hypothesis that a specialized term set improves NLP tasks. Conclusion The creation and validation of SCOVACLIS support the hypothesis that specific term sets reduce the level of ambiguity when compared to a specialty-independent and broad-scope vocabulary.


2021 ◽  
Vol 11 (7) ◽  
pp. 3095
Author(s):  
Suhyune Son ◽  
Seonjeong Hwang ◽  
Sohyeun Bae ◽  
Soo Jun Park ◽  
Jang-Hwan Choi

Multi-task learning (MTL) approaches are actively used for various natural language processing (NLP) tasks. The Multi-Task Deep Neural Network (MT-DNN) has contributed significantly to improving the performance of natural language understanding (NLU) tasks. However, one drawback is that confusion about the language representation of various tasks arises during the training of the MT-DNN model. Inspired by the internal-transfer weighting of MTL in medical imaging, we introduce a Sequential and Intensive Weighted Language Modeling (SIWLM) scheme. The SIWLM consists of two stages: (1) Sequential weighted learning (SWL), which trains a model to learn entire tasks sequentially and concentrically, and (2) Intensive weighted learning (IWL), which enables the model to focus on the central task. We apply this scheme to the MT-DNN model and call this model the MTDNN-SIWLM. Our model achieves higher performance than the existing reference algorithms on six out of the eight GLUE benchmark tasks. Moreover, our model outperforms MT-DNN by 0.77 on average on the overall task. Finally, we conducted a thorough empirical investigation to determine the optimal weight for each GLUE task.


Author(s):  
E.G. Shidlovskaya ◽  
L. Schimansky-Geier ◽  
Yu.M. Romanovsky

A two dimensional model for the substrate inside a pocket of an active site of an enzyme is presented and investigated as a vibrational system. The parameters of the system are evaluated for α-chymotrypsin. In the case of internal resonance it is analytically and numerically shown that the energy concentrated on a certain degree of freedom might be several times larger than in the non-resonant case. Additionally, the system is driven by harmonic excitations and again energy due to nonlinear phenomena is redistributed inhomogeneously. These results may be of importance for the determination of the rates of catalytic events of substrates bound in pockets of active sites.


Sign in / Sign up

Export Citation Format

Share Document