scholarly journals Predicting phosphorylation sites using machine learning by integrating the sequence, structure, and functional information of proteins

2021 ◽  
Vol 19 (1) ◽  
Author(s):  
Salma Jamal ◽  
Waseem Ali ◽  
Priya Nagpal ◽  
Abhinav Grover ◽  
Sonam Grover

Abstract Background Post-translational modification (PTM) is a biological process that alters proteins and is therefore involved in the regulation of various cellular activities and pathogenesis. Protein phosphorylation is an essential process and one of the most-studied PTMs: it occurs when a phosphate group is added to serine (Ser, S), threonine (Thr, T), or tyrosine (Tyr, Y) residue. Dysregulation of protein phosphorylation can lead to various diseases—most commonly neurological disorders, Alzheimer’s disease, and Parkinson’s disease—thus necessitating the prediction of S/T/Y residues that can be phosphorylated in an uncharacterized amino acid sequence. Despite a surplus of sequencing data, current experimental methods of PTM prediction are time-consuming, costly, and error-prone, so a number of computational methods have been proposed to replace them. However, phosphorylation prediction remains limited, owing to substrate specificity, performance, and the diversity of its features. Methods In the present study we propose machine-learning-based predictors that use the physicochemical, sequence, structural, and functional information of proteins to classify S/T/Y phosphorylation sites. Rigorous feature selection, the minimum redundancy/maximum relevance approach, and the symmetrical uncertainty method were employed to extract the most informative features to train the models. Results The RF and SVM models generated using diverse feature types in the present study were highly accurate as is evident from good values for different statistical measures. Moreover, independent test sets and benchmark validations indicated that the proposed method clearly outperformed the existing methods, demonstrating its ability to accurately predict protein phosphorylation. Conclusions The results obtained in the present work indicate that the proposed computational methodology can be effectively used for predicting putative phosphorylation sites further facilitating discovery of various biological processes mechanisms.

2018 ◽  
Vol 21 (2) ◽  
pp. 595-608 ◽  
Author(s):  
Man Cao ◽  
Guodong Chen ◽  
Jialin Yu ◽  
Shaoping Shi

Abstract Protein phosphorylation is a reversible and ubiquitous post-translational modification that primarily occurs at serine, threonine and tyrosine residues and regulates a variety of biological processes. In this paper, we first briefly summarized the current progresses in computational prediction of eukaryotic protein phosphorylation sites, which mainly focused on animals and plants, especially on human, with a less extent on fungi. Since the number of identified fungi phosphorylation sites has greatly increased in a wide variety of organisms and their roles in pathological physiology still remain largely unknown, more attention has been paid on the identification of fungi-specific phosphorylation. Here, experimental fungi phosphorylation sites data were collected and most of the sites were classified into different types to be encoded with various features and trained via a two-step feature optimization method. A novel method for prediction of species-specific fungi phosphorylation-PreSSFP was developed, which can identify fungi phosphorylation in seven species for specific serine, threonine and tyrosine residues (http://computbiol.ncu.edu.cn/PreSSFP). Meanwhile, we critically evaluated the performance of PreSSFP and compared it with other existing tools. The satisfying results showed that PreSSFP is a robust predictor. Feature analyses exhibited that there have some significant differences among seven species. The species-specific prediction via two-step feature optimization method to mine important features for training could considerably improve the prediction performance. We anticipate that our study provides a new lead for future computational analysis of fungi phosphorylation.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Rulan Wang ◽  
Zhuo Wang ◽  
Hongfei Wang ◽  
Yuxuan Pang ◽  
Tzong-Yi Lee

AbstractLysine crotonylation (Kcr) is a type of protein post-translational modification (PTM), which plays important roles in a variety of cellular regulation and processes. Several methods have been proposed for the identification of crotonylation. However, most of these methods can predict efficiently only on histone or non-histone protein. Therefore, this work aims to give a more balanced performance in different species, here plant (non-histone) and mammalian (histone) are involved. SVM (support vector machine) and RF (random forest) were employed in this study. According to the results of cross-validations, the RF classifier based on EGAAC attribute achieved the best predictive performance which performs competitively good as existed methods, meanwhile more robust when dealing with imbalanced datasets. Moreover, an independent test was carried out, which compared the performance of this study and existed methods based on the same features or the same classifier. The classifiers of SVM and RF could achieve best performances with 92% sensitivity, 88% specificity, 90% accuracy, and an MCC of 0.80 in the mammalian dataset, and 77% sensitivity, 83% specificity, 70% accuracy and 0.54 MCC in a relatively small dataset of mammalian and a large-scaled plant dataset respectively. Moreover, a cross-species independent testing was also carried out in this study, which has proved the species diversity in plant and mammalian.


2019 ◽  
Vol 35 (16) ◽  
pp. 2766-2773 ◽  
Author(s):  
Fenglin Luo ◽  
Minghui Wang ◽  
Yu Liu ◽  
Xing-Ming Zhao ◽  
Ao Li

Abstract Motivation Phosphorylation is the most studied post-translational modification, which is crucial for multiple biological processes. Recently, many efforts have been taken to develop computational predictors for phosphorylation site prediction, but most of them are based on feature selection and discriminative classification. Thus, it is useful to develop a novel and highly accurate predictor that can unveil intricate patterns automatically for protein phosphorylation sites. Results In this study we present DeepPhos, a novel deep learning architecture for prediction of protein phosphorylation. Unlike multi-layer convolutional neural networks, DeepPhos consists of densely connected convolutional neuron network blocks which can capture multiple representations of sequences to make final phosphorylation prediction by intra block concatenation layers and inter block concatenation layers. DeepPhos can also be used for kinase-specific prediction varying from group, family, subfamily and individual kinase level. The experimental results demonstrated that DeepPhos outperforms competitive predictors in general and kinase-specific phosphorylation site prediction. Availability and implementation The source code of DeepPhos is publicly deposited at https://github.com/USTCHIlab/DeepPhos. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Shaofeng Lin ◽  
Chenwei Wang ◽  
Jiaqi Zhou ◽  
Ying Shi ◽  
Chen Ruan ◽  
...  

Abstract As an important post-translational modification (PTM), protein phosphorylation is involved in the regulation of almost all of biological processes in eukaryotes. Due to the rapid progress in mass spectrometry-based phosphoproteomics, a large number of phosphorylation sites (p-sites) have been characterized but remain to be curated. Here, we briefly summarized the current progresses in the development of data resources for the collection, curation, integration and annotation of p-sites in eukaryotic proteins. Also, we designed the eukaryotic phosphorylation site database (EPSD), which contained 1 616 804 experimentally identified p-sites in 209 326 phosphoproteins from 68 eukaryotic species. In EPSD, we not only collected 1 451 629 newly identified p-sites from high-throughput (HTP) phosphoproteomic studies, but also integrated known p-sites from 13 additional databases. Moreover, we carefully annotated the phosphoproteins and p-sites of eight model organisms by integrating the knowledge from 100 additional resources that covered 15 aspects, including phosphorylation regulator, genetic variation and mutation, functional annotation, structural annotation, physicochemical property, functional domain, disease-associated information, protein-protein interaction, drug-target relation, orthologous information, biological pathway, transcriptional regulator, mRNA expression, protein expression/proteomics and subcellular localization. We anticipate that the EPSD can serve as a useful resource for further analysis of eukaryotic phosphorylation. With a data volume of 14.1 GB, EPSD is free for all users at http://epsd.biocuckoo.cn/.


Biomolecules ◽  
2019 ◽  
Vol 9 (2) ◽  
pp. 39 ◽  
Author(s):  
Zi-Shu Lu ◽  
Qian-Si Chen ◽  
Qing-Xia Zheng ◽  
Juan-Juan Shen ◽  
Zhao-Peng Luo ◽  
...  

Tobacco mosaic virus (TMV) is a common source of biological stress that significantly affects plant growth and development. It is also useful as a model in studies designed to clarify the mechanisms involved in plant viral disease. Plant responses to abiotic stress were recently reported to be regulated by complex mechanisms at the post-translational modification (PTM) level. Protein phosphorylation is one of the most widespread and major PTMs in organisms. Using immobilized metal ion affinity chromatography (IMAC) enrichment, high-pH C18 chromatography fraction, and high-accuracy mass spectrometry (MS), a set of proteins and phosphopeptides in both TMV-infected tobacco and control tobacco were identified. A total of 4905 proteins and 3998 phosphopeptides with 3063 phosphorylation sites were identified. These 3998 phosphopeptides were assigned to 1311 phosphoproteins, as some proteins carried multiple phosphorylation sites. Among them, 530 proteins and 337 phosphopeptides corresponding to 277 phosphoproteins differed between the two groups. There were 43 upregulated phosphoproteins, including phosphoglycerate kinase, pyruvate phosphate dikinase, protein phosphatase 2C, and serine/threonine protein kinase. To the best of our knowledge, this is the first phosphoproteomic analysis of leaves from a tobacco cultivar, K326. The results of this study advance our understanding of tobacco development and TMV action at the protein phosphorylation level.


2014 ◽  
Author(s):  
◽  
Qiuming Yao

[ACCESS RESTRICTED TO THE UNIVERSITY OF MISSOURI AT AUTHOR'S REQUEST.] Protein posttranslational modification (PTM) occurs broadly after or during protein biosynthesis, to assist folding or activate function during the protein lifetime. Among all types of PTMs, protein phosphorylation is widely recognized as the most pervasive, enzyme-catalyzed post-translational modification in eukaryotes. In particular, plants have higher magnitude of this signaling mechanism in terms of the protein kinase frequency within the genome compared to other eukaryotes. Phosphorylation site mapping using high-resolution mass spectrometry has grown exponentially. In Arabidopsis alone there are thousands of experimentally-determined phosphorylation sites. Likewise, other types of post translational modification data are rapidly increasing too. Acetylation proteome is another big data set in PTM kingdom. To provide an easy access of these modification events in a user-intuitive format we have developed P3DB, The Plant Protein Phosphorylation Database (p3db.org). This database is a repository for plant protein phosphorylation site data. These data can be queried for a protein-of-interest using an integrated BLAST function to search for similar sequences with known phosphorylation sites among the multiple plants currently investigated. Thus, this resource can help identify functionally-conserved phosphorylation sites in plants using a multi-system approach. Centralized by these phosphorylation data, multiple related data and annotations are provided, including protein-protein interaction (PPI), gene ontology, protein tertiary structures, orthologous sequences, kinase/phosphatase classification and Kinase Client Assay (KiC Assay) data. P3DB thus is not only a repository, but also a context provider for studying phosphorylation events. In addition, P3DB incorporates multiple network viewers for the above features, such as PPI network, kinase-substrate network, phosphatase-substrate network, and domain co-occurrence network to help study phosphorylation from a systems point of view. Furthermore, P3DB reflects a community-based design through which users can share data sets and automate data depository processes for publication purposes. Since P3DB is a comprehensive, systematic, and interactive platform for phosphoproteomics research, many data analyses can be done based on it. For example, the disorder analysis and the sequence conservation can be done based on the P3DB datasets. Many researchers downloaded and did some meaningful analysis based on P3DB infrastructure. Although with the development of the high-resolution mass spectrometry protein phosphorylation sites can be reliably identified, the experimental approach is time-consuming and resource-dependent. Furthermore, it is unlikely that an experimental approach could catalog an entire phosphoproteome. Computational prediction of phosphorylation sites provides an efficient and flexible way to reveal potential phosphorylation sites, facilitate experimental phosphorylation site identification and provide hypotheses in experimental design. Musite is a powerful tool that we developed to predict phosphorylation sites based solely on protein sequence. Musite integrates data preprocessing, feature extraction, machine-learning method, and prediction models into one comprehensive tool. Musite (http://musite.net) can be extended to all types of post translational modification study, as long as the dataset contains sufficient modification sites. To further improve the performance of Musite, a generalized motif tree applying fuzzy logic is introduced to compensate the machine learning based prediction. On one hand, using a tree based approach and fuzzy variables help to interpret the final rules, in order to help biologists to obtain the significant patterns. On the other hand, its extracted rule sets essentially generalize the motifs and reveal more information. It can be paired with traditional classification method and provide better interpretation, pre-filtering and analyzing power. Comparing to traditional motif extraction, the fuzzy motif decision tree is able to borrow more information from the observations and thus it may extract more novel motifs or more comprehensive patterns. It can be applied on kinase specific phosphorylated peptides to achieve more insights of the phosphorylation events. A comprehensive database (P3DB), a well-developed prediction tool (Musite), and a generalized motif constructor (Fuzzy Motif Tree) combined enable researchers to investigate the phosphorylation and other posttranslational modification events more thoroughly and thus to reveal more underlying biological significance by applying these computational resources.


2021 ◽  
Vol 22 (22) ◽  
pp. 12110
Author(s):  
Xueting Ma ◽  
Baohong Liu ◽  
Zhenxing Gong ◽  
Zigang Qu ◽  
Jianping Cai

Protein phosphorylation is an important post-translational modification (PTM) involved in diverse cellular functions. It is the most prevalent PTM in both Toxoplasma gondii and Plasmodium falciparum, but its status in Eimeria tenella has not been reported. Herein, we performed a comprehensive, quantitative phosphoproteomic profile analysis of four stages of the E. tenella life cycle: unsporulated oocysts (USO), partially sporulated (7 h) oocysts (SO7h), sporulated oocysts (SO), and sporozoites (S). A total of 15,247 phosphorylation sites on 9514 phosphopeptides corresponding to 2897 phosphoproteins were identified across the four stages. In addition, 456, 479, and 198 differentially expressed phosphoproteins (DEPPs) were identified in the comparisons SO7h vs. USO, SO vs. SO7h, and S vs. SO, respectively. Gene Ontology (GO) term and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses of DEPPs suggested that they were involved in diverse functions. For SO7h vs. USO, DEPPs were mainly involved in cell division, actin cytoskeleton organization, positive regulation of transport, and pyruvate metabolism. For SO vs. SO7h, they were related to the peptide metabolic process, translation, and RNA transport. DEPPs in the S vs. SO comparison were associated with the tricarboxylic acid metabolic process, positive regulation of ATPase activity, and calcium ion binding. Time course sequencing data analysis (TCseq) identified six clusters with similar expression change characteristics related to carbohydrate metabolism, cytoskeleton organization, and calcium ion transport, demonstrating different regulatory profiles across the life cycle of E. tenella. The results revealed significant changes in the abundance of phosphoproteins during E. tenella development. The findings shed light on the key roles of protein phosphorylation and dephosphorylation in the E. tenella life cycle.


Viruses ◽  
2021 ◽  
Vol 13 (7) ◽  
pp. 1393
Author(s):  
Thanyaporn Dechtawewat ◽  
Sittiruk Roytrakul ◽  
Yodying Yingchutrakul ◽  
Sawanya Charoenlappanit ◽  
Bunpote Siridechadilok ◽  
...  

Dengue virus (DENV) infection causes a spectrum of dengue diseases that have unclear underlying mechanisms. Nonstructural protein 1 (NS1) is a multifunctional protein of DENV that is involved in DENV infection and dengue pathogenesis. This study investigated the potential post-translational modification of DENV NS1 by phosphorylation following DENV infection. Using liquid chromatography-tandem mass spectrometry (LC-MS/MS), 24 potential phosphorylation sites were identified in both cell-associated and extracellular NS1 proteins from three different cell lines infected with DENV. Cell-free kinase assays also demonstrated kinase activity in purified preparations of DENV NS1 proteins. Further studies were conducted to determine the roles of specific phosphorylation sites on NS1 proteins by site-directed mutagenesis with alanine substitution. The T27A and Y32A mutations had a deleterious effect on DENV infectivity. The T29A, T230A, and S233A mutations significantly decreased the production of infectious DENV but did not affect relative levels of intracellular DENV NS1 expression or NS1 secretion. Only the T230A mutation led to a significant reduction of detectable DENV NS1 dimers in virus-infected cells; however, none of the mutations interfered with DENV NS1 oligomeric formation. These findings highlight the importance of DENV NS1 phosphorylation that may pave the way for future target-specific antiviral drug design.


2020 ◽  
Vol 24 (6) ◽  
pp. 1311-1328
Author(s):  
Jozsef Suto

Nowadays there are hundreds of thousands known plant species on the Earth and many are still unknown yet. The process of plant classification can be performed using different ways but the most popular approach is based on plant leaf characteristics. Most types of plants have unique leaf characteristics such as shape, color, and texture. Since machine learning and vision considerably developed in the past decade, automatic plant species (or leaf) recognition has become possible. Recently, the automated leaf classification is a standalone research area inside machine learning and several shallow and deep methods were proposed to recognize leaf types. From 2007 to present days several research papers have been published in this topic. In older studies the classifier was a shallow method while in current works many researchers applied deep networks for classification. During the overview of plant leaf classification literature, we found an interesting deficiency (lack of hyper-parameter search) and a key difference between studies (different test sets). This work gives an overall review about the efficiency of shallow and deep methods under different test conditions. It can be a basis to further research.


Sign in / Sign up

Export Citation Format

Share Document