The pLoc_bal-mPlant is a powerful artificial intelligence tool for predicting the subcellular localization of plant proteins purely based on their sequence information

Knowledge of protein subcellular localization is vitally important for both basic research and drug development. With the avalanche of protein sequences emerging in the post-genomic age, it is highly desired to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called “pLoc-mPlant” was developed for identifying the subcellular localization of plant proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, called “multiplex proteins”, may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mPlant was trained by an extremely skewed dataset in which some subsets (i.e., the protein numbers for some subcellular locations) were more than 10 times larger than the others. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset. To overcome such biased consequence, we have developed a new and bias-free predictor called pLoc_bal-mPlant by balancing the training dataset. Cross-validation tests on exactly the same experimentconfirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mPlant, the existing state-of-the-art predictor in identifying the subcellular localization of plant proteins. To maximize the convenience for the majority of experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mPlant/, by which users can easily get their desired results without the need to go through the detailed mathematics.

Download Full-text

The Ploc_Bal-Mhum Is a Powerful Web-Serve for Predicting the Subcellular Localization of Human Proteins Purely Based on Their Sequence Information

Clinical Research Notes ◽

10.31579/2690-8816/009 ◽

2020 ◽

Vol 1 (2) ◽

pp. 01-05

Author(s):

Kuo Chou

Keyword(s):

Artificial Intelligence ◽

Subcellular Localization ◽

Web Server ◽

Sequence Information ◽

Human Proteins

In 2019 a very powerful web-server, or AI (Artificial Intelligence) tool, has been developed for predicting the subcellular localization of human proteins purely according to their information for the multi-label systems, in which a same protein may appear or travel between two or more locations and hence its identification needs the multi-label mark.

Download Full-text

The pLoc bal-mHum is a powerful web-serve for predicting the subcellular localization of human proteins purely based on their sequence information

10.54646/bijbnt.001 ◽

2020 ◽

pp. 1-4

Author(s):

Kuo-Chen Chou ◽

Keyword(s):

Artificial Intelligence ◽

Subcellular Localization ◽

Web Server ◽

Sequence Information ◽

Human Proteins

In 2019 a very powerful web-server, or AI (Artificial Intelligence) tool, has been developed for predicting the subcellular localization of human proteins purely according to their information for the multi-label systems [1], in which a same protein may appear or travel between two or more locations and hence its identification needs the multi-label mark [2].

Download Full-text

The pLoc_bal-mHum is a Powerful Web-Serve for Predicting the Subcellular Localization of Human Proteins Purely Based on Their Sequence Information

Advances in Bioengineering and Biomedical Science Research ◽

10.33140/abbsr.03.01.06 ◽

2020 ◽

Vol 3 (1) ◽

Keyword(s):

Artificial Intelligence ◽

Subcellular Localization ◽

Web Server ◽

Sequence Information ◽

Human Proteins

In 2019 a very powerful web-server, or AI (Artificial Intelligence) tool, has been developed for predicting the subcellular localization of human proteins purely according to their information for the multi-label systems, in which a same protein may appear or travel between two or more locations and hence its identification needs the multi-label mark [1, 2].

Download Full-text

pLoc_bal-mEuk: Predict Subcellular Localization of Eukaryotic Proteins by General PseAAC and Quasi-balancing Training Dataset

Medicinal Chemistry ◽

10.2174/1573406415666181218102517 ◽

2019 ◽

Vol 15 (5) ◽

pp. 472-485 ◽

Cited By ~ 21

Author(s):

Kuo-Chen Chou ◽

Xiang Cheng ◽

Xuan Xiao

Keyword(s):

Drug Development ◽

Subcellular Localization ◽

Basic Research ◽

The Other ◽

Training Dataset ◽

Sequence Information ◽

Eukaryotic Proteins ◽

Validation Tests ◽

User Friendly ◽

Better Than

Background/Objective: Information of protein subcellular localization is crucially important for both basic research and drug development. With the explosive growth of protein sequences discovered in the post-genomic age, it is highly demanded to develop powerful bioinformatics tools for timely and effectively identifying their subcellular localization purely based on the sequence information alone. Recently, a predictor called “pLoc-mEuk” was developed for identifying the subcellular localization of eukaryotic proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems where many proteins, called “multiplex proteins”, may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mEuk was trained by an extremely skewed dataset where some subset was about 200 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset. Methods: To alleviate such bias, we have developed a new predictor called pLoc_bal-mEuk by quasi-balancing the training dataset. Cross-validation tests on exactly the same experimentconfirmed dataset have indicated that the proposed new predictor is remarkably superior to pLocmEuk, the existing state-of-the-art predictor in identifying the subcellular localization of eukaryotic proteins. It has not escaped our notice that the quasi-balancing treatment can also be used to deal with many other biological systems. Results: To maximize the convenience for most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mEuk/. Conclusion: It is anticipated that the pLoc_bal-Euk predictor holds very high potential to become a useful high throughput tool in identifying the subcellular localization of eukaryotic proteins, particularly for finding multi-target drugs that is currently a very hot trend trend in drug development.

Download Full-text

MpsLDA-ProSVM: predicting multi-label protein subcellular localization by wMLDAe dimensionality reduction and ProSVM classifier

10.1101/2020.04.19.049478 ◽

2020 ◽

Author(s):

Qi Zhang ◽

Shan Li ◽

Bin Yu ◽

Yang Li ◽

Yandan Zhang ◽

...

Keyword(s):

Subcellular Localization ◽

Nearest Neighbor ◽

Chemical Information ◽

Sequence Information ◽

Feature Subset ◽

Protein Subcellular Localization ◽

K Nearest Neighbor ◽

Entropy Weight ◽

Linear Discriminant ◽

Optimal Feature Subset

ABSTRACTProteins play a significant part in life processes such as cell growth, development, and reproduction. Exploring protein subcellular localization (SCL) is a direct way to better understand the function of proteins in cells. Studies have found that more and more proteins belong to multiple subcellular locations, and these proteins are called multi-label proteins. They not only play a key role in cell life activities, but also play an indispensable role in medicine and drug development. This article first presents a new prediction model, MpsLDA-ProSVM, to predict the SCL of multi-label proteins. Firstly, the physical and chemical information, evolution information, sequence information and annotation information of protein sequences are fused. Then, for the first time, use a weighted multi-label linear discriminant analysis framework based on entropy weight form (wMLDAe) to refine and purify features, reduce the difficulty of learning. Finally, input the optimal feature subset into the multi-label learning with label-specific features (LIFT) and multi-label k-nearest neighbor (ML-KNN) algorithms to obtain a synthetic ranking of relevant labels, and then use Prediction and Relevance Ordering based SVM (ProSVM) classifier to predict the SCLs. This method can rank and classify related tags at the same time, which greatly improves the efficiency of the model. Tested by jackknife method, the overall actual accuracy (OAA) on virus, plant, Gram-positive bacteria and Gram-negative bacteria datasets are 98.06%, 98.97%, 99.81% and 98.49%, which are 0.56%-9.16%, 5.37%-30.87%, 3.51%-6.91% and 3.99%-8.59% higher than other advanced methods respectively. The source codes and datasets are available at https://github.com/QUST-AIBBDRC/MpsLDA-ProSVM/.

Download Full-text

pLoc_Deep-mPlant: Predict Subcellular Localization of Plant Proteins by Deep Learning

Natural Science ◽

10.4236/ns.2020.125021 ◽

2020 ◽

Vol 12 (05) ◽

pp. 237-247

Author(s):

Yu-Tao Shao ◽

Xin-Xin Liu ◽

Zhe Lu ◽

Kuo-Chen Chou

Keyword(s):

Deep Learning ◽

Subcellular Localization ◽

Plant Proteins

Download Full-text

SubLocEP: a novel ensemble predictor of subcellular localization of eukaryotic mRNA based on machine learning

Briefings in Bioinformatics ◽

10.1093/bib/bbaa401 ◽

2021 ◽

Author(s):

Jing Li ◽

Lichao Zhang ◽

Shida He ◽

Fei Guo ◽

Quan Zou

Keyword(s):

Subcellular Localization ◽

Prediction Model ◽

Protein Function ◽

Single Layer ◽

Protein Translation ◽

Sequence Information ◽

Layer Model ◽

Integration Model ◽

Generalization Ability ◽

Single Feature

Abstract Motivation mRNA location corresponds to the location of protein translation and contributes to precise spatial and temporal management of the protein function. However, current assignment of subcellular localization of eukaryotic mRNA reveals important limitations: (1) turning multiple classifications into multiple dichotomies makes the training process tedious; (2) the majority of the models trained by classical algorithm are based on the extraction of single sequence information; (3) the existing state-of-the-art models have not reached an ideal level in terms of prediction and generalization ability. To achieve better assignment of subcellular localization of eukaryotic mRNA, a better and more comprehensive model must be developed. Results In this paper, SubLocEP is proposed as a two-layer integrated prediction model for accurate prediction of the location of sequence samples. Unlike the existing models based on limited features, SubLocEP comprehensively considers additional feature attributes and is combined with LightGBM to generated single feature classifiers. The initial integration model (single-layer model) is generated according to the categories of a feature. Subsequently, two single-layer integration models are weighted (sequence-based: physicochemical properties = 3:2) to produce the final two-layer model. The performance of SubLocEP on independent datasets is sufficient to indicate that SubLocEP is an accurate and stable prediction model with strong generalization ability. Additionally, an online tool has been developed that contains experimental data and can maximize the user convenience for estimation of subcellular localization of eukaryotic mRNA.

Download Full-text

DETECTING AND SORTING TARGETING PEPTIDES WITH NEURAL NETWORKS AND SUPPORT VECTOR MACHINES

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720006001771 ◽

2006 ◽

Vol 04 (01) ◽

pp. 1-18 ◽

Cited By ~ 36

Author(s):

JOHN HAWKINS ◽

MIKAEL BODÉN

Keyword(s):

Support Vector Machines ◽

Subcellular Localization ◽

Support Vector ◽

Plant Proteins ◽

Final Model ◽

Classifier Design ◽

Vector Machines ◽

Localization Predictor ◽

Plant Data ◽

Targeting Peptide

This paper presents a composite multi-layer classifier system for predicting the subcellular localization of proteins based on their amino acid sequence. The work is an extension of our previous predictor PProwler v1.1 which is itself built upon the series of predictors SignalP and TargetP. In this study we outline experiments conducted to improve the classifier design. The major improvement came from using Support Vector machines as a "smart gate" sorting the outputs of several different targeting peptide detection networks. Our final model (PProwler v1.2) gives MCC values of 0.873 for non-plant and 0.849 for plant proteins. The model improves upon the accuracy of our previous subcellular localization predictor (PProwler v1.1) by 2% for plant data (which represents 7.5% improvement upon TargetP).

Download Full-text

pLoc_bal-mVirus: Predict Subcellular Localization of Multi-Label Virus Proteins by Chou's General PseAAC and IHTS Treatment to Balance Training Dataset

Medicinal Chemistry ◽

10.2174/1573406415666181217114710 ◽

2019 ◽

Vol 15 (5) ◽

pp. 496-509 ◽

Cited By ~ 25

Author(s):

Xuan Xiao ◽

Xiang Cheng ◽

Genqiang Chen ◽

Qi Mao ◽

Kuo-Chen Chou

Keyword(s):

Subcellular Localization ◽

Balance Training ◽

Basic Research ◽

Subcellular Location ◽

The Other ◽

Training Dataset ◽

Sequence Information ◽

Training Samples ◽

A Cell ◽

Virus Proteins

Background/Objective:Knowledge of protein subcellular localization is vitally important for both basic research and drug development. Facing the avalanche of protein sequences emerging in the post-genomic age, it is urgent to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called “pLoc-mVirus” was developed for identifying the subcellular localization of virus proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, known as “multiplex proteins”, may simultaneously occur in, or move between two or more subcellular location sites. Despite the fact that it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mVirus was trained by an extremely skewed dataset in which some subset was over 10 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset.Methods:Using the Chou's general PseAAC (Pseudo Amino Acid Composition) approach and the IHTS (Inserting Hypothetical Training Samples) treatment to balance out the training dataset, we have developed a new predictor called “pLoc_bal-mVirus” for predicting the subcellular localization of multi-label virus proteins.Results:Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mVirus, the existing state-of-theart predictor for the same purpose.Conclusion:Its user-friendly web-server is available at http://www.jci-bioinfo.cn/pLoc_balmVirus/, by which the majority of experimental scientists can easily get their desired results without the need to go through the detailed complicated mathematics. Accordingly, pLoc_bal-mVirus will become a very useful tool for designing multi-target drugs and in-depth understanding of the biological process in a cell.

Download Full-text