PROTEIN SUBCELLULAR LOCALIZATION BASED ON PSI-BLAST AND MACHINE LEARNING

2006 ◽  
Vol 04 (06) ◽  
pp. 1181-1195 ◽  
Author(s):  
JIAN GUO ◽  
XIAN PU ◽  
YUANLIE LIN ◽  
HOWARD LEUNG

Subcellular location is an important functional annotation of proteins. An automatic, reliable and efficient prediction system for protein subcellular localization is necessary for large-scale genome analysis. This paper describes a protein subcellular localization method which extracts features from protein profiles rather than from amino acid sequences. The protein profile represents a protein family, discards part of the sequence information that is not conserved throughout the family and therefore is more sensitive than the amino acid sequence. The amino acid compositions of whole profile and the N-terminus of the profile are extracted, respectively, to train and test the probabilistic neural network classifiers. On two benchmark datasets, the overall accuracies of the proposed method reach 89.1% and 68.9%, respectively. The prediction results show that the proposed method perform better than those methods based on amino acid sequences. The prediction results of the proposed method are also compared with Subloc on two redundance-reduced datasets.

2021 ◽  
Vol 2021 ◽  
pp. 1-14
Author(s):  
Danyu Jin ◽  
Ping Zhu

The prediction of protein subcellular localization not only is important for the study of protein structure and function but also can facilitate the design and development of new drugs. In recent years, feature extraction methods based on protein evolution information have attracted much attention and made good progress. Based on the protein position-specific score matrix (PSSM) obtained by PSI-BLAST, PSSM-GSD method is proposed according to the data distribution characteristics. In order to reflect the protein sequence information as much as possible, AAO method, PSSM-AAO method, and PSSM-GSD method are fused together. Then, conditional entropy-based classifier chain algorithm and support vector machine are used to locate multilabel proteins. Finally, we test Gpos-mPLoc and Gneg-mPLoc datasets, considering the severe imbalance of data, and select SMOTE algorithm to expand a few sample; the experiment shows that the AAO + PSSM ∗ method in the paper achieved 83.1% and 86.8% overall accuracy, respectively. After experimental comparison of different methods, AAO + PSSM ∗ has good performance and can effectively predict protein subcellular location.


2019 ◽  
Vol 24 (34) ◽  
pp. 4013-4022 ◽  
Author(s):  
Xiang Cheng ◽  
Xuan Xiao ◽  
Kuo-Chen Chou

Knowledge of protein subcellular localization is vitally important for both basic research and drug development. With the avalanche of protein sequences emerging in the post-genomic age, it is highly desired to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called “pLoc-mPlant” was developed for identifying the subcellular localization of plant proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, called “multiplex proteins”, may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mPlant was trained by an extremely skewed dataset in which some subsets (i.e., the protein numbers for some subcellular locations) were more than 10 times larger than the others. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset. To overcome such biased consequence, we have developed a new and bias-free predictor called pLoc_bal-mPlant by balancing the training dataset. Cross-validation tests on exactly the same experimentconfirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mPlant, the existing state-of-the-art predictor in identifying the subcellular localization of plant proteins. To maximize the convenience for the majority of experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mPlant/, by which users can easily get their desired results without the need to go through the detailed mathematics.


2020 ◽  
Vol 15 (6) ◽  
pp. 517-527
Author(s):  
Yunyun Liang ◽  
Shengli Zhang

Background: Apoptosis proteins have a key role in the development and the homeostasis of the organism, and are very important to understand the mechanism of cell proliferation and death. The function of apoptosis protein is closely related to its subcellular location. Objective: Prediction of apoptosis protein subcellular localization is a meaningful task. Methods: In this study, we predict the apoptosis protein subcellular location by using the PSSMbased second-order moving average descriptor, nonnegative matrix factorization based on Kullback-Leibler divergence and over-sampling algorithms. This model is named by SOMAPKLNMF- OS and constructed on the ZD98, ZW225 and CL317 benchmark datasets. Then, the support vector machine is adopted as the classifier, and the bias-free jackknife test method is used to evaluate the accuracy. Results: Our prediction system achieves the favorable and promising performance of the overall accuracy on the three datasets and also outperforms the other listed models. Conclusion: The results show that our model offers a high throughput tool for the identification of apoptosis protein subcellular localization.


2020 ◽  
Vol 11 ◽  
Author(s):  
Kenichiro Imai ◽  
Kenta Nakai

At the time of translation, nascent proteins are thought to be sorted into their final subcellular localization sites, based on the part of their amino acid sequences (i.e., sorting or targeting signals). Thus, it is interesting to computationally recognize these signals from the amino acid sequences of any given proteins and to predict their final subcellular localization with such information, supplemented with additional information (e.g., k-mer frequency). This field has a long history and many prediction tools have been released. Even in this era of proteomic atlas at the single-cell level, researchers continue to develop new algorithms, aiming at accessing the impact of disease-causing mutations/cell type-specific alternative splicing, for example. In this article, we overview the entire field and discuss its future direction.


2020 ◽  
Author(s):  
Qi Zhang ◽  
Shan Li ◽  
Bin Yu ◽  
Yang Li ◽  
Yandan Zhang ◽  
...  

ABSTRACTProteins play a significant part in life processes such as cell growth, development, and reproduction. Exploring protein subcellular localization (SCL) is a direct way to better understand the function of proteins in cells. Studies have found that more and more proteins belong to multiple subcellular locations, and these proteins are called multi-label proteins. They not only play a key role in cell life activities, but also play an indispensable role in medicine and drug development. This article first presents a new prediction model, MpsLDA-ProSVM, to predict the SCL of multi-label proteins. Firstly, the physical and chemical information, evolution information, sequence information and annotation information of protein sequences are fused. Then, for the first time, use a weighted multi-label linear discriminant analysis framework based on entropy weight form (wMLDAe) to refine and purify features, reduce the difficulty of learning. Finally, input the optimal feature subset into the multi-label learning with label-specific features (LIFT) and multi-label k-nearest neighbor (ML-KNN) algorithms to obtain a synthetic ranking of relevant labels, and then use Prediction and Relevance Ordering based SVM (ProSVM) classifier to predict the SCLs. This method can rank and classify related tags at the same time, which greatly improves the efficiency of the model. Tested by jackknife method, the overall actual accuracy (OAA) on virus, plant, Gram-positive bacteria and Gram-negative bacteria datasets are 98.06%, 98.97%, 99.81% and 98.49%, which are 0.56%-9.16%, 5.37%-30.87%, 3.51%-6.91% and 3.99%-8.59% higher than other advanced methods respectively. The source codes and datasets are available at https://github.com/QUST-AIBBDRC/MpsLDA-ProSVM/.


2002 ◽  
Vol 68 (6) ◽  
pp. 2731-2736 ◽  
Author(s):  
Hirokazu Nankai ◽  
Wataru Hashimoto ◽  
Kousaku Murata

ABSTRACT When cells of Bacillus sp. strain GL1 were grown in a medium containing xanthan as a carbon source, α-mannosidase exhibiting activity toward p-nitrophenyl-α-d-mannopyranoside (pNP-α-d-Man) was produced intracellularly. The 350-kDa α-mannosidase purified from a cell extract of the bacterium was a trimer comprising three identical subunits, each with a molecular mass of 110 kDa. The enzyme hydrolyzed pNP-α-d-Man (Km = 0.49 mM) and d-mannosyl-(α-1,3)-d-glucose most efficiently at pH 7.5 to 9.0, indicating that the enzyme catalyzes the last step of the xanthan depolymerization pathway of Bacillus sp. strain GL1. The gene for α-mannosidase cloned most by using N-terminal amino acid sequence information contained an open reading frame (3,144 bp) capable of coding for a polypeptide with a molecular weight of 119,239. The deduced amino acid sequence showed homology with the amino acid sequences of α-mannosidases belonging to glycoside hydrolase family 38.


1990 ◽  
Vol 3 (1) ◽  
pp. 159
Author(s):  
A Gibbs ◽  
A Ding ◽  
J Howe ◽  
P Keese ◽  
A MacKenzie ◽  
...  

Molecular sequence information about viruses has mostly confirmed the groupings devised by traditional taxonomic methods, but shown in addition that the genes of related species may differ in number, arrangement, orientation and in sequence homology. It has also revealed that true genetic recombination between viruses has been common, even among those with RNA genomes, indeed most virus groups seem to have arisen y recombination. Thus, there is an unexpected wealth of genetic chaos hidden behind the fatade of the phenotype, and it is possible that the difficulties that plant taxonomists have had in identifying the relationships of the major groupings of plants could have similar causes. Nonetheless, molecular taxonomy does give sensible results and this is illustrated by a classification of the large subunit Rubisco proteins of 21 plant species based on their amino acid sequences.


Sign in / Sign up

Export Citation Format

Share Document