Protein Subcellular Localization Based on Evolutionary Information and Segmented Distribution

The prediction of protein subcellular localization not only is important for the study of protein structure and function but also can facilitate the design and development of new drugs. In recent years, feature extraction methods based on protein evolution information have attracted much attention and made good progress. Based on the protein position-specific score matrix (PSSM) obtained by PSI-BLAST, PSSM-GSD method is proposed according to the data distribution characteristics. In order to reflect the protein sequence information as much as possible, AAO method, PSSM-AAO method, and PSSM-GSD method are fused together. Then, conditional entropy-based classifier chain algorithm and support vector machine are used to locate multilabel proteins. Finally, we test Gpos-mPLoc and Gneg-mPLoc datasets, considering the severe imbalance of data, and select SMOTE algorithm to expand a few sample; the experiment shows that the AAO + PSSM ∗ method in the paper achieved 83.1% and 86.8% overall accuracy, respectively. After experimental comparison of different methods, AAO + PSSM ∗ has good performance and can effectively predict protein subcellular location.

Download Full-text

Integrating Second-order Moving Average and Over-sampling Algorithm to Predict Apoptosis Protein Subcellular Localization

Current Bioinformatics ◽

10.2174/1574893614666190902155811 ◽

2020 ◽

Vol 15 (6) ◽

pp. 517-527

Author(s):

Yunyun Liang ◽

Shengli Zhang

Keyword(s):

Subcellular Localization ◽

Moving Average ◽

Subcellular Location ◽

Second Order ◽

Test Method ◽

Support Vector ◽

Protein Subcellular Localization ◽

Protein Subcellular Location ◽

Apoptosis Protein ◽

Leibler Divergence

Background: Apoptosis proteins have a key role in the development and the homeostasis of the organism, and are very important to understand the mechanism of cell proliferation and death. The function of apoptosis protein is closely related to its subcellular location. Objective: Prediction of apoptosis protein subcellular localization is a meaningful task. Methods: In this study, we predict the apoptosis protein subcellular location by using the PSSMbased second-order moving average descriptor, nonnegative matrix factorization based on Kullback-Leibler divergence and over-sampling algorithms. This model is named by SOMAPKLNMF- OS and constructed on the ZD98, ZW225 and CL317 benchmark datasets. Then, the support vector machine is adopted as the classifier, and the bias-free jackknife test method is used to evaluate the accuracy. Results: Our prediction system achieves the favorable and promising performance of the overall accuracy on the three datasets and also outperforms the other listed models. Conclusion: The results show that our model offers a high throughput tool for the identification of apoptosis protein subcellular localization.

Download Full-text

MIC_Locator: a novel image-based protein subcellular location multi-label prediction model based on multi-scale monogenic signal representation and intensity encoding strategy

BMC Bioinformatics ◽

10.1186/s12859-019-3136-3 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 1

Author(s):

Fan Yang ◽

Yang Liu ◽

Yanbin Wang ◽

Zhijian Yin ◽

Zhen Yang

Keyword(s):

Subcellular Localization ◽

Prediction Model ◽

Subcellular Location ◽

Protein Subcellular Localization ◽

Monogenic Signal ◽

Protein Subcellular Location ◽

Intensity Coding ◽

Coding Strategy ◽

The Right ◽

Frequency Feature

Abstract Background Protein subcellular localization plays a crucial role in understanding cell function. Proteins need to be in the right place at the right time, and combine with the corresponding molecules to fulfill their functions. Furthermore, prediction of protein subcellular location not only should be a guiding role in drug design and development due to potential molecular targets but also be an essential role in genome annotation. Taking the current status of image-based protein subcellular localization as an example, there are three common drawbacks, i.e., obsolete datasets without updating label information, stereotypical feature descriptor on spatial domain or grey level, and single-function prediction algorithm’s limited capacity of handling single-label database. Results In this paper, a novel human protein subcellular localization prediction model MIC_Locator is proposed. Firstly, the latest datasets are collected and collated as our benchmark dataset instead of obsolete data while training prediction model. Secondly, Fourier transformation, Riesz transformation, Log-Gabor filter and intensity coding strategy are employed to obtain frequency feature based on three components of monogenic signal with different frequency scales. Thirdly, a chained prediction model is proposed to handle multi-label instead of single-label datasets. The experiment results showed that the MIC_Locator can achieve 60.56% subset accuracy and outperform the existing majority of prediction models, and the frequency feature and intensity coding strategy can be conducive to improving the classification accuracy. Conclusions Our results demonstrate that the frequency feature is more beneficial for improving the performance of model compared to features extracted from spatial domain, and the MIC_Locator proposed in this paper can speed up validation of protein annotation, knowledge of protein function and proteomics research.

Download Full-text

Protein subcellular localization based on deep image features and criterion learning strategy

Briefings in Bioinformatics ◽

10.1093/bib/bbaa313 ◽

2020 ◽

Author(s):

Ran Su ◽

Linlin He ◽

Tianling Liu ◽

Xiaofeng Liu ◽

Leyi Wei

Keyword(s):

Neural Networks ◽

Subcellular Localization ◽

Learning Strategy ◽

Subcellular Location ◽

Image Features ◽

Protein Subcellular Localization ◽

Protein Subcellular Location ◽

Protein Functions ◽

Deep Image ◽

Criterion Learning

Abstract The spatial distribution of proteome at subcellular levels provides clues for protein functions, thus is important to human biology and medicine. Imaging-based methods are one of the most important approaches for predicting protein subcellular location. Although deep neural networks have shown impressive performance in a number of imaging tasks, its application to protein subcellular localization has not been sufficiently explored. In this study, we developed a deep imaging-based approach to localize the proteins at subcellular levels. Based on deep image features extracted from convolutional neural networks (CNNs), both single-label and multi-label locations can be accurately predicted. Particularly, the multi-label prediction is quite a challenging task. Here we developed a criterion learning strategy to exploit the label–attribute relevancy and label–label relevancy. A criterion that was used to determine the final label set was automatically obtained during the learning procedure. We concluded an optimal CNN architecture that could give the best results. Besides, experiments show that compared with the hand-crafted features, the deep features present more accurate prediction with less features. The implementation for the proposed method is available at https://github.com/RanSuLab/ProteinSubcellularLocation.

Download Full-text

Prediction of Drug–Target Interactions by Combining Dual-Tree Complex Wavelet Transform with Ensemble Learning Method

Molecules ◽

10.3390/molecules26175359 ◽

2021 ◽

Vol 26 (17) ◽

pp. 5359

Author(s):

Jie Pan ◽

Li-Ping Li ◽

Zhu-Hong You ◽

Chang-Qing Yu ◽

Zhong-Hao Ren ◽

...

Keyword(s):

Wavelet Transform ◽

Drug Discovery ◽

Protein Sequence ◽

Drug Target ◽

New Drugs ◽

Evolutionary Information ◽

Support Vector ◽

Sequence Information ◽

Complex Wavelet Transform ◽

Complex Wavelet

Identification of drug–target interactions (DTIs) is vital for drug discovery. However, traditional biological approaches have some unavoidable shortcomings, such as being time consuming and expensive. Therefore, there is an urgent need to develop novel and effective computational methods to predict DTIs in order to shorten the development cycles of new drugs. In this study, we present a novel computational approach to identify DTIs, which uses protein sequence information and the dual-tree complex wavelet transform (DTCWT). More specifically, a position-specific scoring matrix (PSSM) was performed on the target protein sequence to obtain its evolutionary information. Then, DTCWT was used to extract representative features from the PSSM, which were then combined with the drug fingerprint features to form the feature descriptors. Finally, these descriptors were sent to the Rotation Forest (RoF) model for classification. A 5-fold cross validation (CV) was adopted on four datasets (Enzyme, Ion Channel, GPCRs (G-protein-coupled receptors), and NRs (Nuclear Receptors)) to validate the proposed model; our method yielded high average accuracies of 89.21%, 85.49%, 81.02%, and 74.44%, respectively. To further verify the performance of our model, we compared the RoF classifier with two state-of-the-art algorithms: the support vector machine (SVM) and the k-nearest neighbor (KNN) classifier. We also compared it with some other published methods. Moreover, the prediction results for the independent dataset further indicated that our method is effective for predicting potential DTIs. Thus, we believe that our method is suitable for facilitating drug discovery and development.

Download Full-text

PROTEIN SUBCELLULAR LOCALIZATION BASED ON PSI-BLAST AND MACHINE LEARNING

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720006002405 ◽

2006 ◽

Vol 04 (06) ◽

pp. 1181-1195 ◽

Cited By ~ 2

Author(s):

JIAN GUO ◽

XIAN PU ◽

YUANLIE LIN ◽

HOWARD LEUNG

Keyword(s):

Amino Acid ◽

Subcellular Localization ◽

Large Scale ◽

Probabilistic Neural Network ◽

Protein Profile ◽

Subcellular Location ◽

Amino Acid Sequences ◽

Sequence Information ◽

Protein Subcellular Localization ◽

Benchmark Datasets

Subcellular location is an important functional annotation of proteins. An automatic, reliable and efficient prediction system for protein subcellular localization is necessary for large-scale genome analysis. This paper describes a protein subcellular localization method which extracts features from protein profiles rather than from amino acid sequences. The protein profile represents a protein family, discards part of the sequence information that is not conserved throughout the family and therefore is more sensitive than the amino acid sequence. The amino acid compositions of whole profile and the N-terminus of the profile are extracted, respectively, to train and test the probabilistic neural network classifiers. On two benchmark datasets, the overall accuracies of the proposed method reach 89.1% and 68.9%, respectively. The prediction results show that the proposed method perform better than those methods based on amino acid sequences. The prediction results of the proposed method are also compared with Subloc on two redundance-reduced datasets.

Download Full-text

PSL-Recommender: Protein Subcellular Localization Prediction using Recommender System

10.1101/462812 ◽

2018 ◽

Cited By ~ 1

Author(s):

Ruhollah Jamali ◽

Changiz Eslahchi ◽

Soheil Jahangiri-Tazehkand

Keyword(s):

Subcellular Localization ◽

Recommender System ◽

State Of The Art ◽

Subcellular Location ◽

Protein Subcellular Localization ◽

Protein Subcellular Location ◽

Wet Lab ◽

And Behavior ◽

Protein Subcellular Localization Prediction ◽

Localization Prediction

AbstractIdentifying a protein’s subcellular location is of great interest for understanding its function and behavior within the cell. In the last decade, many computational approaches have been proposed as a surrogate for expensive and inefficient wet-lab methods that are used for protein subcellular localization. Yet, there is still much room for improving the prediction accuracy of these methods.PSL-Recommender (Protein subcellular location recommender) is a method that employs neighborhood regularized logistic matrix factorization to build a recommender system for protein subcellular localization. The effectiveness of PSL-Recommender method is benchmarked on one human and three animals datasets. The results indicate that the PSL-Recommender significantly outperforms state-of-the-art methods, improving the previous best method up to 31% in F1 – mean, up to 28% in ACC, and up to 47% in AVG. The source of datasets and codes are available at:https://github.com/RJamali/PSL-Recommender

Download Full-text

PSL-Recommender: Protein Subcellular Localization Prediction using Recommender System

10.21203/rs.3.rs-878139/v1 ◽

2021 ◽

Author(s):

Ruhollah Jamali ◽

Soheil Jahangiri-Tazehkand ◽

Changiz Eslahchi

Keyword(s):

Subcellular Localization ◽

Subcellular Location ◽

Computational Method ◽

Protein Subcellular Localization ◽

Protein Subcellular Location ◽

Wet Lab ◽

Protein Subcellular Location Prediction ◽

And Behavior ◽

Protein Subcellular Localization Prediction ◽

Localization Prediction

Abstract Identifying a protein’s subcellular location is of great interest for understanding its function and behavior within the cell. In the last decade, many computational approaches have been proposed as a surrogate for expensive and labor-intensive wet-lab methods that are used for protein subcellular localization. Yet, there is still much room for improving the prediction accuracy of these methods. In this article, we meant to develop a customized computational method rather than using common machine learning predictors, which are used in the majority of computational research on this topic. The neighbourhood regularized logistic matrix factorization technique was used to create PSL-Recommender (Protein subcellular location recommender), a GO-based predictor. We declared statistical inference as the driving force behind the PSL-Recommender here. Following that, it was benchmarked against twelve well-known methods using five different datasets, demonstrating outstanding performance. Finally, we discussed potential research avenues for developing a comprehensive prediction tool for protein subcellular location prediction. The datasets and codes are available at: https://github.com/RJamali/PSL-Recommender

Download Full-text

pLoc_bal-mPlant: Predict Subcellular Localization of Plant Proteins by General PseAAC and Balancing Training Dataset

Current Pharmaceutical Design ◽

10.2174/1381612824666181119145030 ◽

2019 ◽

Vol 24 (34) ◽

pp. 4013-4022 ◽

Cited By ~ 28

Author(s):

Xiang Cheng ◽

Xuan Xiao ◽

Kuo-Chen Chou

Keyword(s):

Subcellular Localization ◽

Basic Research ◽

Training Dataset ◽

Sequence Information ◽

Plant Proteins ◽

Protein Subcellular Localization ◽

Computational Tools ◽

Validation Tests ◽

User Friendly ◽

Better Than

Knowledge of protein subcellular localization is vitally important for both basic research and drug development. With the avalanche of protein sequences emerging in the post-genomic age, it is highly desired to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called “pLoc-mPlant” was developed for identifying the subcellular localization of plant proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, called “multiplex proteins”, may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mPlant was trained by an extremely skewed dataset in which some subsets (i.e., the protein numbers for some subcellular locations) were more than 10 times larger than the others. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset. To overcome such biased consequence, we have developed a new and bias-free predictor called pLoc_bal-mPlant by balancing the training dataset. Cross-validation tests on exactly the same experimentconfirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mPlant, the existing state-of-the-art predictor in identifying the subcellular localization of plant proteins. To maximize the convenience for the majority of experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mPlant/, by which users can easily get their desired results without the need to go through the detailed mathematics.

Download Full-text

Support vector machine approach for protein subcellular localization prediction

Bioinformatics ◽

10.1093/bioinformatics/17.8.721 ◽

2001 ◽

Vol 17 (8) ◽

pp. 721-728 ◽

Cited By ~ 553

Author(s):

S. Hua ◽

Z. Sun

Keyword(s):

Support Vector Machine ◽

Subcellular Localization ◽

Support Vector ◽

Protein Subcellular Localization ◽

Subcellular Localization Prediction ◽

Protein Subcellular Localization Prediction ◽

Localization Prediction

Download Full-text

MpsLDA-ProSVM: predicting multi-label protein subcellular localization by wMLDAe dimensionality reduction and ProSVM classifier

10.1101/2020.04.19.049478 ◽

2020 ◽

Author(s):

Qi Zhang ◽

Shan Li ◽

Bin Yu ◽

Yang Li ◽

Yandan Zhang ◽

...

Keyword(s):

Subcellular Localization ◽

Nearest Neighbor ◽

Chemical Information ◽

Sequence Information ◽

Feature Subset ◽

Protein Subcellular Localization ◽

K Nearest Neighbor ◽

Entropy Weight ◽

Linear Discriminant ◽

Optimal Feature Subset

ABSTRACTProteins play a significant part in life processes such as cell growth, development, and reproduction. Exploring protein subcellular localization (SCL) is a direct way to better understand the function of proteins in cells. Studies have found that more and more proteins belong to multiple subcellular locations, and these proteins are called multi-label proteins. They not only play a key role in cell life activities, but also play an indispensable role in medicine and drug development. This article first presents a new prediction model, MpsLDA-ProSVM, to predict the SCL of multi-label proteins. Firstly, the physical and chemical information, evolution information, sequence information and annotation information of protein sequences are fused. Then, for the first time, use a weighted multi-label linear discriminant analysis framework based on entropy weight form (wMLDAe) to refine and purify features, reduce the difficulty of learning. Finally, input the optimal feature subset into the multi-label learning with label-specific features (LIFT) and multi-label k-nearest neighbor (ML-KNN) algorithms to obtain a synthetic ranking of relevant labels, and then use Prediction and Relevance Ordering based SVM (ProSVM) classifier to predict the SCLs. This method can rank and classify related tags at the same time, which greatly improves the efficiency of the model. Tested by jackknife method, the overall actual accuracy (OAA) on virus, plant, Gram-positive bacteria and Gram-negative bacteria datasets are 98.06%, 98.97%, 99.81% and 98.49%, which are 0.56%-9.16%, 5.37%-30.87%, 3.51%-6.91% and 3.99%-8.59% higher than other advanced methods respectively. The source codes and datasets are available at https://github.com/QUST-AIBBDRC/MpsLDA-ProSVM/.

Download Full-text