HLPpred-Fuse: improved and robust prediction of hemolytic peptide and its activity by fusing multiple feature representation

Abstract Motivation Therapeutic peptides failing at clinical trials could be attributed to their toxicity profiles like hemolytic activity, which hamper further progress of peptides as drug candidates. The accurate prediction of hemolytic peptides (HLPs) and its activity from the given peptides is one of the challenging tasks in immunoinformatics, which is essential for drug development and basic research. Although there are a few computational methods that have been proposed for this aspect, none of them are able to identify HLPs and their activities simultaneously. Results In this study, we proposed a two-layer prediction framework, called HLPpred-Fuse, that can accurately and automatically predict both hemolytic peptides (HLPs or non-HLPs) as well as HLPs activity (high and low). More specifically, feature representation learning scheme was utilized to generate 54 probabilistic features by integrating six different machine learning classifiers and nine different sequence-based encodings. Consequently, the 54 probabilistic features were fused to provide sufficiently converged sequence information which was used as an input to extremely randomized tree for the development of two final prediction models which independently identify HLP and its activity. Performance comparisons over empirical cross-validation analysis, independent test and case study against state-of-the-art methods demonstrate that HLPpred-Fuse consistently outperformed these methods in the identification of hemolytic activity. Availability and implementation For the convenience of experimental scientists, a web-based tool has been established at http://thegleelab.org/HLPpred-Fuse. Contact [email protected] or [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PEPred-Suite: improved and robust prediction of therapeutic peptides using adaptive feature representation learning

Bioinformatics ◽

10.1093/bioinformatics/btz246 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4272-4280 ◽

Cited By ~ 21

Author(s):

Leyi Wei ◽

Chen Zhou ◽

Ran Su ◽

Quan Zou

Keyword(s):

Characteristic Curve ◽

Representation Learning ◽

Feature Representation ◽

Supplementary Information ◽

Optimization Strategy ◽

Feature Representations ◽

Random Forest Models ◽

Therapeutic Peptides ◽

Functional Peptides ◽

Robust Prediction

Abstract Motivation Prediction of therapeutic peptides is critical for the discovery of novel and efficient peptide-based therapeutics. Computational methods, especially machine learning based methods, have been developed for addressing this need. However, most of existing methods are peptide-specific; currently, there is no generic predictor for multiple peptide types. Moreover, it is still challenging to extract informative feature representations from the perspective of primary sequences. Results In this study, we have developed PEPred-Suite, a bioinformatics tool for the generic prediction of therapeutic peptides. In PEPred-Suite, we introduce an adaptive feature representation strategy that can learn the most representative features for different peptide types. To be specific, we train diverse sequence-based feature descriptors, integrate the learnt class information into our features, and utilize a two-step feature optimization strategy based on the area under receiver operating characteristic curve to extract the most discriminative features. Using the learnt representative features, we trained eight random forest models for eight different types of functional peptides, respectively. Benchmarking results showed that as compared with existing predictors, PEPred-Suite achieves better and robust performance for different peptides. As far as we know, PEPred-Suite is currently the first tool that is capable of predicting so many peptide types simultaneously. In addition, our work demonstrates that the learnt features can reliably predict different peptides. Availability and implementation The user-friendly webserver implementing the proposed PEPred-Suite is freely accessible at http://server.malab.cn/PEPred-Suite. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function

Bioinformatics ◽

10.1093/bioinformatics/btaa701 ◽

2020 ◽

Cited By ~ 1

Author(s):

Amelia Villegas-Morcillo ◽

Stavros Makrodimitris ◽

Roeland C H J van Ham ◽

Angel M Gomez ◽

Victoria Sanchez ◽

...

Keyword(s):

Protein Function ◽

Prediction Models ◽

Protein Function Prediction ◽

3D Structure ◽

Function Prediction ◽

Feature Representation ◽

Training Data ◽

Supplementary Information ◽

Molecular Function ◽

Structure Information

Abstract Motivation Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. Results We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. Availability and implementation Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TargetMM: accurate missense mutation prediction by utilizing local and global sequence information with classifier ensemble

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207323666201204140438 ◽

2020 ◽

Vol 23 ◽

Author(s):

Fang Ge ◽

Jun Hu ◽

Yi-Heng Zhu ◽

Muhammad Arif ◽

Dong-Jun Yu

Keyword(s):

Missense Mutation ◽

Prediction Models ◽

Feature Representation ◽

Classifier Ensemble ◽

Sequence Information ◽

Mutation Site ◽

Feature Extracting ◽

Independent Test ◽

Benchmark Datasets ◽

The Impact

Aim and Objective: Missense mutation (MM) may lead to various human diseases by disabling proteins. Accurate prediction of MM is important and challenging for both protein function annotation and drug design. Although several computational methods yielded acceptable success rates, there is still room for further enhancing the prediction performance of MM. Materials and Methods: In the present study, we designed a new feature extracting method, which considers the impact degree of residues in the microenvironment range to the mutation site. Stringent cross-validation and independent test on benchmark datasets were performed to evaluate the efficacy of the proposed feature extracting method. Furthermore, three heterogeneous prediction models were trained and then ensembled for the final prediction. Results: By combining the feature representation method and classifier ensemble technique, we reported a novel MM predictor called TargetMM for identifying the pathogenic mutations from the neutral ones. Conclusion: Comparison outcomes based on statistical evaluation demonstrate that TargetMM outperforms the prior advanced methods on the independent test data. The source code and benchmark datasets of TargetMM are freely available at https://github.com/sera616/TargetMM.git for academic use.

Download Full-text

Identification of Sub-Golgi protein localization by use of deep representation learning features

Bioinformatics ◽

10.1093/bioinformatics/btaa1074 ◽

2020 ◽

Author(s):

Zhibin Lv ◽

Pingping Wang ◽

Quan Zou ◽

Qinghua Jiang

Keyword(s):

Golgi Apparatus ◽

Protein Identification ◽

Protein Localization ◽

Protein Biosynthesis ◽

Representation Learning ◽

Eukaryotic Cell ◽

Feature Representation ◽

Supplementary Information ◽

Golgi Localization ◽

Golgi Protein

Abstract Motivation The Golgi apparatus has a key functional role in protein biosynthesis within the eukaryotic cell with malfunction resulting in various neurodegenerative diseases. For a better understanding of the Golgi apparatus, it is essential to identification of sub-Golgi protein localization. Although some machine learning methods have been used to identify sub-Golgi localization proteins by sequence representation fusion, more accurate sub-Golgi protein identification is still challenging by existing methodology. Results we developed a protein sub-Golgi localization identification protocol using deep representation learning features with 107 dimensions. By this protocol, we demonstrated that instead of multi-type protein sequence feature representation fusion as in previous state-of-the-art sub-Golgi-protein localization classifiers, it is sufficient to exploit only one type of feature representation for more accurately identification of sub-Golgi proteins. Compared with independent testing results for benchmark datasets, our protocol is able to perform generally, reliably, and robustly for sub-Golgi protein localization prediction. Availability A use-friendly webserver is freely accessible at http://isGP-DRLF.aibiochem.net and the prediction code is accessible at https://github.com/zhibinlv/isGP-DRLF. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PPTPP: a novel therapeutic peptide prediction method using physicochemical property encoding and adaptive feature representation learning

Bioinformatics ◽

10.1093/bioinformatics/btaa275 ◽

2020 ◽

Vol 36 (13) ◽

pp. 3982-3987 ◽

Cited By ~ 2

Author(s):

Yu P Zhang ◽

Quan Zou

Keyword(s):

Physicochemical Property ◽

Prediction Method ◽

Peptide Identification ◽

Representation Learning ◽

Feature Representation ◽

Supplementary Information ◽

Therapeutic Peptide ◽

Therapeutic Peptides ◽

Peptide Prediction ◽

Massive Data Processing

Abstract Motivation Peptide is a promising candidate for therapeutic and diagnostic development due to its great physiological versatility and structural simplicity. Thus, identifying therapeutic peptides and investigating their properties are fundamentally important. As an inexpensive and fast approach, machine learning-based predictors have shown their strength in therapeutic peptide identification due to excellences in massive data processing. To date, no reported therapeutic peptide predictor can perform high-quality generic prediction and informative physicochemical properties (IPPs) identification simultaneously. Results In this work, Physicochemical Property-based Therapeutic Peptide Predictor (PPTPP), a Random Forest-based prediction method was presented to address this issue. A novel feature encoding and learning scheme were initiated to produce and rank physicochemical property-related features. Besides being capable of predicting multiple therapeutics peptides with high comparability to established predictors, the presented method is also able to identify peptides’ informative IPP. Results presented in this work not only illustrated the soundness of its working capacity but also demonstrated its potential for investigating other therapeutic peptides. Availability and implementation https://github.com/YPZ858/PPTPP. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ELECTRA-DTA: A New Compound-protein Binding Affinity Prediction Model based on the Contextualized Sequence Encoding

10.21203/rs.3.rs-934203/v1 ◽

2021 ◽

Author(s):

junjie wang ◽

NaiFeng Wen ◽

Chunyu Wang ◽

Lingling Zhao ◽

Liang Cheng

Keyword(s):

Drug Discovery ◽

Binding Affinity ◽

Drug Target ◽

Large Scale ◽

Prediction Models ◽

Target Selection ◽

Drug Repurposing ◽

Representation Learning ◽

Search Space ◽

Feature Representation

Abstract Motivation: Drug-target binding affinity (DTA) reflects the strength of the drug-target interaction; therefore, predicting the DTA can considerably benefit drug discovery by narrowing the search space and pruning drug-target (DT) pairs with low binding affinity scores. Representation learning using deep neural networks has achieved promising performance compared with traditional machine learning methods; hence, extensive research efforts have been made in learning the feature representation of proteins and compounds. However, such feature representation learning relies on a large-scale labelled dataset, which is not always available.Results: We present an end-to-end deep learning framework, ELECTRA-DTA, to predict the binding affinity of drug-target pairs. This framework incorporates an unsupervised learning mechanism to train two ELECTRA-based contextual embedding models, one for protein amino acids and the other for compound SMILES string encoding. In addition, ELECTRA-DTA leverages a squeeze-and-excitation (SE) convolutional neural network block stacked over three fully connected layers to further capture the sequential and spatial features of the protein sequence and SMILES for the DTA regression task. Experimental evaluations show that ELECTRA-DTA outperforms various state-of-the-art DTA prediction models, especially with the challenging, interaction-sparse BindingDB dataset. In target selection and drug repurposing for COVID-19, ELECTRA-DTA also offers competitive performance, suggesting its potential in speeding drug discovery and generalizability for other compound- or protein-related computational tasks.

Download Full-text

pLoc_bal-mPlant: Predict Subcellular Localization of Plant Proteins by General PseAAC and Balancing Training Dataset

Current Pharmaceutical Design ◽

10.2174/1381612824666181119145030 ◽

2019 ◽

Vol 24 (34) ◽

pp. 4013-4022 ◽

Cited By ~ 28

Author(s):

Xiang Cheng ◽

Xuan Xiao ◽

Kuo-Chen Chou

Keyword(s):

Subcellular Localization ◽

Basic Research ◽

Training Dataset ◽

Sequence Information ◽

Plant Proteins ◽

Protein Subcellular Localization ◽

Computational Tools ◽

Validation Tests ◽

User Friendly ◽

Better Than

Knowledge of protein subcellular localization is vitally important for both basic research and drug development. With the avalanche of protein sequences emerging in the post-genomic age, it is highly desired to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called “pLoc-mPlant” was developed for identifying the subcellular localization of plant proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, called “multiplex proteins”, may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mPlant was trained by an extremely skewed dataset in which some subsets (i.e., the protein numbers for some subcellular locations) were more than 10 times larger than the others. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset. To overcome such biased consequence, we have developed a new and bias-free predictor called pLoc_bal-mPlant by balancing the training dataset. Cross-validation tests on exactly the same experimentconfirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mPlant, the existing state-of-the-art predictor in identifying the subcellular localization of plant proteins. To maximize the convenience for the majority of experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mPlant/, by which users can easily get their desired results without the need to go through the detailed mathematics.

Download Full-text

pLoc_bal-mEuk: Predict Subcellular Localization of Eukaryotic Proteins by General PseAAC and Quasi-balancing Training Dataset

Medicinal Chemistry ◽

10.2174/1573406415666181218102517 ◽

2019 ◽

Vol 15 (5) ◽

pp. 472-485 ◽

Cited By ~ 21

Author(s):

Kuo-Chen Chou ◽

Xiang Cheng ◽

Xuan Xiao

Keyword(s):

Drug Development ◽

Subcellular Localization ◽

Basic Research ◽

The Other ◽

Training Dataset ◽

Sequence Information ◽

Eukaryotic Proteins ◽

Validation Tests ◽

User Friendly ◽

Better Than

Background/Objective: Information of protein subcellular localization is crucially important for both basic research and drug development. With the explosive growth of protein sequences discovered in the post-genomic age, it is highly demanded to develop powerful bioinformatics tools for timely and effectively identifying their subcellular localization purely based on the sequence information alone. Recently, a predictor called “pLoc-mEuk” was developed for identifying the subcellular localization of eukaryotic proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems where many proteins, called “multiplex proteins”, may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mEuk was trained by an extremely skewed dataset where some subset was about 200 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset. Methods: To alleviate such bias, we have developed a new predictor called pLoc_bal-mEuk by quasi-balancing the training dataset. Cross-validation tests on exactly the same experimentconfirmed dataset have indicated that the proposed new predictor is remarkably superior to pLocmEuk, the existing state-of-the-art predictor in identifying the subcellular localization of eukaryotic proteins. It has not escaped our notice that the quasi-balancing treatment can also be used to deal with many other biological systems. Results: To maximize the convenience for most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mEuk/. Conclusion: It is anticipated that the pLoc_bal-Euk predictor holds very high potential to become a useful high throughput tool in identifying the subcellular localization of eukaryotic proteins, particularly for finding multi-target drugs that is currently a very hot trend trend in drug development.

Download Full-text

Joint feature representation learning and progressive distribution matching for cross-project defect prediction

Information and Software Technology ◽

10.1016/j.infsof.2021.106588 ◽

2021 ◽

Vol 137 ◽

pp. 106588

Author(s):

Quanyi Zou ◽

Lu Lu ◽

Zhanyu Yang ◽

Xiaowei Gu ◽

Shaojian Qiu

Keyword(s):

Representation Learning ◽

Feature Representation ◽

Defect Prediction ◽

Distribution Matching ◽

Cross Project

Download Full-text

mAHTPred: a sequence-based meta-predictor for improving the prediction of anti-hypertensive peptides using effective feature representation

Bioinformatics ◽

10.1093/bioinformatics/bty1047 ◽

2018 ◽

Vol 35 (16) ◽

pp. 2757-2765 ◽

Cited By ~ 63

Author(s):

Balachandran Manavalan ◽

Shaherin Basith ◽

Tae Hwan Shin ◽

Leyi Wei ◽

Gwang Lee

Keyword(s):

Nearest Neighbor ◽

Feature Representation ◽

Superior Performance ◽

Supplementary Information ◽

Gradient Boosting ◽

Support Vector ◽

Pharmaceutical Drugs ◽

K Nearest Neighbor ◽

Feature Descriptors ◽

Predicted Probability

AbstractMotivationCardiovascular disease is the primary cause of death globally accounting for approximately 17.7 million deaths per year. One of the stakes linked with cardiovascular diseases and other complications is hypertension. Naturally derived bioactive peptides with antihypertensive activities serve as promising alternatives to pharmaceutical drugs. So far, there is no comprehensive analysis, assessment of diverse features and implementation of various machine-learning (ML) algorithms applied for antihypertensive peptide (AHTP) model construction.ResultsIn this study, we utilized six different ML algorithms, namely, Adaboost, extremely randomized tree (ERT), gradient boosting (GB), k-nearest neighbor, random forest (RF) and support vector machine (SVM) using 51 feature descriptors derived from eight different feature encodings for the prediction of AHTPs. While ERT-based trained models performed consistently better than other algorithms regardless of various feature descriptors, we treated them as baseline predictors, whose predicted probability of AHTPs was further used as input features separately for four different ML-algorithms (ERT, GB, RF and SVM) and developed their corresponding meta-predictors using a two-step feature selection protocol. Subsequently, the integration of four meta-predictors through an ensemble learning approach improved the balanced prediction performance and model robustness on the independent dataset. Upon comparison with existing methods, mAHTPred showed superior performance with an overall improvement of approximately 6–7% in both benchmarking and independent datasets.Availability and implementationThe user-friendly online prediction tool, mAHTPred is freely accessible at http://thegleelab.org/mAHTPred.Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text