scholarly journals USPNet: unbiased organism-agnostic signal peptide predictor with deep protein language model

2021 ◽  
Author(s):  
Shenyang Chen ◽  
QingXiong Tan ◽  
JingChen Li ◽  
Yu Li

Signal peptide is a short peptide located in the N-terminus of proteins. It plays an important role in targeting and transferring transmembrane proteins and secreted proteins to correct positions. Compared with traditional experimental methods to identify and discover signal peptides,the computational methods are faster and more efficient, which are more practical for the analysis of thousands or even millions of protein sequences in reality, especially for the metagenomic data. Therefore, computational tools are recently proposed to classify signal peptides and predict cleavage site positions, but most of them disregard the extreme data imbalance problem in these tasks. In addition, almost all these methods rely on additional group information of proteins to boost their performances, which, however, may not always be available. To deal with these issues, in this paper, we present Unbiased Organism-agnostic Signal Peptide Network(USPNet), a signal peptide prediction and cleavage site prediction model based on deep protein language model. We propose to use label distribution-aware margin (LDAM) loss and evolutionary scale modeling (ESM) embedding to handle data imbalance and object-dependence problems. Extensive experimental results demonstrate that the proposed method significantly outperforms all the previous methods on the classification performance. Additional study on the simulated metagenomic data further indicates that our model is a more universal and robust tool without dependency on additional group information of proteins, with the Matthews correlation coefficient improved by up to 17.5‰. The proposed method will be potentially useful to discover new signal peptides from the abundant metagenomic data.

Microbiology ◽  
2009 ◽  
Vol 155 (7) ◽  
pp. 2375-2383 ◽  
Author(s):  
Nils Anders Leversen ◽  
Gustavo A. de Souza ◽  
Hiwa Målen ◽  
Swati Prasad ◽  
Inge Jonassen ◽  
...  

Secreted proteins play an important part in the pathogenicity of Mycobacterium tuberculosis, and are the primary source of vaccine and diagnostic candidates. A majority of these proteins are exported via the signal peptidase I-dependent pathway, and have a signal peptide that is cleaved off during the secretion process. Sequence similarities within signal peptides have spurred the development of several algorithms for predicting their presence as well as the respective cleavage sites. For proteins exported via this pathway, algorithms exist for eukaryotes, and for Gram-negative and Gram-positive bacteria. However, the unique structure of the mycobacterial membrane raises the question of whether the existing algorithms are suitable for predicting signal peptides within mycobacterial proteins. In this work, we have evaluated the performance of nine signal peptide prediction algorithms on a positive validation set, consisting of 57 proteins with a verified signal peptide and cleavage site, and a negative set, consisting of 61 proteins that have an N-terminal sequence that confirms the annotated translational start site. We found the hidden Markov model of SignalP v3.0 to be the best-performing algorithm for predicting the presence of a signal peptide in mycobacterial proteins. It predicted no false positives or false negatives, and predicted a correct cleavage site for 45 of the 57 proteins in the positive set. Based on these results, we used the hidden Markov model of SignalP v3.0 to analyse the 10 available annotated proteomes of mycobacterial species, including annotations of M. tuberculosis H37Rv from the Wellcome Trust Sanger Institute and the J. Craig Venter Institute (JCVI). When excluding proteins with transmembrane regions among the proteins predicted to harbour a signal peptide, we found between 7.8 and 10.5 % of the proteins in the proteomes to be putative secreted proteins. Interestingly, we observed a consistent difference in the percentage of predicted proteins between the Sanger Institute and JCVI. We have determined the most valuable algorithm for predicting signal peptidase I-processed proteins of M. tuberculosis, and used this algorithm to estimate the number of mycobacterial proteins with the potential to be exported via this pathway.


2019 ◽  
Vol 11 (5) ◽  
pp. 1327 ◽  
Author(s):  
Bei Zhou ◽  
Zongzhi Li ◽  
Shengrui Zhang ◽  
Xinfen Zhang ◽  
Xin Liu ◽  
...  

Hit-and-run (HR) crashes refer to crashes involving drivers of the offending vehicle fleeing incident scenes without aiding the possible victims or informing authorities for emergency medical services. This paper aims at identifying significant predictors of HR and non-hit-and-run (NHR) in vehicle-bicycle crashes based on the classification and regression tree (CART) method. An oversampling technique is applied to deal with the data imbalance problem, where the number of minority instances (HR crash) is much lower than that of the majority instances (NHR crash). The police-reported data within City of Chicago from September 2017 to August 2018 is collected. The G-mean (geometric mean) is used to evaluate the classification performance. Results indicate that, compared with original CART model, the G-mean of CART model incorporating data imbalance treatment is increased from 23% to 61% by 171%. The decision tree reveals that the following five variables play the most important roles in classifying HR and NHR in vehicle-bicycle crashes: Driver age, bicyclist safety equipment, driver action, trafficway type, and gender of drivers. Several countermeasures are recommended accordingly. The current study demonstrates that, by incorporating data imbalance treatment, the CART method could provide much more robust classification results.


2016 ◽  
Vol 2016 ◽  
pp. 1-14 ◽  
Author(s):  
Chunlin Gong ◽  
Liangxian Gu

In many practical engineering applications, data are usually collected in online pattern. However, if the classes of these data are severely imbalanced, the classification performance will be restricted. In this paper, a novel classification approach is proposed to solve the online data imbalance problem by integrating a fast and efficient learning algorithm, that is, Extreme Learning Machine (ELM), and a typical sampling strategy, that is, the synthetic minority oversampling technique (SMOTE). To reduce the severe imbalance, the granulation division for major-class samples is made according to the samples’ distribution characteristic, and the original samples are replaced by the obtained granule core to prepare a balanced sample set. In online stage, we firstly make granulation division for minor-class and then conduct oversampling using SMOTE in the region around granule core and granule border. Therefore, the training sample set is gradually balanced and the online ELM model is dynamically updated. We also theoretically introduce fuzzy information entropy to prove that the proposed approach has the lower bound of model reliability after undersampling. Numerical experiments are conducted on two different kinds of datasets, and the results demonstrate that the proposed approach outperforms some state-of-the-art methods in terms of the generalization performance and numerical stability.


Biologia ◽  
2009 ◽  
Vol 64 (4) ◽  
Author(s):  
Xiaohui Zhang ◽  
Yudang Li ◽  
Yudong Li

AbstractGram-positive bacteria have been widely investigated for their huge capability to secrete proteins, such as those involved in gene expression, bacterial surface display and bacterial pathogenesis. The N-terminal signal peptide of a secretory protein is responsible for the translocation of polypeptide through the cytoplasmic membrane. Recently, the signal peptide prediction has become a major task in bioinformatics, and many programs with different algorithms were developed to predict signal peptides. In this paper, five prediction programs (SignalP 3.0, PrediSi, Phobius, SOSUIsignal and SIG-Pred) were selected to evaluate their prediction accuracy for signal peptides and cleavage site using 509 unbiased and experimentally verified Gram-positive protein sequences. The results showed that SignalP was the most accurate program in signal peptide (96% accuracy) and cleavage site (83%) prediction. Prediction performance could further be improved by combining multiple methods into consensus prediction, which would increase the accuracy to 98%, and decrease the false positive to zero. When the consensus method was used to predict Bacillus’s extracellular proteins identified by proteomics, more new signal peptides were successfully identified. It could be concluded that the consensus method would be useful to make prediction of signal peptides more reliable.


2021 ◽  
Vol 21 (S2) ◽  
Author(s):  
Kun Zeng ◽  
Yibin Xu ◽  
Ge Lin ◽  
Likeng Liang ◽  
Tianyong Hao

Abstract Background Eligibility criteria are the primary strategy for screening the target participants of a clinical trial. Automated classification of clinical trial eligibility criteria text by using machine learning methods improves recruitment efficiency to reduce the cost of clinical research. However, existing methods suffer from poor classification performance due to the complexity and imbalance of eligibility criteria text data. Methods An ensemble learning-based model with metric learning is proposed for eligibility criteria classification. The model integrates a set of pre-trained models including Bidirectional Encoder Representations from Transformers (BERT), A Robustly Optimized BERT Pretraining Approach (RoBERTa), XLNet, Pre-training Text Encoders as Discriminators Rather Than Generators (ELECTRA), and Enhanced Representation through Knowledge Integration (ERNIE). Focal Loss is used as a loss function to address the data imbalance problem. Metric learning is employed to train the embedding of each base model for feature distinguish. Soft Voting is applied to achieve final classification of the ensemble model. The dataset is from the standard evaluation task 3 of 5th China Health Information Processing Conference containing 38,341 eligibility criteria text in 44 categories. Results Our ensemble method had an accuracy of 0.8497, a precision of 0.8229, and a recall of 0.8216 on the dataset. The macro F1-score was 0.8169, outperforming state-of-the-art baseline methods by 0.84% improvement on average. In addition, the performance improvement had a p-value of 2.152e-07 with a standard t-test, indicating that our model achieved a significant improvement. Conclusions A model for classifying eligibility criteria text of clinical trials based on multi-model ensemble learning and metric learning was proposed. The experiments demonstrated that the classification performance was improved by our ensemble model significantly. In addition, metric learning was able to improve word embedding representation and the focal loss reduced the impact of data imbalance to model performance.


2003 ◽  
Vol 31 (6) ◽  
pp. 1243-1247 ◽  
Author(s):  
B. Martoglio

Signal sequences are the addresses of proteins destined for secretion. In eukaryotic cells, they mediate targeting to the endoplasmic reticulum membrane and insertion into the translocon. Thereafter, signal sequences are cleaved from the pre-protein and liberated into the endoplasmic reticulum membrane. We have recently reported that some liberated signal peptides are further processed by the intramembrane-cleaving aspartic protease signal peptide peptidase. Cleavage in the membrane-spanning portion of the signal peptide promotes the release of signal peptide fragments from the lipid bilayer. Typical processes that include intramembrane proteolysis is the regulatory or signalling function of cleavage products. Likewise, signal peptide fragments liberated upon intramembrane cleavage may promote such post-targeting functions in the cell.


2017 ◽  
Vol 19 (1) ◽  
pp. 42-49
Author(s):  
Divya Agrawal ◽  
Padma Bonde

Prediction using classification techniques is one of the fundamental feature widely applied in various fields. Classification accuracy is still a great challenge due to data imbalance problem. The increased volume of data is also posing a challenge for data handling and prediction, particularly when technology is used as the interface between customers and the company. As the data imbalance increases it directly affects the classification accuracy of the entire system. AUC (area under the curve) and lift proved to be good evaluation metrics. Classification techniques help to improve classification accuracy, but in case of imbalanced dataset classification accuracy does not predict well and other techniques, such as oversampling needs to be resorted. Paper presented Voting based ensembling technique to improve classification accuracy in case of imbalanced data. The voting based ensemble is based on taking the votes on the best class obtained by the three classification techniques, namely, Logistics Regression, Classification Trees and Discriminant Analysis. The observed result revealed improvement in classification accuracy by using voting ensembling technique.


Sign in / Sign up

Export Citation Format

Share Document