scholarly journals SignalP 6.0 predicts all five types of signal peptides using protein language models

Author(s):  
Felix Teufel ◽  
José Juan Almagro Armenteros ◽  
Alexander Rosenberg Johansen ◽  
Magnús Halldór Gíslason ◽  
Silas Irby Pihl ◽  
...  

AbstractSignal peptides (SPs) are short amino acid sequences that control protein secretion and translocation in all living organisms. SPs can be predicted from sequence data, but existing algorithms are unable to detect all known types of SPs. We introduce SignalP 6.0, a machine learning model that detects all five SP types and is applicable to metagenomic data.

2021 ◽  
Author(s):  
Felix Teufel ◽  
José Juan Almagro Armenteros ◽  
Alexander Rosenberg Johansen ◽  
Magnús Halldór Gislason ◽  
Silas Irby Pihl ◽  
...  

Signal peptides (SPs) are short amino acid sequences that control protein secretion and translocation in all living organisms. As experimental characterization of SPs is costly, prediction algorithms are applied to predict them from sequence data. However, existing methods are unable to detect all known types of SPs. We introduce SignalP 6.0, the first model capable of detecting all five SP types. Additionally, the model accurately identifies the positions of regions within SPs, revealing the defining biochemical properties that underlie the function of SPs in vivo. Results show that SignalP 6.0 has improved prediction performance, and is the first model to be applicable to metagenomic data. SignalP 6.0 is available at https://services.healthtech.dtu.dk/service.php?SignalP-6.0


1990 ◽  
Vol 269 (3) ◽  
pp. 691-696 ◽  
Author(s):  
M Prabhakaran

Signal peptides play a major role in an as-yet-undefined way in the translocation of proteins across membranes. The sequential arrangement of the chemical, physical and conformational properties of the signal and nascent amino acid sequences of the translocated proteins has been compiled and analysed in the present study. The sequence data of 126 signal peptides of length between 18 and 21 residues form the basis of this study. The statistical distribution of the following properties was studied hydrophobicity, Mr, bulkiness, chromatographic index and preference for adopting alpha-helical, β-sheet and turn structures. The contribution of each property to the sequence arrangement was derived. A hydrophobic core sequence was found in all signal peptides investigated. The structural arrangement of the cleavage site was also clearly revealed by this study. Most of the physical properties of the individual sequences correlated (correlation coefficient approximately 0.4) very well with the average distribution. The preferred occupancy of amino acid residues in the signal and nascent sequences was also calculated and correlated with their property distribution. The periodic behaviour of the signal and nascent chains was revealed by calculating their hydrophobic moments for various repetitive conformations. A graphical analysis of average hydrophobic moments versus average hydrophobicity of peptides revealed the transmembrane characteristics of signal peptides and globular characteristics of the nascent peptides.


2019 ◽  
Vol 15 (3) ◽  
pp. 206-211 ◽  
Author(s):  
Jihui Tang ◽  
Jie Ning ◽  
Xiaoyan Liu ◽  
Baoming Wu ◽  
Rongfeng Hu

<P>Introduction: Machine Learning is a useful tool for the prediction of cell-penetration compounds as drug candidates. </P><P> Materials and Methods: In this study, we developed a novel method for predicting Cell-Penetrating Peptides (CPPs) membrane penetrating capability. For this, we used orthogonal encoding to encode amino acid and each amino acid position as one variable. Then a software of IBM spss modeler and a dataset including 533 CPPs, were used for model screening. </P><P> Results: The results indicated that the machine learning model of Support Vector Machine (SVM) was suitable for predicting membrane penetrating capability. For improvement, the three CPPs with the most longer lengths were used to predict CPPs. The penetration capability can be predicted with an accuracy of close to 95%. </P><P> Conclusion: All the results indicated that by using amino acid position as a variable can be a perspective method for predicting CPPs membrane penetrating capability.</P>


1980 ◽  
Vol 187 (1) ◽  
pp. 65-74 ◽  
Author(s):  
D Penny ◽  
M D Hendy ◽  
L R Foulds

We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds Hendy & Penny (1979) J. Mol. Evol. 13. 127–150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151–166]. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin alpha-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree.


2020 ◽  
Author(s):  
Abel Debebe Mitiku ◽  
Dawit Tesfaye Degefu ◽  
Adane Abraham ◽  
Desta Mejan ◽  
Pauline Asami ◽  
...  

AbstractGarlic is one of the most crucial Allium vegetables used as seasoning of foods. It has a lot of benefits from the medicinal and nutritional point of view; however, its production is highly constrained by both biotic and abiotic challenges. Among these, viral infections are the most prevalent factors affecting crop productivity around the globe. This experiment was conducted on eleven selected garlic accessions and three improved varieties collected from different garlic growing agro-climatic regions of Ethiopia. This study aimed to identify and characterize the isolated garlic virus using the coat protein (CP) gene and further determine their phylogenetic relatedness. RNA was extracted from fresh young leaves, thirteen days old seedlings, which showed yellowing, mosaic, and stunting symptoms. Pairwise molecular diversity for CP nucleotide and amino acid sequences were calculated using MEGA5. Maximum Likelihood tree of CP nucleotide sequence data of Allexivirus and Potyvirus were conducted using PhyML, while a neighbor-joining tree was constructed for the amino acid sequence data using MEGA5. From the result, five garlic viruses were identified viz. Garlic virus C (78.6 %), Garlic virus D (64.3 %), Garlic virus X (78.6 %), Onion yellow dwarf virus (OYDV) (100%), and Leek yellow stripe virus (LYSV) (78.6 %). The study revealed the presence of complex mixtures of viruses with 42.9 % of the samples had co-infected with a species complex of Garlic virus C, Garlic virus D, Garlic virus X, OYDV, and LYSV. Pairwise comparisons of the isolated Potyviruses and Allexiviruses species revealed high identity with that of the known members of their respected species. As an exception, less within species identity was observed among Garlic virus C isolates as compared with that of the known members of the species. Finally, our results highlighted the need for stepping up a working framework to establish virus-free garlic planting material exchange in the country which could result in the reduction of viral gene flow across the country.Author SummaryGarlic viruses are the most devastating disease since garlic is the most vulnerable crop due to their vegetative nature of propagation. Currently, the garlic viruses are the aforementioned production constraint in Ethiopia. However, so far very little is known on the identification, diversity, and dissemination of garlic infecting viruses in the country. Here we explore the prevalence, genetic diversity, and the presence of mixed infection of garlic viruses in Ethiopia using next generation sequencing platform. Analysis of nucleotide and amino acid sequences of coat protein genes from infected samples revealed the association of three species from Allexivirus and two species from Potyvirus in a complex mixture. Ultimately the article concludes there is high time to set up a working framework to establish garlic free planting material exchange platform which could result in a reduction of viral gene flow across the country.


PeerJ ◽  
2020 ◽  
Vol 8 ◽  
pp. e10381
Author(s):  
Rohit Nandakumar ◽  
Valentin Dinu

Throughout the history of drug discovery, an enzymatic-based approach for identifying new drug molecules has been primarily utilized. Recently, protein–protein interfaces that can be disrupted to identify small molecules that could be viable targets for certain diseases, such as cancer and the human immunodeficiency virus, have been identified. Existing studies computationally identify hotspots on these interfaces, with most models attaining accuracies of ~70%. Many studies do not effectively integrate information relating to amino acid chains and other structural information relating to the complex. Herein, (1) a machine learning model has been created and (2) its ability to integrate multiple features, such as those associated with amino-acid chains, has been evaluated to enhance the ability to predict protein–protein interface hotspots. Virtual drug screening analysis of a set of hotspots determined on the EphB2-ephrinB2 complex has also been performed. The predictive capabilities of this model offer an AUROC of 0.842, sensitivity/recall of 0.833, and specificity of 0.850. Virtual screening of a set of hotspots identified by the machine learning model developed in this study has identified potential medications to treat diseases caused by the overexpression of the EphB2-ephrinB2 complex, including prostate, gastric, colorectal and melanoma cancers which are linked to EphB2 mutations. The efficacy of this model has been demonstrated through its successful ability to predict drug-disease associations previously identified in literature, including cimetidine, idarubicin, pralatrexate for these conditions. In addition, nadolol, a beta blocker, has also been identified in this study to bind to the EphB2-ephrinB2 complex, and the possibility of this drug treating multiple cancers is still relatively unexplored.


1993 ◽  
Vol 4 (3) ◽  
pp. 287-292 ◽  
Author(s):  
D.L. Kauffman ◽  
P.J. Keller ◽  
A. Bennick ◽  
M. Blum

Human proline-rich proteins (PRPs) constitute a complex family of salivary proteins that are encoded by a small number of genes. The primary gene product is cleaved by proteases, thereby giving rise to about 20 secreted proteins. To determine the genes for the secreted PRPs, therefore, it is necessary to obtain sequences of both the secreted proteins and the DNA encoding these proteins. We have sequenced most PRPs from one donor (D.K.) and aligned the protein sequences with available DNA sequences from unrelated individuals. Partial sequence data have now been obtained for an additional PRP from D.K. named II-1. This protein was purified from parotid saliva by gel filtration and ion-exchange chromatography. Peptides were obtained by cleavage with trypsin, clostripain, and N-bromosuccinimide, followed by column chromatography. The peptides were sequenced on a gas-phase protein sequenator. Overlapping peptide sequences were obtained for most of II-1 and aligned with translated DNA sequences. The best fit was obtained with clones containing sequences for the allele PRB4" (Lyons et al., 1988). However, there was not complete identity of the protein amino acid sequence and the DNA-derived sequences, indicating that II-1 is not encoded by PRB4". Other PRPs isolated from D.K. also fail to conform to any DNA structure so far reported. This shows the need to obtain amino acid sequences and corresponding DNA sequences from the same person to assign genes for the PRPs and to determine the location of the postribosomal cleavage points in the primary translation product.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Abu Sayed Chowdhury ◽  
Sarah M. Reehl ◽  
Kylene Kehn-Hall ◽  
Barney Bishop ◽  
Bobbie-Jo M. Webb-Robertson

Abstract The emergence of viral epidemics throughout the world is of concern due to the scarcity of available effective antiviral therapeutics. The discovery of new antiviral therapies is imperative to address this challenge, and antiviral peptides (AVPs) represent a valuable resource for the development of novel therapies to combat viral infection. We present a new machine learning model to distinguish AVPs from non-AVPs using the most informative features derived from the physicochemical and structural properties of their amino acid sequences. To focus on those features that are most likely to contribute to antiviral performance, we filter potential features based on their importance for classification. These feature selection analyses suggest that secondary structure is the most important peptide sequence feature for predicting AVPs. Our Feature-Informed Reduced Machine Learning for Antiviral Peptide Prediction (FIRM-AVP) approach achieves a higher accuracy than either the model with all features or current state-of-the-art single classifiers. Understanding the features that are associated with AVP activity is a core need to identify and design new AVPs in novel systems. The FIRM-AVP code and standalone software package are available at https://github.com/pmartR/FIRM-AVP with an accompanying web application at https://msc-viz.emsl.pnnl.gov/AVPR.


mSphere ◽  
2019 ◽  
Vol 4 (2) ◽  
Author(s):  
Marli Vlok ◽  
Andrew S. Lang ◽  
Curtis A. Suttle

ABSTRACTRNA viruses, particularly genetically diverse members of thePicornavirales, are widespread and abundant in the ocean. Gene surveys suggest that there are spatial and temporal patterns in the composition of RNA virus assemblages, but data on their diversity and genetic variability in different oceanographic settings are limited. Here, we show that specific RNA virus genomes have widespread geographic distributions and that the dominant genotypes are under purifying selection. Genomes from three previously unknown picorna-like viruses (BC-1, -2, and -3) assembled from a coastal site in British Columbia, Canada, as well as marine RNA viruses JP-A, JP-B, andHeterosigma akashiwoRNA virus exhibited different biogeographical patterns. Thus, biotic factors such as host specificity and viral life cycle, and not just abiotic processes such as dispersal, affect marine RNA virus distribution. Sequence differences relative to reference genomes imply that virus quasispecies are under purifying selection, with synonymous single-nucleotide variations dominating in genomes from geographically distinct regions resulting in conservation of amino acid sequences. Conversely, sequences from coastal South Africa that mapped to marine RNA virus JP-A exhibited more nonsynonymous mutations, probably representing amino acid changes that accumulated over a longer separation. This biogeographical analysis of marine RNA viruses demonstrates that purifying selection is occurring across oceanographic provinces. These data add to the spectrum of known marine RNA virus genomes, show the importance of dispersal and purifying selection for these viruses, and indicate that closely related RNA viruses are pathogens of eukaryotic microbes across oceans.IMPORTANCEVery little is known about aquatic RNA virus populations and genome evolution. This is the first study that analyzes marine environmental RNA viral assemblages in an evolutionary and broad geographical context. This study contributes the largest marine RNA virus metagenomic data set to date, substantially increasing the sequencing space for RNA viruses and also providing a baseline for comparisons of marine RNA virus diversity. The new viruses discovered in this study are representative of the most abundant family of marine RNA viruses, theMarnaviridae, and expand our view of the diversity of this important group. Overall, our data and analyses provide a foundation for interpreting marine RNA virus diversity and evolution.


Sign in / Sign up

Export Citation Format

Share Document