Moonlighting protein prediction using physico-chemical and evolutional properties via machine learning methods

Abstract Background Moonlighting proteins (MPs) are a subclass of multifunctional proteins in which more than one independent or usually distinct function occurs in a single polypeptide chain. Identification of unknown cellular processes, understanding novel protein mechanisms, improving the prediction of protein functions, and gaining information about protein evolution are the main reasons to study MPs. They also play an important role in disease pathways and drug-target discovery. Since detecting MPs experimentally is quite a challenge, most of them are detected randomly. Therefore, introducing an appropriate computational approach to predict MPs seems reasonable. Results In this study, we introduced a competent model for detecting moonlighting and non-MPs through extracted features from protein sequences. We attempted to set up a well-judged scheme for detecting outlier proteins. Consequently, 37 distinct feature vectors were utilized to study each protein’s impact on detecting MPs. Furthermore, 8 different classification methods were assessed to find the best performance. To detect outliers, each one of the classifications was executed 100 times by tenfold cross-validation on feature vectors; proteins which misclassified 90 times or more were grouped. This process was applied to every single feature vector and eventually the intersection of these groups was determined as the outlier proteins. The results of tenfold cross-validation on a dataset of 351 samples (containing 215 moonlighting and 136 non-moonlighting proteins) reveal that the SVM method on all feature vectors has the highest performance among all methods in this study and other available methods. Besides, the study of outliers showed that 57 of 351 proteins in the dataset could be an appropriate candidate for the outlier. Among the outlier proteins, there were non-MPs (such as P69797) that have been misclassified in 8 different classification methods with 16 different feature vectors. Because these proteins have been obtained by computational methods, the results of this study could reduce the likelihood of hypothesizing whether these proteins are non-moonlighting at all. Conclusions MPs are difficult to be identified through experimentation. Using distinct feature vectors, our method enabled identification of novel moonlighting proteins. The study also pinpointed that a number of non-MPs are likely to be moonlighting.

Download Full-text

Moonlighting protein prediction using physico-chemical and evolutional properties via machine learning methods

10.21203/rs.3.rs-126672/v1 ◽

2020 ◽

Author(s):

Farshid Shirafkan ◽

Sajjad Gharaghani ◽

Karim Rahimian ◽

Reza Sajedi ◽

Javad Zahiri

Keyword(s):

Cross Validation ◽

Classification Methods ◽

Drug Target Discovery ◽

Feature Vectors ◽

Single Polypeptide Chain ◽

Protein Functions ◽

Moonlighting Proteins ◽

Decision Tree Method ◽

Single Polypeptide ◽

Fold Cross Validation

Abstract Background: Moonlighting proteins (MPs) are a subclass of multifunctional proteins in which more than one independent or usually distinct function occurs in a single polypeptide chain. Identification of unknown cellular processes, understanding novel protein mechanisms, improving the prediction of protein functions, and gaining information about protein evolution are the main reasons to study MPs. They also play an important role in disease pathways and drug-target discovery. Since detecting MPs experimentally is quite a challenge, most of them were detected randomly. Therefore, introducing an appropriate computational approach seems to be rational. Results: In this study, we would like to represent a competent model for detecting moonlighting and non-moonlighting proteins by extracted features from protein sequences. Then, we will represent a scheme for detecting outlier proteins. To do so, 15 distinct feature vectors were used to study each one's effect on detecting MPs. Furthermore, 8 different classification methods were assessed to find the best performance. To detect outliers, each one of the classifications was implemented 100 times by 10 fold cross-validation on feature vectors, then proteins which miss classified 80 times or more, were grouped. This process was applied to every single feature vector and in the end, the intersection of these groups was determined as the outlier proteins. The results of 10 fold cross-validation on a dataset of 351 samples (containing 215 moonlighting and 136 non-moonlighting proteins) show that the decision tree method on all feature vectors has the highest performance among all methods in this research and also in other available methods. Besides, the study of outliers shows that 57 of 351 proteins in the dataset could be an appropriate candidate for the outlier. Among the outlier proteins, there are non-moonlighting proteins (such as P69797) that have been misclassified by 8 different classification methods with 16 different feature types. Because these moonlighting proteins have been obtained by computational methods, the results of this study could reduce the likelihood of hypothesizing that, these proteins are non-moonlighting. Conclusions: Moonlighting proteins are difficult to identify by experiments. Our method enables identification of novel moonlighting proteins using distinct feature vectors. It also indicates that a number of non-moonlight proteins are likely to be moonlight.

Download Full-text

The Proportion for Splitting Data into Training and Test Set for the Bootstrap in Classification Problems

Business Systems Research Journal ◽

10.2478/bsrj-2021-0015 ◽

2021 ◽

Vol 12 (1) ◽

pp. 228-242

Author(s):

Borislava Vrigazova

Keyword(s):

Logistic Regression ◽

Decision Tree ◽

Cross Validation ◽

Computing Time ◽

Splitting Method ◽

Classification Methods ◽

Classification Problems ◽

Test Set ◽

Resampling Methods ◽

Tenfold Cross Validation

Abstract Background: The bootstrap can be alternative to cross-validation as a training/test set splitting method since it minimizes the computing time in classification problems in comparison to the tenfold cross-validation. Objectives: Тhis research investigates what proportion should be used to split the dataset into the training and the testing set so that the bootstrap might be competitive in terms of accuracy to other resampling methods. Methods/Approach: Different train/test split proportions are used with the following resampling methods: the bootstrap, the leave-one-out cross-validation, the tenfold cross-validation, and the random repeated train/test split to test their performance on several classification methods. The classification methods used include the logistic regression, the decision tree, and the k-nearest neighbours. Results: The findings suggest that using a different structure of the test set (e.g. 30/70, 20/80) can further optimize the performance of the bootstrap when applied to the logistic regression and the decision tree. For the k-nearest neighbour, the tenfold cross-validation with a 70/30 train/test splitting ratio is recommended. Conclusions: Depending on the characteristics and the preliminary transformations of the variables, the bootstrap can improve the accuracy of the classification problem.

Download Full-text

Multitalented actors inside and outside the cell: recent discoveries add to the number of moonlighting proteins

Biochemical Society Transactions ◽

10.1042/bst20190798 ◽

2019 ◽

Vol 47 (6) ◽

pp. 1941-1948 ◽

Cited By ~ 9

Author(s):

Constance J. Jeffery

Keyword(s):

Molecular Mechanisms ◽

Food Crops ◽

Bacterial Strains ◽

Cellular Processes ◽

Biochemical Pathways ◽

The Past ◽

Single Polypeptide Chain ◽

New Antibiotics ◽

Moonlighting Proteins ◽

Single Polypeptide

During the past few decades, it's become clear that many enzymes evolved not only to act as specific, finely tuned and carefully regulated catalysts, but also to perform a second, completely different function in the cell. In general, these moonlighting proteins have a single polypeptide chain that performs two or more distinct and physiologically relevant biochemical or biophysical functions. This mini-review describes examples of moonlighting proteins that have been found within the past few years, including some that play key roles in human and animal diseases and in the regulation of biochemical pathways in food crops. Several belong to two of the most common subclasses of moonlighting proteins: trigger enzymes and intracellular/surface moonlighting proteins, but a few represent less often observed combinations of functions. These examples also help illustrate some of the current methods used for identifying proteins with multiple functions. In general, a greater understanding about the functions and molecular mechanisms of moonlighting proteins, their roles in the regulation of cellular processes, and their involvement in health and disease could aid in many areas including developing new antibiotics, predicting the functions of the millions of proteins being identified through genome sequencing projects, designing novel proteins, using biological circuitry analysis to construct bacterial strains that are better producers of materials for industrial use, and developing methods to tweak biochemical pathways for increasing yields of food crops.

Download Full-text

An introduction to protein moonlighting

Biochemical Society Transactions ◽

10.1042/bst20140226 ◽

2014 ◽

Vol 42 (6) ◽

pp. 1679-1683 ◽

Cited By ~ 79

Author(s):

Constance J. Jeffery

Keyword(s):

Transcription Factors ◽

Protein Sequence ◽

Polypeptide Chain ◽

Biochemical Pathways ◽

Single Polypeptide Chain ◽

Potential Benefits ◽

Moonlighting Proteins ◽

Structure Databases ◽

Single Polypeptide ◽

And Function

Moonlighting proteins comprise a class of multifunctional proteins in which a single polypeptide chain performs multiple physiologically relevant biochemical or biophysical functions. Almost 300 proteins have been found to moonlight. The known examples of moonlighting proteins include diverse types of proteins, including receptors, enzymes, transcription factors, adhesins and scaffolds, and different combinations of functions are observed. Moonlighting proteins are expressed throughout the evolutionary tree and function in many different biochemical pathways. Some moonlighting proteins can perform both functions simultaneously, but for others, the protein's function changes in response to changes in the environment. The diverse examples of moonlighting proteins already identified, and the potential benefits moonlighting proteins might provide to the organism, such as through coordinating cellular activities, suggest that many more moonlighting proteins are likely to be found. Continuing studies of the structures and functions of moonlighting proteins will aid in predicting the functions of proteins identified through genome sequencing projects, in interpreting results from proteomics experiments, in understanding how different biochemical pathways interact in systems biology, in annotating protein sequence and structure databases, in studies of protein evolution and in the design of proteins with novel functions.

Download Full-text

MoonProt 3.0: an update of the moonlighting proteins database

Nucleic Acids Research ◽

10.1093/nar/gkaa1101 ◽

2020 ◽

Vol 49 (D1) ◽

pp. D368-D372

Author(s):

Chang Chen ◽

Haipeng Liu ◽

Shadi Zabad ◽

Nina Rivera ◽

Emily Rowin ◽

...

Keyword(s):

Data Bank ◽

Transmembrane Helices ◽

Additional Information ◽

Single Polypeptide Chain ◽

Open Access Database ◽

Moonlighting Proteins ◽

Version 2.0 ◽

Single Polypeptide ◽

Scop Classification ◽

Relationship Of

Abstract MoonProt 3.0 (http://moonlightingproteins.org) is an updated open-access database storing expert-curated annotations for moonlighting proteins. Moonlighting proteins have two or more physiologically relevant distinct biochemical or biophysical functions performed by a single polypeptide chain. Here, we describe an expansion in the database since our previous report in the Database Issue of Nucleic Acids Research in 2018. For this release, the number of proteins annotated has been expanded to over 500 proteins and dozens of protein annotations have been updated with additional information, including more structures in the Protein Data Bank, compared with version 2.0. The new entries include more examples from humans, plants and archaea, more proteins involved in disease and proteins with different combinations of functions. More kinds of information about the proteins and the species in which they have multiple functions has been added, including CATH and SCOP classification of structure, known and predicted disorder, predicted transmembrane helices, type of organism, relationship of the protein to disease, and relationship of organism to cause of disease.

Download Full-text

Pathogen Moonlighting Proteins: From Ancestral Key Metabolic Enzymes to Virulence Factors

Microorganisms ◽

10.3390/microorganisms9061300 ◽

2021 ◽

Vol 9 (6) ◽

pp. 1300

Author(s):

Luis Franco-Serrano ◽

David Sánchez-Redondo ◽

Araceli Nájar-García ◽

Sergio Hernández ◽

Isaac Amela ◽

...

Keyword(s):

Primary Metabolism ◽

Protein Targets ◽

Pathogen Virulence ◽

Single Polypeptide Chain ◽

Moonlighting Proteins ◽

Single Polypeptide ◽

Main Ideas ◽

Moonlighting Functions ◽

Work First ◽

Host Tissues

Moonlighting and multitasking proteins refer to proteins with two or more functions performed by a single polypeptide chain. An amazing example of the Gain of Function (GoF) phenomenon of these proteins is that 25% of the moonlighting functions of our Multitasking Proteins Database (MultitaskProtDB-II) are related to pathogen virulence activity. Moreover, they usually have a canonical function belonging to highly conserved ancestral key functions, and their moonlighting functions are often involved in inducing extracellular matrix (ECM) protein remodeling. There are three main questions in the context of moonlighting proteins in pathogen virulence: (A) Why are a high percentage of pathogen moonlighting proteins involved in virulence? (B) Why do most of the canonical functions of these moonlighting proteins belong to primary metabolism? Moreover, why are they common in many pathogen species? (C) How are these different protein sequences and structures able to bind the same set of host ECM protein targets, mainly plasminogen (PLG), and colonize host tissues? By means of an extensive bioinformatics analysis, we suggest answers and approaches to these questions. There are three main ideas derived from the work: first, moonlighting proteins are not good candidates for vaccines. Second, several motifs that might be important in the adhesion to the ECM were identified. Third, an overrepresentation of GO codes related with virulence in moonlighting proteins were seen.

Download Full-text

Debranching enzyme from rabbit skeletal muscle; Evidence for the location of two active centres on a single polypeptide chain

FEBS Letters ◽

10.1016/0014-5793(75)80254-7 ◽

1975 ◽

Vol 58 (1-2) ◽

pp. 181-185 ◽

Cited By ~ 50

Author(s):

Edna J. Bates ◽

Gillian M. Heaton ◽

Carol Taylor ◽

John C. Kernohan ◽

Philip Cohen

Keyword(s):

Skeletal Muscle ◽

Polypeptide Chain ◽

Rabbit Skeletal Muscle ◽

Debranching Enzyme ◽

Single Polypeptide Chain ◽

Single Polypeptide ◽

Active Centres

Download Full-text

An antibody VH domain with a lox -Cre site integrated into its coding region: bacterial recombination within a single polypeptide chain

FEBS Letters ◽

10.1016/0014-5793(95)01313-x ◽

1995 ◽

Vol 377 (1) ◽

pp. 92-96 ◽

Cited By ~ 73

Author(s):

Julian Davies ◽

Lutz Riechmann

Keyword(s):

Polypeptide Chain ◽

Coding Region ◽

Single Polypeptide Chain ◽

Vh Domain ◽

Single Polypeptide

Download Full-text

N-Terminal amino acid sequence of rat tonin: homology with serine proteases

Canadian Journal of Biochemistry ◽

10.1139/o78-142 ◽

1978 ◽

Vol 56 (9) ◽

pp. 920-925 ◽

Cited By ~ 13

Author(s):

N. G. Seidah ◽

R. Routhier ◽

M. Caron ◽

M. Chrétien ◽

S. Demassieux ◽

...

Keyword(s):

Serine Proteases ◽

Terminal Sequence ◽

Renin Substrate ◽

Amino Terminal ◽

Single Polypeptide Chain ◽

Serine Protease Family ◽

Terminal Amino ◽

Single Polypeptide ◽

Carboxy Terminal ◽

Amino Terminal Sequence

In this paper, we present the amino-terminal sequence of rat tonin, an endopeptidase responsible for the conversion of angiotensinogen, the tetradecapeptide renin substrate, or angiotensin I to angiotensin II. It is shown that isoleucine and proline occupy the amino- and carboxy-terminal residues respectively. The N-terminal sequence analysis permitted the identification of 34 out of the first 40 residue s of the single polypeptide chain composed of 272 amino acids. The se results showed an extensive homology with the sequence of many serine proteases of the trypsin–chymotrypsin family. This information, coupled with the slow inhibition of tonin by diisopropylfluorophosphate, classified this enzyme as a selective endopeptidase of the active serine protease family.

Download Full-text