Mitigating Data Scarcity in Protein Binding Prediction Using Meta-Learning

AbstractA plethora of biological functions are performed through various types of protein-peptide binding. Prime examples include the protein kinase phosphorylation on peptide substrates and the binding of major histocompatibility complex to neoantigens in the immune system. Understanding the specificity of protein-peptide interactions is critical for unraveling the architectures of functional pathways and the mechanisms of cellular processes in human cells. Despite mass-spectrometric techniques were developed for the identification of protein-peptide interactions, our understanding of the preferences of proteins on their binding peptides is still rudimentary. As a complementary direction, a line of computational prediction methods has been recently proposed to predict protein-peptide bindings which efficiently provide rich functional annotations on a large scale. To achieve a high prediction accuracy, these computational methods require a sufficient amount of data to build the prediction model. However, the number of experimentally verified protein-peptide bindings is often limited in real cases. For example, a majority of protein kinases have very few experimentally verified phosphorylation sites (e.g., less than 30 sites) in existing databases. These methods are thus limited to building accurate prediction models for only well-characterized proteins with a large volume of known binding peptides and cannot be extended to predict new binding peptides for less-studied proteins. In this paper, we introduce a generic framework to address this issue of data scarcity in protein binding prediction. We demonstrate the applicability of our framework in predicting kinase-specific phosphorylation sites. Our method uses an effective training strategy to build a prediction model with robust transferability. The model is able to predict the phosphorylation sites of a less-studied kinase, even if there is only a small number of phosphorylation sites known for this kinase. To achieve this, we train the model via a meta-learning phase followed by a few-shot learning phase. We demonstrate our framework has better transferability than state-of-the-art methods and is effective in utilizing limited data to accurately predict phosphorylation sites for less-characterized kinases. The implementation of our framework is available at https://github.com/luoyunan/MetaKinase.

Download Full-text

Refining Vancomycin Protein Binding Estimates: Identification of Clinical Factors That Influence Protein Binding

Antimicrobial Agents and Chemotherapy ◽

10.1128/aac.01674-10 ◽

2011 ◽

Vol 55 (9) ◽

pp. 4277-4282 ◽

Cited By ~ 43

Author(s):

Jill M. Butterfield ◽

Nimish Patel ◽

Manjunath P. Pai ◽

Thomas G. Rosano ◽

George L. Drusano ◽

...

Keyword(s):

Protein Binding ◽

Prediction Model ◽

Total Protein ◽

Prediction Models ◽

Current Data ◽

Algebraic Expression ◽

Clinical Factors ◽

Free Concentration ◽

Vancomycin Concentration ◽

Pharmacologically Active

ABSTRACTWhile current data indicate only free (unbound) drug is pharmacologically active and is most predictive of response, pharmacodynamic studies of vancomycin have been limited to measurement of total concentrations. The protein binding of vancomycin is thought to be approximately 50%, but considerable variability surrounds this estimate. The present study sought to determine the extent of vancomycin protein binding, to identify factors that modulate its binding, and to create and validate a prediction tool to estimate the extent of protein binding based on individual clinical factors. This single-site prospective cohort study included hospitalized adult patients treated with vancomycin and with a vancomycin serum concentration determination available. Linear regression was used to predict the free vancomycin concentration (f[vanco]) and to determine the clinical factors modulating vancomycin protein binding. Among the 50 patients in the study, the mean protein binding was 41.5%. The strongest predictor off[vanco] was the total vancomycin concentration (total [vanco]), and this was modified by dialysis and total protein of ≥6.7 g/dl as covariates. The algebraic expression from the final prediction model wasf[vanco] = 0.643 + 0.560 × total [vanco] − {0.067 × total [vanco] × D} − {0.071 × total [vanco] × TP} where D = 1 if dialysis dependent or 0 if not dialysis dependent, and TP = 1 if total protein is ≥6.7 g/dl or 0 if total protein is <6.7 g/dl. TheR2of the final prediction model was 0.959 (P< 0.001). Validation of our model was performed in 13 patients, and the predictive performance was highly favorable (R2was 0.9, and bias and precision were 0.18 and 0.18, respectively). Prediction models such as ours can be utilized in future pharmacokinetics and pharmacodynamics studies evaluating the exposure-response profile and to determine the pharmacodynamic target of interest as it relates to the free concentration.

Download Full-text

SKIPHOS: non-kinase specific phosphorylation site prediction with random forests and amino acid skip-gram embeddings

10.1101/793794 ◽

2019 ◽

Cited By ~ 1

Author(s):

Thanh Hai Dang ◽

Quang Thinh Trac ◽

Huy Kinh Phan ◽

Manh Cuong Nguyen ◽

Quynh Trang Pham Thi

Keyword(s):

Prediction Model ◽

Random Forests ◽

Protein Modification ◽

Prediction Models ◽

Phosphorylation Site ◽

Supplementary Information ◽

Phosphorylation Sites ◽

Site Prediction ◽

Link Type ◽

Development And Aging

AbstractMotivationPhosphorylation, which is catalyzed by kinase proteins, is in the top two most common and widely studied types of known essential post-translation protein modification (PTM). Phosphorylation is known to regulate most cellular processes such as protein synthesis, cell division, signal transduction, cell growth, development and aging. Various phosphorylation site prediction models have been developed, which can be broadly categorized as being kinase-specific or non-kinase specific (general). Unlike the latter, the former requires a large enough number of experimentally known phosphorylation sites annotated with a given kinase for training the model, which is not the case in reality: less than 3% of the phosphorylation sites known to date have been annotated with a responsible kinase. To date, there are a few non-kinase specific phosphorylation site prediction models proposed.ResultsThis paper proposes SKIPHOS, a non-kinase specific phosphorylation site prediction model based on random forests on top of a continuous distributed representation of amino acids. Experimental results on the benchmark dataset and the independent test set demonstrate that SKIPHOS compares favorably to recent state-of-the-art related methods for three phosphorylation residues. Although being trained on phosphorylation sites in mamals, SKIPHOS can yield predictions for Y residues better than PHOSFER, a recently proposed plants-specific phosphorylation prediction model.Availability and ImplementationSKIPHOS Web Server is freely available for non-commercial use at http://fit.uet.vnu.edu.vn/SKIPHOS or http://112.137.130.46:[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Faculty Opinions recommendation of The use of mRNA display to select high-affinity protein-binding peptides.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1005033.59206 ◽

2002 ◽

Author(s):

Michael Yarus

Keyword(s):

Protein Binding ◽

Mrna Display ◽

High Affinity ◽

Binding Peptides

Download Full-text

Bioactivity Prediction Based on Matched Molecular Pair and Matched Molecular Series Methods

Current Pharmaceutical Design ◽

10.2174/1381612826666200427111309 ◽

2020 ◽

Vol 26 (33) ◽

pp. 4195-4205

Author(s):

Xiaoyu Ding ◽

Chen Cui ◽

Dingyan Wang ◽

Jihui Zhao ◽

Mingyue Zheng ◽

...

Keyword(s):

Prediction Model ◽

Large Scale ◽

Prediction Models ◽

Predictive Accuracy ◽

Lead Optimization ◽

Consensus Method ◽

Molecular Pair ◽

Bioactivity Prediction ◽

Compound Synthesis ◽

Consensus Modeling

Background: Enhancing a compound’s biological activity is the central task for lead optimization in small molecules drug discovery. However, it is laborious to perform many iterative rounds of compound synthesis and bioactivity tests. To address the issue, it is highly demanding to develop high quality in silico bioactivity prediction approaches, to prioritize such more active compound derivatives and reduce the trial-and-error process. Methods: Two kinds of bioactivity prediction models based on a large-scale structure-activity relationship (SAR) database were constructed. The first one is based on the similarity of substituents and realized by matched molecular pair analysis, including SA, SA_BR, SR, and SR_BR. The second one is based on SAR transferability and realized by matched molecular series analysis, including Single MMS pair, Full MMS series, and Multi single MMS pairs. Moreover, we also defined the application domain of models by using the distance-based threshold. Results: Among seven individual models, Multi single MMS pairs bioactivity prediction model showed the best performance (R2 = 0.828, MAE = 0.406, RMSE = 0.591), and the baseline model (SA) produced the most lower prediction accuracy (R2 = 0.798, MAE = 0.446, RMSE = 0.637). The predictive accuracy could further be improved by consensus modeling (R2 = 0.842, MAE = 0.397 and RMSE = 0.563). Conclusion: An accurate prediction model for bioactivity was built with a consensus method, which was superior to all individual models. Our model should be a valuable tool for lead optimization.

Download Full-text

Fire modelling in Tasmanian buttongrass moorlands. III. Dead fuel moisture

International Journal of Wildland Fire ◽

10.1071/wf01025 ◽

2001 ◽

Vol 10 (2) ◽

pp. 241 ◽

Cited By ~ 27

Author(s):

Jon B. Marsden-Smedley ◽

Wendy R. Catchpole

Keyword(s):

Prediction Model ◽

Regression Model ◽

Fire Management ◽

Prediction Models ◽

Dew Point ◽

Seasonal Effects ◽

Experimental Program ◽

Fuel Moisture ◽

Fire Behaviour ◽

Fire Modelling

An experimental program was carried out in Tasmanian buttongrass moorlands to develop fire behaviour prediction models for improving fire management. This paper describes the results of the fuel moisture modelling section of this project. A range of previously developed fuel moisture prediction models are examined and three empirical dead fuel moisture prediction models are developed. McArthur’s grassland fuel moisture model gave equally good predictions as a linear regression model using humidity and dew-point temperature. The regression model was preferred as a prediction model as it is inherently more robust. A prediction model based on hazard sticks was found to have strong seasonal effects which need further investigation before hazard sticks can be used operationally.

Download Full-text

A Genetic Algorithm Optimized RNN-LSTM Model for Remaining Useful Life Prediction of Turbofan Engine

Electronics ◽

10.3390/electronics10030285 ◽

2021 ◽

Vol 10 (3) ◽

pp. 285

Author(s):

Kwok Tai Chui ◽

Brij B. Gupta ◽

Pandian Vasant

Keyword(s):

Genetic Algorithm ◽

Feature Extraction ◽

Prediction Model ◽

Prediction Models ◽

Remaining Useful Life ◽

Prediction Algorithm ◽

Short Term ◽

Turbofan Engine ◽

Term Prediction ◽

Useful Life

Understanding the remaining useful life (RUL) of equipment is crucial for optimal predictive maintenance (PdM). This addresses the issues of equipment downtime and unnecessary maintenance checks in run-to-failure maintenance and preventive maintenance. Both feature extraction and prediction algorithm have played crucial roles on the performance of RUL prediction models. A benchmark dataset, namely Turbofan Engine Degradation Simulation Dataset, was selected for performance analysis and evaluation. The proposal of the combination of complete ensemble empirical mode decomposition and wavelet packet transform for feature extraction could reduce the average root-mean-square error (RMSE) by 5.14–27.15% compared with six approaches. When it comes to the prediction algorithm, the results of the RUL prediction model could be that the equipment needs to be repaired or replaced within a shorter or a longer period of time. Incorporating this characteristic could enhance the performance of the RUL prediction model. In this paper, we have proposed the RUL prediction algorithm in combination with recurrent neural network (RNN) and long short-term memory (LSTM). The former takes the advantages of short-term prediction whereas the latter manages better in long-term prediction. The weights to combine RNN and LSTM were designed by non-dominated sorting genetic algorithm II (NSGA-II). It achieved average RMSE of 17.2. It improved the RMSE by 6.07–14.72% compared with baseline models, stand-alone RNN, and stand-alone LSTM. Compared with existing works, the RMSE improvement by proposed work is 12.95–39.32%.

Download Full-text

The Role of Board Independence and Ownership Structure in Improving the Efficacy of Corporate Financial Distress Prediction Model Evidence from India

Journal of Risk and Financial Management ◽

10.3390/jrfm14070333 ◽

2021 ◽

Vol 14 (7) ◽

pp. 333

Author(s):

Shilpa H. Shetty ◽

Theresa Nithila Vincent

Keyword(s):

Prediction Model ◽

Ownership Structure ◽

Financial Distress ◽

Prediction Models ◽

Receiver Operating Curve ◽

Financial Measures ◽

Financial Variables ◽

Financial Distress Prediction ◽

Distress Prediction

The study aimed to investigate the role of non-financial measures in predicting corporate financial distress in the Indian industrial sector. The proportion of independent directors on the board and the proportion of the promoters’ share in the ownership structure of the business were the non-financial measures that were analysed, along with ten financial measures. For this, sample data consisted of 82 companies that had filed for bankruptcy under the Insolvency and Bankruptcy Code (IBC). An equal number of matching financially sound companies also constituted the sample. Therefore, the total sample size was 164 companies. Data for five years immediately preceding the bankruptcy filing was collected for the sample companies. The data of 120 companies evenly drawn from the two groups of companies were used for developing the model and the remaining data were used for validating the developed model. Two binary logistic regression models were developed, M1 and M2, where M1 was formulated with both financial and non-financial variables, and M2 only had financial variables as predictors. The diagnostic ability of the model was tested with the aid of the receiver operating curve (ROC), area under the curve (AUC), sensitivity, specificity and annual accuracy. The results of the study show that inclusion of the two non-financial variables improved the efficacy of the financial distress prediction model. This study made a unique attempt to provide empirical evidence on the role played by non-financial variables in improving the efficiency of corporate distress prediction models.

Download Full-text

Predicting Change Prone Classes in Open Source Software

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2018100101 ◽

2018 ◽

Vol 8 (4) ◽

pp. 1-23 ◽

Cited By ~ 2

Author(s):

Deepa Godara ◽

Amit Choudhary ◽

Rakesh Kumar Singh

Keyword(s):

Prediction Model ◽

Open Source ◽

Open Source Software ◽

Prediction Models ◽

New Technology ◽

Modern Technology ◽

Time Frequency ◽

Rigorous Testing ◽

Technology Changes ◽

Sensitivity Specificity

In today's world, the heart of modern technology is software. In order to compete with pace of new technology, changes in software are inevitable. This article aims at the association between changes and object-oriented metrics using different versions of open source software. Change prediction models can detect the probability of change in a class earlier in the software life cycle which would result in better effort allocation, more rigorous testing and easier maintenance of any software. Earlier, researchers have used various techniques such as statistical methods for the prediction of change-prone classes. In this article, some new metrics such as execution time, frequency, run time information, popularity and class dependency are proposed which can help in prediction of change prone classes. For evaluating the performance of the prediction model, the authors used Sensitivity, Specificity, and ROC Curve. Higher values of AUC indicate the prediction model gives significant accurate results. The proposed metrics contribute to the accurate prediction of change-prone classes.

Download Full-text

Building prediction models for coronary heart disease by synthesizing multiple longitudinal research findings

European Journal of Cardiovascular Prevention & Rehabilitation ◽

10.1097/01.hjr.0000173109.14228.71 ◽

2005 ◽

Vol 12 (5) ◽

pp. 459-464 ◽

Cited By ~ 4

Author(s):

Guizhou Hu ◽

Martin M. Root

Keyword(s):

Coronary Heart Disease ◽

Heart Disease ◽

Prediction Model ◽

Empirical Model ◽

Complex Disease ◽

Prediction Models ◽

Longitudinal Research ◽

Study Data ◽

Individual Risk ◽

Data Set

Background No methodology is currently available to allow the combining of individual risk factor information derived from different longitudinal studies for a chronic disease in a multivariate fashion. This paper introduces such a methodology, named Synthesis Analysis, which is essentially a multivariate meta-analytic technique. Design The construction and validation of statistical models using available data sets. Methods and results Two analyses are presented. (1) With the same data, Synthesis Analysis produced a similar prediction model to the conventional regression approach when using the same risk variables. Synthesis Analysis produced better prediction models when additional risk variables were added. (2) A four-variable empirical logistic model for death from coronary heart disease was developed with data from the Framingham Heart Study. A synthesized prediction model with five new variables added to this empirical model was developed using Synthesis Analysis and literature information. This model was then compared with the four-variable empirical model using the first National Health and Nutrition Examination Survey (NHANES I) Epidemiologic Follow-up Study data set. The synthesized model had significantly improved predictive power ( x2 = 43.8, P < 0.00001). Conclusions Synthesis Analysis provides a new means of developing complex disease predictive models from the medical literature.

Download Full-text

Efficient Identification of Novel Hla-A*0201–Presented Cytotoxic T Lymphocyte Epitopes in the Widely Expressed Tumor Antigen Prame by Proteasome-Mediated Digestion Analysis

Journal of Experimental Medicine ◽

10.1084/jem.193.1.73 ◽

2001 ◽

Vol 193 (1) ◽

pp. 73-88 ◽

Cited By ~ 184

Author(s):

Jan H. Kessler ◽

Nico J. Beekman ◽

Sandra A. Bres-Vloemans ◽

Pauline Verdijk ◽

Peter A. van Veelen ◽

...

Keyword(s):

T Lymphocyte ◽

Cytotoxic T Lymphocyte ◽

Epitope Prediction ◽

Binding Prediction ◽

Ctl Epitopes ◽

High Affinity ◽

Tumor Associated Antigen ◽

Immunotherapy Of Cancer ◽

Binding Peptides

We report the efficient identification of four human histocompatibility leukocyte antigen (HLA)-A*0201–presented cytotoxic T lymphocyte (CTL) epitopes in the tumor-associated antigen PRAME using an improved “reverse immunology” strategy. Next to motif-based HLA-A*0201 binding prediction and actual binding and stability assays, analysis of in vitro proteasome-mediated digestions of polypeptides encompassing candidate epitopes was incorporated in the epitope prediction procedure. Proteasome cleavage pattern analysis, in particular determination of correct COOH-terminal cleavage of the putative epitope, allows a far more accurate and selective prediction of CTL epitopes. Only 4 of 19 high affinity HLA-A*0201 binding peptides (21%) were found to be efficiently generated by the proteasome in vitro. This approach avoids laborious CTL response inductions against high affinity binding peptides that are not processed and limits the number of peptides to be assayed for binding. CTL clones induced against the four identified epitopes (VLDGLDVLL, PRA100–108; SLYSFPEPEA, PRA142–151; ALYVDSLFFL, PRA300–309; and SLLQHLIGL, PRA425–433) lysed melanoma, renal cell carcinoma, lung carcinoma, and mammary carcinoma cell lines expressing PRAME and HLA-A*0201. This indicates that these epitopes are expressed on cancer cells of diverse histologic origin, making them attractive targets for immunotherapy of cancer.

Download Full-text