Software-aided workflow for predicting protease-specific cleavage sites using physicochemical properties of the natural and unnatural amino acids in peptide-based drug discovery: Peptide cleavage sites prediction workflow

AbstractPeptide drugs have been used in the treatment of multiple pathologies. During peptide discovery, it is crucially important to be able to map the potential sites of cleavages of the proteases. This knowledge is used to later chemically modify the peptide drug to adapt it for the therapeutic use, making peptide stable against individual proteases or in complex medias. In some other cases it needed to make it specifically unstable for some proteases, as peptides could be used as a system to target delivery drugs on specific tissues or cells. The information about proteases, their sites of cleavages and substrates are widely spread across publications and collected in databases such as MEROPS. Therefore, it is possible to develop models to improve the understanding of the potential peptide drug proteolysis. We propose a new workflow to derive protease specificity rules and predict the potential scissile bonds in peptides for individual proteases. WebMetabase stores the information from experimental or external sources in a chemically aware database where each peptide and site of cleavage is represented as a sequence of structural blocks connected by amide bonds and characterized by its physicochemical properties described by Volsurf descriptors. Thus, this methodology could be applied in the case of non-standard amino acid. A frequency analysis can be performed in WebMetabase to discover the most frequent cleavage sites. These results were used to train several models using logistic regression, support vector machine and ensemble tree classifiers to map cleavage sites for several human proteases from four different families (serine, cysteine, aspartic and matrix metalloproteases). Finally, we compared the predictive performance of the developed models with other available public tools PROSPERous and SitePrediction.

Download Full-text

Software-aided workflow for predicting protease-specific cleavage sites using physicochemical properties of the natural and unnatural amino acids in peptide-based drug discovery

PLoS ONE ◽

10.1371/journal.pone.0199270 ◽

2019 ◽

Vol 14 (1) ◽

pp. e0199270

Author(s):

Tatiana Radchenko ◽

Fabien Fontaine ◽

Luca Morettoni ◽

Ismael Zamora

Keyword(s):

Amino Acids ◽

Drug Discovery ◽

Physicochemical Properties ◽

Unnatural Amino Acids ◽

Cleavage Sites

Download Full-text

Prediction of Incident Cancers in the Lifelines Population-Based Cohort

Cancers ◽

10.3390/cancers13092133 ◽

2021 ◽

Vol 13 (9) ◽

pp. 2133

Author(s):

Francisco O. Cortés-Ibañez ◽

Sunil Belur Nagaraj ◽

Ludo Cornelissen ◽

Gerjan J. Navis ◽

Bert van der Vegt ◽

...

Keyword(s):

Cancer Incidence ◽

Binary Classification ◽

Predictive Performance ◽

Population Based ◽

Support Vector ◽

Clinical Variables ◽

Incident Cancer ◽

History Of ◽

Diagnosis Of Cancer ◽

Auc Value

Cancer incidence is rising, and accurate prediction of incident cancers could be relevant to understanding and reducing cancer incidence. The aim of this study was to develop machine learning (ML) models that could predict an incident diagnosis of cancer. Participants without any history of cancer within the Lifelines population-based cohort were followed for a median of 7 years. Data were available for 116,188 cancer-free participants and 4232 incident cancer cases. At baseline, socioeconomic, lifestyle, and clinical variables were assessed. The main outcome was an incident cancer during follow-up (excluding skin cancer), based on linkage with the national pathology registry. The performance of three ML algorithms was evaluated using supervised binary classification to identify incident cancers among participants. Elastic net regularization and Gini index were used for variables selection. An overall area under the receiver operator curve (AUC) <0.75 was obtained, the highest AUC value was for prostate cancer (random forest AUC = 0.82 (95% CI 0.77–0.87), logistic regression AUC = 0.81 (95% CI 0.76–0.86), and support vector machines AUC = 0.83 (95% CI 0.78–0.88), respectively); age was the most important predictor in these models. Linear and non-linear ML algorithms including socioeconomic, lifestyle, and clinical variables produced a moderate predictive performance of incident cancers in the Lifelines cohort.

Download Full-text

Improved Estimation of Winter Wheat Aboveground Biomass Using Multiscale Textures Extracted from UAV-Based Digital Images and Hyperspectral Feature Analysis

Remote Sensing ◽

10.3390/rs13040581 ◽

2021 ◽

Vol 13 (4) ◽

pp. 581 ◽

Cited By ~ 2

Author(s):

Yuanyuan Fu ◽

Guijun Yang ◽

Xiaoyu Song ◽

Zhenhong Li ◽

Xingang Xu ◽

...

Keyword(s):

Winter Wheat ◽

Least Squares ◽

Aboveground Biomass ◽

Regression Models ◽

Digital Images ◽

Predictive Performance ◽

Estimation Accuracy ◽

Support Vector ◽

Biomass Estimation ◽

High Definition

Rapid and accurate crop aboveground biomass estimation is beneficial for high-throughput phenotyping and site-specific field management. This study explored the utility of high-definition digital images acquired by a low-flying unmanned aerial vehicle (UAV) and ground-based hyperspectral data for improved estimates of winter wheat biomass. To extract fine textures for characterizing the variations in winter wheat canopy structure during growing seasons, we proposed a multiscale texture extraction method (Multiscale_Gabor_GLCM) that took advantages of multiscale Gabor transformation and gray-level co-occurrency matrix (GLCM) analysis. Narrowband normalized difference vegetation indices (NDVIs) involving all possible two-band combinations and continuum removal of red-edge spectra (SpeCR) were also extracted for biomass estimation. Subsequently, non-parametric linear (i.e., partial least squares regression, PLSR) and nonlinear regression (i.e., least squares support vector machine, LSSVM) analyses were conducted using the extracted spectral features, multiscale textural features and combinations thereof. The visualization technique of LSSVM was utilized to select the multiscale textures that contributed most to the biomass estimation for the first time. Compared with the best-performing NDVI (1193, 1222 nm), the SpeCR yielded higher coefficient of determination (R2), lower root mean square error (RMSE), and lower mean absolute error (MAE) for winter wheat biomass estimation and significantly alleviated the saturation problem after biomass exceeded 800 g/m2. The predictive performance of the PLSR and LSSVM regression models based on SpeCR decreased with increasing bandwidths, especially at bandwidths larger than 11 nm. Both the PLSR and LSSVM regression models based on the multiscale textures produced higher accuracies than those based on the single-scale GLCM-based textures. According to the evaluation of variable importance, the texture metrics “Mean” from different scales were determined as the most influential to winter wheat biomass. Using just 10 multiscale textures largely improved predictive performance over using all textures and achieved an accuracy comparable with using SpeCR. The LSSVM regression model based on the combination of the selected multiscale textures, and SpeCR with a bandwidth of 9 nm produced the highest estimation accuracy with R2val = 0.87, RMSEval = 119.76 g/m2, and MAEval = 91.61 g/m2. However, the combination did not significantly improve the estimation accuracy, compared to the use of SpeCR or multiscale textures only. The accuracy of the biomass predicted by the LSSVM regression models was higher than the results of the PLSR models, which demonstrated LSSVM was a potential candidate to characterize winter wheat biomass during multiple growth stages. The study suggests that multiscale textures derived from high-definition UAV-based digital images are competitive with hyperspectral features in predicting winter wheat biomass.

Download Full-text

Predicting Signal Peptides and Their Cleavage Sites Using Support Vector Machines and Improved Position Weight Matrixes

2008 Fourth International Conference on Natural Computation ◽

10.1109/icnc.2008.406 ◽

2008 ◽

Cited By ~ 2

Author(s):

Jingjing Sun ◽

Lipo Wang

Keyword(s):

Support Vector Machines ◽

Support Vector ◽

Signal Peptides ◽

Cleavage Sites ◽

Vector Machines

Download Full-text

Detection of HIV-1 Protease Cleavage Sites via Hidden Markov Model and Physicochemical Properties of Amino Acids

Nonlinear Systems and Complexity - Numerical Solutions of Realistic Nonlinear Phenomena ◽

10.1007/978-3-030-37141-8_10 ◽

2020 ◽

pp. 171-193

Author(s):

Elif Doğan Dar ◽

Vilda Purutçuoğlu ◽

Eda Purutçuoğlu

Keyword(s):

Amino Acids ◽

Markov Model ◽

Physicochemical Properties ◽

Hidden Markov Model ◽

Hidden Markov ◽

Cleavage Sites ◽

Protease Cleavage ◽

Protease Cleavage Sites ◽

Hiv 1

Download Full-text

iBitter-Fuse: A Novel Sequence-Based Bitter Peptide Predictor by Fusing Multi-View Features

International Journal of Molecular Sciences ◽

10.3390/ijms22168958 ◽

2021 ◽

Vol 22 (16) ◽

pp. 8958

Author(s):

Phasit Charoenkwan ◽

Chanin Nantasenamat ◽

Md. Mehedi Hasan ◽

Mohammad Ali Moni ◽

Pietro Lio’ ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

De Novo ◽

Predictive Performance ◽

Support Vector ◽

Sufficient Information ◽

Self Assessment ◽

Accurate Identification ◽

Bitter Peptides ◽

Accurate Performance

Accurate identification of bitter peptides is of great importance for better understanding their biochemical and biophysical properties. To date, machine learning-based methods have become effective approaches for providing a good avenue for identifying potential bitter peptides from large-scale protein datasets. Although few machine learning-based predictors have been developed for identifying the bitterness of peptides, their prediction performances could be improved. In this study, we developed a new predictor (named iBitter-Fuse) for achieving more accurate identification of bitter peptides. In the proposed iBitter-Fuse, we have integrated a variety of feature encoding schemes for providing sufficient information from different aspects, namely consisting of compositional information and physicochemical properties. To enhance the predictive performance, the customized genetic algorithm utilizing self-assessment-report (GA-SAR) was employed for identifying informative features followed by inputting optimal ones into a support vector machine (SVM)-based classifier for developing the final model (iBitter-Fuse). Benchmarking experiments based on both 10-fold cross-validation and independent tests indicated that the iBitter-Fuse was able to achieve more accurate performance as compared to state-of-the-art methods. To facilitate the high-throughput identification of bitter peptides, the iBitter-Fuse web server was established and made freely available online. It is anticipated that the iBitter-Fuse will be a useful tool for aiding the discovery and de novo design of bitter peptides

Download Full-text

Detection of misinformation on garlic and COVID-19 in Twitter: A machine learning-based approach (Preprint)

10.2196/preprints.33056 ◽

2021 ◽

Author(s):

Myeong Gyu Kim ◽

Jae Hyun Kim ◽

Kyungim Kim

Keyword(s):

Machine Learning ◽

Social Media ◽

Latent Dirichlet Allocation ◽

Predictive Performance ◽

Machine Learning Algorithms ◽

Training Dataset ◽

Polynomial Kernel ◽

Support Vector ◽

Accurate Information ◽

Probability Number

BACKGROUND Garlic-related misinformation is prevalent whenever a virus outbreak occurs. Again, with the outbreak of coronavirus disease 2019 (COVID-19), garlic-related misinformation is spreading through social media sites, including Twitter. Machine learning-based approaches can be used to detect misinformation from vast tweets. OBJECTIVE This study aimed to develop machine learning algorithms for detecting misinformation on garlic and COVID-19 in Twitter. METHODS This study used 5,929 original tweets mentioning garlic and COVID-19. Tweets were manually labeled as misinformation, accurate information, and others. We tested the following algorithms: k-nearest neighbors; random forest; support vector machine (SVM) with linear, radial, and polynomial kernels; and neural network. Features for machine learning included user-based features (verified account, user type, number of followers, and follower rate) and text-based features (uniform resource locator, negation, sentiment score, Latent Dirichlet Allocation topic probability, number of retweets, and number of favorites). A model with the highest accuracy in the training dataset (70% of overall dataset) was tested using a test dataset (30% of overall dataset). Predictive performance was measured using overall accuracy, sensitivity, specificity, and balanced accuracy. RESULTS SVM with the polynomial kernel model showed the highest accuracy of 0.670. The model also showed a balanced accuracy of 0.757, sensitivity of 0.819, and specificity of 0.696 for misinformation. Important features in the misinformation and accurate information classes included topic 4 (common myths), topic 13 (garlic-specific myths), number of followers, topic 11 (misinformation on social media), and follower rate. Topic 3 (cooking recipes) was the most important feature in the others class. CONCLUSIONS Our SVM model showed good performance in detecting misinformation. The results of our study will help detect misinformation related to garlic and COVID-19. It could also be applied to prevent misinformation related to dietary supplements in the event of a future outbreak of a disease other than COVID-19.

Download Full-text

The Influence of Inhomogeneous Input Data from Different Waves on Predictive Model Development for COVID-19 ICU Patients (Preprint)

10.2196/preprints.31539 ◽

2021 ◽

Author(s):

Sebastian Johannes Fritsch ◽

Konstantin Sharafutdinov ◽

Moein Einollahzadeh Samadi ◽

Gernot Marx ◽

Andreas Schuppert ◽

...

Keyword(s):

Machine Learning ◽

Convex Hull ◽

Prediction Models ◽

Model Development ◽

Predictive Performance ◽

Support Vector ◽

Good Prediction ◽

The Impact ◽

Second Wave ◽

Over Time

BACKGROUND During the course of the COVID-19 pandemic, a variety of machine learning models were developed to predict different aspects of the disease, such as long-term causes, organ dysfunction or ICU mortality. The number of training datasets used has increased significantly over time. However, these data now come from different waves of the pandemic, not always addressing the same therapeutic approaches over time as well as changing outcomes between two waves. The impact of these changes on model development has not yet been studied. OBJECTIVE The aim of the investigation was to examine the predictive performance of several models trained with data from one wave predicting the second wave´s data and the impact of a pooling of these data sets. Finally, a method for comparison of different datasets for heterogeneity is introduced. METHODS We used two datasets from wave one and two to develop several predictive models for mortality of the patients. Four classification algorithms were used: logistic regression (LR), support vector machine (SVM), random forest classifier (RF) and AdaBoost classifier (ADA). We also performed a mutual prediction on the data of that wave which was not used for training. Then, we compared the performance of models when a pooled dataset from two waves was used. The populations from the different waves were checked for heterogeneity using a convex hull analysis. RESULTS 63 patients from wave one (03-06/2020) and 54 from wave two (08/2020-01/2021) were evaluated. For both waves separately, we found models reaching sufficient accuracies up to 0.79 AUROC (95%-CI 0.76-0.81) for SVM on the first wave and up 0.88 AUROC (95%-CI 0.86-0.89) for RF on the second wave. After the pooling of the data, the AUROC decreased relevantly. In the mutual prediction, models trained on second wave´s data showed, when applied on first wave´s data, a good prediction for non-survivors but an insufficient classification for survivors. The opposite situation (training: first wave, test: second wave) revealed the inverse behaviour with models correctly classifying survivors and incorrectly predicting non-survivors. The convex hull analysis for the first and second wave populations showed a more inhomogeneous distribution of underlying data when compared to randomly selected sets of patients of the same size. CONCLUSIONS Our work demonstrates that a larger dataset is not a universal solution to all machine learning problems in clinical settings. Rather, it shows that inhomogeneous data used to develop models can lead to serious problems. With the convex hull analysis, we offer a solution for this problem. The outcome of such an analysis can raise concerns if the pooling of different datasets would cause inhomogeneous patterns preventing a better predictive performance.

Download Full-text

Machine Learning Readmission Risk Modeling: A Pediatric Case Study

BioMed Research International ◽

10.1155/2019/8532892 ◽

2019 ◽

Vol 2019 ◽

pp. 1-9 ◽

Cited By ~ 3

Author(s):

Patricio Wolff ◽

Manuel Graña ◽

Sebastián A. Ríos ◽

Maria Begoña Yarza

Keyword(s):

Machine Learning ◽

Multilayer Perceptron ◽

Naive Bayes ◽

Class Imbalance ◽

Predictive Performance ◽

Naïve Bayes ◽

Distribution Model ◽

Training Dataset ◽

Support Vector ◽

Pediatric Hospital

Background. Hospital readmission prediction in pediatric hospitals has received little attention. Studies have focused on the readmission frequency analysis stratified by disease and demographic/geographic characteristics but there are no predictive modeling approaches, which may be useful to identify preventable readmissions that constitute a major portion of the cost attributed to readmissions.Objective. To assess the all-cause readmission predictive performance achieved by machine learning techniques in the emergency department of a pediatric hospital in Santiago, Chile.Materials. An all-cause admissions dataset has been collected along six consecutive years in a pediatric hospital in Santiago, Chile. The variables collected are the same used for the determination of the child’s treatment administrative cost.Methods. Retrospective predictive analysis of 30-day readmission was formulated as a binary classification problem. We report classification results achieved with various model building approaches after data curation and preprocessing for correction of class imbalance. We compute repeated cross-validation (RCV) with decreasing number of folders to assess performance and sensitivity to effect of imbalance in the test set and training set size.Results. Increase in recall due to SMOTE class imbalance correction is large and statistically significant. The Naive Bayes (NB) approach achieves the best AUC (0.65); however the shallow multilayer perceptron has the best PPV and f-score (5.6 and 10.2, resp.). The NB and support vector machines (SVM) give comparable results if we consider AUC, PPV, and f-score ranking for all RCV experiments. High recall of deep multilayer perceptron is due to high false positive ratio. There is no detectable effect of the number of folds in the RCV on the predictive performance of the algorithms.Conclusions. We recommend the use of Naive Bayes (NB) with Gaussian distribution model as the most robust modeling approach for pediatric readmission prediction, achieving the best results across all training dataset sizes. The results show that the approach could be applied to detect preventable readmissions.

Download Full-text

PA6 and Halloysite Nanotubes Composites with Improved Hydrothermal Ageing Resistance: Role of Filler Physicochemical Properties, Functionalization and Dispersion Technique

Polymers ◽

10.3390/polym12010211 ◽

2020 ◽

Vol 12 (1) ◽

pp. 211 ◽

Cited By ~ 6

Author(s):

Valentina Sabatini ◽

Tommaso Taroni ◽

Riccardo Rampazzo ◽

Marco Bompieri ◽

Daniela Maggioni ◽

...

Keyword(s):

Physicochemical Properties ◽

Polymer Matrix ◽

In Situ Polymerization ◽

Cost Effective ◽

Polyamide 6 ◽

Halloysite Nanotubes ◽

Amide Bonds ◽

Molecular Weights ◽

Dispersion Technique

Polyamide 6 (PA6) suffers from fast degradation in humid conditions due to hydrolysis of amide bonds, which limits its durability. The addition of nanotubular fillers represents a viable strategy for overcoming this issue, although the additive/polymer interface at high filler content can become privileged site for moisture accumulation. As a cost-effective and versatile material, halloysite nanotubes (HNT) were investigated to prepare PA6 nanocomposites with very low loadings (1–45% w/w). The roles of the physicochemical properties of two differently sourced HNT, of filler functionalization with (3-aminopropyl)triethoxysilane and of dispersion techniques (in situ polymerization vs. melt blending) were investigated. The aspect ratio (5 vs. 15) and surface charge (−31 vs. −59 mV) of the two HNT proved crucial in determining their distribution within the polymer matrix. In situ polymerization of functionalized HNT leads to enclosed and well-penetrated filler within the polymer matrix. PA6 nanocomposites crystal growth and nucleation type were studied according to Avrami theory, as well as the formation of different crystalline structures (α and γ forms). After 1680 h of ageing, functionalized HNT reduced the diffusion of water into polymer, lowering water uptake after 600 h up to 90%, increasing the materials durability also regarding molecular weights and rheological behavior.

Download Full-text