protein prediction
Recently Published Documents


TOTAL DOCUMENTS

179
(FIVE YEARS 62)

H-INDEX

17
(FIVE YEARS 4)

2022 ◽  
Vol 23 (1) ◽  
Author(s):  
Shiming Wang ◽  
Jie Li ◽  
Yadong Wang

Abstract Background Detecting pathogenic proteins is the origin way to understand the mechanism and resist the invasion of diseases, making pathogenic protein prediction develop into an urgent problem to be solved. Prediction for genome-wide proteins may be not necessarily conducive to rapidly cure diseases as developing new drugs specifically for the predicted pathogenic protein always need major expenditures on time and cost. In order to facilitate disease treatment, computational method to predict pathogenic proteins which are targeted by existing drugs should be exploited. Results In this study, we proposed a novel computational model to predict drug-targeted pathogenic proteins, named as M2PP. Three types of features were presented on our constructed heterogeneous network (including target proteins, diseases and drugs), which were based on the neighborhood similarity information, drug-inferred information and path information. Then, a random forest regression model was trained to score unconfirmed target-disease pairs. Five-fold cross-validation experiment was implemented to evaluate model’s prediction performance, where M2PP achieved advantageous results compared with other state-of-the-art methods. In addition, M2PP accurately predicted high ranked pathogenic proteins for common diseases with public biomedical literature as supporting evidence, indicating its excellent ability. Conclusions M2PP is an effective and accurate model to predict drug-targeted pathogenic proteins, which could provide convenience for the future biological researches.


2021 ◽  
Vol 17 ◽  
Author(s):  
Ke Yan ◽  
Hongwu Lv ◽  
Yichen Guo ◽  
Jie Wen ◽  
Bin Liu

Background: Therapeutic peptide prediction is critical for drug development and therapy. Researchers have been studying this essential task, developing several computational methods to identify different therapeutic peptide types. Objective: Most predictors are the specific methods for certain peptides. Currently, developing methods to predict the presence of multiple peptides remains a challenging problem. Moreover, it is still challenging to combine different features to make the therapeutic prediction. Method: In this paper, we proposed a new ensemble method TP-MV for general therapeutic peptide recognition. TP-MV is developed using the stacking framework in conjunction with the KNN, SVM, ET, RF, and XGB. Then TP-MV constructs a multi-view learning model as meta-classifiers to extract the discriminative feature for different peptides. Results: In the experiment, the proposed method outperforms the other existing methods on the benchmark datasets, indicating that the proposed method has the ability to predict multiple therapeutic peptides simultaneously. Conclusion: The TP-MV is a useful tool for predicting therapeutic peptides.


2021 ◽  
Author(s):  
Min Zeng ◽  
Nian Wang ◽  
Yifan Wu ◽  
Yiming Li ◽  
Fang-Xiang Wu ◽  
...  

2021 ◽  
Author(s):  
Anthony M. Musolf ◽  
Emily R. Holzinger ◽  
James D. Malley ◽  
Joan E. Bailey-Wilson

AbstractGenetic data have become increasingly complex within the past decade, leading researchers to pursue increasingly complex questions, such as those involving epistatic interactions and protein prediction. Traditional methods are ill-suited to answer these questions, but machine learning (ML) techniques offer an alternative solution. ML algorithms are commonly used in genetics to predict or classify subjects, but some methods evaluate which features (variables) are responsible for creating a good prediction; this is called feature importance. This is critical in genetics, as researchers are often interested in which features (e.g., SNP genotype or environmental exposure) are responsible for a good prediction. This allows for the deeper analysis beyond simple prediction, including the determination of risk factors associated with a given phenotype. Feature importance further permits the researcher to peer inside the black box of many ML algorithms to see how they work and which features are critical in informing a good prediction. This review focuses on ML methods that provide feature importance metrics for the analysis of genetic data. Five major categories of ML algorithms: k nearest neighbors, artificial neural networks, deep learning, support vector machines, and random forests are described. The review ends with a discussion of how to choose the best machine for a data set. This review will be particularly useful for genetic researchers looking to use ML methods to answer questions beyond basic prediction and classification.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Denise S. Arico ◽  
Paula Beati ◽  
Diego L. Wengier ◽  
Maria Agustina Mazzella

Abstract Background Proteins are the workforce of the cell and their phosphorylation status tailors specific responses efficiently. One of the main challenges of phosphoproteomic approaches is to deconvolute biological processes that specifically respond to an experimental query from a list of phosphoproteins. Comparison of the frequency distribution of GO (Gene Ontology) terms in a given phosphoproteome set with that observed in the genome reference set (GenRS) is the most widely used tool to infer biological significance. Yet, this comparison assumes that GO term distribution between the phosphoproteome and the genome are identical. However, this hypothesis has not been tested due to the lack of a comprehensive phosphoproteome database. Results In this study, we test this hypothesis by constructing three phosphoproteome databases in Arabidopsis thaliana: one based in experimental data (ExpRS), another based in in silico phosphorylation protein prediction (PredRS) and a third that is the union of both (UnRS). Our results show that the three phosphoproteome reference sets show default enrichment of several GO terms compared to GenRS, indicating that GO term distribution in the phosphoproteomes does not match that of the genome. Moreover, these differences overshadow the identification of GO terms that are specifically enriched in a particular condition. To overcome this limitation, we present an additional comparison of the sample of interest with UnRS to uncover GO terms specifically enriched in a particular phosphoproteome experiment. Using this strategy, we found that mRNA splicing and cytoplasmic microtubule compounds are important processes specifically enriched in the phosphoproteome of dark-grown Arabidopsis seedlings. Conclusions This study provides a novel strategy to uncover GO specific terms in phosphoproteome data of Arabidopsis that could be applied to any other organism. We also highlight the importance of specific phosphorylation pathways that take place during dark-grown Arabidopsis development.


2021 ◽  
Vol 12 ◽  
Author(s):  
Xiaomei Gu ◽  
Lina Guo ◽  
Bo Liao ◽  
Qinghua Jiang

Phages have seriously affected the biochemical systems of the world, and not only are phages related to our health, but medical treatments for many cancers and skin infections are related to phages; therefore, this paper sought to identify phage proteins. In this paper, a Pseudo-188D model was established. The digital features of the phage were extracted by PseudoKNC, an appropriate vector was selected by the AdaBoost tool, and features were extracted by 188D. Then, the extracted digital features were combined together, and finally, the viral proteins of the phage were predicted by a stochastic gradient descent algorithm. Our model effect reached 93.4853%. To verify the stability of our model, we randomly selected 80% of the downloaded data to train the model and used the remaining 20% of the data to verify the robustness of our model.


2021 ◽  
Author(s):  
Denise S. Arico ◽  
Paula Beati ◽  
Diego L. Wengier ◽  
María Agustina Mazzella

Abstract Background. Proteins are the workforce of the cell and their phosphorylation status tailors specific responses efficiently. One of the main challenges of phosphoproteomic approaches is to deconvolute biological processes that specifically respond to an experimental query from a list of phosphoproteins. Comparison of the frequency distribution of GO (Gene Ontology) terms in a given phosphoproteome set with that observed in the genome reference set (GenRS) is the most widely used tool to infer biological significance. Yet, this comparison assumes that GO term distribution between the phosphoproteome and the genome are identical. However, this hypothesis has not been tested due to the lack of a comprehensive phosphoproteome database.Results. In this study, we test this hypothesis by constructing three phosphoproteome databases in Arabidopsis thaliana: one based in experimental data (ExpRS), another based in in silico phosphorylation protein prediction (PredRS) and a third that is the union of both (UnRS). Our results show that the three phosphoproteome reference sets show default enrichment of several GO terms compared to GenRS, indicating that GO term distribution in the phosphoproteomes does not match that of the genome. Moreover, these differences overshadow the identification of GO terms that are specifically enriched in a particular condition. To overcome this limitation, we present an additional comparison of the sample of interest with UnRS to uncover GO terms specifically enriched in a particular phosphoproteome experiment. Using this strategy, we found that mRNA splicing and cytoplasmic microtubule compounds are important processes specifically enriched in the phosphoproteome of dark-grown Arabidopsis seedlings. Conclusions. This study provides a novel strategy to uncover GO specific terms in phosphoproteome data of Arabidopsis that could be applied to any other organism. We also highlight the importance of specific phosphorylation pathways that take place during dark-grown Arabidopsis development.


2021 ◽  
Author(s):  
Carter J. Wilson ◽  
Wing-Yiu Choy ◽  
Mikko Karttunen

The development of AlphaFold2 was a paradigm-shift in the structural biology community; herein we assess the ability of AlphaFold2 to predict disordered regions against traditional sequence-based disorder predictors. We find that a naive use of Dictionary of Secondary Structure of Proteins (DSSP) to separate ordered from disordered regions leads to a dramatic overestimation in disorder content, and that the Predicted Aligned Error (PAE) provides a much more rigorous metric. In addition, we show that even when used for disorder prediction, conventional predictors can outperform the PAE in disorder identification, and note an interesting relationship between the PAE and secondary structure that may explain our observations and hints at a broader application of the PAE to IDP dynamics.


2021 ◽  
Author(s):  
Chengbin Hu ◽  
Yiru Qin ◽  
Chuan Ye ◽  
Jiao Jin ◽  
Ting Zhou ◽  
...  

Abstract Background: Many proteins or partial regions of proteins do not have stable and well-defined three-dimensional structures in vitro. Understanding Intrinsically Disorder Proteins (IDPs) is significant for interpreting biological function as well as studying many diseases. Although more than 70 disorder predictors have been invented, many existing predictors are limited on the characteristics of proteins and do not have very high accuracy. Therefore, it is critical to formulate new strategies on disorder protein prediction. Results: Here, we propose a machine learning meta-strategy to improve the accuracy of disordered proteins and disordered regions prediction. We first use logistic forward parameter selection to select eight most significant predictors from the current available IDP predictors. Then we design a novel meta-strategy using several machine learning models, including Decision-tree based algorithm, Naive Bayes, Random forest, and Convolutional Neural Network (CNN). By applying different strategies, the results suggest Random forest can improve the predicted single amino acid accuracy significantly to 93.35%. Using the combination vector data of eight most significant predictors as input, the Convolution Neural Network can improve the whole protein prediction to 95.62%. Conclusion: According to the performance of our machine learning meta-strategy, the Random forest and CNN models can improve the accuracy to predict IDPs.


Mathematics ◽  
2021 ◽  
Vol 9 (16) ◽  
pp. 1941
Author(s):  
Gordana Ispirova ◽  
Tome Eftimov ◽  
Barbara Koroušić Seljak

Being both a poison and a cure for many lifestyle and non-communicable diseases, food is inscribing itself into the prime focus of precise medicine. The monitoring of few groups of nutrients is crucial for some patients, and methods for easing their calculations are emerging. Our proposed machine learning pipeline deals with nutrient prediction based on learned vector representations on short text–recipe names. In this study, we explored how the prediction results change when, instead of using the vector representations of the recipe description, we use the embeddings of the list of ingredients. The nutrient content of one food depends on its ingredients; therefore, the text of the ingredients contains more relevant information. We define a domain-specific heuristic for merging the embeddings of the ingredients, which combines the quantities of each ingredient in order to use them as features in machine learning models for nutrient prediction. The results from the experiments indicate that the prediction results improve when using the domain-specific heuristic. The prediction models for protein prediction were highly effective, with accuracies up to 97.98%. Implementing a domain-specific heuristic for combining multi-word embeddings yields better results than using conventional merging heuristics, with up to 60% more accuracy in some cases.


Sign in / Sign up

Export Citation Format

Share Document