NeuroCS: A Tool to Predict Cleavage Sites of Neuropeptide Precursors

2020 ◽  
Vol 27 (4) ◽  
pp. 337-345 ◽  
Author(s):  
Ying Wang ◽  
Juanjuan Kang ◽  
Ning Li ◽  
Yuwei Zhou ◽  
Zhongjie Tang ◽  
...  

Background: Neuropeptides are a class of bioactive peptides produced from neuropeptide precursors through a series of extremely complex processes, mediating neuronal regulations in many aspects. Accurate identification of cleavage sites of neuropeptide precursors is of great significance for the development of neuroscience and brain science. Objective: With the explosive growth of neuropeptide precursor data, it is pretty much needed to develop bioinformatics methods for predicting neuropeptide precursors’ cleavage sites quickly and efficiently. Method : We started with processing the neuropeptide precursor data from SwissProt and NueoPedia into two sets of data, training dataset and testing dataset. Subsequently, six feature extraction schemes were applied to generate different feature sets and then feature selection methods were used to find the optimal feature subset of each. Thereafter the support vector machine was utilized to build models for different feature types. Finally, the performance of models were evaluated with the independent testing dataset. Results: Six models are built through support vector machine. Among them the enhanced amino acid composition-based model reaches the highest accuracy of 91.60% in the 5-fold cross validation. When evaluated with independent testing dataset, it also showed an excellent performance with a high accuracy of 90.37% and Area under Receiver Operating Characteristic curve up to 0.9576. Conclusion: The performance of the developed model was decent. Moreover, for users’ convenience, an online web server called NeuroCS is built, which is freely available at http://i.uestc.edu.cn/NeuroCS/dist/index.html#/. NeuroCS can be used to predict neuropeptide precursors’ cleavage sites effectively.

2019 ◽  
Vol 17 ◽  
Author(s):  
Yanqiu Yao ◽  
Xiaosa Zhao ◽  
Qiao Ning ◽  
Junping Zhou

Background: Glycation is a nonenzymatic post-translational modification process by attaching a sugar molecule to a protein or lipid molecule. It may impair the function and change the characteristic of the proteins which may lead to some metabolic diseases. In order to understand the underlying molecular mechanisms of glycation, computational prediction methods have been developed because of their convenience and high speed. However, a more effective computational tool is still a challenging task in computational biology. Methods: In this study, we showed an accurate identification tool named ABC-Gly for predicting lysine glycation sites. At first, we utilized three informative features, including position-specific amino acid propensity, secondary structure and the composition of k-spaced amino acid pairs to encode the peptides. Moreover, to sufficiently exploit discriminative features thus can improve the prediction and generalization ability of the model, we developed a two-step feature selection, which combined the Fisher score and an improved binary artificial bee colony algorithm based on support vector machine. Finally, based on the optimal feature subset, we constructed the effective model by using Support Vector Machine on the training dataset. Results: The performance of the proposed predictor ABC-Gly was measured with the sensitivity of 76.43%, the specificity of 91.10%, the balanced accuracy of 83.76%, the area under the receiver-operating characteristic curve (AUC) of 0.9313, a Matthew’s Correlation Coefficient (MCC) of 0.6861 by 10-fold cross-validation on training dataset, and a balanced accuracy of 59.05% on independent dataset. Compared to the state-of-the-art predictors on the training dataset, the proposed predictor achieved significant improvement in the AUC of 0.156 and MCC of 0.336. Conclusion: The detailed analysis results indicated that our predictor may serve as a powerful complementary tool to other existing methods for predicting protein lysine glycation. The source code and datasets of the ABC-Gly were provided in the Supplementary File 1.


2016 ◽  
Vol 36 (suppl_1) ◽  
Author(s):  
Hua Tang ◽  
Hao Lin

Objective: Apolipoproteins are of great physiological importance and are associated with different diseases such as dyslipidemia, thrombogenesis and angiocardiopathy. Apolipoproteins have therefore emerged as key risk markers and important research targets yet the types of apolipoproteins has not been fully elucidated. Accurate identification of the apoliproproteins is very crucial to the comprehension of cardiovascular diseases and drug design. The aim of this study is to develop a powerful model to precisely identify apolipoproteins. Approach and Results: We manually collected a non-redundant dataset of 53 apoliproproteins and 136 non-apoliproproteins with the sequence identify of less than 40% from UniProt. After formulating the protein sequence samples with g -gap dipeptide composition (here g =1~10), the analysis of various (ANOVA) was adopted to find out the best feature subset which can achieve the best accuracy. Support Vector Machine (SVM) was then used to perform classification. The predictive model was evaluated using a five-fold cross-validation which yielded a sensitivity of 96.2%, a specificity of 99.3%, and an accuracy of 98.4%. The study indicated that the proposed method could be a feasible means of conducting preliminary analyses of apoliproproteins. Conclusion: We demonstrated that apoliproproteins can be predicted from their primary sequences. Also we discovered the special dipeptide distribution in apoliproproteins. These findings open new perspectives to improve apoliproproteins prediction by considering the specific dipeptides. We expect that these findings will help to improve drug development in anti-angiocardiopathy disease. Key words: Apoliproproteins Angiocardiopathy Support Vector Machine


2013 ◽  
pp. 786-797
Author(s):  
Ruofei Wang ◽  
Xieping Gao

Classification of protein folds plays a very important role in the protein structure discovery process, especially when traditional sequence alignment methods fail to yield convincing structural homologies. In this chapter, we have developed a two-layer learning architecture, named TLLA, for multi-class protein folds classification. In the first layer, OET-KNN (Optimized Evidence-Theoretic K Nearest Neighbors) is used as the component classifier to find the most probable K-folds of the query protein. In the second layer, we use support vector machine (SVM) to build the multi-class classifier just on the K-folds, generated in the first layer, rather than on all the 27 folds. For multi-feature combination, ensemble strategy based on voting is selected to give the final classification result. The standard percentage accuracy of our method at ~63% is achieved on the independent testing dataset, where most of the proteins have <25% sequence identity with those in the training dataset. The experimental evaluation based on a widely used benchmark dataset has shown that our approach outperforms the competing methods, implying our approach might become a useful vehicle in the literature.


Author(s):  
Ruofei Wang ◽  
Xieping Gao

Classification of protein folds plays a very important role in the protein structure discovery process, especially when traditional sequence alignment methods fail to yield convincing structural homologies. In this chapter, we have developed a two-layer learning architecture, named TLLA, for multi-class protein folds classification. In the first layer, OET-KNN (Optimized Evidence-Theoretic K Nearest Neighbors) is used as the component classifier to find the most probable K-folds of the query protein. In the second layer, we use support vector machine (SVM) to build the multi-class classifier just on the K-folds, generated in the first layer, rather than on all the 27 folds. For multi-feature combination, ensemble strategy based on voting is selected to give the final classification result. The standard percentage accuracy of our method at ~63% is achieved on the independent testing dataset, where most of the proteins have <25% sequence identity with those in the training dataset. The experimental evaluation based on a widely used benchmark dataset has shown that our approach outperforms the competing methods, implying our approach might become a useful vehicle in the literature.


2012 ◽  
Vol 532-533 ◽  
pp. 1497-1502
Author(s):  
Hong Mei Li ◽  
Lin Gen Yang ◽  
Li Hua Zou

To make feature subset which can gain the higher classification accuracy rate, the method based on genetic algorithms and the feature selection of support vector machine is proposed. Firstly, the ReliefF algorithm provides a priori information to GA, the parameters of the support vector machine mixed into the genetic encoding,and then using genetic algorithm finds the optimal feature subset and support vector machines parameter combination. Finally, experimental results show that the proposed algorithm can gain the higher classification accuracy rate based on the smaller feature subset.


2020 ◽  
Author(s):  
Xiao Chen ◽  
Yi Xiong ◽  
Yinbo Liu ◽  
Yuqing Chen ◽  
Shoudong Bi ◽  
...  

Abstract Background: As one of the most common post-transcriptional modifications (PTCM) in RNA, 5-cytosine-methylation plays important roles in many biological functionssuch as RNA metabolism and cell fate decision. Through accurate identification of 5-methylcytosine (m5C) sites on RNA,researcherscanbetter understandthe exact role of 5-cytosine-methylation in these biological functions. In recent years, computational methods of predicting m5C sites have attracted lots of interests because of its efficiency and low-cost.However, both the accuracy and efficiency of these methods are not satisfactory yet and need further improvement.Results: In this work, we have developed a new computational method, m5CPred-SVM, to identify m5C sites in three species, H. sapiens, M. musculus and A. thaliana. To build this model, we first collected benchmark datasets following three recently published methods. Then, six types of sequence-based features were generated based on RNA segments and the sequential forward feature selection strategy was used to obtain the optimal feature subset. After that, the performance of models based on different learning algorithms were compared, and the model based on the support vector machine provided the highest prediction accuracy. Finally, our proposed method, m5CPred-SVM was compared with several existing methods, and the result showed that m5CPred-SVMoffered substantially higher prediction accuracy thanpreviously published methods. It is expected that our method, m5CPred-SVM, can become a useful tool for accurate identification of m5C sites.Conclusion: In this study, by introducing position-specific propensity related features, we built a new model, m5CPred-SVM, to predict RNA m5C sites of three different species.The result shows that our model outperformed the existing state-of-art models.Our model is available for users through a web serverat http://zhulab.ahu.edu.cn/m5CPred-SVM.


2019 ◽  
Vol 16 (4) ◽  
pp. 332-339
Author(s):  
Liangwei Yang ◽  
Hui Gao ◽  
Zhen Liu ◽  
Lixia Tang

Phages are widely distributed in locations populated by bacterial hosts. Phage proteins can be divided into two main categories, that is, virion and non-virion proteins with different functions. In practice, people mainly use phage virion proteins to clarify the lysis mechanism of bacterial cells and develop new antibacterial drugs. Accurate identification of phage virion proteins is therefore essential to understanding the phage lysis mechanism. Although some computational methods have been focused on identifying virion proteins, the result is not satisfying which gives more room for improvement. In this study, a new sequence-based method was proposed to identify phage virion proteins using g-gap tripeptide composition. In this approach, the protein features were firstly extracted from the ggap tripeptide composition. Subsequently, we obtained an optimal feature subset by performing incremental feature selection (IFS) with information gain. Finally, the support vector machine (SVM) was used as the classifier to discriminate virion proteins from non-virion proteins. In 10-fold crossvalidation test, our proposed method achieved an accuracy of 97.40% with AUC of 0.9958, which outperforms state-of-the-art methods. The result reveals that our proposed method could be a promising method in the work of phage virion proteins identification.


Author(s):  
Muhamad Addin Akmal Bin Mohd Raif ◽  
Nurlaila Ismail ◽  
Nor Azah Mohd Ali ◽  
Mohd Hezri Fazalul Rahiman ◽  
Saiful Nizam Tajuddin ◽  
...  

<span>This paper presents the analysis of agarwood oil compounds quality classification by tuning quadratic kernel parameter in Support Vector Machine (SVM). The experimental work involved of agarwood oil samples from low and high qualities. The input is abundances (%) of the agarwood oil compounds and the output is the quality of the oil either high or low. The input and output data were processed by following tasks; i) data processing which covers normalization, randomization and data splitting into two parts in which training and testing database (ratio of 80%:20%), and ii) data analysis which covers SVM development by tuning quadratic kernel parameter. The training dataset was used to be train the SVM model and the testing dataset was used to test the developed SVM model. All the analytical works are performed via MATLAB software version R2013a. The result showed that, quadratic tuned kernel parameter in SVM model was successful since it passed all the performance criteria’s in which accuracy, precision, confusion matrix, sensitivity and specificity. The finding obtained in this paper is vital to the agarwood oil and its research area especially to the agarwood oil compounds classification system.</span>


2021 ◽  
Vol 8 ◽  
Author(s):  
Caidong Liu ◽  
Ziyu Wang ◽  
Wei Wu ◽  
Changgang Xiang ◽  
Lingxiang Wu ◽  
...  

Objective: To distinguish COVID-19 patients and non-COVID-19 viral pneumonia patients and classify COVID-19 patients into low-risk and high-risk at admission by laboratory indicators.Materials and methods: In this retrospective cohort, a total of 3,563 COVID-19 patients and 118 non-COVID-19 pneumonia patients were included. There are two cohorts of COVID-19 patients, including 548 patients in the training dataset, and 3,015 patients in the testing dataset. Laboratory indicators were measured during hospitalization for all patients. Based on laboratory indicators, we used the support vector machine and joint random sampling to risk stratification for COVID-19 patients at admission. Based on laboratory indicators detected within the 1st week after admission, we used logistic regression and joint random sampling to develop the survival mode. The laboratory indicators of COVID-10 and non-COVID-19 were also compared.Results: We first identified the significant laboratory indicators related to the severity of COVID-19 in the training dataset. Neutrophils percentage, lymphocytes percentage, creatinine, and blood urea nitrogen with AUC &gt;0.7 were included in the model. These indicators were further used to build a support vector machine model to classify patients into low-risk and high-risk at admission in the testing dataset. Results showed that this model could stratify the patients in the testing dataset effectively (AUC = 0.89). Our model still has good performance at different times (Mean AUC: 0.71, 0.72, 0.72, respectively for 3, 5, and 7 days after admission). Moreover, laboratory indicators detected within the 1st week after admission were able to estimate the probability of death (AUC = 0.95). We identified six indicators with permutation p &lt; 0.05, including eosinophil percentage (p = 0.007), white blood cell count (p = 0.045), albumin (p = 0.041), aspartate transaminase (p = 0.043), lactate dehydrogenase (p = 0.002), and hemoglobin (p = 0.031). We could diagnose COVID-19 and differentiate it from other kinds of viral pneumonia based on these laboratory indicators.Conclusions: Our risk-stratification model based on laboratory indicators could help to diagnose, monitor, and predict severity at an early stage of COVID-19. In addition, laboratory findings could be used to distinguish COVID-19 and non-COVID-19.


2020 ◽  
Vol 15 ◽  
Author(s):  
Chun Qiu ◽  
Sai Li ◽  
Shenghui Yang ◽  
Lin Wang ◽  
Aihui Zeng ◽  
...  

Aim: To search the genes related to the mechanisms of the occurrence of glioma and to try to build a prediction model for glioblastomas. Background: The morbidity and mortality of glioblastomas are very high, which seriously endangers human health. At present, the goals of many investigations on gliomas are mainly to understand the cause and mechanism of these tumors at the molecular level and to explore clinical diagnosis and treatment methods. However, there is no effective early diagnosis method for this disease, and there are no effective prevention, diagnosis or treatment measures. Methods: First, the gene expression profiles derived from GEO were downloaded. Then, differentially expressed genes (DEGs) in the disease samples and the control samples were identified. After that, GO and KEGG enrichment analyses of DEGs were performed by DAVID. Furthermore, the correlation-based feature subset (CFS) method was applied to the selection of key DEGs. In addition, the classification model between the glioblastoma samples and the controls was built by an Support Vector Machine (SVM) based on selected key genes. Results and Discussion: Thirty-six DEGs, including 17 upregulated and 19 downregulated genes, were selected as the feature genes to build the classification model between the glioma samples and the control samples by the CFS method. The accuracy of the classification model by using a 10-fold cross-validation test and independent set test was 76.25% and 70.3%, respectively. In addition, PPP2R2B and CYBB can also be found in the top 5 hub genes screened by the protein– protein interaction (PPI) network. Conclusions: This study indicated that the CFS method is a useful tool to identify key genes in glioblastomas. In addition, we also predicted that genes such as PPP2R2B and CYBB might be potential biomarkers for the diagnosis of glioblastomas.


Sign in / Sign up

Export Citation Format

Share Document