Identification of Lysine Carboxylation Sites in Proteins by Integrating Statistical Moments and Position Relative Features via General PseAAC

2020 ◽  
Vol 15 (5) ◽  
pp. 396-407 ◽  
Author(s):  
Saba Amanat ◽  
Adeel Ashraf ◽  
Waqar Hussain ◽  
Nouman Rasool ◽  
Yaser D. Khan

Background: Carboxylation is one of the most biologically important post-translational modifications and occurs on lysine, arginine, and glutamine residues of a protein. Among all these three, the covalent attachment of the carboxyl group with the lysine side chain is the most frequent and biologically important type of carboxylation. For studying such biological functions, it is essential to correctly determine the lysine sites sensitive to carboxylation. Objective: Herein, we present a computational model for the prediction of the carboxylysine site which is based on machine learning. Methods: Various position and composition relative features have been incorporated into the Pse- AAC for construction of feature vectors and a neural network is employed as a classifier. The model is validated by jackknife, cross-validation, self-consistency, and independent testing. Results: The results of the self-consistency test elaborated that model has 99.76% Acc, 99.76% Sp, 99.76% Sp, and 0.99 MCC..Using the jackknife method, prediction model validation gave 97.07% Acc, while for 10-fold cross-validation, prediction model validation gave 95.16% Acc. Conclusion: The results of independent dataset testing were 94.3% which illustrated that the proposed model has better performance as compared to the existing model PreLysCar; however, the accuracy can be improved further, in the future, due to the increasing number of carboxylysine sites in proteins.

2021 ◽  
Vol 15 ◽  
Author(s):  
Muhammad Awais ◽  
Waqar Hussain ◽  
Nouman Rasool ◽  
Yaser Daanial Khan

Background: The uncontrolled growth due to accumulation of genetic and epigenetic changes as a result of loss or reduction in the normal function of Tumor Suppressor Genes (TSGs) and Pro-oncogenes is known as cancer. TSGs control cell division and growth by repairing of DNA mistakes during replication and restrict the unwanted proliferation of a cell or activities, those are the part of tumor production. Objectives: This study aims to propose a novel, accurate, user-friendly model to predict tumor suppressor proteins, which would be freely available to experimental molecular biologists to assist them using in vitro and in vivo studies. Methods: The predictor model has used the input feature vector (IFV) calculated from the physicochemical properties of proteins based on FCNN to compute the accuracy, sensitivity, specificity, and MCC. The proposed model was validated against different exhaustive validation techniques i.e. self-consistency and cross-validation. Results: Using self-consistency, the accuracy is 99%, for cross-validation and independent testing has 99.80% and 100% accuracy respectively. The overall accuracy of the proposed model is 99%, sensitivity value 98% and specificity 99% and F1-score was 0.99. Conclusion: It concludes, the proposed model for prediction of the tumor suppressor proteins can predict the tumor suppressor proteins efficiently, but it still has space for improvements in computational ways as the protein sequences may rapidly increase, day by day.


2019 ◽  
Vol 16 (3) ◽  
pp. 226-234 ◽  
Author(s):  
Sher Afzal Khan ◽  
Yaser Daanial Khan ◽  
Shakeel Ahmad ◽  
Khalid H. Allehaibi

N-Myristoylation, an irreversible protein modification, occurs by the covalent attachment of myristate with the N-terminal glycine of the eukaryotic and viral proteins, and is associated with a variety of pathogens and disease-related proteins. Identification of myristoylation sites through experimental mechanisms can be costly, labour associated and time-consuming. Due to the association of N-myristoylation with various diseases, its timely prediction can help in diagnosing and controlling the associated fatal diseases. Herein, we present a method named N-MyristoylG-PseAAC in which we have incorporated PseAAC with statistical moments for the prediction of N-Myristoyl Glycine (NMG) sites. A benchmark dataset of 893 positive and 1093 negative samples was collected and used in this study. For feature vector, various position and composition relative features along with the statistical moments were calculated. Later on, a back propagation neural network was trained using feature vectors and scaled conjugate gradient descent with adaptive learning was used as an optimizer. Selfconsistency testing and 10-fold cross-validation were performed to evaluate the performance of N-MyristoylG-PseAAC, by using accuracy metrics. For self-consistency testing, 99.80% Acc, 99.78% Sp, 99.81% Sn and 0.99 MCC were observed, whereas, for 10-fold cross validation, 97.18% Acc, 98.54% Sp, 96.07% Sn and 0.94 MCC were observed. Thus, it was found that the proposed predictor can help in predicting the myristoylation sites in an efficient and accurate way.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Muhammad Adeel Ashraf ◽  
Yaser Daanial Khan ◽  
Bilal Shoaib ◽  
Muhammad Adnan Khan ◽  
Faheem Khan ◽  
...  

Beta-lactamase (β-lactamase) produced by different bacteria confers resistance against β-lactam-containing drugs. The gene encoding β-lactamase is plasmid-borne and can easily be transferred from one bacterium to another during conjugation. By such transformations, the recipient also acquires resistance against the drugs of the β-lactam family. β-Lactam antibiotics play a vital significance in clinical treatment of disastrous diseases like soft tissue infections, gonorrhoea, skin infections, urinary tract infections, and bronchitis. Herein, we report a prediction classifier named as βLact-Pred for the identification of β-lactamase proteins. The computational model uses the primary amino acid sequence structure as its input. Various metrics are derived from the primary structure to form a feature vector. Experimentally determined data of positive and negative beta-lactamases are collected and transformed into feature vectors. An operating algorithm based on the artificial neural network is used by integrating the position relative features and sequence statistical moments in PseAAC for training the neural networks. The results for the proposed computational model were validated by employing numerous types of approach, i.e., self-consistency testing, jackknife testing, cross-validation, and independent testing. The overall accuracy of the predictor for self-consistency, jackknife testing, cross-validation, and independent testing presents 99.76%, 96.07%, 94.20%, and 91.65%, respectively, for the proposed model. Stupendous experimental results demonstrated that the proposed predictor “βLact-Pred” has surpassed results from the existing methods.


2021 ◽  
Vol 2021 ◽  
pp. 1-15
Author(s):  
Khalid Allehaibi ◽  
Yaser Daanial Khan ◽  
Sher Afzal Khan

A crucial biological process called angiogenesis plays a vital role in migration, growth, and wound healing of endothelial cells and other processes that are controlled by chemical signals. Angiogenesis is the process that controls the growth of blood vessels within tissues while angiogenesis proteins play a significant role in the proper working of this process. The balancing of these signals is necessary for the proper working of angiogenesis. Unbalancing of these signals increases blood vessel formation, which causes abnormal growth or several diseases including cancer. The proposed work focuses on developing a two-layered prediction model using different classifiers like random forest (RF), neural network, and support vector machine. The first level performs in silico identification of angiogenesis proteins based on the primary structure. In the case the protein is an angiogenesis protein, then the second level predicts whether the protein is linked with tumor angiogenesis or not. The performance of the model is evaluated through various validation techniques. The model was evaluated using k -fold cross-validation, independent, self-consistency, and jackknife testing. The overall accuracy using an RF classifier for angiogenesis at the first level was 97.8% and for tumor angiogenesis at the second level was 99.5%, ANN showed 94.1% accuracy for angiogenesis and 79.9% for tumor angiogenesis, and the accuracy of SVM for angiogenesis was 78.8% and for tumor angiogenesis was 65.19%.


2020 ◽  
Author(s):  
Muhammad Khalid Mahmood ◽  
Asma Ehsan ◽  
Yaser Daanial Khan

AbstractIn various cellular functions, post translational modifications (PTM) of protein play a vital role. The addition of certain functional group through a covalent bond to the protein induces PTM. The number of PTMs are identified which are closely linked with diseases for example cancer and neurological disorder. Hydroxylation is one of the PTM, modified proline residue within a polypeptide sequence. The defective hydroxylation of proline causes absences of ascorbic acid in human which produce scurvy, and many other dominant health issues. Undoubtedly, the prediction of hydroxylation sites in proline residues is of challenging frontier. The experimental identification of hydroxyproline site is quite difficult, high-priced and time-consuming. The diversity in protein sequences instigates to develop a computational tool to identify hydroxylated site within short time with excellent prediction accuracy to handle such proteomics problems. In this work a novel in silico predictor is developed through rigorous mathematical modeling to identify which site of proline is hydroxylated and which site is not? Then performance of the predictor was verified using three validations tests, namely self-consistency test, cross-validation test and jackknife test over the benchmark dataset. A comparison was established for jackknife test with the previous methods. In comparison with previous predictors the proposed tool is more accurate than the existing techniques. Hence this scheme is highly useful and inspiring in contrast to all previous predictors.


2021 ◽  
Vol 18 ◽  
Author(s):  
Min Liu ◽  
Lu Zhang ◽  
Xinyi Qin ◽  
Tao Huang ◽  
Ziwei Xu ◽  
...  

Background: Nitration is one of the important Post-Translational Modification (PTM) occurring on the tyrosine residues of proteins. The occurrence of protein tyrosine nitration under disease conditions is inevitable and represents a shift from the signal transducing physiological actions of -NO to oxidative and potentially pathogenic pathways. Abnormal protein nitration modification can lead to serious human diseases, including neurodegenerative diseases, acute respiratory distress, organ transplant rejection and lung cancer. Objective: It is necessary and important to identify the nitration sites in protein sequences. Predicting that which tyrosine residues in the protein sequence are nitrated and which are not is of great significance for the study of nitration mechanism and related diseases. Methods: In this study, a prediction model of nitration sites based on the over-under sampling strategy and the FCBF method was proposed by stacking ensemble learning and fusing multiple features. Firstly, the protein sequence sample was encoded by 2701-dimensional fusion features (PseAAC, PSSM, AAIndex, CKSAAP, Disorder). Secondly, the ranked feature set was generated by the FCBF method according to the symmetric uncertainty metric. Thirdly, in the process of model training, use the over- and under- sampling technique was used to tackle the imbalanced dataset. Finally, the Incremental Feature Selection (IFS) method was adopted to extract an optimal classifier based on 10-fold cross-validation. Results and Conclusion: Results show that the model has significant performance advantages in indicators such as MCC, Recall and F1-score, no matter in what way the comparison was conducted with other classifiers on the independent test set, or made by cross-validation with single-type feature or with fusion-features on the training set. By integrating the FCBF feature ranking methods, over- and under- sampling technique and a stacking model composed of multiple base classifiers, an effective prediction model for nitration PTM sites was build, which can achieve a better recall rate when the ratio of positive and negative samples is highly imbalanced.


Genes ◽  
2021 ◽  
Vol 12 (2) ◽  
pp. 296
Author(s):  
Zeeshan Abbas ◽  
Hilal Tayara ◽  
Kil To Chong

Among DNA modifications, N4-methylcytosine (4mC) is one of the most significant ones, and it is linked to the development of cell proliferation and gene expression. To know different its biological functions, the accurate detection of 4mC sites is required. Although we have several techniques for the prediction of 4mC sites in different genomes based on both machine learning (ML) and convolutional neural networks (CNNs), there is no CNN-based tool for the identification of 4mC sites in the mouse genome. In this article, a CNN-based model named 4mCPred-CNN was developed to classify 4mC locations in the mouse genome. Until now, we had only two ML-based models for this purpose; they utilized several feature encoding schemes, and thus still had a lot of space available to improve the prediction accuracy. Utilizing only a single feature encoding scheme—one-hot encoding—we outperformed both of the previous ML-based techniques. In a ten-fold validation test, the proposed model, 4mCPred-CNN, achieved an accuracy of 85.71% and Matthews correlation coefficient (MCC) of 0.717. On an independent dataset, the achieved accuracy was 87.50% with an MCC value of 0.750. The attained results exhibit that the proposed model can be of great use for researchers in the fields of biology and bioinformatics.


2020 ◽  
Vol 0 (0) ◽  
Author(s):  
Xuyang Pan ◽  
Laijun Sun ◽  
Guobing Sun ◽  
Panxiang Rong ◽  
Yuncai Lu ◽  
...  

AbstractNeutral detergent fiber (NDF) content was the critical indicator of fiber in corn stover. This study aimed to develop a prediction model to precisely measure NDF content in corn stover using near-infrared spectroscopy (NIRS) technique. Here, spectral data ranging from 400 to 2500 nm were obtained by scanning 530 samples, and Monte Carlo Cross Validation and the pretreatment were used to preprocess the original spectra. Moreover, the interval partial least square (iPLS) was employed to extract feature wavebands to reduce data computation. The PLSR model was built using two spectral regions, and it was evaluated with the coefficient of determination (R2) and root mean square error of cross validation (RMSECV) obtaining 0.97 and 0.65%, respectively. The overall results proved that the developed prediction model coupled with spectral data analysis provides a set of theoretical foundations for NIRS techniques application on measuring fiber content in corn stover.


Animals ◽  
2021 ◽  
Vol 11 (1) ◽  
pp. 104
Author(s):  
Shulin Liang ◽  
Chaoqun Wu ◽  
Wenchao Peng ◽  
Jian-Xin Liu ◽  
Hui-Zeng Sun

The objective of this study was to evaluate the feasibility of using the dry matter intake of first 2 h after feeding (DMI-2h), body weight (BW), and milk yield to estimate daily DMI in mid and late lactating dairy cows with fed ration three times per day. Our dataset included 2840 individual observations from 76 cows enrolled in two studies, of which 2259 observations served as development dataset (DDS) from 54 cows and 581 observations acted as the validation dataset (VDS) from 22 cows. The descriptive statistics of these variables were 26.0 ± 2.77 kg/day (mean ± standard deviation) of DMI, 14.9 ± 3.68 kg/day of DMI-2h, 35.0 ± 5.48 kg/day of milk yield, and 636 ± 82.6 kg/day of BW in DDS and 23.2 ± 4.72 kg/day of DMI, 12.6 ± 4.08 kg/day of DMI-2h, 30.4 ± 5.85 kg/day of milk yield, and 597 ± 63.7 kg/day of BW in VDS, respectively. A multiple regression analysis was conducted using the REG procedure of SAS to develop the forecasting models for DMI. The proposed prediction equation was: DMI (kg/day) = 8.499 + 0.2725 × DMI-2h (kg/day) + 0.2132 × Milk yield (kg/day) + 0.0095 × BW (kg/day) (R2 = 0.46, mean bias = 0 kg/day, RMSPE = 1.26 kg/day). Moreover, when compared with the prediction equation for DMI in Nutrient Requirements of Dairy Cattle (2001) using the independent dataset (VDS), our proposed model shows higher R2 (0.22 vs. 0.07) and smaller mean bias (−0.10 vs. 1.52 kg/day) and RMSPE (1.77 vs. 2.34 kg/day). Overall, we constructed a feasible forecasting model with better precision and accuracy in predicting daily DMI of dairy cows in mid and late lactation when fed ration three times per day.


Sign in / Sign up

Export Citation Format

Share Document