FermatS: a novel numerical representation for protein sequence comparison and DNA-binding protein identification

Author(s):  
Yanping Zhang ◽  
Ya Gao ◽  
Jianwei Ni ◽  
Pengcheng Chen ◽  
Xiaosheng Wang

Aim and Objective: Given the rapidly increasing number of molecular biology data available, computational methods of low complexity are necessary to infer protein structure, function, and evolution. Method: In the work, we proposed a novel mthod, FermatS, which based on the global position information and local position representation from the curve and normalized moments of inertia, respectively, to extract features information of protein sequences. Furthermore, we use the generated features by FermatS method to analyze the similarity/dissimilarity of nine ND5 proteins and establish the prediction model of DNA-binding proteins based on logistic regression with 5-fold crossvalidation. Results: In the similarity/dissimilarity analysis of nine ND5 proteins, the results are consistent with evolutionary theory. Moreover, this method can effectively predict the DNA-binding proteins in realistic situations. Conclusion: The findings demonstrate that the proposed method is effective for comparing, recognizing and predicting protein sequences. The main code and datasets can download from https://github.com/GaoYa1122/FermatS..

Author(s):  
Yanping Zhang ◽  
Pengcheng Chen ◽  
Ya Gao ◽  
Jianwei Ni ◽  
Xiaosheng Wang

Aim and Objective:: Given the rapidly increasing number of molecular biology data available, computational methods of low complexity are necessary to infer protein structure, function, and evolution. Method:: In the work, we proposed a novel mthod, FermatS, which based on the global position information and local position representation from the curve and normalized moments of inertia, respectively, to extract features information of protein sequences. Furthermore, we use the generated features by FermatS method to analyze the similarity/dissimilarity of nine ND5 proteins and establish the prediction model of DNA-binding proteins based on logistic regression with 5-fold crossvalidation. Results:: In the similarity/dissimilarity analysis of nine ND5 proteins, the results are consistent with evolutionary theory. Moreover, this method can effectively predict the DNA-binding proteins in realistic situations. Conclusion:: The findings demonstrate that the proposed method is effective for comparing, recognizing and predicting protein sequences. The main code and datasets can download from https://github.com/GaoYa1122/FermatS.


2018 ◽  
Vol 21 (2) ◽  
pp. 100-110 ◽  
Author(s):  
Chun Li ◽  
Jialing Zhao ◽  
Changzhong Wang ◽  
Yuhua Yao

Aim and Objective: The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. This study is undertaken to develop an efficient computational approach for timely encoding protein sequences and extracting the hidden information. Methods: Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. Results: By using the proposed mathematical descriptor of a protein sequence, similarity comparisons among β-globin proteins of 17 species and 72 spike proteins of coronaviruses were made, respectively. The resulting clusters agreed well with the established taxonomic groups. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Experiment results showed that our method performed better than DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1.45-15.76% in terms of F1M. When the benchmark dataset was expanded with negative samples, the presented approach outperformed the four previous methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82- 33.85% in terms of F1M. Conclusion: These results suggested that the generalized PseAAC model was very efficient for comparison and analysis of protein sequences, and very competitive in identifying DNA-binding proteins.


2014 ◽  
Vol 2014 ◽  
pp. 1-10 ◽  
Author(s):  
Ruifeng Xu ◽  
Jiyun Zhou ◽  
Bin Liu ◽  
Lin Yao ◽  
Yulan He ◽  
...  

DNA-binding proteins are crucial for various cellular processes, such as recognition of specific nucleotide, regulation of transcription, and regulation of gene expression. Developing an effective model for identifying DNA-binding proteins is an urgent research problem. Up to now, many methods have been proposed, but most of them focus on only one classifier and cannot make full use of the large number of negative samples to improve predicting performance. This study proposed a predictor called enDNA-Prot for DNA-binding protein identification by employing the ensemble learning technique. Experiential results showed that enDNA-Prot was comparable with DNA-Prot and outperformed DNAbinder and iDNA-Prot with performance improvement in the range of 3.97–9.52% in ACC and 0.08–0.19 in MCC. Furthermore, when the benchmark dataset was expanded with negative samples, the performance of enDNA-Prot outperformed the three existing methods by 2.83–16.63% in terms of ACC and 0.02–0.16 in terms of MCC. It indicated that enDNA-Prot is an effective method for DNA-binding protein identification and expanding training dataset with negative samples can improve its performance. For the convenience of the vast majority of experimental scientists, we developed a user-friendly web-server for enDNA-Prot which is freely accessible to the public.


2015 ◽  
Vol 11 (4) ◽  
pp. 1110-1118 ◽  
Author(s):  
Sony Malhotra ◽  
Ramanathan Sowdhamini

The distribution of GO molecular functions across different SCOP DNA-binding folds was studied. Majority of the folds were observed to perform more than one molecular function. This supports the notion that majority of DNA-binding proteins might follow divergent evolution.


RSC Advances ◽  
2018 ◽  
Vol 8 (50) ◽  
pp. 28367-28375 ◽  
Author(s):  
Kuan-Lin Chen ◽  
Jen-Hao Cheng ◽  
Chih-Yang Lin ◽  
Yen-Hua Huang ◽  
Cheng-Yang Huang

Single-stranded DNA-binding proteins (SSBs) are essential to cells as they participate in DNA metabolic processes, such as DNA replication, repair, and recombination.


2017 ◽  
Vol 18 (1) ◽  
Author(s):  
Wei Wang ◽  
Lin Sun ◽  
Shiguang Zhang ◽  
Hongjun Zhang ◽  
Jinling Shi ◽  
...  

2020 ◽  
Vol 17 (4) ◽  
pp. 258-270
Author(s):  
Gaofeng Pan ◽  
Jiandong Wang ◽  
Liang Zhao ◽  
William Hoskins ◽  
Jijun Tang

Background: DNA-binding proteins are very important to many biomolecular functions. The traditional experimental methods are expensive and time-consuming, so, computational methods that can predict whether a protein is a DNA-binding protein or not are very helpful to researchers. Machine learning has been widely used in many research areas. Many researchers have proposed machine learning methods for DNA-binding protein prediction, and this paper highlights their advantages and disadvantages. Objective: There are many computational methods that can predict DNA-binding proteins. Every method uses different features and different classifier algorithms. In this paper, a review of these methods is provided to find out some common procedures that can help researchers to develop more accurate methods. Methods: Firstly, the information stored in the protein sequence and gene sequence is presented. That information is the basis to find out the patterns leading to binding. Then, feature extraction methods and classifier algorithms are discussed. At last, some commonly used benchmark datasets are analysed and evaluated by methods. Conclusion: In this review, we analyzed some popular computational methods to predict DNAbinding protein. From those methods, we highlighted many features necessary to build up an accurate DNA-binding protein classifier. This can also help researchers to build up more useful computational tools. Currently, there are some machine learning methods with good performance in predicting DNAbinding proteins. The performance can be improved by using different kinds of features and classifiers.


Author(s):  
Tural Aksel ◽  
Zanlin Yu ◽  
Yifan Cheng ◽  
Shawn M. Douglas

AbstractCorrect reconstruction of macromolecular structure by cryo-electron microscopy relies on accurate determination of the orientation of single-particle images. For small (<100 kDa) DNA-binding proteins, obtaining particle images with sufficiently asymmetric features to correctly guide alignment is challenging. DNA nanotechnology was conceived as a potential tool for building host nanostructures to prescribe the locations and orientations of docked proteins. We used DNA origami to construct molecular goniometers—instruments to precisely orient objects—to dock a DNA-binding protein on a double-helix stage that has user-programmable tilt and rotation angles. Each protein orientation maps to a distinct barcode pattern specifying particle classification and angle assignment. We used goniometers to obtain a 6.5 Å structure of BurrH, an 82-kDa DNA-binding protein whose helical pseudosymmetry prevents accurate image orientation using classical cryo-EM. Our approach should be adaptable for other DNA-binding proteins, and a wide variety of other small proteins, by fusing DNA binding domains to them.


2014 ◽  
Vol 602-605 ◽  
pp. 1614-1617
Author(s):  
Ming Hai Yao ◽  
Na Wang

The structure of DNA binding proteins is identified that has great significance for the study of gene expression regulation mechanism.The new recognition method is proposed to identify the super-secondary structure and structure domain of DNA-binding protein in this paper. The nucleotide transition probability is calculated by the known DNA-binding protein binding locus sequence. Using mouse data which downloaded from the TRANSFAC establish the binding protein super-secondary structure recognition models. The probability score is calculated by the transition probability of the binding site and the background. This method differs from the conventional method, It is neither the amino acid sequence of the protein, nor the use of homologous proteins. In order to verify the validity of the algorithm, 10 DNA-binding proteins of drosophila and yeast are used to do the experiment. The experimental results show that our method has very good recognition result.


Sign in / Sign up

Export Citation Format

Share Document