A new 2D graphical representation of protein sequence and its application

Graphical representation is a very efficient tool for visual analysis of protein sequences. In this paper, a novel 2D graphical representation scheme is proposed on the basis of a newly introduced concept, named characteristic model of the protein sequences. After obtaining the 2D graphics of protein sequences, two numerical characterizations of them is designed as descriptors to analyze the nine DN5 protein sequences, simulation and analysis results show that, comparing with existing methods, our method is not only visible, intuitional, and simple, but also has no circuit or degeneracy, and even more important, since the storage space required by our method is constant and has nothing to do with the length of protein sequences, then it can keep excellent visual inspection for long protein sequences.

Download Full-text

Measuring Similarity among Protein Sequences Using a New Descriptor

BioMed Research International ◽

10.1155/2019/2796971 ◽

2019 ◽

Vol 2019 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Mervat M. Abo-Elkhier ◽

Marwa A. Abd Elwahaab ◽

Moheb I. Abo El Maaty

Keyword(s):

Protein Sequence ◽

Nadh Dehydrogenase ◽

Graphical Representation ◽

Protein Sequences ◽

Computation Time ◽

Fundamental Aspect ◽

Beta Globin ◽

Nadh Dehydrogenase Subunit ◽

The Public ◽

Sequencing Technologies

The comparison of protein sequences according to similarity is a fundamental aspect of today’s biomedical research. With the developments of sequencing technologies, a large number of protein sequences increase exponentially in the public databases. Famous sequences’ comparison methods are alignment based. They generally give excellent results when the sequences under study are closely related and they are time consuming. Herein, a new alignment-free method is introduced. Our technique depends on a new graphical representation and descriptor. The graphical representation of protein sequence is a simple way to visualize protein sequences. The descriptor compresses the primary sequence into a single vector composed of only two values. Our approach gives good results with both short and long sequences within a little computation time. It is applied on nine beta globin, nine ND5 (NADH dehydrogenase subunit 5), and 24 spike protein sequences. Correlation and significance analyses are also introduced to compare our similarity/dissimilarity results with others’ approaches, results, and sequence homology.

Download Full-text

Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207321666180130100838 ◽

2018 ◽

Vol 21 (2) ◽

pp. 100-110 ◽

Cited By ~ 3

Author(s):

Chun Li ◽

Jialing Zhao ◽

Changzhong Wang ◽

Yuhua Yao

Keyword(s):

Dna Binding ◽

Protein Sequence ◽

Protein Identification ◽

Binding Proteins ◽

Graphical Representation ◽

Sequence Data ◽

Protein Sequences ◽

Dna Binding Proteins ◽

Support Vector ◽

Letter Sequence

Aim and Objective: The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. This study is undertaken to develop an efficient computational approach for timely encoding protein sequences and extracting the hidden information. Methods: Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. Results: By using the proposed mathematical descriptor of a protein sequence, similarity comparisons among β-globin proteins of 17 species and 72 spike proteins of coronaviruses were made, respectively. The resulting clusters agreed well with the established taxonomic groups. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Experiment results showed that our method performed better than DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1.45-15.76% in terms of F1M. When the benchmark dataset was expanded with negative samples, the presented approach outperformed the four previous methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82- 33.85% in terms of F1M. Conclusion: These results suggested that the generalized PseAAC model was very efficient for comparison and analysis of protein sequences, and very competitive in identifying DNA-binding proteins.

Download Full-text

FEGS: a novel feature extraction model for protein sequences and its applications

BMC Bioinformatics ◽

10.1186/s12859-021-04223-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Zengchao Mu ◽

Ting Yu ◽

Xiaoping Liu ◽

Hongyu Zheng ◽

Leyi Wei ◽

...

Keyword(s):

Feature Extraction ◽

Protein Sequence ◽

Graphical Representation ◽

Sequence Data ◽

Protein Sequences ◽

Statistical Features ◽

Research Areas ◽

Protein Functions ◽

Protein Sequence Data ◽

Extraction Model

Abstract Background Feature extraction of protein sequences is widely used in various research areas related to protein analysis, such as protein similarity analysis and prediction of protein functions or interactions. Results In this study, we introduce FEGS (Feature Extraction based on Graphical and Statistical features), a novel feature extraction model of protein sequences, by developing a new technique for graphical representation of protein sequences based on the physicochemical properties of amino acids and effectively employing the statistical features of protein sequences. By fusing the graphical and statistical features, FEGS transforms a protein sequence into a 578-dimensional numerical vector. When FEGS is applied to phylogenetic analysis on five protein sequence data sets, its performance is notably better than all of the other compared methods. Conclusion The FEGS method is carefully designed, which is practically powerful for extracting features of protein sequences. The current version of FEGS is developed to be user-friendly and is expected to play a crucial role in the related studies of protein sequence analyses.

Download Full-text

3D Graphical Representation of Protein Sequences Based on Conformational Parameters of Amino Acids

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.989-994.3599 ◽

2014 ◽

Vol 989-994 ◽

pp. 3599-3604

Author(s):

Qian Jun Xiao ◽

Zong Gang Deng

Keyword(s):

Amino Acids ◽

Protein Sequence ◽

Graphical Representation ◽

Protein Sequences ◽

3D Graphical Representation ◽

Protein Graphs

Based on the helix and-sheet and the-turn conformational parameters, and and , of the 20 amino acids, we propose a new 3D graphical representation of protein sequence without circuit or degeneracy, which may reflect the innate structure of the protein sequence. Then the numerical characterizations of protein graphs, the leading eigenvalues of the L/L matrices associated with the graphical curves for protein sequences, was utilized as descriptors to analyze the similarity/dissimilarity of the nine ND5 protein sequences.

Download Full-text

A 2D Non-degeneracy Graphical Representation of Protein Sequence and Its Applications

Current Bioinformatics ◽

10.2174/1574893615666200106114337 ◽

2020 ◽

Vol 15 (7) ◽

pp. 758-766

Author(s):

Xiaoli Xie ◽

Yunxiu Zhao

Keyword(s):

Amino Acids ◽

Amino Acid ◽

Protein Sequence ◽

Graphical Representation ◽

Protein Sequences ◽

Important Research ◽

Physiochemical Properties ◽

New Methods ◽

Alignment Free ◽

Highly Correlated

Background: The comparison of the protein sequences is an important research filed in bioinformatics. Many alignment-free methods have been proposed. Objective: In order to mining the more information of the protein sequence, this study focus on a new alignment-free method based on physiochemical properties of amino acids. Methods: Average physiochemical value (Apv) has been defined. For a given protein sequence, a 2D curve was outlined based on Apv and position of the amino acid, and there is not loop and intersection on the curve. According to the curve, the similarity/dissimilarity of the protein sequences can be analyzed. Results and Conclusion: Two groups of protein sequences are taken as examples to illustrate the new methods, the protein sequences can be classified correctly, and the results are highly correlated with that of ClustalW. The new method is simple and effective.

Download Full-text

DV-Curve Representation of Protein Sequences and Its Application

Computational and Mathematical Methods in Medicine ◽

10.1155/2014/203871 ◽

2014 ◽

Vol 2014 ◽

pp. 1-8 ◽

Cited By ~ 1

Author(s):

Wei Deng ◽

Yihui Luan

Keyword(s):

Amino Acids ◽

Protein Sequence ◽

Graphical Representation ◽

Protein Sequences ◽

Quantitative Comparison ◽

The Other ◽

Hp Model ◽

Dual Vector ◽

Numerical Characterization ◽

Curve Representation

Based on the detailed hydrophobic-hydrophilic(HP) model of amino acids, we propose dual-vector curve (DV-curve) representation of protein sequences, which uses two vectors to represent one alphabet of protein sequences. This graphical representation not only avoids degeneracy, but also has good visualization no matter how long these sequences are, and can reflect the length of protein sequence. Then we transform the 2D-graphical representation into a numerical characterization that can facilitate quantitative comparison of protein sequences. The utility of this approach is illustrated by two examples: one is similarity/dissimilarity comparison among different ND6 protein sequences based on their DV-curve figures the other is the phylogenetic analysis among coronaviruses based on their spike proteins.

Download Full-text

An Efficient Tool for Searching Maximal and Super Maximal Repeats in Large DNA/Protein Sequences via Induced-Enhanced Suffix Array

Recent Patents on Computer Science ◽

10.2174/2213275911666181107095645 ◽

2019 ◽

Vol 12 (2) ◽

pp. 128-134

Author(s):

Sanjeev Kumar ◽

Suneeta Agarwal ◽

Ranvijay

Keyword(s):

Protein Sequences ◽

Input Sequence ◽

Suffix Array ◽

Secondary Memory ◽

Time And Space ◽

Efficient Tool ◽

Frequency Distributions ◽

Alignment Algorithms ◽

Common Prefix ◽

Art Works

Background: DNA and Protein sequences of an organism contain a variety of repeated structures of various types. These repeated structures play an important role in Molecular biology as they are related to genetic backgrounds of inherited diseases. They also serve as a marker for DNA mapping and DNA fingerprinting. Efficient searching of maximal and super maximal repeats in DNA/Protein sequences can lead to many other applications in the area of genomics. Moreover, these repeats can also be used for identification of critical diseases by finding the similarity between frequency distributions of repeats in viruses and genomes (without using alignment algorithms). Objective: The study aims to develop an efficient tool for searching maximal and super maximal repeats in large DNA/Protein sequences. Methods: The proposed tool uses a newly introduced data structure Induced Enhanced Suffix Array (IESA). IESA is an extension of enhanced suffix array. It uses induced suffix array instead of classical suffix array. IESA consists of Induced Suffix Array (ISA) and an additional array-Longest Common Prefix (LCP) array. ISA is an array of all sorted suffixes of the input sequence while LCP array stores the lengths of the longest common prefixes between all pairs of consecutive suffixes in an induced suffix array. IESA is known to be efficient w.r.t. both time and space. It facilitates the use of secondary memory for constructing the large suffix-array. Results: An open source standalone tool named MSR-IESA for searching maximal and super maximal repeats in DNA/Protein sequences is provided at https://github.com/sanjeevalg/MSRIESA. Experimental results show that the proposed algorithm outperforms other state of the art works w.r.t. to both time and space. Conclusion: The proposed tool MSR-IESA is remarkably efficient for the analysis of DNA/Protein sequences, having maximal and super maximal repeats of any length. It can be used for identification of well-known diseases.

Download Full-text

A Graphical Representation of Protein Sequences and Its Applications

Proceedings of the Fourth International Conference on Biological Information and Biomedical Engineering ◽

10.1145/3403782.3403812 ◽

2020 ◽

Author(s):

Ping-An He ◽

Linlin Yan ◽

Tianyu Zhu

Keyword(s):

Graphical Representation ◽

Protein Sequences

Download Full-text

Protein Sequence Classification with Improved Extreme Learning Machine Algorithms

BioMed Research International ◽

10.1155/2014/103054 ◽

2014 ◽

Vol 2014 ◽

pp. 1-12 ◽

Cited By ~ 51

Author(s):

Jiuwen Cao ◽

Lianglin Xiong

Keyword(s):

Extreme Learning Machine ◽

Protein Sequence ◽

Protein Sequences ◽

Activation Function ◽

Majority Voting ◽

Training Algorithms ◽

Sequence Classification ◽

Protein Sequence Classification ◽

Learning Machine ◽

Majority Voting Method

Precisely classifying a protein sequence from a large biological protein sequences database plays an important role for developing competitive pharmacological products. Comparing the unseen sequence with all the identified protein sequences and returning the category index with the highest similarity scored protein, conventional methods are usually time-consuming. Therefore, it is urgent and necessary to build an efficient protein sequence classification system. In this paper, we study the performance of protein sequence classification using SLFNs. The recent efficient extreme learning machine (ELM) and its invariants are utilized as the training algorithms. The optimal pruned ELM is first employed for protein sequence classification in this paper. To further enhance the performance, the ensemble based SLFNs structure is constructed where multiple SLFNs with the same number of hidden nodes and the same activation function are used as ensembles. For each ensemble, the same training algorithm is adopted. The final category index is derived using the majority voting method. Two approaches, namely, the basic ELM and the OP-ELM, are adopted for the ensemble based SLFNs. The performance is analyzed and compared with several existing methods using datasets obtained from the Protein Information Resource center. The experimental results show the priority of the proposed algorithms.

Download Full-text

Techniques for the verification of minimal phylogenetic trees illustrated with ten mammalian haemoglobin sequences

Biochemical Journal ◽

10.1042/bj1870065 ◽

1980 ◽

Vol 187 (1) ◽

pp. 65-74 ◽

Cited By ~ 12

Author(s):

D Penny ◽

M D Hendy ◽

L R Foulds

Keyword(s):

Amino Acid ◽

Phylogenetic Tree ◽

Protein Sequence ◽

Phylogenetic Trees ◽

Sequence Data ◽

Protein Sequences ◽

Nucleotide Sequences ◽

Amino Acid Sequences ◽

Minimal Tree ◽

Protein Sequence Data

We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds Hendy & Penny (1979) J. Mol. Evol. 13. 127–150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151–166]. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin alpha-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree.

Download Full-text