Sequence-Order Frequency Matrix - Sampling and Machine learning with Smith-Waterman (SOFM-SMSW) for Protein Remote Homology Detection

Abstract Background: Protein Remote Homology Detection (PRHD) is used to find the homologous proteins which are similar in function and structure but sharing low sequence identity. In general, the Sequence-Order Frequency Matrix (SOFM) was used for protein remote homology detection. In the SOFM Top-n-gram (SOFM-Top) algorithm, the probability of substrings was calculated based on the highest probability value of substrings. Moreover, SOFM-Smith Waterman (SOFM-SW) algorithm combines the SOFM with local alignment for protein remote homology detection. However, the computation complexity of SOFM based PRHD is high since it processes all protein sequences in SOFM.Objective: Sequence-Order Frequency Matrix - Sampling and Machine learning with Smith-Waterman (SOFM-SMSW) algorithm is proposed for predicting the protein remote homology. The SOFM-SMSW algorithm used the PVS method to select the optimum target sequences based on the uniform distribution measure.Method: This research work considers the most important sequences for PRHD by introducing Proportional Volume Sampling (PVS). After sampling the protein sequences, a feature vector is constructed and labeling is performed based on the concatenation between two protein sequences. Then, a substitution score which represents the structural alignment is learned using k-Nearest Neighbor (k-NN). Based on the learned substitution score and alignment score, the protein homology is detected using Smith-Waterman algorithm and Support Vector Machine (SVM). By selecting the most important sequences, the accuracy of PRHD is improved and the computational complexity for PRHD is reduced by using structural alignment along with the local alignment.Results: The performance of the proposed SOFM-SMSW algorithm is tested with SCOP database and it has been compared with various existing algorithms such as SVM Top-N-gram, SVM pairwise, GPkernal, Long Short-Term Memory (LSTM), SOFM Top-N-gram and SOFM-SW. Conclusion: The experimental results illustrate that the proposed SOFM-SMSW algorithm has better accuracy, precision, recall, ROC and ROC 50 for PRHD than the other existing algorithms.

Download Full-text

A discriminative method for protein remote homology detection based on N-Gram

Genetics and Molecular Research ◽

10.4238/2015.january.15.9 ◽

2015 ◽

Vol 14 (1) ◽

pp. 69-78 ◽

Cited By ~ 2

Author(s):

S. Xie ◽

P. Li ◽

Y. Jiang ◽

Y. Zhao

Keyword(s):

Homology Detection ◽

Remote Homology ◽

Discriminative Method ◽

N Gram ◽

Remote Homology Detection

Download Full-text

PROFILE-BASED STRING KERNELS FOR REMOTE HOMOLOGY DETECTION AND MOTIF EXTRACTION

Journal of Bioinformatics and Computational Biology ◽

10.1142/s021972000500120x ◽

2005 ◽

Vol 03 (03) ◽

pp. 527-550 ◽

Cited By ~ 77

Author(s):

RUI KUANG ◽

EUGENE IE ◽

KE WANG ◽

KAI WANG ◽

MAHIRA SIDDIQI ◽

...

Keyword(s):

Structural Features ◽

Support Vector ◽

Protein Classification ◽

Svm Classifier ◽

Sequence Motifs ◽

Homology Detection ◽

String Kernels ◽

Remote Homology ◽

Structure Information ◽

Remote Homology Detection

We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the profiles is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We further examine how to incorporate predicted secondary structure information into the profile kernel to obtain a small but significant performance improvement. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs" — short regions of the original profile that contribute almost all the weight of the SVM classification score — and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results also outperform cluster kernels while providing much better scalability to large datasets. Supplementary website:.

Download Full-text

Improving model construction of profile HMMs for remote homology detection through structural alignment

BMC Bioinformatics ◽

10.1186/1471-2105-8-435 ◽

2007 ◽

Vol 8 (1) ◽

Cited By ~ 10

Author(s):

Juliana S Bernardes ◽

Alberto MR Dávila ◽

Vítor S Costa ◽

Gerson Zaverucha

Keyword(s):

Structural Alignment ◽

Model Construction ◽

Homology Detection ◽

Profile Hmms ◽

Remote Homology ◽

Remote Homology Detection

Download Full-text

SVM-HUSTLE--an iterative semi-supervised machine learning approach for pairwise protein remote homology detection

Bioinformatics ◽

10.1093/bioinformatics/btn028 ◽

2008 ◽

Vol 24 (6) ◽

pp. 783-790 ◽

Cited By ~ 25

Author(s):

A. R. Shah ◽

C. S. Oehmen ◽

B.-J. Webb-Robertson

Keyword(s):

Machine Learning ◽

Supervised Machine Learning ◽

Learning Approach ◽

Homology Detection ◽

Remote Homology ◽

Machine Learning Approach ◽

Remote Homology Detection

Download Full-text

Reducing Dimensionality in Remote Homology Detection

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39417 ◽

2021 ◽

Vol 9 (12) ◽

pp. 1052-1054

Author(s):

S. Dinesh

Keyword(s):

Protein Sequence ◽

Protein Sequences ◽

High Dimensional ◽

Homology Detection ◽

Protein Families ◽

Classification Techniques ◽

Remote Homology ◽

High Dimensional Datasets ◽

Remote Homology Detection

Abstract: Homology detection plays a major role in bioinformatics. Different type of methods is used for Homology detection. Here we extract the information from protein sequences and then uses the various algorithm to predict the similarity between protein families. SVM most commonly used the algorithm in homology detection. Classification techniques are not suitable for homology detection because theyare not suitable for high dimensional datasets. Soreducing the higher dimensionality is very important than easily can predict the similarity of protein families. Keywords: Homology detection, Protein, Sequence, Reducing dimensionality, BLAST, SCOP.

Download Full-text