New Method for Sequence Similarity Analysis Based on the Position and Frequency of Statistically Significant Repeats
Background: The analysis of DNA nucleotide sequence similarity among different species is crucial in identifying their functional, structural or evolutionary relationships. The number of bioinformatics tools designed to perform the similarity analysis of nucleotide sequences has been growing rapidly. According to the current literature, alignment-free methods ha-ven’t been performed on nucleotide sequence repeats of different lengths. Objective: To develop a new algorithm for determining sequence characteristics and similarity based on statistically signifi-cant repetitive elements of different lengths, which are located in analyzed sequences. Method: This paper presents Repeats-Position/Frequency method (R-P/F method), for determining nucleotide sequence similarity which takes into consideration statistically significant repetitive parts of analyzed sequences. It is based on infor-mation theory and the fact that both position and frequency of repeated sequences are not expected to occur with the identical presence in a random sequence of the same length. Nucleotide sequences are presented in rn-dimensional vector space and their hierarchy is constructed by applying hierarchical clustering algorithm. Results: R-P/F method has been validated on multiple data sets of nucleotide sequences and compared with results obtained from alignment-based algorithms BLAST and Clustal Omega, and multiple well-established alignment-free dissimilarity measures. Presented method provides results comparable with other commonly used methods focused on resolving the same problem, with the new view on the used repetitive parts of sequences in these calculations. Conclusion: The presented, novel algorithm for calculating sequence similarity measure is effective in discovering relation-ships among the sequences and makes a powerful and complementary addition to existing sequence similarity methods.