Self-Analysis of Repeat Proteins Reveals Evolutionarily Conserved Patterns
Abstract Background Protein repeats can confound sequence analyses due to the repetitiveness of their amino acid sequences leading to difficulties in identifying when similar repeats are due to convergent or divergent evolution. We noted that the patterns derived from traditional dot plot protein self-analysis tended to be conserved in sets of related repeat proteins and this conservation could be quantified using a standard Jaccard metric.Results Use of these plots obviated the issues due to sequence similarity for analysis of these proteins. The dot plot patterns decay quickly in the absence of selective pressure with an expected loss of 50% of Jaccard similarity due to a loss of 8.2% sequence identity. Comparison of repeat and non-repeat proteins in the PDB suggested that the information content in dot plots could be used to identify repeat proteins from pure sequence. Analysis of the UniProt90 database suggested that 16.9% of all known proteins could be classified as repeat proteins. These 13.3 million putative repeat protein chains were clustered and significant amount (82.9%) of clusters containing between 5 and 200 members were of a single functional type.Conclusions Dot plot analysis of repeat proteins obviates the issues that arise from sequence degeneracy. These results show that this kind of analysis can efficiently be applied to analyze repeat proteins on a large scale.