Abstract
Homopeptides (consecutive runs of one amino-acid type) are suggested to play important roles in proteome evolution, since they are prone to expand/contract during DNA replication, recombination and repair. It is currently not clear how homopeptide frequencies vary as organisms evolve, and which genomic/proteomic traits drive variation. Thus, to gain insight, we analyzed how homopeptides and homocodons (which are pure codon repeats) vary across 405 Dikarya, and probed how this variation is linked to GC/AT bias amongst other factors. We observe that amino-acid homopeptide frequencies vary diversely between clades (even close relatives), with the AT-rich Saccharomycotina trending distinctly. As organisms evolve, homocodon and homopeptide numbers are majorly coupled to GC/AT-bias, with medium GC/AT genomes having markedly fewer. Despite this, homopeptides tend to be more GC-rich than other proteome areas, even in AT-rich organisms, indicating they absorb AT bias less or are inherently more GC-rich. Furthermore, the purity of homopeptides (i.e., the degree one codon type predominates in them) varies least for amino acids with GC/AT-balanced codon repertoires, with most variation for arginine since it has only one AT-rich codon (out of six). The most frequent and most variable homopeptide amino acids have greater intrinsic disorder propensity, and annotated intrinsic disorder fractions are strongly correlated with homopeptide levels (unlike structured domain fractions, which are anti-correlated). Poly-glutamine uniquely behaves as an evolutionarily very variable homopeptide with a codon repertoire unbiased for GC/AT. In summary, homopeptide/homocodon levels are coupled to or influenced by several factors, including GC/AT bias and amino-acid intrinsic disorder propensity.