Absent from DNA and protein: genomic characterization of nullomers and nullpeptides across functional categories and evolution
AbstractNullomers and nullpeptides are short DNA or amino acid sequences that are absent from a genome or proteome, respectively. One potential cause for their absence could be that they have a detrimental impact on an organism. Here, we identified all possible nullomers and nullpeptides in the genomes and proteomes of over thirty species and show that a significant proportion of these sequences are under negative selection. We assign nullomers to different functional categories (coding sequences, exons, introns, 5’UTR, 3’UTR and promoters) and show that nullomers from coding sequences and promoters are most likely to be selected against. Utilizing variants in the human population, we annotate variant-associated nullomers, highlighting their potential use as DNA ‘fingerprints’. Phylogenetic analyses of nullomers and nullpeptides across evolution shows that they could be used to build phylogenetic trees. Our work provides a catalog of genomic and proteome derived absent k-mers, together with a novel scoring function to determine their potential functional importance. In addition, it shows how these unique sequences could be used as DNA ‘fingerprints’ or for phylogenetic analyses.