VariantKey: A Reversible Numerical Representation of Human Genetic Variants
AbstractHuman genetic variants are usually represented by four values with variable length: chromosome, position, reference and alternate alleles. There is no guarantee that these components are represented in a consistent way across different data sources, and processing variant-based data can be inefficient because four different comparison operations are needed for each variant, three of which are string comparisons. Existing variant identifiers do not typically represent every possible variant we may be interested in, nor they are directly reversible. Similarly, genomic regions are typically represented inconsistently by three or four values. Working with strings, in contrast to numbers, poses extra challenges on computer memory allocation and data-representation. To overcome these limitations, a novel reversible numerical encoding schema for human genetic variants (VariantKey) and genomics regions (RegionKey), is presented here alongside a multi-language open-source software implementation (https://github.com/Genomicsplc/variantkey). VariantKey and RegionKey represents variants and regions as single 64 bit numeric entities, while preserving the ability to be searched and sorted by chromosome and position. The individual components of short variants can be directly read back from the VariantKey, while long variants are supported with a fast lookup table.