KAGE: Fast alignment-free graph-based genotyping of SNPs and short indels
AbstractOne of the core applications of high-throughput sequencing is the characterization of individual genetic variation. Traditionally, variants have been inferred by comparing sequenced reads to a reference genome. There has recently been an emergence of genotyping methods, which instead infer variants of an individual based on variation present in population-scale repositories like the 1000 Genomes Project. However, commonly used methods for genotyping are slow since they still require mapping of reads to a reference genome. Also, since traditional reference genomes do not include genetic variation, traditional genotypers suffer from reference bias and poor accuracy in variation-rich regions where reads cannot accurately be mapped.We here present KAGE, a genotyper for SNPs and short indels that is inspired by recent developments within graph-based genome representations and alignment-free genotyping. We propose two novel ideas to improve both the speed and accuracy: we (1) use known genotypes from thousands of individuals in a Bayesian model to predict genotypes, and (2) propose a computationally efficient method for leveraging correlation between variants.We show through experiments on experimental data that KAGE is both faster and more accurate than other alignment-free genotypers. KAGE is able to genotype a new sample (15x coverage) in less than half an hour on a consumer laptop, more than 10 times faster than the fastest existing methods, making it ideal in clinical settings or when large numbers of individuals are to be genotyped at low computational cost.