VirStrain: a strain identification tool for RNA viruses
1AbstractGenome epidemiology, which uses genomic data to analyze the source and spread of infectious diseases, provides important information beyond interview-based methods. Given fast accumulation of sequenced viral genomes, a basic need in genome epidemiology is to identify which reference genomes are identical or closest to the ones in a sequenced sample. Then the associated metadata such as the geographical locations can be utilized to infer the transmission network. In this work, we deliver VirStrain, a fast and accurate tool for conducting strain-level analysis from short reads. By using a greedy covering algorithm, we are able to derive unique k-mer combinations for highly similar reference genomes. VirStrain is able to detect the most possible strain and also multiple strains that may simultaneously infect the same host. We tested VirStrain on three types of RNA viruses whose reference genomes have different similarity distributions. For each types of virus, we assessed VirStrain across multiple benchmark datasets of different properties and complexity. The experimental results on both simulated and real sequencing data show that VirStrain outperforms other strain identification tools.