VirusTaxo: Taxonomic classification of virus genome using multi-class hierarchical classification by k-mer enrichment
AbstractTaxonomic classification of viruses is a multi-class hierarchical classification problem, as taxonomic ranks (e.g., order, family and genus) of viruses are hierarchically structured and have multiple classes in each rank. Classification of biological sequences which are hierarchically structured with multiple classes is challenging. Here we developed a machine learning architecture, VirusTaxo, using a multi-class hierarchical classification by k-mer enrichment. VirusTaxo classifies DNA and RNA viruses to their taxonomic ranks using genome sequence. To assign taxonomic ranks, VirusTaxo extracts k-mers from genome sequence and creates bag-of-k-mers for each class in a rank. VirusTaxo uses a top-down hierarchical classification approach and accurately assigns the order, family and genus of a virus from the genome sequence. The average accuracies of VirusTaxo for DNA viruses are 99% (order), 98% (family) and 95% (genus) and for RNA viruses 97% (order), 96% (family) and 82% (genus). VirusTaxo can be used to detect taxonomy of novel viruses using full length genome or contig sequences.AvailabilityOnline version of VirusTaxo is available at https://omics-lab.com/virustaxo/.