Abstract
Background: Estimating the taxonomic composition of viral sequences in a biological sample processed by next-generation sequencing is an important step for comparative metagenomics. For that purpose, sequencing reads are usually classified by mapping them against a database of known viral reference genomes. This fails, however, to classify reads from novel viruses and quasispecies whose reference sequences are not yet available in public databases. Methods: In order to circumvent the problem of a mapping approach with unknown viruses, the feasibility and performance of neural networks to classify sequencing reads to taxonomic classes is studied. For that purpose, taxonomy and genome data from the NCBI database are used to sample artificial reads from known viruses with known taxonomic attribution. Based on these training data, artificial neural networks are fitted and applied to classify single viral read sequences to di erent taxa. Model building includes di erent input features derived from artificial read sequences as possible predictors which are chosen by a feature selection method. Training, validation and test data are computed from these input features. To summarise classification results, a generalised confusion matrix is proposed which lists all possible misclassification combination frequencies. Two new formulas to statistically estimate taxa frequencies are introduced for studying the overall viral composition.Results: We found that the best taxonomic level supported by the NCBI database is that of viral orders. Prediction accuracy of the fitted models is evaluated on test data and classification results are summarised in a confusion matrix, from which diagnostic measures such as sensitivity and specificity as well as positive and negative predictive values are calculated. The prediction accuracy of the artificial neural net is considerably higher than for random classification and posterior estimation of taxa frequencies is closer to the true distribution in the training data than simple classification or mapping results. Conclusions: Neural networks are helpful to classify sequencing reads into viral orders and can be used to complement the results of mapping approaches. The machine learning approach is not limited to already known viruses. In addition, statistical estimations of taxa frequencies can be used for subsequent comparative metagenomics.