Expanding the utility of sequence comparisons using data from whole genomes
AbstractWhole genome comparisons based on Average Nucleotide Identities (ANI), and the Genome-to-genome distance calculator have risen to prominence in rapidly classifying taxa using whole genome sequences. Some implementations have even been proposed as a new standard in species classification and have become a common technique for papers describing newly sequenced genomes. However, attempts to apply whole genome divergence data to delineation of higher taxonomic units, and to phylogenetic inference have had difficulty matching those produced by more complex phylogenetics methods. We present a novel method for generating reliable and statistically supported phylogenies using established ANI techniques. For the test cases to which we applied the developed approach we obtained accurate results up to at least the family level. The developed method uses non-parametric bootstrapping to gauge reliability of inferred groups. This method offers the opportunity make use of whole-genome comparison data that is already being generated to quickly produce accurate phylogenies. Additionally, the developed ANI methodology can assist classification of higher order taxonomic groups.Significance StatementThe average nucleotide identity (ANI) measure and its iterations have come to dominate in-silico species delimitation in the past decade. Yet the problem of gene content has not been fully resolved, and attempts made to do so contain two metrics which makes interpretation difficult at times. We provide a new single based ANI metric created from the combination of genomic content and genomic identity measures. Our results show that this method can handle comparisons of genomes with divergent content or identity. Additionally, the metric can be used to create distance based phylogenetic trees that are comparable to other tree building methods, while also providing a tentative metric for categorizing organisms into higher level taxonomic classifications.