Using taxon resampling to identify species with contrasting phylogenetic signals: an empirical example in Terrabacteria
AbstractMotivationThe promise of higher phylogenetic stability through increasing dataset size within Tree of Life (TOL) reconstructions has not been fulfilled, especially for deep nodes. Among the many causes proposed are changes in species composition (taxon sampling) that could influence phylogenetic accuracy of the methods by altering the relative weight of the evolutionary histories of each individual species. This effect would be stronger in clades that are represented by few lineages, which is common in many Prokaryote phyla. Indeed, phyla with fewer taxa showed the most discordance among recent TOL studies. Thus, we implemented an approach to systematically test how the number of taxa and the identity of those taxa among a larger dataset affected the accuracy of phylogenetic reconstruction.ResultsWe utilized an empirical dataset of 766 fully-sequenced proteomes for phyla within Terrabacteria as a reference for subsampled datasets that differed in both number of species and composition of species. After evaluating the backbone of trees produced as well as the internal nodes, we found that trees with fewer species were more dissimilar to the tree produced from the full dataset. Further, we found that even within scenarios consisting of the same number of taxa, the species used strongly affected phylogenetic stability. These results hold even when the tree is composed by many phyla and only one of them is being altered. Thus, the effect of taxon sampling in one group does not seem to be buffered by the presence of many other clades, making this issue relevant even to very large datasets. Our results suggest that a systematic evaluation of phylogenetic stability through taxon resampling is advisable even for very large [email protected] informationSupplementary text and figures are available on the journal’s website.