Phyloreferences: Tree-Native, Reproducible, and Machine-Interpretable Taxon Concepts
Evolutionary and organismal biology, similar to other fields in biology, have become inundated with data. At the same rate, we are experiencing a surge in broader evolutionary and ecological syntheses for which tree-thinking is the staple for a variety of post-tree analyses. To fully take advantage of this wealth of data to discover and understand large-scale evolutionary and ecological patterns, computational data integration, i.e. the use of machines to link data at large scale by shared entities, is crucial. The most common shared entity by which evolutionary and ecological data need to be linked is the taxon to which they belong. In this paper, we propose a set of requirements that a system for defining such taxa should meet for computational data science: taxon definitions should maintain conceptual consistency, be reproducible via a known algorithm, be computationally automatable, and be applicable across the tree of life. We argue that Linnaean names based in Linnaean taxonomy, by far the most prevalent means of linking data to taxa, fail to meet these requirements due to fundamental theoretical and practical shortfalls. We argue that for the purposes of data-integration we should instead use phylogenetic clade definitions transformed into formal logic expressions. We call such expressions phyloreferences, and argue that, unlike Linnaean names, they meet all requirements for effective data-integration.