rPinecone: Define sub-lineages of a clonal expansion via a phylogenetic tree
ABSTRACTThe ability to distinguish between pathogens is a fundamental requirement to understand the epidemiology of infectious diseases. Phylogenetic analysis of genomic data can provide a powerful platform to identify lineages within bacterial populations, and thus inform outbreak investigation and transmission dynamics. However, resolving differences between pathogens associated with low variant (LV) populations carrying low median pairwise single nucleotide variant (SNV) distances, remains a major challenge. Here we present rPinecone, an R package designed to define sub-lineages within closely related LV populations. rPinecone uses a root-to-tip directional approach to define sub-lineages within a phylogenetic tree according to SNV distance from the ancestral node. The utility of this program was demonstrated using genomic data of two LV populations: a hospital outbreak of methicillin-resistant Staphylococcus aureus and endemic Salmonella Typhi from rural Cambodia. rPinecone identified the transmission branches of the hospital outbreak and geographically-confined lineages in Cambodia. Sub-lineages identified by rPinecone in both analyses were phylogenetically robust. It is anticipated that rPinecone can be used to discriminate between lineages of bacteria from LV populations where other methods fail, enabling a deeper understanding of infectious disease epidemiology for public health purposes.DATA SUMMARYSource code for rPinecone is available on GitHub under the open source licence GNU GPL 3; (url: https://github.com/alexwailan/rpinecone).Newick format files for both phylogenetic trees have been deposited in Figshare; (url: https://doi.org/10.6084/m9.figshare.7022558)Geographical analysis of the S. Typhi Dataset using Microreact is available at https://microreact.org/project/r1IqkrN1X.Accession numbers, meta data and sample lineage results of both datasets used in this paper are listed in the supplementary tables.I/We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. ⊠IMPACT STATEMENTWhole genome sequence data from bacterial pathogens is increasingly used in the epidemiological investigation of infectious disease, both in outbreak and endemic situations. However, distinguishing bacterial species which are both very similar and which are likely to come from a small geographical and temporal range presents a major technical challenge for epidemiologists. rPinecone was designed to address this challenge and utilises phylogenetic data to define lineages within bacterial populations that have limited variation. This approach is therefore of great interest to epidemiologists as it adds a further level of clarity above and beyond that which is offered by existing approaches which have not been designed to consider bacterial isolates containing variation that only transiently exist, but which is epidemiologically informative. rPinecone has the flexibility to be applied to multiple pathogens and has direct application for investigations of clinical outbreaks and endemic disease to understand transmission dynamics or geographical hotspots of disease.