phylostratr: A framework for phylostratigraphy
AbstractMotivationThe goal of phylostratigraphy is to infer the evolutionary origin of each gene in an organism. Currently, there are no general pipelines for this task. We present an R package, phylostratr, to fill this gap, making high-quality phylostratigraphic analysis accessible to non-specialists.ResultsPhylostratigraphic analysis entails searching for homologs within increasingly broad clades. The highest clade that contains all homologs of a gene is that gene’s phylostratum. We have created a general R-based framework, phylostratr, for estimating the phylostratum of every gene in a species. The program can fully automate an analysis: select species for a balanced representation of each strata, retrieve the sequences from UniProt, build BLAST databases, run BLAST, infer homologs for each gene against each subject species, determine phylostrata, and return summaries and diagnostics. phylostratr allows extensive customization. A user may: modify the automatically-generated clade tree or use their own tree; provide custom sequences in place of those automatically retrieved from UniProt; replace BLAST with an alternative algorithm; or tailor the method and sensitivity of the homology inference classifier. phylostratr also offers proteome quality assessments, false-positive diagnostics, and checks for missing organelle genomes. We show the utility of phylostratr through case studies in Arabidopsis thaliana and Saccharomyces cerevisiae.Availabilityphylostratr source code and vignettes are available on GitHub at https://github.com/arendsee/[email protected]