PhyloHerb: A phylogenomic pipeline for processing genome skimming data for plants
Premise of the study: The application of high throughput sequencing, especially to herbarium specimens, is greatly accelerating biodiversity research. Among various techniques, low coverage Illumina sequencing of total genomic DNA (genome skimming) can simultaneously recover the plastid, mitochondrial, and nuclear ribosomal regions across hundreds of species. Here, we introduce PhyloHerb -- a bioinformatic pipeline to efficiently and effectively assemble phylogenomic datasets derived from genome skimming. Methods and Results: PhyloHerb uses either a built-in database or user-specified references to extract orthologous sequences using BLAST search. It outputs FASTA files and offers a suite of utility functions to assist with alignment, data partitioning, concatenation, and phylogeny inference. The program is freely available at https://github.com/lmcai/PhyloHerb/. Conclusions: Using published data from Clusiaceae, we demonstrated that PhyloHerb can accurately identify genes using highly fragmented assemblies derived from sequencing older herbarium specimens. Our approach is effective at all taxonomic depths and is scalable to thousands of species.