Reconstructing phylogeny from reduced-representation genome sequencing data without assembly or alignment
AbstractAlthough genome sequencing is becoming cheaper and faster, reducing the quantity of data by only sequencing part of the genome lowers both sequencing costs and computational burdens. One popular genome-reduction approach is restriction site associated DNA sequencing, or RADseq. RADseq was initially designed for studying genetic variation across genomes usually at the population level, and it has also proved to be suitable for interspecific phylogeny reconstruction. RADseq data pose challenges for standard phylogenomic methods, however, due to incomplete coverage of the genome and large amounts of missing data. Alignment-free methods are both efficient and accurate for phylogenetic reconstructions with whole genomes and are especially practical for non-model organisms; nonetheless, alignment-free methods have only been applied with whole genome sequences. Here, we test a full-genome assembly and alignment-free method, AAF, in application to RADseq data and propose two procedures for reads selection to remove missing data. We validate these methods using both simulations and a real dataset. Reads selection improved the accuracy of phylogenetic construction in every simulated scenario and the real dataset, making AAF comparable to or better than alignment-based method with much lower computation burdens. We also investigated the sources of missing data in RADseq and their effects on phylogeny reconstruction using AAF. The AAF pipeline modified for RADseq data, phyloRAD, is available on github (https://github.com/fanhuan/phyloRAD).