GBSmode: a pipeline for haplotype-aware analysis of genotyping-by-sequencing data
BACKGROUND: Genotyping-by-sequencing (GBS) has revolutionised molecular genetic analysis. It enables simultaneous genotyping of thousands of DNA markers in the genome of any species. In contrast to whole-genome shotgun sequencing, GBS exploits a restriction enzyme to reduce genome complexity and directs the sequencing to begin at fixed digestion sites. However, currently used tools for the analysis of GBS data, such as SAMtools, often neglect the fundamental technical differences between GBS and shotgun sequencing. RESULTS: Here we present GBSmode, a dedicated pipeline to call DNA sequence variants using whole-read information from GBS data. It removes false positives by incorporating biological features such as the ploidy level and the number of possible alleles in the population under investigation. Comparison of GBSmode with SAMtools in an F2 population of rice (Oryza sativa L.) showed both identified a similar number of polymorphisms (13,449 and 14,445, respectively) with a high overlap (8,143). However, differences were found in the number of read misalignments (8.0% and 14.3% for GBSmode and SAMtools, respectively) and genotyping errors (5.0% and 8.3% for GBSmode and SAMtools, respectively). Further tests in a bi-parental F1 population of cassava (Manihot esculenta Crantz) showed GBSmode found 31,489 polymorphic loci, whereas the number was higher with SAMtools (43,860). However, this difference was mainly attributable to GBSmode rejecting 11,695 loci that were biologically not possible. CONCLUSIONS: This study shows that GBSmode is a versatile tool for the analysis of GBS data. Moreover, GBSmode was able to reduce genotyping errors arising from read misalignments by combining haplotype data with biological information. Whilst other tools may find more markers, GBSmode is designed for accuracy.