Background
Next generation sequencing technologies are opening new doors to researchers. One application is the direct discovery of sequence variants that are causative for a phenotypic trait or a disease. The detection of an organisms alterations from a reference genome is know as variant calling, a computational task involving a complex chain of software applications. One key player in the field is the Genome Analysis Toolkit (GATK). The GATK Best Practices are commonly referred recipe for variant calling on human sequencing data. Still the fact the Best Practices are highly specialized on human sequencing data and are permanently evolving is often ignored. Reproducibility is thereby aggravated, leading to continuous reinvention of pretended GATK Best Practice workflows.
Results
Here we present an automatized variant calling workflow, for the detection of SNPs and indels, that is broadly applicable for model as well as non-model diploid organisms. It is derived from the GATK Best Practice workflow for "Germline short variant discovery", without being focused on human sequencing data. The workflow has been highly optimized to achieve parallelized data evaluation and also maximize performance of individual applications to shorten overall analysis time. Optimized Java garbage collection and heap size settings for the GATK applications SortSam, MarkDuplicates, HaplotypeCaller and GatherVcfs were determined by thorough benchmarking. In doing so, runtimes of an example data evaluation could be reduced from 67 h to less than 35 h.
Conclusions
The demand for standardized variant calling workflows is proportionally growing with the dropping costs of next generation sequencing methods. Our workflow perfectly fits into this niche, offering automatization, reproducibility and documentation of the variant calling process. Moreover resource usage is lowered to a minimum. Thereby variant calling projects should become more standardized, reducing the barrier further for smaller institutions or groups.