TT-Mars: Structural Variants Assessment Based on Haplotype-resolved Assemblies
AbstractVariant benchmarking is a critical component of method development and evaluating the accuracy of studies of genetic variation. Currently, the best approach to evaluate the accuracy of a callset is the comparison against a well curated gold standard. In repetitive regions of the genome it may be difficult to establish what is the truth for a call, for example when different alignment scoring metrics provide equally supported but different variant calls in on the same data. Here we provide an alternative approach, TT-Mars, that takes advantage of the recent production of high-quality haplotype-resolved genome assemblies by evaluating variant calls based on how well their call reflects the content of the assembly, rather than comparing calls themselves. We used TT-Mars to assess callsets from different SV discovery methods on multiple human genome samples and demonstrated that it is capable at accurately classifying true positive and false positive SVs. On the HG002 personal genome, TT-Mars recapitulates 96.0%-99.6% of the validations made using the Genome in a Bottle gold standard callset evaluated by truvari, and evaluates an additional 121-10,966 variants across different callsets. Furthermore, with a group of high-quality assemblies, TT-Mars can evaluate performance of SV calling algorithms as a distribution rather than a point estimate. We also compare TT-Mars against the long-read based validation tool, VaPoR, and when assembly-based variant calls produced by dipcall are used as a gold standard. Compared with VaPoR, TT-Mars analyzes more calls on a long read callset by assessing more short variant calls (< 100 bases), while requiring smaller input. Compared with validation using dipcall variants, TT-Mars analyzes 1,497-2,229 more calls on long read callsets and has favorable results when candidate calls are fragmented into multiple calls in alignments. TT-Mars is available at https://github.com/ChaissonLab/TT-Mars.git with accompanying assembly data and corresponding liftover files.