Discordance between different bioinformatic methods for identifying resistance genes from short-read genomic data, with a focus on Escherichia coli
Several bioinformatics genotyping algorithms are now commonly used to characterise antimicrobial resistance (AMR) gene profiles in whole genome sequencing (WGS) data, with a view to understanding AMR epidemiology and developing resistance prediction workflows using WGS in clinical settings. Accurately evaluating AMR in Enterobacterales, particularly Escherichia coli, is of major importance, because this is a common pathogen. However, robust comparisons of different genotyping approaches on relevant simulated and large real-life WGS datasets are lacking. Here, we used both simulated datasets and a large set of real E. coli WGS data (n=1818 isolates) to systematically investigate genotyping methods in greater detail. Simulated constructs and real sequences were processed using four different bioinformatic programs (ABRicate, ARIBA, KmerResistance, and SRST2, run with the ResFinder database) and their outputs compared. For simulations tests where 3,092 AMR gene variants were inserted into random sequence constructs, KmerResistance was correct for all 3,092 simulations, ABRicate for 3,082 (99.7%), ARIBA for 2,927 (94.7%) and SRST2 for 2,120 (68.6%). For simulations tests where two closely related gene variants were inserted into random sequence constructs, ABRicate identified the correct alleles in 11,382/46,279 (25%) of simulations, ARIBA in 2494/46,279 (5%), SRST in 2539/46,279 (5%) and KmerResistance in 38,826/46,279 (84%). In real data, across all methods, 1392/1818 (76%) isolates had discrepant allele calls for at least one gene. Our evaluations revealed poor performance in scenarios that would be expected to be challenging (e.g. identification of AMR genes at <10x coverage, discriminating between closely related AMR gene sequences), but also identified systematic sequence classification (i.e. naming) errors even in straightforward circumstances, which contributed to 1081/3092 (35%) errors in our most simple simulations and at least 2530/4321 (59%) discrepancies in real data. Further, many of the remaining discrepancies were likely artefactual, with reporting cut-off differences accounted for at least 1430/4321 (33%) discrepants. Comparing outputs generated by running multiple algorithms on the same dataset can help identify and resolve these artefacts, but ideally new and more robust genotyping algorithms are needed.