Cluster-specific gene markers enhance Shigella and Enteroinvasive Escherichia coli in silico serotyping
AbstractShigella and enteroinvasive Escherichia coli (EIEC) cause human bacillary dysentery with similar invasion mechanisms and share similar physiological, biochemical and genetic characteristics. The ability to differentiate Shigella and EIEC from each other is important for clinical diagnostic and epidemiologic investigations. The existing genetic signatures may not discriminate between Shigella and EIEC. However, phylogenetically, Shigella and EIEC strains are composed of multiple clusters and are different forms of E. coli. In this study, we identified 10 Shigella clusters, 7 EIEC clusters and 53 sporadic types of EIEC by examining over 17,000 publicly available Shigella/EIEC genomes. We compared Shigella and EIEC accessory genomes to identify the cluster-specific gene markers or marker sets for the 17 clusters and 53 sporadic types. The gene markers showed 99.63% accuracy and more than 97.02% specificity. In addition, we developed a freely available in silico serotyping pipeline named Shigella EIEC Cluster Enhanced Serotype Finder (ShigEiFinder) by incorporating the cluster-specific gene markers and established Shigella/EIEC serotype specific O antigen genes and modification genes into typing. ShigEiFinder can process either paired end Illumina sequencing reads or assembled genomes and almost perfectly differentiated Shigella from EIEC with 99.70% and 99.81% cluster assignment accuracy for the assembled genomes and mapped reads respectively. ShigEiFinder was able to serotype over 59 Shigella serotypes and 22 EIEC serotypes and provided a high specificity with 99.40% for assembled genomes and 99.38% for mapped reads for serotyping. The cluster markers and our new serotyping tool, ShigEiFinder (https://github.com/LanLab/ShigEiFinder), will be useful for epidemiologic and diagnostic investigations.Impact statementThe differentiation of Shigella strains from enteroinvasive E. coli (EIEC) is important for clinical diagnosis and public health epidemiologic investigations. The similarities between Shigella and EIEC strains make this differentiation very difficult as both share common ancestries within E. coli. However, Shigella and EIEC are phylogenetically separated into multiple clusters, making high resolution separation using cluster specific genomic markers possible. In this study, we identified 17 Shigella or EIEC clusters including five that were newly identified through examination of over 17,000 publicly available Shigella and EIEC genomes. We further identified an individual or a set of cluster-specific gene markers for each cluster using comparative genomic analysis. These markers can then be used to classify isolates into clusters and were used to develop an in silico pipeline, ShigEiFinder (https://github.com/LanLab/ShigEiFinder) for accurate differentiation, cluster typing and serotyping of Shigella and EIEC from Illumina sequencing reads or assembled genomes. This study will have broad application from understanding the evolution of Shigella/EIEC to diagnosis and epidemiology.Data summarySequencing data have been deposited at the National Center for Biotechnology Information under BioProject number PRJNA692536.RepositoriesRaw sequence data are available from NCBI under the BioProject number PRJNA692536.