Global genomic similarity and core genome sequence diversity of the Streptococcus genus as a toolkit to identify close related bacterial strains in complex environments
Background. Comparative genomics between closely related bacterial strains aids to distinguish important features like pathogenesis, antibiotic resistance, and phylogenetic structure. Streptococcus is relevant because public health and food safety and it are well-represented (>100 genomes ) in databases of publicly available databases. Streptococci are cosmopolitan, and there are multiple sources of isolation, from humans to dairy products. The Streptococcus have been classified by morphology, serum types, 16S rRNA gene, and Multi Locus Sequence Types (MLST). The Genomic Similarity Score (GSS) is proposed as a tool to quantify genome level relatedness between Streptococcus and using their core genome as a simplified tool to assess strain specific abundances in metagenomic sequences. Methods. A 16S rRNA gene phylogeny has been calculated for 108 strains, belonging to 16 Streptococcus species and compared the results to a dendrogram using the GSS with all homologous shared information available in the genomes. Additionally, genus core and pan-genome were calculated. The core genome sequences identity was analyzed and the core genome was used as a seed to discriminate abundances between close related strains in metagenomic samples. Results. A total of 404 proteins are shared by all 108 Streptococcus genomes, which are the core genome. The core identity values ranges across all the compared strains and outgroups are reported. Lower sequence identity variation (90-100%) within the core belongs to ribosomal and translation-related proteins. It was found out that 48 proteins (11.8%) of the core genome are considered a hypothetical protein and those proteins host the larger sequence identity variations within the core. The sequence identity of the core genome identity diminishes as GSS score between species increases. The GSS dendrogram recovers most of the clades in the 16S rRNA gene phylogeny with the advantage to distinguish between 16S polytomies (unresolved nodes). Finally, our proposed core genome was used to distinguish the abundances of close related strains within human oral metagenomes being able to get strain relative abundances between healthy and caries infected (with S. mutans) individuals. Discussion. The clinical and food safety importance of Streptococcus genus gives a playground to test multiple comparative genomic scenarios due to its excellent genomic coverage. Understanding of genomic variability and strains relatedness is the goal of tools like GSS, which make use of both pairwise shared core and pan-genomic homologous shared sequences for its calculation. Combination of core genome and rapid alignment tools allows to estimate abundance and discriminate in a strain-specific manner in metagenomic samples. Here it is shared with the community both GSS genomic dendrogram and core genome to explore possibilities within streptococci.