AbstractBackgroundAmong patients colonized with carbapenem-resistant Klebsiella pneumoniae (CRKP), only a subset develop clinical infection. While patient characteristics may influence risk for infection, it remains unclear if the genetic background of CRKP strains contributes to this risk. We applied machine learning to quantify the capacity of patient characteristics and microbial genotypes to discriminate infection and colonization, and identified patient and microbial features associated with infection across multiple healthcare facilities.MethodsMachine learning models were built using whole-genome sequences and clinical metadata from 331 patients colonized or infected with CRKP across 21 long-term acute care hospitals. To quantify variation in performance, we built models using 100 different train/test splits of the entire dataset, and urinary and respiratory site-specific subsets, and evaluated predictive performance on each test split using the area under the receiver operating characteristics curve (AUROC). Patient and microbial features predictive of infection were identified as those consistently important for predicting infection based on average change in AUROC when included in the model.FindingsWe found that patient and genomic features were only weakly predictive of clinical CRKP infection vs. colonization (AUROC IQRs: patient=0·59-0·68, genomic=0·55-0·61, combined=0·62-0·68), and that one feature set did not consistently outperform the other (genomic vs. patient p=0·4). Comparable model performances were observed for anatomic site-specific models (combined AUROC IQRs: respiratory=0·61-0·71, urinary=0·54-0·64). Strong genomic predictors of infection included the presence of the ICEKp10 mobile genetic element carrying an iron acquisition system (yersiniabactin) and a toxin (colibactin), along with disruption of an O-antigen biosynthetic gene in a sub-lineage of the epidemic ST258 clone. Teasing apart sequential evolutionary steps in the context of clinical metadata indicated that altered O-antigen biosynthesis increased association with the respiratory tract, and subsequent acquisition of ICEKp10 was associated with increased virulence.InterpretationOur results support the need for rigorous machine learning frameworks to gain realistic estimates of the performance of clinical models of infection. Moreover, integrating microbial genomic and clinical data using such a framework can help tease apart the contribution of microbial genetic variation to clinical outcomes.FundingCenters for Disease Control and Prevention, National Institutes of Health, National Science FoundationResearch in contextEvidence before this studyWe searched PubMed for “crkp” OR “carbapenem resistant klebsiella pneumoniae” AND “infection” AND “machine learning” for papers published up to April 14, 2020 and found no results. Substituting “machine learning” with “bacterial genome-wide association studies” produced one relevant paper investigating pathogenicity-associated loci in K. pneumoniae clinical isolates. When we searched for “infection” AND “machine learning” AND “genom*” AND “clinical”, there was one relevant result - a study that used clinical and bacterial genomic features in a machine learning model to identify clonal differences related to Staphylococcus aureus infection outcome.Added value of this studyTo our knowledge, this is the first study to integrate clinical and genomic data to study anatomic site-specific colonization and infection across multiple healthcare facilities. Using this method, we identified clinical features associated with CRKP infection, as well as a sub-lineage of CRKP with potentially altered niche-specific adaptation and virulence. This method could be used for other organisms and other clinical outcomes to evaluate performance of predictive models and identify features that are consistently associated with clinical outcomes of interest across facilities or geographic regions.Implications of all the available evidenceFew studies have combined patient and microbial genomic data to study important clinical outcomes. However, those that have done this, including ours, have identified clinical and/or genomic features associated with the outcome of interest that provide a foundation for future epidemiological, clinical, and biological studies to better understand bacterial infections and clinical outcomes.