Abstract
Background: Notable discrepancies in vulnerability to COVID-19 infection have been identified between specific population groups and regions in the United States. The purpose of this study was estimate likelihood of COVID-19 infection using a machine-learning algorithm that can be updated continuously based on health care data.Methods: Patient records were extracted for all COVID-19 nasal swab PCR tests performed within the Providence St. Joseph Health system from February to October of 2020. Several different machine learning models were tested to evaluate effects of sociodemographic, environmental, and medical history factors on risk of initial COVID-19 infection.Results: A total of 316,599 participants were included in this study and approximately 7.7% (n = 24,358) tested positive for COVID-19. A gradient boosting model, LightGBM (LGBM), predicted risk of initial infection with an area under the receiver operating characteristic curve of 0.819. Factors that predicted infection were cough, fever, being a member of the Hispanic or Latino community, being Spanish speaking, having a history of diabetes or dementia, and living in a neighborhood with housing insecurity. Conclusion: A model trained on sociodemographic, environmental, and medical history data performed well in predicting risk of a positive COVID-19 test. This model could be used to tailor education, public health policy, and resources for communities that are at the greatest risk of infection.