BACKGROUND
In Wisconsin, COVID-19 case interview forms contain free text fields that need to be mined to identify potential outbreaks for targeted policy making. We developed an automated pipeline to ingest the free text into a pre-trained neural language model to identify businesses and facilities as outbreaks.
OBJECTIVE
We aim to examine the performance of our pipeline.
METHODS
Data on cases of COVID-19 were extracted from the Wisconsin Electronic Disease Surveillance System (WEDSS) for Dane County between July 1, 2020, and June 30, 2021. Features from the case interview forms were fed into a Bidirectional Encoder Representations from Transformers (BERT) model that was fine-tuned for named entity recognition (NER). We also developed a novel location mapping tool to provide addresses for relevant NERs. The pipeline was validated against known outbreaks that were already investigated and confirmed.
RESULTS
There were 46,898 cases of COVID-19 with 4,183,273 total BERT tokens and 15,051 unique tokens. The recall and precision of the NER tool were 0.67 (95 % CI 0.66-0.68) and 0.55 (95 % CI: 0.54-0.57), respectively. For the location mapping tool, the recall and precision were 0.93 (95% CI: 0.92-0.95) and 0.93 (95% CI: 0.92-0.95), respectively. Across monthly intervals, the NER tool identified more potential clusters than were confirmed in the WEDSS system.
CONCLUSIONS
We developed a novel pipeline of tools that identified existing outbreaks and novel clusters with associated addresses. Our pipeline ingests data from a statewide database and may be deployed to assist local health departments for targeted interventions.
CLINICALTRIAL
Not applicable