Missing data matters in participatory syndromic surveillance systems: comparative evaluation of missing data methods when estimating disease burden
Introduction Traditional surveillance methods have been enhanced by the emergence of online participatory syndromic surveillance systems that collect health-related digital data. These systems have many applications including tracking weekly prevalence of Influenza-Like Illness (ILI), predicting probable infection of Coronavirus 2019 (COVID-19), and determining risk factors of ILI and COVID-19. However, not every volunteer consistently completes surveys. In this study, we assess how different missing data methods affect estimates of ILI burden using data from FluTracking, a participatory surveillance system in Australia. Methods We estimate the incidence rate, the incidence proportion, and weekly prevalence using five missing data methods: available case, complete case, assume missing is non-ILI, multiple imputation (MI), and delta (δ) MI, which is a flexible and transparent method to impute missing data under Missing Not at Random (MNAR) assumptions. We evaluate these methods using simulated and FluTracking data. Results Our simulations show that the optimal missing data method depends on the measure of ILI burden and the underlying missingness model. Of note, the δ-MI method provides estimates of ILI burden that are similar to the true parameter under MNAR models. When we apply these methods to FluTracking, we find that the δ-MI method accurately predicted complete, end of season weekly prevalence estimates from real-time data. Conclusion Missing data is an important problem in participatory surveillance systems. Here, we show that accounting for missingness using statistical approaches leads to different inferences from the data.