BACKGROUND
A lifelogs-based wellness index (LWI) is a function to calculate wellness scores from health behavior lifelogs such as daily walking steps and sleep time collected through smartphones. A wellness score intuitively shows a user of a smart wellness service the overall condition of health behaviors. LWI development includes LWI estimation (i.e., estimating coefficients in LWI with data). A panel data set of health behavior lifelogs allows LWI estimation to control for variables unobserved in LWI and hence to be less biased. Such panel data sets are likely to have missing data due to various random events of daily life (e.g., smart devices stop collecting data when they are out of batteries). Missing data can introduce the biases to LWI coefficients. Thus, the choice of appropriate missing data handling method is important to reduce the biases in LWI estimation with a panel data set of health behavior lifelogs. However, relevant studies are scarce in the literature.
OBJECTIVE
This research aims to identify a suitable missing data handling method for LWI estimation with panel data. Six representative missing data handling methods (i.e., listwise deletion (LD), mean imputation, Expectation-Maximization (EM) based multiple imputation, Predictive-Mean Matching (PMM) based multiple imputation, k-Nearest Neighbors (k-NN) based imputation, and Low-rank Approximation (LA) based imputation) are comparatively evaluated through the simulation of an existing LWI development case.
METHODS
A panel data set of health behavior lifelogs collected in the existing LWI development case was transformed into a reference data set. 200 simulated data sets were generated by randomly introducing missing data to the reference data set at each of missingness proportions from 1% to 80%. The six methods were applied to transform the simulated data sets into complete data sets by handling missing data. Coefficients in a linear LWI, a linear function, were estimated with each of all the complete data sets by following the case. Coefficient biases of the six methods were calculated by comparing the estimated coefficient values with reference values estimated with the reference data set.
RESULTS
Based on the coefficient biases, the superior methods changed according to the missingness proportion: LA based imputation, PMM based multiple imputation, and EM based multiple imputation for 1% to 30% missingness proportions; LA based imputation and PMM based multiple imputation for 31% to 60%; and only LA based imputation for over 60%.
CONCLUSIONS
LA based imputation was superior among the six methods regardless of the missingness proportion. This superiority is generalizable for other panel data sets of health behavior lifelogs because existing works have verified their low-rank nature where LA based imputation works well. This result will guide the missing data handling to reduce the coefficient biases in new development cases of linear LWIs with panel data.