scholarly journals Missing-Data Handling Methods for Lifelogs-Based Wellness Index Estimation: Comparative Analysis With Panel Data

10.2196/20597 ◽  
2020 ◽  
Vol 8 (12) ◽  
pp. e20597
Author(s):  
Ki-Hun Kim ◽  
Kwang-Jae Kim

Background A lifelogs-based wellness index (LWI) is a function for calculating wellness scores based on health behavior lifelogs (eg, daily walking steps and sleep times collected via a smartwatch). A wellness score intuitively shows the users of smart wellness services the overall condition of their health behaviors. LWI development includes estimation (ie, estimating coefficients in LWI with data). A panel data set comprising health behavior lifelogs allows LWI estimation to control for unobserved variables, thereby resulting in less bias. However, these data sets typically have missing data due to events that occur in daily life (eg, smart devices stop collecting data when batteries are depleted), which can introduce biases into LWI coefficients. Thus, the appropriate choice of method to handle missing data is important for reducing biases in LWI estimations with panel data. However, there is a lack of research in this area. Objective This study aims to identify a suitable missing-data handling method for LWI estimation with panel data. Methods Listwise deletion, mean imputation, expectation maximization–based multiple imputation, predictive-mean matching–based multiple imputation, k-nearest neighbors–based imputation, and low-rank approximation–based imputation were comparatively evaluated by simulating an existing case of LWI development. A panel data set comprising health behavior lifelogs of 41 college students over 4 weeks was transformed into a reference data set without any missing data. Then, 200 simulated data sets were generated by randomly introducing missing data at proportions from 1% to 80%. The missing-data handling methods were each applied to transform the simulated data sets into complete data sets, and coefficients in a linear LWI were estimated for each complete data set. For each proportion for each method, a bias measure was calculated by comparing the estimated coefficient values with values estimated from the reference data set. Results Methods performed differently depending on the proportion of missing data. For 1% to 30% proportions, low-rank approximation–based imputation, predictive-mean matching–based multiple imputation, and expectation maximization–based multiple imputation were superior. For 31% to 60% proportions, low-rank approximation–based imputation and predictive-mean matching–based multiple imputation performed best. For over 60% proportions, only low-rank approximation–based imputation performed acceptably. Conclusions Low-rank approximation–based imputation was the best of the 6 data-handling methods regardless of the proportion of missing data. This superiority is generalizable to other panel data sets comprising health behavior lifelogs given their verified low-rank nature, for which low-rank approximation–based imputation is known to perform effectively. This result will guide missing-data handling in reducing coefficient biases in new development cases of linear LWIs with panel data.

2020 ◽  
Author(s):  
KI-Hun Kim ◽  
Kwang-Jae Kim

BACKGROUND A lifelogs-based wellness index (LWI) is a function to calculate wellness scores from health behavior lifelogs such as daily walking steps and sleep time collected through smartphones. A wellness score intuitively shows a user of a smart wellness service the overall condition of health behaviors. LWI development includes LWI estimation (i.e., estimating coefficients in LWI with data). A panel data set of health behavior lifelogs allows LWI estimation to control for variables unobserved in LWI and hence to be less biased. Such panel data sets are likely to have missing data due to various random events of daily life (e.g., smart devices stop collecting data when they are out of batteries). Missing data can introduce the biases to LWI coefficients. Thus, the choice of appropriate missing data handling method is important to reduce the biases in LWI estimation with a panel data set of health behavior lifelogs. However, relevant studies are scarce in the literature. OBJECTIVE This research aims to identify a suitable missing data handling method for LWI estimation with panel data. Six representative missing data handling methods (i.e., listwise deletion (LD), mean imputation, Expectation-Maximization (EM) based multiple imputation, Predictive-Mean Matching (PMM) based multiple imputation, k-Nearest Neighbors (k-NN) based imputation, and Low-rank Approximation (LA) based imputation) are comparatively evaluated through the simulation of an existing LWI development case. METHODS A panel data set of health behavior lifelogs collected in the existing LWI development case was transformed into a reference data set. 200 simulated data sets were generated by randomly introducing missing data to the reference data set at each of missingness proportions from 1% to 80%. The six methods were applied to transform the simulated data sets into complete data sets by handling missing data. Coefficients in a linear LWI, a linear function, were estimated with each of all the complete data sets by following the case. Coefficient biases of the six methods were calculated by comparing the estimated coefficient values with reference values estimated with the reference data set. RESULTS Based on the coefficient biases, the superior methods changed according to the missingness proportion: LA based imputation, PMM based multiple imputation, and EM based multiple imputation for 1% to 30% missingness proportions; LA based imputation and PMM based multiple imputation for 31% to 60%; and only LA based imputation for over 60%. CONCLUSIONS LA based imputation was superior among the six methods regardless of the missingness proportion. This superiority is generalizable for other panel data sets of health behavior lifelogs because existing works have verified their low-rank nature where LA based imputation works well. This result will guide the missing data handling to reduce the coefficient biases in new development cases of linear LWIs with panel data.


2008 ◽  
Vol 20 (11) ◽  
pp. 2839-2861 ◽  
Author(s):  
Dit-Yan Yeung ◽  
Hong Chang ◽  
Guang Dai

In recent years, metric learning in the semisupervised setting has aroused a lot of research interest. One type of semisupervised metric learning utilizes supervisory information in the form of pairwise similarity or dissimilarity constraints. However, most methods proposed so far are either limited to linear metric learning or unable to scale well with the data set size. In this letter, we propose a nonlinear metric learning method based on the kernel approach. By applying low-rank approximation to the kernel matrix, our method can handle significantly larger data sets. Moreover, our low-rank approximation scheme can naturally lead to out-of-sample generalization. Experiments performed on both artificial and real-world data show very promising results.


2021 ◽  
Vol 47 (3) ◽  
pp. 1-37
Author(s):  
Srinivas Eswar ◽  
Koby Hayashi ◽  
Grey Ballard ◽  
Ramakrishnan Kannan ◽  
Michael A. Matheson ◽  
...  

We consider the problem of low-rank approximation of massive dense nonnegative tensor data, for example, to discover latent patterns in video and imaging applications. As the size of data sets grows, single workstations are hitting bottlenecks in both computation time and available memory. We propose a distributed-memory parallel computing solution to handle massive data sets, loading the input data across the memories of multiple nodes, and performing efficient and scalable parallel algorithms to compute the low-rank approximation. We present a software package called Parallel Low-rank Approximation with Nonnegativity Constraints, which implements our solution and allows for extension in terms of data (dense or sparse, matrices or tensors of any order), algorithm (e.g., from multiplicative updating techniques to alternating direction method of multipliers), and architecture (we exploit GPUs to accelerate the computation in this work). We describe our parallel distributions and algorithms, which are careful to avoid unnecessary communication and computation, show how to extend the software to include new algorithms and/or constraints, and report efficiency and scalability results for both synthetic and real-world data sets.


2012 ◽  
Vol 21 (06) ◽  
pp. 1250033
Author(s):  
MANOLIS G. VOZALIS ◽  
ANGELOS I. MARKOS ◽  
KONSTANTINOS G. MARGARITIS

Collaborative Filtering (CF) is a popular technique employed by Recommender Systems, a term used to describe intelligent methods that generate personalized recommendations. Some of the most efficient approaches to CF are based on latent factor models and nearest neighbor methods, and have received considerable attention in recent literature. Latent factor models can tackle some fundamental challenges of CF, such as data sparsity and scalability. In this work, we present an optimal scaling framework to address these problems using Categorical Principal Component Analysis (CatPCA) for the low-rank approximation of the user-item ratings matrix, followed by a neighborhood formation step. CatPCA is a versatile technique that utilizes an optimal scaling process where original data are transformed so that their overall variance is maximized. We considered both smooth and non-smooth transformations for the observed variables (items), such as numeric, (spline) ordinal, (spline) nominal and multiple nominal. The method was extended to handle missing data and incorporate differential weighting for items. Experiments were executed on three data sets of different sparsity and size, MovieLens 100k, 1M and Jester, aiming to evaluate the aforementioned options in terms of accuracy. A combined approach with a multiple nominal transformation and a "passive" missing data strategy clearly outperformed the other tested options for all three data sets. The results are comparable with those reported for single methods in the CF literature.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Yong Zeng ◽  
Yixin Li ◽  
Zhongyuan Jiang ◽  
Jianfeng Ma

It is crucial to generate random graphs with specific structural properties from real graphs, which could anonymize graphs or generate targeted graph data sets. The state-of-the-art method called spectral graph forge (SGF) was proposed at INFOCOM 2018. This method uses a low-rank approximation of the matrix by throwing away some spectrums, which provides privacy protection after distributing graphs while ensuring data availability to a certain extent. As shown in SGF, it needs to discard at least 20% spectrum to defend against deanonymous attacks. However, the data availability will be significantly decreased after more spectrum discarding. Thus, is there a way to generate a graph that guarantees maximum spectrum and anonymity at the same time? To solve this problem, this paper proposes graph nonlinear scaling (GNS). We firmly prove that GNS can preserve all eigenvectors meanwhile providing high anonymity for the forged graph. Precisely, the GNS scales the eigenvalues of the original spectrum and constructs the forged graph with scaled eigenvalues and original eigenvectors. This approach maximizes the preservation of spectrum information to guarantee data availability. Meanwhile, it provides high robustness towards deanonymous attacks. The experimental results show that when SGF discards only 10% of the spectrum, the forged graph has high data availability. At this time, if the distance vector deanonymity algorithm is used to attack the forged graph, almost 100% of the nodes can be identified, while when achieving the same availability, only about 20% of the nodes in the forged graph obtained from GNS can be identified. Moreover, our method is better than SGF in capturing the real graph’s structure in terms of modularity, the number of partitions, and average clustering.


2016 ◽  
Vol 27 (6) ◽  
pp. 846-887 ◽  
Author(s):  
MIHAI CUCURINGU ◽  
PUCK ROMBACH ◽  
SANG HOON LEE ◽  
MASON A. PORTER

We introduce several novel and computationally efficient methods for detecting “core–periphery structure” in networks. Core–periphery structure is a type of mesoscale structure that consists of densely connected core vertices and sparsely connected peripheral vertices. Core vertices tend to be well-connected both among themselves and to peripheral vertices, which tend not to be well-connected to other vertices. Our first method, which is based on transportation in networks, aggregates information from many geodesic paths in a network and yields a score for each vertex that reflects the likelihood that that vertex is a core vertex. Our second method is based on a low-rank approximation of a network's adjacency matrix, which we express as a perturbation of a tensor-product matrix. Our third approach uses the bottom eigenvector of the random-walk Laplacian to infer a coreness score and a classification into core and peripheral vertices. We also design an objective function to (1) help classify vertices into core or peripheral vertices and (2) provide a goodness-of-fit criterion for classifications into core versus peripheral vertices. To examine the performance of our methods, we apply our algorithms to both synthetically generated networks and a variety of networks constructed from real-world data sets.


2020 ◽  
Vol 14 (12) ◽  
pp. 2791-2798
Author(s):  
Xiaoqun Qiu ◽  
Zhen Chen ◽  
Saifullah Adnan ◽  
Hongwei He

Sign in / Sign up

Export Citation Format

Share Document