A multi-dimensional quality comparison of synthetic data generators (Preprint)
BACKGROUND Synthetic datasets are gradually emerging as solutions for fast and inclusive health data sharing. Multiple synthetic data generators have been introduced in the last decade fueled by advancement in machine learning, yet their utility is not well understood. Few recent papers tried to compare the utility of synthetic data generators, each focused on different evaluation metrics and presented conclusions targeted at specific analysis. OBJECTIVE This work aims to understand the overall utility (referred to as quality) of four recent synthetic data generators by identifying multiple criteria for high-utility for synthetic data. METHODS We investigate commonly used utility metrics for masked data evaluation and classify them into criteria/categories depending on the function they attempt to preserve: attribute fidelity, bivariate fidelity, population fidelity, and application fidelity. Then we chose a representative metric from each of the identified categories based on popularity and consistency. The set of metrics together, referred to as quality criteria, are used to evaluate the overall utility of four recent synthetic data generators across 19 datasets of different sizes and feature counts. Moreover, correlations between the identified metrics are investigated in an attempt to streamline synthetic data utility. RESULTS Our results indicate that a non-parametric machine learning synthetic data generator (Synthpop) provides the best utility values across all quality criteria along with the highest stability. It displays the best overall accuracy in supervised machine learning and often agrees with real dataset on the learning model with the highest accuracy. On another front, our results suggest no strong correlation between the different metrics, which implies that all categories/dimensions are required when evaluating the overall utility of synthetic data. CONCLUSIONS The paper used four quality criteria to inform on the synthesizer with the best overall utility. The results are promising with small decreases in accuracy observed from the winning synthesizer when tested with real datasets (in comparison with models trained on real data). Further research into one (overall) quality measure would greatly help data holders in optimizing the utility of the released dataset.