Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation

Khaled El Emam; Lucy Mosquera; Jason Bass

doi:10.2196/23139

Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation

Journal of Medical Internet Research ◽

10.2196/23139 ◽

2020 ◽

Vol 22 (11) ◽

pp. e23139

Author(s):

Khaled El Emam ◽

Lucy Mosquera ◽

Jason Bass

Keyword(s):

Risk Model ◽

Model Development ◽

Synthesis Method ◽

Secondary Analysis ◽

Synthetic Data ◽

Generative Models ◽

Sequential Decision ◽

Science Data ◽

Disclosure Risk ◽

Identity Disclosure

Background There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them. Objective The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data. Methods A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this “meaningful identity disclosure risk.” The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data. Results The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively. Conclusions We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data.

Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation (Preprint)

10.2196/preprints.23139 ◽

2020 ◽

Author(s):

Khaled El Emam ◽

Lucy Mosquera ◽

Jason Bass

Keyword(s):

Risk Model ◽

Model Development ◽

Synthesis Method ◽

Secondary Analysis ◽

Synthetic Data ◽

Generative Models ◽

Sequential Decision ◽

Science Data ◽

Disclosure Risk ◽

Identity Disclosure

BACKGROUND There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them. OBJECTIVE The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data. METHODS A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this “meaningful identity disclosure risk.” The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data. RESULTS The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively. CONCLUSIONS We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data.

Deep Neural Networks for Analysis of Microscopy Images—Synthetic Data Generation and Adaptive Sampling

Crystals ◽

10.3390/cryst11030258 ◽

2021 ◽

Vol 11 (3) ◽

pp. 258

Author(s):

Patrick Trampert ◽

Dmitri Rubinstein ◽

Faysal Boughorbel ◽

Christian Schlinkmann ◽

Maria Luschkova ◽

...

Keyword(s):

Neural Network ◽

Neural Networks ◽

Adaptive Sampling ◽

Synthetic Data ◽

Generative Models ◽

Data Generation ◽

Neural Network Models ◽

Science Data ◽

Parametric Data ◽

Microscopy Images

The analysis of microscopy images has always been an important yet time consuming process in materials science. Convolutional Neural Networks (CNNs) have been very successfully used for a number of tasks, such as image segmentation. However, training a CNN requires a large amount of hand annotated data, which can be a problem for material science data. We present a procedure to generate synthetic data based on ad hoc parametric data modelling for enhancing generalization of trained neural network models. Especially for situations where it is not possible to gather a lot of data, such an approach is beneficial and may enable to train a neural network reasonably. Furthermore, we show that targeted data generation by adaptively sampling the parameter space of the generative models gives superior results compared to generating random data points.

A Baseline for Attribute Disclosure Risk in Synthetic Data

Proceedings of the Tenth ACM Conference on Data and Application Security and Privacy ◽

10.1145/3374664.3375722 ◽

2020 ◽

Cited By ~ 2

Author(s):

Markus Hittmeir ◽

Rudolf Mayer ◽

Andreas Ekelhart

Keyword(s):

Synthetic Data ◽

Disclosure Risk

OC08.02: A preoperative risk model with ultrasound variables to assess lymph node metastasis in endometrial cancer patients: a model development and validation study by the IETA group

Ultrasound in Obstetrics and Gynecology ◽

10.1002/uog.19255 ◽

2018 ◽

Vol 52 ◽

pp. 17-17

Author(s):

L. Eriksson ◽

E. Epstein ◽

A.C. Testa ◽

D. Fischerová ◽

L. Valentin ◽

...

Keyword(s):

Endometrial Cancer ◽

Lymph Node ◽

Lymph Node Metastasis ◽

Cancer Patients ◽

Validation Study ◽

Risk Model ◽

Model Development ◽

Preoperative Risk ◽

Node Metastasis ◽

Development And Validation

Collective Value-At-Risk (Colvar) In Life Insurance Collection

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.7.16199 ◽

2018 ◽

Vol 7 (3.7) ◽

pp. 25

Author(s):

Abdul Talib Bon ◽

Muhammad Iqbal Al-Banna Ismail ◽

Sukono . ◽

Adhitya Ronnie Effendie

Keyword(s):

At Risk ◽

Standard Deviation ◽

Life Insurance ◽

Value At Risk ◽

Risk Model ◽

Risk Measures ◽

Model Development ◽

Insurance Company ◽

Insurance Claims ◽

Level Of Significance

Analysis of risk in life insurance claims is very important to do by the insurance company actuary. Risk in life insurance claims are generally measured using the standard deviation or variance. The problem is, that the standard deviation or variance which is used as a measure of the risk of a claim can not accommodate any claims of risk events. Therefore, in this study developed a model called risk measures Collective Modified Value-at-Risk. Model development is done for several models of the distribution of the number of claims and the distribution of the value of the claim. Collective results of model development Modified Value-at-Risk is expected to accommodate any claims of risk events, when given a certain level of significance

Sense of responsibility in ICU end-of-life decision-making: Relatives’ experiences

Nursing Ethics ◽

10.1177/0969733017703697 ◽

2017 ◽

Vol 26 (1) ◽

pp. 270-279 ◽

Cited By ~ 1

Author(s):

Ranveig Lind

Keyword(s):

Intensive Care Unit ◽

Decision Making ◽

Intensive Care ◽

End Of Life ◽

Secondary Analysis ◽

Active Role ◽

Decision Making Process ◽

Science Data ◽

End Of Life Decision ◽

Life Decision

Background: Relatives of intensive care unit patients who lack or have reduced capacity to consent are entitled to information and participation in decision-making together with the patient. Practice varies with legislation in different countries. In Norway, crucial decisions such as withdrawing treatment are made by clinicians, usually morally justified to relatives with reference to the principle of non-maleficence. The relatives should, however, be consulted about whether they know what the patient would have wished in the situation. Research objectives: To examine and describe relatives’ experiences of responsibility in the intensive care unit decision-making process. Research design: A secondary analysis of interviews with bereaved relatives of intensive care unit patients was performed, using a narrative analytical approach. Participants and research context: In all, 27 relatives of 21 deceased intensive care unit patients were interviewed about their experiences from the end-of-life decision-making process. Most interviews took place in the participants’ homes, 3–12 months after the patient’s death. Ethical considerations: Based on informed consent, the study was approved by the Data Protection Official of the Norwegian Social Science Data Services and by the Regional Committee for Medical and Health Research Ethics. Findings: The results show that intensive care unit relatives experienced a sense of responsibility in the decision-making process, independently of clinicians’ intention of sparing them. Some found this troublesome. Three different variants of participation were revealed, ranging from paternalism to a more active role for relatives. Discussion: For the study participants, the sense of responsibility reflects the fact that ethics and responsibility are grounded in the individual’s relationship to other people. Relatives need to be included in a continuous dialogue over time to understand decisions and responsibility. Conclusion: Nurses and physicians should acknowledge and address relatives’ sense of responsibility, include them in regular dialogue and help them separate their responsibility from that of the clinicians.

6. Secondary Analysis and Audit of Social Science Data

Assuring the Confidentiality of Social Research Data ◽

10.9783/9781512800814-007 ◽

1979 ◽

Keyword(s):

Social Science ◽

Secondary Analysis ◽

Science Data ◽

Social Science Data

Development of the Dundee Caries Risk Assessment Model (DCRAM) - risk model development using a novel application of CHAID analysis

Community Dentistry And Oral Epidemiology ◽

10.1111/j.1600-0528.2011.00630.x ◽

2011 ◽

Vol 40 (1) ◽

pp. 37-45 ◽

Cited By ~ 24

Author(s):

Heather M. B. MacRitchie ◽

Christopher Longbottom ◽

Margaret Robertson ◽

Zoann Nugent ◽

Karen Chan ◽

...

Keyword(s):

Risk Assessment ◽

Risk Model ◽

Model Development ◽

Assessment Model ◽

Risk Assessment Model ◽

Caries Risk ◽

Caries Risk Assessment ◽

Chaid Analysis

Deep generative models in DataSHIELD

BMC Medical Research Methodology ◽

10.1186/s12874-021-01237-6 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Stefan Lenz ◽

Moritz Hess ◽

Harald Binder

Keyword(s):

Genetic Variant ◽

Small Sample Size ◽

Synthetic Data ◽

Routine Data ◽

Original Data ◽

Generative Models ◽

Small Sample ◽

Generative Adversarial Networks ◽

Artificial Data ◽

Data Set

Abstract Background The best way to calculate statistics from medical data is to use the data of individual patients. In some settings, this data is difficult to obtain due to privacy restrictions. In Germany, for example, it is not possible to pool routine data from different hospitals for research purposes without the consent of the patients. Methods The DataSHIELD software provides an infrastructure and a set of statistical methods for joint, privacy-preserving analyses of distributed data. The contained algorithms are reformulated to work with aggregated data from the participating sites instead of the individual data. If a desired algorithm is not implemented in DataSHIELD or cannot be reformulated in such a way, using artificial data is an alternative. Generating artificial data is possible using so-called generative models, which are able to capture the distribution of given data. Here, we employ deep Boltzmann machines (DBMs) as generative models. For the implementation, we use the package “BoltzmannMachines” from the Julia programming language and wrap it for use with DataSHIELD, which is based on R. Results We present a methodology together with a software implementation that builds on DataSHIELD to create artificial data that preserve complex patterns from distributed individual patient data. Such data sets of artificial patients, which are not linked to real patients, can then be used for joint analyses. As an exemplary application, we conduct a distributed analysis with DBMs on a synthetic data set, which simulates genetic variant data. Patterns from the original data can be recovered in the artificial data using hierarchical clustering of the virtual patients, demonstrating the feasibility of the approach. Additionally, we compare DBMs, variational autoencoders, generative adversarial networks, and multivariate imputation as generative approaches by assessing the utility and disclosure of synthetic data generated from real genetic variant data in a distributed setting with data of a small sample size. Conclusions Our implementation adds to DataSHIELD the ability to generate artificial data that can be used for various analyses, e.g., for pattern recognition with deep learning. This also demonstrates more generally how DataSHIELD can be flexibly extended with advanced algorithms from languages other than R.

Predictors and risk model development for menopausal age in fragile X premutation carriers

Genetics in Medicine ◽

10.1097/gim.0b013e31821705e5 ◽

2011 ◽

Vol 13 (7) ◽

pp. 643-650 ◽

Cited By ~ 25

Author(s):

Marian A Spath ◽

Ton B Feuth ◽

Arie P T Smits ◽

Helger G Yntema ◽

Didi D M Braat ◽

...

Keyword(s):

Risk Model ◽

Fragile X ◽

Model Development ◽

Fragile X Premutation