scholarly journals Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation

10.2196/23139 ◽  
2020 ◽  
Vol 22 (11) ◽  
pp. e23139
Author(s):  
Khaled El Emam ◽  
Lucy Mosquera ◽  
Jason Bass

Background There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them. Objective The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data. Methods A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this “meaningful identity disclosure risk.” The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data. Results The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively. Conclusions We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data.


2020 ◽  
Author(s):  
Khaled El Emam ◽  
Lucy Mosquera ◽  
Jason Bass

BACKGROUND There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them. OBJECTIVE The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data. METHODS A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this “meaningful identity disclosure risk.” The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data. RESULTS The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively. CONCLUSIONS We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data.



Crystals ◽  
2021 ◽  
Vol 11 (3) ◽  
pp. 258
Author(s):  
Patrick Trampert ◽  
Dmitri Rubinstein ◽  
Faysal Boughorbel ◽  
Christian Schlinkmann ◽  
Maria Luschkova ◽  
...  

The analysis of microscopy images has always been an important yet time consuming process in materials science. Convolutional Neural Networks (CNNs) have been very successfully used for a number of tasks, such as image segmentation. However, training a CNN requires a large amount of hand annotated data, which can be a problem for material science data. We present a procedure to generate synthetic data based on ad hoc parametric data modelling for enhancing generalization of trained neural network models. Especially for situations where it is not possible to gather a lot of data, such an approach is beneficial and may enable to train a neural network reasonably. Furthermore, we show that targeted data generation by adaptively sampling the parameter space of the generative models gives superior results compared to generating random data points.





2018 ◽  
Vol 7 (3.7) ◽  
pp. 25
Author(s):  
Abdul Talib Bon ◽  
Muhammad Iqbal Al-Banna Ismail ◽  
Sukono . ◽  
Adhitya Ronnie Effendie

Analysis of risk in life insurance claims is very important to do by the insurance company actuary. Risk in life insurance claims are generally measured using the standard deviation or variance. The problem is, that the standard deviation or variance which is used as a measure of the risk of a claim can not accommodate any claims of risk events. Therefore, in this study developed a model called risk measures Collective Modified Value-at-Risk. Model development is done for several models of the distribution of the number of claims and the distribution of the value of the claim. Collective results of model development Modified Value-at-Risk is expected to accommodate any claims of risk events, when given a certain level of significance  



2017 ◽  
Vol 26 (1) ◽  
pp. 270-279 ◽  
Author(s):  
Ranveig Lind

Background: Relatives of intensive care unit patients who lack or have reduced capacity to consent are entitled to information and participation in decision-making together with the patient. Practice varies with legislation in different countries. In Norway, crucial decisions such as withdrawing treatment are made by clinicians, usually morally justified to relatives with reference to the principle of non-maleficence. The relatives should, however, be consulted about whether they know what the patient would have wished in the situation. Research objectives: To examine and describe relatives’ experiences of responsibility in the intensive care unit decision-making process. Research design: A secondary analysis of interviews with bereaved relatives of intensive care unit patients was performed, using a narrative analytical approach. Participants and research context: In all, 27 relatives of 21 deceased intensive care unit patients were interviewed about their experiences from the end-of-life decision-making process. Most interviews took place in the participants’ homes, 3–12 months after the patient’s death. Ethical considerations: Based on informed consent, the study was approved by the Data Protection Official of the Norwegian Social Science Data Services and by the Regional Committee for Medical and Health Research Ethics. Findings: The results show that intensive care unit relatives experienced a sense of responsibility in the decision-making process, independently of clinicians’ intention of sparing them. Some found this troublesome. Three different variants of participation were revealed, ranging from paternalism to a more active role for relatives. Discussion: For the study participants, the sense of responsibility reflects the fact that ethics and responsibility are grounded in the individual’s relationship to other people. Relatives need to be included in a continuous dialogue over time to understand decisions and responsibility. Conclusion: Nurses and physicians should acknowledge and address relatives’ sense of responsibility, include them in regular dialogue and help them separate their responsibility from that of the clinicians.



2011 ◽  
Vol 40 (1) ◽  
pp. 37-45 ◽  
Author(s):  
Heather M. B. MacRitchie ◽  
Christopher Longbottom ◽  
Margaret Robertson ◽  
Zoann Nugent ◽  
Karen Chan ◽  
...  


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Stefan Lenz ◽  
Moritz Hess ◽  
Harald Binder

Abstract Background The best way to calculate statistics from medical data is to use the data of individual patients. In some settings, this data is difficult to obtain due to privacy restrictions. In Germany, for example, it is not possible to pool routine data from different hospitals for research purposes without the consent of the patients. Methods The DataSHIELD software provides an infrastructure and a set of statistical methods for joint, privacy-preserving analyses of distributed data. The contained algorithms are reformulated to work with aggregated data from the participating sites instead of the individual data. If a desired algorithm is not implemented in DataSHIELD or cannot be reformulated in such a way, using artificial data is an alternative. Generating artificial data is possible using so-called generative models, which are able to capture the distribution of given data. Here, we employ deep Boltzmann machines (DBMs) as generative models. For the implementation, we use the package “BoltzmannMachines” from the Julia programming language and wrap it for use with DataSHIELD, which is based on R. Results We present a methodology together with a software implementation that builds on DataSHIELD to create artificial data that preserve complex patterns from distributed individual patient data. Such data sets of artificial patients, which are not linked to real patients, can then be used for joint analyses. As an exemplary application, we conduct a distributed analysis with DBMs on a synthetic data set, which simulates genetic variant data. Patterns from the original data can be recovered in the artificial data using hierarchical clustering of the virtual patients, demonstrating the feasibility of the approach. Additionally, we compare DBMs, variational autoencoders, generative adversarial networks, and multivariate imputation as generative approaches by assessing the utility and disclosure of synthetic data generated from real genetic variant data in a distributed setting with data of a small sample size. Conclusions Our implementation adds to DataSHIELD the ability to generate artificial data that can be used for various analyses, e.g., for pattern recognition with deep learning. This also demonstrates more generally how DataSHIELD can be flexibly extended with advanced algorithms from languages other than R.



2011 ◽  
Vol 13 (7) ◽  
pp. 643-650 ◽  
Author(s):  
Marian A Spath ◽  
Ton B Feuth ◽  
Arie P T Smits ◽  
Helger G Yntema ◽  
Didi D M Braat ◽  
...  


Sign in / Sign up

Export Citation Format

Share Document