disclosure risk
Recently Published Documents


TOTAL DOCUMENTS

136
(FIVE YEARS 21)

H-INDEX

16
(FIVE YEARS 2)

2021 ◽  
Author(s):  
Matthias Templ ◽  
Chifundo Kanjala ◽  
Inken Siems

BACKGROUND Sharing and anonymising data have become hot topics for individuals, organisations, and countries around the world. Open-access sharing of anonymised data containing sensitive information about individuals makes the most sense whenever the utility of the data can be preserved and the risk of disclosure can be kept below acceptable levels. In this case, researchers can use the data without access restrictions and limitations. OBJECTIVE The goal of this paper is to highlight solutions and requirements for sharing longitudinal health and surveillance event history data in form of open-access data. The challenges lie in the anonymisation of multiple event dates and the time-varying variables. A sequential approach that adds noise to the event dates is proposed. This approach maintains the event order and preserves the average time between events. Additionally, a nosy neighbor distance-based matching approach to estimate the risk is proposed. Regarding dealing with the key variables that change over time such as educational level or occupation, we make two proposals, one based on limiting the intermediate status of a person (e.g. on education), and the other to achieve k-anonymity in subsets of the data. The proposed approaches were applied to the Karonga Health and Demographic Surveillance System (HDSS) core dataset, which contains longitudinal data from 1995 to the end of 2016 and includes 280,381 event records with time-varying, socio-economic variables and demographic information on individuals. The proposed anonymisation strategy lowers the risk of disclosure to acceptable levels thus allowing sharing of the data. METHODS statistical disclosure control, k-anonymity, adding noise, disclosure risk measurement, event history data anonymization, longitudinal data anonymization, data utility by visual comparisons. RESULTS Anonymized version of event history data including longitudinal information on individuals over time with high data utility. CONCLUSIONS The proposed anonymisation of study participants in event history data including static and time-varying status variables, specifically applied to longitudinal health and demographic surveillance system data, led to an anonymized data set with very low disclosure risk and high data utility ready to be shared to the public in form of an open-access data set. Different level of noise for event history dates were evaluated for disclosure risk and data utility. It turned out that high utility had been achieved even with the highest level of noise. Details matters to ensure consistency/credibility. Most important, the sequential noise approach presented in this paper maintains the event order. It has been shown that not even the event order is preserved but also the time between events is well maintained in comparison to the original data. We also proposed an anonymization strategy to handle the information of time-varying status of educational, occupational level of a person, year of death, year of birth, and number of events of a person. We proposed an approach that preserves the data utility well but limit the number of educational and occupational levels of a person. Using distance-based neighborhood matching we simulated an attack under a nosy neighbor situation and by using a worst-case scenario where attackers has full information on the original data. It could be shown that the disclosure risk is very low even by assuming that the attacker’s data base and information is optimal. The HDSS and medical science research communities in LMIC settings will be the primary beneficiaries of the results and methods presented in this science article, but the results will be useful for anyone working on anonymising longitudinal datasets possibly including also time-varying information and event history data for purposes of sharing. In other words, the proposed approaches can be applied to almost any event history data, and, additionally, to event history data including static and/or status variables that changes its entries in time.


2021 ◽  
Author(s):  
Daniel Beunza ◽  
Kira Henshaw ◽  
Matthew Agarwala ◽  
Sarah Perkins-Kirkpatrick ◽  
Stefano Battiston ◽  
...  

Data ◽  
2021 ◽  
Vol 6 (5) ◽  
pp. 53
Author(s):  
Ebaa Fayyoumi ◽  
Omar Alhuniti

This research investigates the micro-aggregation problem in secure statistical databases by integrating the divide and conquer concept with a genetic algorithm. This is achieved by recursively dividing a micro-data set into two subsets based on the proximity distance similarity. On each subset the genetic operation “crossover” is performed until the convergence condition is satisfied. The recursion will be terminated if the size of the generated subset is satisfied. Eventually, the genetic operation “mutation” will be performed over all generated subsets that satisfied the variable group size constraint in order to maximize the objective function. Experimentally, the proposed micro-aggregation technique was applied to recommended real-life data sets. Results demonstrated a remarkable reduction in the computational time, which sometimes exceeded 70% compared to the state-of-the-art. Furthermore, a good equilibrium value of the Scoring Index (SI) was achieved by involving a linear combination of the General Information Loss (GIL) and the General Disclosure Risk (GDR).


2021 ◽  
Vol 49 (2) ◽  
Author(s):  
Federico Camerlenghi ◽  
Stefano Favaro ◽  
Zacharie Naulet ◽  
Francesca Panero

2021 ◽  
pp. 1-6
Author(s):  
Siu-Ming Tam

With increasing demand from the research community for more frequent and unrestricted access to data, national statistical offices (NSOs) are adopting the 5 safes framework to manage the disclosure risk for releasing such data. In this paper, under some mild conditions, we show that the probability of disclosure, given the controls in the 5 safes, is not greater than the product of the smallest conditional disclosure probability amongst the 5 controls and the Risk Ratios of the remaining four safe controls. By computing the disclosure probabilities of all possible configurations of the controls in each of the 5 dimensions of the framework, one can select the set which has the least control on data, but which also meet the confidentiality and privacy requirements of the NSO. Where the required assumption of unconditional independence of the safes cannot be met, the paper proposes a merger of some of the controls to overcome the violation.


2021 ◽  
Vol 15 (2) ◽  
Author(s):  
Stefano Favaro ◽  
Francesca Panero ◽  
Tommaso Rigon

2020 ◽  
Vol 39 (5) ◽  
pp. 5999-6008
Author(s):  
Vicenç Torra

Microaggregation is an effective data-driven protection method that permits us to achieve a good trade-off between disclosure risk and information loss. In this work we propose a method for microaggregation based on fuzzy c-means, that is appropriate when there are constraints (linear constraints) on the variables that describe the data. Our method leads to results that satisfy these constraints even when the data to be masked do not satisfy them.


10.2196/23139 ◽  
2020 ◽  
Vol 22 (11) ◽  
pp. e23139
Author(s):  
Khaled El Emam ◽  
Lucy Mosquera ◽  
Jason Bass

Background There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them. Objective The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data. Methods A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this “meaningful identity disclosure risk.” The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data. Results The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively. Conclusions We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data.


Sign in / Sign up

Export Citation Format

Share Document