Statistical Disclosure Risk: An Overview

Minimum distance controlled tabular adjustment is a recent perturbative approach for statistical disclosure control in tabular data. Given a table to be protected, it looks for the closest safe table, using some particular distance. Controlled adjustment is known to provide high data utility. However, the disclosure risk has only been partially analyzed using theoretical results from optimization. This work extends these previous results, providing both a more detailed theoretical analysis, and an extensive empirical assessment of the disclosure risk of the method. A set of 25 instances from the literature and four different attacker scenarios are considered, with several random replications for each scenario, both for L1 and L2 distances. This amounts to the solution of more than 2000 optimization problems. The analysis of the results shows that the approach has low disclosure risk when the attacker has no good information on the bounds of the optimization problem. On the other hand, when the attacker has good estimates of the bounds, and the only uncertainty is in the objective function (which is a very strong assumption), the disclosure risk of controlled adjustment is high and it should be avoided.

Download Full-text

Quality Indicators for Statistical Disclosure Methods: A Case Study on the Structure of Earnings Survey

Journal of Official Statistics ◽

10.1515/jos-2015-0043 ◽

2015 ◽

Vol 31 (4) ◽

pp. 737-761 ◽

Cited By ~ 2

Author(s):

Matthias Templ

Keyword(s):

Original Data ◽

Data Sets ◽

Data Utility ◽

High Data ◽

Disclosure Risk ◽

Statistical Disclosure ◽

Context Data ◽

Statistical Agencies ◽

Utility Measures

Abstract Scientific- or public-use files are typically produced by applying anonymisation methods to the original data. Anonymised data should have both low disclosure risk and high data utility. Data utility is often measured by comparing well-known estimates from original data and anonymised data, such as comparing their means, covariances or eigenvalues. However, it is a fact that not every estimate can be preserved. Therefore the aim is to preserve the most important estimates, that is, instead of calculating generally defined utility measures, evaluation on context/data dependent indicators is proposed. In this article we define such indicators and utility measures for the Structure of Earnings Survey (SES) microdata and proper guidelines for selecting indicators and models, and for evaluating the resulting estimates are given. For this purpose, hundreds of publications in journals and from national statistical agencies were reviewed to gain insight into how the SES data are used for research and which indicators are relevant for policy making. Besides the mathematical description of the indicators and a brief description of the most common models applied to SES, four different anonymisation procedures are applied and the resulting indicators and models are compared to those obtained from the unmodified data. The disclosure risk is reported and the data utility is evaluated for each of the anonymised data sets based on the most important indicators and a model which is often used in practice.

Download Full-text

Providing Data With High Utility And No Disclosure Risk For The Public and Researchers: An Evaluation By Advanced Statistical Disclosure Risk Methods

Austrian Journal of Statistics ◽

10.17713/ajs.v43i4.43 ◽

2014 ◽

Vol 43 (4) ◽

pp. 247-254

Author(s):

Matthias Templ

Keyword(s):

Data Privacy ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

The Public ◽

Disclosure Control ◽

High Data ◽

Disclosure Risk ◽

Statistical Disclosure ◽

High Utility

The demand of data from surveys, registers or other data sets containing sensibleinformation on people or enterprises have been increased significantly over the last years.However, before providing data to the public or to researchers, confidentiality has to berespected for any data set containing sensible individual information. Confidentiality canbe achieved by applying statistical disclosure control (SDC) methods to the data. Theresearch on SDC methods becomes more and more important in the last years because ofan increase of the awareness on data privacy and because of the fact that more and moredata are provided to the public or to researchers. However, for legal reasons this is onlyvisible when the released data has (very) low disclosure risk.In this contribution existing disclosure risk methods are review and summarized. Thesemethods are finally applied on a popular real-world data set - the Structural EarningsSurvey (SES) of Austria. It is shown that the application of few selected anonymisationmethods leads to well-protected anonymised data with high data utility and low informationloss.

Download Full-text

Statistical Disclosure Limitation: New Directions and Challenges

Journal of Privacy and Confidentiality ◽

10.29012/jpc.684 ◽

2018 ◽

Vol 8 (1) ◽

Author(s):

Natalie Shlomo

Keyword(s):

Data Dissemination ◽

Differential Privacy ◽

Synthetic Data ◽

Remote Access ◽

Disclosure Limitation ◽

Disclosure Risk ◽

Statistical Disclosure Limitation ◽

Statistical Disclosure ◽

Statistical Agencies ◽

Definition Of

An overview of traditional types of data dissemination at statistical agencies is provided including definitions of disclosure risks, the quantification of disclosure risk and data utility and common statistical disclosure limitation (SDL) methods. However, with technological advancements and the increasing push by governments for openand accessible data, new forms of data dissemination are currently being explored. We focus on web-based applications such as flexible table builders and remote analysis servers, synthetic data and remote access. Many of these applications introduce new challenges for statistical agencies as they are gradually relinquishing some of their control on what data is released. There is now more recognition of the need for perturbative methods to protect the confidentiality of data subjects. These new forms of data dissemination are changing the landscape of how disclosure risks are conceptualized and the types of SDL methods that need to be applied to protect thedata. In particular, inferential disclosure is the main disclosure risk of concern and encompasses the traditional types of disclosure risks based on identity and attribute disclosures. These challenges have led to statisticians exploring the computer science definition of differential privacy and privacy- by-design applications. We explore how differential privacy can be a useful addition to the current SDL framework within statistical agencies.

Download Full-text

A CRITIQUE OF THE SENSITIVITY RULES USUALLY EMPLOYED FOR STATISTICAL TABLE PROTECTION

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s0218488502001636 ◽

2002 ◽

Vol 10 (05) ◽

pp. 545-556 ◽

Cited By ~ 11

Author(s):

JOSEP DOMINGO-FERRER ◽

VICENÇ TORRA

Keyword(s):

Risk Assessment ◽

A Priori ◽

Tabular Data ◽

Disclosure Control ◽

Disclosure Risk ◽

Dominance Rule ◽

Statistical Table ◽

Statistical Disclosure ◽

A Cell ◽

Statistical Agencies

In statistical disclosure control of tabular data, sensitivity rules are commonly used to decide whether a table cell is sensitive and should therefore not be published. The most popular sensitivity rules are the dominance rule, the p%-rule and the pq-rule. The dominance rule has received critiques based on specific numerical examples and is being gradually abandoned by leading statistical agencies. In this paper, we construct general counterexamples which show that none of the above rules does adequately reflect disclosure risk if cell contributors or coalitions of them behave as intruders: in that case, releasing a cell declared non-sensitive can imply higher disclosure risk than releasing a cell declared sensitive. As possible solutions, we propose an alternative sensitivity rule based on the concentration of relative contributions. More generally, we suggest to complement a priori risk assessment based on sensitivity rules with a posteriori risk assessment which takes into account tables after they have been protected.

Download Full-text

Measuring Disclosure Risk and Data Utility for Flexible Table Generators

Journal of Official Statistics ◽

10.1515/jos-2015-0019 ◽

2015 ◽

Vol 31 (2) ◽

pp. 305-324 ◽

Cited By ~ 5

Author(s):

Natalie Shlomo ◽

Laszlo Antal ◽

Mark Elliot

Keyword(s):

Original Data ◽

Web Based ◽

Data Utility ◽

Output Table ◽

Disclosure Control ◽

Disclosure Risk ◽

Final Output ◽

Statistical Disclosure ◽

Original Table ◽

Use Of The Internet

Abstract Statistical agencies are making increased use of the internet to disseminate census tabular outputs through web-based flexible table-generating servers that allow users to define and generate their own tables. The key questions in the development of these servers are: (1) what data should be used to generate the tables, and (2) what statistical disclosure control (SDC) method should be applied. To generate flexible tables, the server has to be able to measure the disclosure risk in the final output table, apply the SDC method and then iteratively reassess the disclosure risk. SDC methods may be applied either to the underlying data used to generate the tables and/or to the final output table that is generated from original data. Besides assessing disclosure risk, the server should provide a measure of data utility by comparing the perturbed table to the original table. In this article, we examine aspects of the design and development of a flexible table-generating server for census tables and demonstrate a disclosure risk-data utility analysis for comparing SDC methods. We propose measures for disclosure risk and data utility that are based on information theory.

Download Full-text

Zone design for statistical disclosure control in administrative and linked microdata

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.163 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

James Robards ◽

David Martin ◽

Chris Gale

Keyword(s):

Occupation Number ◽

Medical Practitioner ◽

Synthetic Dataset ◽

Design Solution ◽

Successful Implementation ◽

Statistical Disclosure Control ◽

Health Service Provider ◽

Disclosure Control ◽

Disclosure Risk ◽

Statistical Disclosure

ABSTRACT ObjectivesTo explore the application of automated zone design tools to protect record-level datasets with attribute detail and a large data volume in a way that might be implemented by a data provider (e.g. National Statistical Organisation/Health Service Provider), initially using a synthetic microdataset. Successful implementation could facilitate the release of rich linked record datasets to researchers so as to preserve small area geographical associations, while not revealing actual locations which are currently lost due to the high level of geographical coding required by data providers prior to release to researchers. Data perturbation is undesirable because of the need for detailed information on certain spatial attributes (e.g. distance to a medical practitioner, exposure to local environment) which has driven demand for new linked administrative datasets, along with provision of suitable research environments. The outcome is a bespoke aggregation of the microdata that meets a set of design constraints but the exact configuration of which is never revealed. Researchers are provided with detailed data and suitable geographies, yet with appropriately reduced disclosure risk. ApproachUsing a synthetic flat file microdataset of individual records with locality-level (MSOA) geography codes for England and Wales (variables: age, gender, economic activity, marital status, occupation, number of hours worked and general health), we synthesize address-level locations within MSOAs using 2011 Census headcount data. These synthetic locations are then associated with a range of spatial measures and indicators such as distance to a medical practitioner. Implementation of the AZTool zone design software enables a bespoke, non-disclosive zone design solution, providing area codes that can be added to the research data without revealing their true locations to the researcher. ResultsTwo sets of results will be presented. Firstly, we will explain the spatial characteristics of the new synthetic dataset which we propose may have broader utility. Secondly, we will present results showing changing risk of disclosure and utility when coding to spatial units from different scales and aggregations. Using the synthetic dataset will therefore demonstrate the utility of the approach for a variety of linked and administrative data without any actual disclosure risk. ConclusionsThis approach is applicable to a variety of datasets. The ability to quantify the zone design solution and security in relation to statistical disclosure control will be discussed. Provision of parameters from the zone design process to the data user and the implications of this for security and data users will be considered.

Download Full-text

Information-theoretic disclosure risk measures in statistical disclosure control of tabular data

Proceedings 14th International Conference on Scientific and Statistical Database Management ◽

10.1109/ssdm.2002.1029724 ◽

2003 ◽

Cited By ~ 6

Author(s):

J. Domingo-Feffer ◽

A. Oganian ◽

V. Torra

Keyword(s):

Risk Measures ◽

Statistical Disclosure Control ◽

Tabular Data ◽

Information Theoretic ◽

Disclosure Control ◽

Disclosure Risk ◽

Statistical Disclosure

Download Full-text

DIS: A New Approach to the Measurement of Statistical Disclosure Risk

Risk Management ◽

10.1057/palgrave.rm.8240067 ◽

2000 ◽

Vol 2 (4) ◽

pp. 39-48 ◽

Cited By ~ 13

Author(s):

Mark Elliot

Keyword(s):

New Approach ◽

Disclosure Risk ◽

Statistical Disclosure

Download Full-text