On some consequences of the permutation paradigm for data anonymization: Centrality of permutation matrices, universal measures of disclosure risk and information loss, evaluation by dominance

Over the years, the literature on individual data anonymization has burgeoned in many directions. While such diversity should be praised, it does not come without some difficulties. Currently, the task of selecting the optimal analytical environment is complicated by the multitude of available choices and the fact that the performance of any method is generally dependent of the data properties. In light of these issues, the contribution of this paper is twofold. First, based on recent insights from the literature and inspired by cryptography, it proposes a new anonymization method that shows that the task of anonymization can ultimately rely only on ranks permutations. As a result, the method offers a new way to practice data anonymization by performing it ex-ante and independently of the distributional features of the data instead of being engaged, as it is currently the case in the literature, in several ex-post evaluations and iterations to reach the protection and information properties sought after. Second, the method establishes a conceptual connection across the field, as it can mimic all the currently existing tools. To make the method operational, this paper proposes also the introduction of permutation menus in data anonymization, where recently developed universal measures of disclosure risk and information loss are used ex-ante for the calibration of permutation keys. To justify the relevance of their uses, a theoretical characterization of these measures is also proposed.

Download Full-text

Utility-preserving transaction data anonymization with low information loss

Expert Systems with Applications ◽

10.1016/j.eswa.2012.02.179 ◽

2012 ◽

Vol 39 (10) ◽

pp. 9764-9777 ◽

Cited By ~ 20

Author(s):

Grigorios Loukides ◽

Aris Gkoulalas-Divanis

Keyword(s):

Information Loss ◽

Transaction Data ◽

Data Anonymization

Download Full-text

Data Anonymization through Collaborative Multi-view Microaggregation

Journal of Intelligent Systems ◽

10.1515/jisys-2020-0026 ◽

2020 ◽

Vol 30 (1) ◽

pp. 327-345

Author(s):

Sarah Zouinina ◽

Younès Bennani ◽

Nicoleta Rogovschi ◽

Abdelouahid Lyhyaoui

Keyword(s):

Vector Quantization ◽

Experimental Results ◽

Quantization Method ◽

Data Utility ◽

Data Anonymization ◽

Disclosure Risk ◽

Main Challenge ◽

Collaborative Clustering ◽

The Will ◽

Utility Measures

Abstract The interest in data anonymization is exponentially growing, motivated by the will of the governments to open their data. The main challenge of data anonymization is to find a balance between data utility and the amount of disclosure risk. One of the most known frameworks of data anonymization is k-anonymity, this method assumes that a dataset is anonymous if and only if for each element of the dataset, there exist at least k − 1 elements identical to it. In this paper, we propose two techniques to achieve k-anonymity through microaggregation: k-CMVM and Constrained-CMVM. Both, use topological collaborative clustering to obtain k-anonymous data. The first one determines the k levels automatically and the second defines it by exploration. We also improved the results of these two approaches by using pLVQ2 as a weighted vector quantization method. The four methods proposed were proven to be efficient using two data utility measures, the separability utility and the structural utility. The experimental results have shown a very promising performance.

Download Full-text

Recursive Genetic Micro-Aggregation Technique: Information Loss, Disclosure Risk and Scoring Index

Data ◽

10.3390/data6050053 ◽

2021 ◽

Vol 6 (5) ◽

pp. 53

Author(s):

Ebaa Fayyoumi ◽

Omar Alhuniti

Keyword(s):

Real Life ◽

Convergence Condition ◽

Information Loss ◽

General Information ◽

Divide And Conquer ◽

Computational Time ◽

Data Set ◽

Disclosure Risk ◽

Aggregation Technique ◽

Scoring Index

This research investigates the micro-aggregation problem in secure statistical databases by integrating the divide and conquer concept with a genetic algorithm. This is achieved by recursively dividing a micro-data set into two subsets based on the proximity distance similarity. On each subset the genetic operation “crossover” is performed until the convergence condition is satisfied. The recursion will be terminated if the size of the generated subset is satisfied. Eventually, the genetic operation “mutation” will be performed over all generated subsets that satisfied the variable group size constraint in order to maximize the objective function. Experimentally, the proposed micro-aggregation technique was applied to recommended real-life data sets. Results demonstrated a remarkable reduction in the computational time, which sometimes exceeded 70% compared to the state-of-the-art. Furthermore, a good equilibrium value of the Scoring Index (SI) was achieved by involving a linear combination of the General Information Loss (GIL) and the General Disclosure Risk (GDR).

Download Full-text

Maintaining Analytic Utility while Protecting Confidentiality of Survey and Nonsurvey Data

Journal of Privacy and Confidentiality ◽

10.29012/jpc.v1i2.571 ◽

2010 ◽

Vol 1 (2) ◽

Cited By ~ 1

Author(s):

Avinash C. Singh

Keyword(s):

Risk Measures ◽

Mean Value ◽

Information Loss ◽

Synthetic Methods ◽

Risk Scores ◽

Main Study ◽

Common Mean ◽

Alternative Framework ◽

Disclosure Risk ◽

Simultaneous Control

Consider a complete rectangular database at the micro (or unit) level from a survey (sample or census) or nonsurvey (administrative source) in which potential identifying variables (IVs) are suitably categorized (so that the analytic utility is essentially maintained) for reducing the pretreatment disclosure risk to the extent possible. The pretreatment risk is due to the presence of unique records (with respect to IVs) or nonuniques (i.e., more than one record having a common IV profile) with similar values of at least one sensitive variable (SV). This setup covers macro (or aggregate) level data including tabular data because a common mean value (of 1 in the case of count data) to all units in the aggregation or cell can be assigned. Our goal is to create a public use file with simultaneous control of disclosure risk and information loss after disclosure treatment by perturbation (i.e., substitution of IVs and not SVs) and suppression (i.e., subsampling-out of records). In this paper, an alternative framework of measuring information loss and disclosure risk under a nonsynthetic approach as proposed by Singh (2002, 2006) is considered which, in contrast to the commonly used deterministic treatment, is based on a stochastic selection of records for disclosure treatment in the sense that all records are subject to treatment (with possibly different probabilities), but only a small proportion of them are actually treated. We also propose an extension of the above alternative framework of Singh with the goal of generalizing risk measures to allow partial risk scores for unique and nonunique records. Survey sampling techniques of sample allocation are used to assign substitution and subsampling rates to risk strata defined by unique and nonunique records such that bias due to substitution and variance due to subsampling for main study variables (functions of SVs and IVs) are minimized. This is followed by calibration to controls based on original estimates of main study variables so that these estimates are preserved, and bias and variance for other study variables may also be reduced. The above alternative framework leads to the method of disclosure treatment known as MASSC (signifying micro-agglomeration, substitution, subsampling, and calibration) and to an enhanced method (denoted GenMASSC) which uses generalized risk measures. The GenMASSC method is illustrated through a simple example followed by a discussion of relative merits and demerits of nonsynthetic and synthetic methods of disclosure treatment.

Download Full-text

Post-Masking Optimization of the Tradeoff between Information Loss and Disclosure Risk in Masked Microdata Sets

Inference Control in Statistical Databases - Lecture Notes in Computer Science ◽

10.1007/3-540-47804-3_13 ◽

2002 ◽

pp. 163-171 ◽

Cited By ~ 16

Author(s):

Francesc Sebé ◽

Josep Domingo-Ferrer ◽

Josep Maria Mateo-Sanz ◽

Vicenç Torra

Keyword(s):

Information Loss ◽

Disclosure Risk

Download Full-text

Trade-Off between Disclosure Risk and Information Loss Using Multivariate Microaggregation: A Case Study on Business Data

Privacy in Statistical Databases - Lecture Notes in Computer Science ◽

10.1007/978-3-540-25955-8_25 ◽

2004 ◽

pp. 307-322 ◽

Cited By ~ 4

Author(s):

Josep A. Sànchez ◽

Julià Urrutia ◽

Enric Ripoll

Keyword(s):

Information Loss ◽

Trade Off ◽

Disclosure Risk ◽

Business Data

Download Full-text

Fuzzy clustering-based microaggregation to achieve probabilistic k-anonymity for data with constraints

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189074 ◽

2020 ◽

Vol 39 (5) ◽

pp. 5999-6008

Author(s):

Vicenç Torra

Keyword(s):

Fuzzy Clustering ◽

Information Loss ◽

Linear Constraints ◽

Data Driven ◽

Trade Off ◽

Fuzzy C Means ◽

Disclosure Risk ◽

Protection Method ◽

Good Trade

Microaggregation is an effective data-driven protection method that permits us to achieve a good trade-off between disclosure risk and information loss. In this work we propose a method for microaggregation based on fuzzy c-means, that is appropriate when there are constraints (linear constraints) on the variables that describe the data. Our method leads to results that satisfy these constraints even when the data to be masked do not satisfy them.

Download Full-text

Anonymization Based Fisher–Yates Shuffle Method for Streaming of Twitter Data

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1397.0882s819 ◽

2019 ◽

Vol 8 (2S8) ◽

pp. 408-411

Keyword(s):

Big Data ◽

Real Time ◽

Privacy Preservation ◽

Personal Data ◽

Large Data ◽

Information Loss ◽

Streaming Data ◽

Privacy Concern ◽

Data Utility ◽

Data Anonymization

In this era of Big Data, many organizations are functioning with personal data, that has to be preserved for privacy reason. There are hazards to identify the individual details by using Quasi Identifier (QI). So to preserve the privacy, anonymization points us to convert the personal data into unidentified personal data. There are many organizations that produce the large data in real time. With the help of Hadoop components like HDFS and MapReduce and with its ecosystems, large volume of data can be processed in real time. There are many basic data anonymization techniques like cryptographic, substitution, character masking, shuffling, nulling out, date variance and number variance. Here privacy preservation is achieved for streaming data by using one of the anonymization techniques called ‘shuffling’ with Big data concept. K-anonymity, t-closeness, l-diversity are usually used technique for privacy concern in a data. But in all these techniques information loss and data utility are not preserved very well. Dynamically Anonymizing Data Shuffling (DADS) technique is used to overcome this information loss and also to improve data utility in streaming data.

Download Full-text

KETIDAKPADANAN DIKSI TERJEMAHAN ACHMAD SUNARTO DALAM BUKU TERJEMAH TA’LIM MUTA’ALIM

Hijai - Journal on Arabic Language and Literature ◽

10.15575/hijai.v2i1.6471 ◽

2019 ◽

Vol 2 (1) ◽

pp. 1-17

Author(s):

Muhammad Ibnu Pamungkas ◽

Izzuddin Musthafa ◽

Muhammad Nurhasan

Keyword(s):

Information Loss ◽

Source Text ◽

Knowledge Based

Ta’lim Muta’alim is Syaikh al-Zarnūjī’s opus that consists of norms, ethics, and rules for gaining knowledge based on Islamic teachings. Thus, claimants of science could reach their goals to obtain it. This book was translated by Achmad Sunarto into Indonesian language and published by Husaini Publisher in Bandung. After reading it totally, researcher found mistakes in translation, especially mistakes in words selection (diction) in translation. And after analyzed it, researcher formulate the mistakes into 4 parts, (1) translation that is the result of direct transliteration from SL without considering its compability in TL, (2) existence of information loss and gain that effects the translation itself and makes it unsuitable, (3) choosing a word which is not suit with the meaning reference from the source text, (4) translation is unacceptable in TL because it is translated literally.

Download Full-text