Differential privacy preserving clustering using Daubechies-2 wavelet transform

Author(s):  
Mohammad Reza Ebrahimi Dishabi ◽  
Mohammad Abdollahi Azgomi

Most of the existing privacy preserving clustering (PPC) algorithms do not consider the worst case privacy guarantees and are based on heuristic notions. In addition, these algorithms do not run efficiently in the case of high dimensionality of data. In this paper, to alleviate these challenges, we propose a new PPC algorithm, which is based on Daubechies-2 wavelet transform (D2WT) and preserves the differential privacy notion. Differential privacy is the strong notion of privacy, which provides the worst case privacy guarantees. On the other hand, most of the existing differential-based PPC algorithms generate data with poor utility. If we apply differential privacy properties over the original raw data, the resulting data will offer lower quality of clustering (QOC) during the clustering analysis. Therefore, we use D2WT for the preprocessing of the original data before adding noise to the data. By applying D2WT to the original data, the resulting data not only contains lower dimension compared to the original data, but also can provide differential privacy guarantee with high QOC due to less noise addition. The proposed algorithm has been implemented and experimented over some well-known datasets. We also compare the proposed algorithm with some recently introduced algorithms based on utility and privacy degrees.

Author(s):  
Quanming Yao ◽  
Xiawei Guo ◽  
James Kwok ◽  
Weiwei Tu ◽  
Yuqiang Chen ◽  
...  

To meet the standard of differential privacy, noise is usually added into the original data, which inevitably deteriorates the predicting performance of subsequent learning algorithms. In this paper, motivated by the success of improving predicting performance by ensemble learning, we propose to enhance privacy-preserving logistic regression by stacking. We show that this can be done either by sample-based or feature-based partitioning. However, we prove that when privacy-budgets are the same, feature-based partitioning requires fewer samples than sample-based one, and thus likely has better empirical performance. As transfer learning is difficult to be integrated with a differential privacy guarantee, we further combine the proposed method with hypothesis transfer learning to address the problem of learning across different organizations. Finally, we not only demonstrate the effectiveness of our method on two benchmark data sets, i.e., MNIST and NEWS20, but also apply it into a real application of cross-organizational diabetes prediction from RUIJIN data set, where privacy is of a significant concern.


2020 ◽  
Author(s):  
Fatima Zahra Errounda ◽  
Yan Liu

Abstract Location and trajectory data are routinely collected to generate valuable knowledge about users' pattern behavior. However, releasing location data may jeopardize the privacy of the involved individuals. Differential privacy is a powerful technique that prevents an adversary from inferring the presence or absence of an individual in the original data solely based on the observed data. The first challenge in applying differential privacy in location is that a it usually involves a single user. This shifts the adversary's target to the user's locations instead of presence or absence in the original data. The second challenge is that the inherent correlation between location data, due to people's movement regularity and predictability, gives the adversary an advantage in inferring information about individuals. In this paper, we review the differentially private approaches to tackle these challenges. Our goal is to help newcomers to the field to better understand the state-of-the art by providing a research map that highlights the different challenges in designing differentially private frameworks that tackle the characteristics of location data. We find that in protecting an individual's location privacy, the attention of differential privacy mechanisms shifts to preventing the adversary from inferring the original location based on the observed one. Moreover, we find that the privacy-preserving mechanisms make use of the predictability and regularity of users' movements to design and protect the users' privacy in trajectory data. Finally, we explore how well the presented frameworks succeed in protecting users' locations and trajectories against well-known privacy attacks.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Farhad Farokhi

AbstractLocal differential privacy has become the gold-standard of privacy literature for gathering or releasing sensitive individual data points in a privacy-preserving manner. However, locally differential data can twist the probability density of the data because of the additive noise used to ensure privacy. In fact, the density of privacy-preserving data (no matter how many samples we gather) is always flatter in comparison with the density function of the original data points due to convolution with privacy-preserving noise density function. The effect is especially more pronounced when using slow-decaying privacy-preserving noises, such as the Laplace noise. This can result in under/over-estimation of the heavy-hitters. This is an important challenge facing social scientists due to the use of differential privacy in the 2020 Census in the United States. In this paper, we develop density estimation methods using smoothing kernels. We use the framework of deconvoluting kernel density estimators to remove the effect of privacy-preserving noise. This approach also allows us to adapt the results from non-parametric regression with errors-in-variables to develop regression models based on locally differentially private data. We demonstrate the performance of the developed methods on financial and demographic datasets.


2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Yang Bai ◽  
Yu Li ◽  
Mingchuang Xie ◽  
Mingyu Fan

In recent years, machine learning approaches have been widely adopted for many applications, including classification. Machine learning models deal with collective sensitive data usually trained in a remote public cloud server, for instance, machine learning as a service (MLaaS) system. In this scene, users upload their local data and utilize the computation capability to train models, or users directly access models trained by MLaaS. Unfortunately, recent works reveal that the curious server (that trains the model with users’ sensitive local data and is curious to know the information about individuals) and the malicious MLaaS user (who abused to query from the MLaaS system) will cause privacy risks. The adversarial method as one of typical mitigation has been studied by several recent works. However, most of them focus on the privacy-preserving against the malicious user; in other words, they commonly consider the data owner and the model provider as one role. Under this assumption, the privacy leakage risks from the curious server are neglected. Differential privacy methods can defend against privacy threats from both the curious sever and the malicious MLaaS user by directly adding noise to the training data. Nonetheless, the differential privacy method will decrease the classification accuracy of the target model heavily. In this work, we propose a generic privacy-preserving framework based on the adversarial method to defend both the curious server and the malicious MLaaS user. The framework can adapt with several adversarial algorithms to generate adversarial examples directly with data owners’ original data. By doing so, sensitive information about the original data is hidden. Then, we explore the constraint conditions of this framework which help us to find the balance between privacy protection and the model utility. The experiments’ results show that our defense framework with the AdvGAN method is effective against MIA and our defense framework with the FGSM method can protect the sensitive data from direct content exposed attacks. In addition, our method can achieve better privacy and utility balance compared to the existing method.


Author(s):  
Hye-Chung Kum ◽  
Prannay Jain

ABSTRACTObjectiveInformation privacy theory demonstrates mathematically that privacy is a budget constrained problem and that privacy preserving algorithms(e.g., differential privacy) must rely on a budgeting system. Thus, we design a privacy measure as a function of information disclosed to support incremental information disclosure required for safe interactive record linkage. The privacy measure will determine the increase in the privacy risk for any given information disclosed during record linkage. ApproachMathematically, the identity disclosure risk is inversely proportional to the number of entities in the population that share the information disclosed. If the information refers to one and only one person in the population, then the identity of the person has been fully disclosed by the information revealed. On the other hand, if the information disclosed is identical for multiple people(say n), then the information is less revealing as it could refer to any one of the n people. The larger the n, the lower the privacy risk. Thus, the anonymity-set size is defined as the number of people in the population that share the same identifying information. The privacy risk measure has one prespecified parameter k, which represents the minimum anonymity-set size to guarantee no privacy risk. That is, for any disclosed information, if the anonymity-set size is less than k, then a privacy risk is present and the risk score will be calculated. A commonly accepted threshold for k is 5 or 10. On the one hand, when all entities have anonymity-set size less than k, the privacy risk would be 100%. On the other hand, if all entities have anonymity-set size greater than or equal to k, the privacy risk would be 0%. ResultsThe budgeting system contributes to the much-needed methods for protecting privacy while still supporting high quality interactive record linkage by allowing safer manual resolution of uncertain linkages. The budgeting system supports refining effective visual encoding techniques for incrementally revealing only the required information on an as-needed basis during manual resolution of uncertain linkages as well as refining the design for a visual interface to facilitate privacy preserving data standardization, cleaning, and conflict resolution for interactive record linkage. We evaluate the budgeting system with the NC voter registry data. ConclusionThe k-anonymity based privacy risk budgeting system provides a mechanism where we can concretely reason about the tradeoff between the privacy risks due to information disclosed, accuracy gained, and biases reduced during interactive record linkage.


2015 ◽  
Vol 19 (6) ◽  
pp. 1323-1353 ◽  
Author(s):  
Mohammad Reza Ebrahimi Dishabi ◽  
Mohammad Abdollahi Azgomi

2019 ◽  
Vol 20 (2) ◽  
pp. 70-77
Author(s):  
Hana H Kareem ◽  
Esraa G. Daway ◽  
Hazim G. Daway

The research aim is to measure the quality of hazy images using a no-reference scale based on the Transmission Component and Wavelet Transform (TCWT) by calculating the histogram in the High and Low (HL) component. The system is designed to capture several images at different levels of distortion from little to medium to high and the quality is studied in the transmission component. This measure is compared with the other no-reference measurements as a Haze Distribution Map based Haze Assessment (HDMHA) and Entropy by calculating the correlation coefficient between the no reference measurements and the reference scale Universal Quality Index (UQI). The results show that the proposed algorithm TCWT is a good measure of the quality of hazy images. ABSTRAK:  Kajian ini bertujuan bagi mengukur kualiti imej berjerebu dengan menggunakan skala tiada-rujukan berdasarkan Komponen Transmisi dan Penukaran Signal Gelombang (TCWT) dengan mengira komponen Tinggi dan Rendah (HL) histogram. Sistem ini dicipta bagi mengumpul imej pada tahap berbeza dari takat selerakan paling rendah kepada paling tinggi dan kualiti imej diselidik dalam komponen transmisi. Ukuran ini dibandingkan dengan ukuran tiada-rujukan lain sebagai Peta Selerak Berjerebu (UQI). Keputusan menunjukkan algoritma  kualiti imej berjerebu TCWT yang dicadangkan adalah berkualiti baik.


2014 ◽  
Vol 18 (4) ◽  
pp. 583-608 ◽  
Author(s):  
Mohammad Reza Ebrahimi Dishabi ◽  
Mohammad Abdollahi Azgomi

2020 ◽  
Vol 9 (1) ◽  
pp. 1279-1282

With the advancement of technology and proliferation of computers in the country, the amount of Afaan Oromo language news documents produced increasingly which becomes a difficult task for news agencies to organize such huge collection of documents items manually. To solve this problem, researches is conducted using unsupervised machine learning python tools for Afaan Oromo news document clustering with low cost and best quality of clustering solution. In this research work focusing on k-means clustering analysis which produced better results as compared to the other cluster analysis both in terms of time requirement and the quality of the clusters produced


Sign in / Sign up

Export Citation Format

Share Document