scholarly journals Equation Chapter 1 Section 1 Differentially Private High-Dimensional Binary Data Publication via Adaptive Bayesian Network

2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Sun Lan ◽  
Jinxin Hong ◽  
Junya Chen ◽  
Jianping Cai ◽  
Yilei Wang

When using differential privacy to publish high-dimensional data, the huge dimensionality leads to greater noise. Especially for high-dimensional binary data, it is easy to be covered by excessive noise. Most existing methods cannot address real high-dimensional data problems appropriately because they suffer from high time complexity. Therefore, in response to the problems above, we propose the differential privacy adaptive Bayesian network algorithm PrivABN to publish high-dimensional binary data. This algorithm uses a new greedy algorithm to accelerate the construction of Bayesian networks, which reduces the time complexity of the GreedyBayes algorithm from O n k C m + 1 k + 2 to O n m 4 . In addition, it uses an adaptive algorithm to adjust the structure and uses a differential privacy Exponential mechanism to preserve the privacy, so as to generate a high-quality protected Bayesian network. Moreover, we use the Bayesian network to calculate the conditional distribution with noise and generate a synthetic dataset for publication. This synthetic dataset satisfies ε -differential privacy. Lastly, we carry out experiments against three real-life high-dimensional binary datasets to evaluate the functional performance.

2012 ◽  
Vol 8 (2) ◽  
pp. 44-63 ◽  
Author(s):  
Baoxun Xu ◽  
Joshua Zhexue Huang ◽  
Graham Williams ◽  
Qiang Wang ◽  
Yunming Ye

The selection of feature subspaces for growing decision trees is a key step in building random forest models. However, the common approach using randomly sampling a few features in the subspace is not suitable for high dimensional data consisting of thousands of features, because such data often contains many features which are uninformative to classification, and the random sampling often doesn’t include informative features in the selected subspaces. Consequently, classification performance of the random forest model is significantly affected. In this paper, the authors propose an improved random forest method which uses a novel feature weighting method for subspace selection and therefore enhances classification performance over high-dimensional data. A series of experiments on 9 real life high dimensional datasets demonstrated that using a subspace size of features where M is the total number of features in the dataset, our random forest model significantly outperforms existing random forest models.


2015 ◽  
Vol 2015 ◽  
pp. 1-12 ◽  
Author(s):  
Sai Kiranmayee Samudrala ◽  
Jaroslaw Zola ◽  
Srinivas Aluru ◽  
Baskar Ganapathysubramanian

Dimensionality reduction refers to a set of mathematical techniques used to reduce complexity of the original high-dimensional data, while preserving its selected properties. Improvements in simulation strategies and experimental data collection methods are resulting in a deluge of heterogeneous and high-dimensional data, which often makes dimensionality reduction the only viable way to gain qualitative and quantitative understanding of the data. However, existing dimensionality reduction software often does not scale to datasets arising in real-life applications, which may consist of thousands of points with millions of dimensions. In this paper, we propose a parallel framework for dimensionality reduction of large-scale data. We identify key components underlying the spectral dimensionality reduction techniques, and propose their efficient parallel implementation. We show that the resulting framework can be used to process datasets consisting of millions of points when executed on a 16,000-core cluster, which is beyond the reach of currently available methods. To further demonstrate applicability of our framework we perform dimensionality reduction of 75,000 images representing morphology evolution during manufacturing of organic solar cells in order to identify how processing parameters affect morphology evolution.


2020 ◽  
Vol 32 (8) ◽  
pp. 1557-1571 ◽  
Author(s):  
Xiang Cheng ◽  
Peng Tang ◽  
Sen Su ◽  
Rui Chen ◽  
Zequn Wu ◽  
...  

Author(s):  
Alfredo Cuzzocrea ◽  
Svetlana Mansmann

The problem of efficiently visualizing multidimensional data sets produced by scientific and statistical tasks/ processes is becoming increasingly challenging, and is attracting the attention of a wide multidisciplinary community of researchers and practitioners. Basically, this problem consists in visualizing multidimensional data sets by capturing the dimensionality of data, which is the most difficult aspect to be considered. Human analysts interacting with high-dimensional data often experience disorientation and cognitive overload. Analysis of high- dimensional data is a challenge encountered in a wide set of real-life applications such as (i) biological databases storing massive gene and protein data sets, (ii) real-time monitoring systems accumulating data sets produced by multiple, multi-rate streaming sources, (iii) advanced Business Intelligence (BI) systems collecting business data for decision making purposes etc. Traditional DBMS front-end tools, which are usually tuple-bag-oriented, are completely inadequate to fulfill the requirements posed by an interactive exploration of high-dimensional data sets due to two major reasons: (i) DBMS implement the OLTP paradigm, which is optimized for transaction processing and deliberately neglects the dimensionality of data; (ii) DBMS operators are very poor and offer nothing beyond the capability of conventional SQL statements, what makes such tools very inefficient with respect to the goal of visualizing and, above all, interacting with multidimensional data sets embedding a large number of dimensions. Despite the above-highlighted practical relevance of the problem of visualizing multidimensional data sets, the literature in this field is rather scarce, due to the fact that, for many years, this problem has been of relevance for life science research communities only, and interaction of the latter with the computer science research community has been insufficient. Following the enormous growth of scientific disciplines like Bio-Informatics, this problem has then become a fundamental field in the computer science academic as well as industrial research. At the same time, a number of proposals dealing with the multidimensional data visualization problem appeared in literature, with the amenity of stimulating novel and exciting application fields such as the visualization of Data Mining results generated by challenging techniques like clustering and association rule discovery. The above-mentioned issues are meant to facilitate understanding of the high relevance and attractiveness of the problem of visualizing multidimensional data sets at present and in the future, with challenging research findings accompanied by significant spin-offs in the Information Technology (IT) industrial field. A possible solution to tackle this problem is represented by well-known OLAP techniques (Codd et al., 1993; Chaudhuri & Dayal, 1997; Gray et al., 1997), focused on obtaining very efficient representations of multidimensional data sets, called data cubes, thus leading to the research field which is known in literature under the terms OLAP Visualization and Visual OLAP, which, in the remaining part of the article, are used interchangeably.


Sensors ◽  
2020 ◽  
Vol 20 (9) ◽  
pp. 2516
Author(s):  
Chunhua Ju ◽  
Qiuyang Gu ◽  
Gongxing Wu ◽  
Shuangzhu Zhang

Although the Crowd-Sensing perception system brings great data value to people through the release and analysis of high-dimensional perception data, it causes great hidden danger to the privacy of participants in the meantime. Currently, various privacy protection methods based on differential privacy have been proposed, but most of them cannot simultaneously solve the complex attribute association problem between high-dimensional perception data and the privacy threat problems from untrustworthy servers. To address this problem, we put forward a local privacy protection based on Bayes network for high-dimensional perceptual data in this paper. This mechanism realizes the local data protection of the users at the very beginning, eliminates the possibility of other parties directly accessing the user’s original data, and fundamentally protects the user’s data privacy. During this process, after receiving the data of the user’s local privacy protection, the perception server recognizes the dimensional correlation of the high-dimensional data based on the Bayes network, divides the high-dimensional data attribute set into multiple relatively independent low-dimensional attribute sets, and then sequentially synthesizes the new dataset. It can effectively retain the attribute dimension correlation of the original perception data, and ensure that the synthetic dataset and the original dataset have as similar statistical characteristics as possible. To verify its effectiveness, we conduct a multitude of simulation experiments. Results have shown that the synthetic data of this mechanism under the effective local privacy protection has relatively high data utility.


2017 ◽  
Vol 2017 ◽  
pp. 1-9 ◽  
Author(s):  
Hongchao Song ◽  
Zhuqing Jiang ◽  
Aidong Men ◽  
Bo Yang

Anomaly detection, which aims to identify observations that deviate from a nominal sample, is a challenging task for high-dimensional data. Traditional distance-based anomaly detection methods compute the neighborhood distance between each observation and suffer from the curse of dimensionality in high-dimensional space; for example, the distances between any pair of samples are similar and each sample may perform like an outlier. In this paper, we propose a hybrid semi-supervised anomaly detection model for high-dimensional data that consists of two parts: a deep autoencoder (DAE) and an ensemble k-nearest neighbor graphs- (K-NNG-) based anomaly detector. Benefiting from the ability of nonlinear mapping, the DAE is first trained to learn the intrinsic features of a high-dimensional dataset to represent the high-dimensional data in a more compact subspace. Several nonparametric KNN-based anomaly detectors are then built from different subsets that are randomly sampled from the whole dataset. The final prediction is made by all the anomaly detectors. The performance of the proposed method is evaluated on several real-life datasets, and the results confirm that the proposed hybrid model improves the detection accuracy and reduces the computational complexity.


2021 ◽  
Author(s):  
Syed Usama Khalid Bukhari ◽  
Anum Qureshi ◽  
Adeel Anjum ◽  
Munam Ali Shah

<div> <div> <div> <p>Privacy preservation of high-dimensional healthcare data is an emerging problem. Privacy breaches are becoming more common than before and affecting thousands of people. Every individual has sensitive and personal information which needs protection and security. Uploading and storing data directly to the cloud without taking any precautions can lead to serious privacy breaches. It’s a serious struggle to publish a large amount of sensitive data while minimizing privacy concerns. This leads us to make crucial decisions for the privacy of outsourced high-dimensional healthcare data. Many types of privacy preservation techniques have been presented to secure high-dimensional data while keeping its utility and privacy at the same time but every technique has its pros and cons. In this paper, a novel privacy preservation NRPP model for high-dimensional data is proposed. The model uses a privacy-preserving generative technique for releasing sensitive data, which is deferentially private. The contribution of this paper is twofold. First, a state-of-the-art anonymization model for high-dimensional healthcare data is proposed using a generative technique. Second, achieved privacy is evaluated using the concept of differential privacy. The experiment shows that the proposed model performs better in terms of utility. </p> </div> </div> </div>


2021 ◽  
Author(s):  
Syed Usama Khalid Bukhari ◽  
Anum Qureshi ◽  
Adeel Anjum ◽  
Munam Ali Shah

<div> <div> <div> <p>Privacy preservation of high-dimensional healthcare data is an emerging problem. Privacy breaches are becoming more common than before and affecting thousands of people. Every individual has sensitive and personal information which needs protection and security. Uploading and storing data directly to the cloud without taking any precautions can lead to serious privacy breaches. It’s a serious struggle to publish a large amount of sensitive data while minimizing privacy concerns. This leads us to make crucial decisions for the privacy of outsourced high-dimensional healthcare data. Many types of privacy preservation techniques have been presented to secure high-dimensional data while keeping its utility and privacy at the same time but every technique has its pros and cons. In this paper, a novel privacy preservation NRPP model for high-dimensional data is proposed. The model uses a privacy-preserving generative technique for releasing sensitive data, which is deferentially private. The contribution of this paper is twofold. First, a state-of-the-art anonymization model for high-dimensional healthcare data is proposed using a generative technique. Second, achieved privacy is evaluated using the concept of differential privacy. The experiment shows that the proposed model performs better in terms of utility. </p> </div> </div> </div>


IEEE Access ◽  
2019 ◽  
Vol 7 ◽  
pp. 176429-176437 ◽  
Author(s):  
Wanjie Li ◽  
Xing Zhang ◽  
Xiaohui Li ◽  
Guanghui Cao ◽  
Qingyun Zhang

Sign in / Sign up

Export Citation Format

Share Document