scholarly journals Differentially Private Release of Datasets using Gaussian Copula

2020 ◽  
Vol 10 (2) ◽  
Author(s):  
Hassan Jameel Asghar ◽  
Ming Ding ◽  
Thierry Rakotoarivelo ◽  
Sirine Mrabet ◽  
Dali Kaafar

We propose a generic mechanism to efficiently release differentially private synthetic versions of high-dimensional datasets with high utility. The core technique in our mechanism is the use of copulas, which are functions representing dependencies among random variables with a multivariate distribution. Specifically, we use the Gaussian copula to define dependencies of attributes in the input dataset, whose rows are modelled as samples from an unknown multivariate distribution, and then sample synthetic records through this copula. Despite the inherently numerical nature of Gaussian correlations we construct a method that is applicable to both numerical and categorical attributes alike. Our mechanism is efficient in that it only takes time proportional to the square of the number of attributes in the dataset. We propose a differentially private way of constructing the Gaussian copula without compromising computational efficiency. Through experiments on three real-world datasets, we show that we can obtain highly accurate answers to the set of all one-way marginal, and two-and three-way positive conjunction queries, with 99% of the query answers having absolute (fractional) error rates between 0.01 to 3%. Furthermore, for a majority of two-way and three-way queries, we outperform independent noise addition through the well-known Laplace mechanism. In terms of computational time we demonstrate that our mechanism can output synthetic datasets in around 6 minutes 47 seconds on average with an input dataset of about 200 binary attributes and more than 32,000 rows, and about 2 hours 30 mins to execute a much larger dataset of about 700 binary attributes and more than 5 million rows. To further demonstrate scalability, we ran the mechanism on larger (artificial) datasets with 1,000 and 2,000 binary attributes (and 5 million rows) obtaining synthetic outputs in approximately 6 and 19 hours, respectively. These are highly feasible times for synthetic datasets, which are one-off releases.

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
João Lobo ◽  
Rui Henriques ◽  
Sara C. Madeira

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.


2021 ◽  
Vol 16 (2) ◽  
pp. 1-31
Author(s):  
Chunkai Zhang ◽  
Zilin Du ◽  
Yuting Yang ◽  
Wensheng Gan ◽  
Philip S. Yu

Utility mining has emerged as an important and interesting topic owing to its wide application and considerable popularity. However, conventional utility mining methods have a bias toward items that have longer on-shelf time as they have a greater chance to generate a high utility. To eliminate the bias, the problem of on-shelf utility mining (OSUM) is introduced. In this article, we focus on the task of OSUM of sequence data, where the sequential database is divided into several partitions according to time periods and items are associated with utilities and several on-shelf time periods. To address the problem, we propose two methods, OSUM of sequence data (OSUMS) and OSUMS + , to extract on-shelf high-utility sequential patterns. For further efficiency, we also design several strategies to reduce the search space and avoid redundant calculation with two upper bounds time prefix extension utility ( TPEU ) and time reduced sequence utility ( TRSU ). In addition, two novel data structures are developed for facilitating the calculation of upper bounds and utilities. Substantial experimental results on certain real and synthetic datasets show that the two methods outperform the state-of-the-art algorithm. In conclusion, OSUMS may consume a large amount of memory and is unsuitable for cases with limited memory, while OSUMS + has wider real-life applications owing to its high efficiency.


Information ◽  
2018 ◽  
Vol 9 (9) ◽  
pp. 234 ◽  
Author(s):  
Sumet Mehta ◽  
Xiangjun Shen ◽  
Jiangping Gou ◽  
Dejiao Niu

The K-nearest neighbour classifier is very effective and simple non-parametric technique in pattern classification; however, it only considers the distance closeness, but not the geometricalplacement of the k neighbors. Also, its classification performance is highly influenced by the neighborhood size k and existing outliers. In this paper, we propose a new local mean based k-harmonic nearest centroid neighbor (LMKHNCN) classifier in orderto consider both distance-based proximity, as well as spatial distribution of k neighbors. In our method, firstly the k nearest centroid neighbors in each class are found which are used to find k different local mean vectors, and then employed to compute their harmonic mean distance to the query sample. Lastly, the query sample is assigned to the class with minimum harmonic mean distance. The experimental results based on twenty-six real-world datasets shows that the proposed LMKHNCN classifier achieves lower error rates, particularly in small sample-size situations, and that it is less sensitive to parameter k when compared to therelated four KNN-based classifiers.


2021 ◽  
pp. 23-41
Author(s):  
Subhagata Chattopadhyay

The study proposes a novel approach to automate classifying Chest X-ray (CXR) images of COVID-19 positive patients. All acquired images have been pre-processed with Simple Median Filter (SMF) and Gaussian Filter (GF) with kernel size (5, 5). The better filter is then identified by comparing Mean Squared Error (MSE) and Peak Signal-to-Noise Ratio (PSNR) of denoised images. Canny's edge detection has been applied to find the Region of Interest (ROI) on denoised images. Eigenvalues [-2, 2] of the Hessian matrix (5 × 5) of the ROIs are then extracted, which constitutes the 'input' dataset to the Feed Forward Neural Network (FFNN) classifier, developed in this study. Eighty percent of the data is used for training the said network after 10-fold cross-validation and the performance of the network is tested with the remaining 20% of the data. Finally, validation has been made on another set of 'raw' normal and abnormal CXRs. Precision, Recall, Accuracy, and Computational time complexity (Big(O)) of the classifier are then estimated to examine its performance.


Symmetry ◽  
2020 ◽  
Vol 12 (8) ◽  
pp. 1211
Author(s):  
Mengjiao Zhang ◽  
Tiantian Xu ◽  
Zhao Li ◽  
Xiqing Han ◽  
Xiangjun Dong

As an important technology in computer science, data mining aims to mine hidden, previously unknown, and potentially valuable patterns from databases.High utility negative sequential rule (HUNSR) mining can provide more comprehensive decision-making information than high utility sequential rule (HUSR) mining by taking non-occurring events into account. HUNSR mining is much more difficult than HUSR mining because of two key intrinsic complexities. One is how to define the HUNSR mining problem and the other is how to calculate the antecedent’s local utility value in a HUNSR, a key issue in calculating the utility-confidence of the HUNSR. To address the intrinsic complexities, we propose a comprehensive algorithm called e-HUNSR and the contributions are as follows. (1) We formalize the problem of HUNSR mining by proposing a series of concepts. (2) We propose a novel data structure to store the related information of HUNSR candidate (HUNSRC) and a method to efficiently calculate the local utility value and utility of HUNSRC’s antecedent. (3) We propose an efficient method to generate HUNSRC based on high utility negative sequential pattern (HUNSP) and a pruning strategy to prune meaningless HUNSRC. To the best of our knowledge, e-HUNSR is the first algorithm to efficiently mine HUNSR. The experimental results on two real-life and 12 synthetic datasets show that e-HUNSR is very efficient.


2020 ◽  
Vol 34 (04) ◽  
pp. 6837-6844
Author(s):  
Xiaojin Zhang ◽  
Honglei Zhuang ◽  
Shengyu Zhang ◽  
Yuan Zhou

We study a variant of the thresholding bandit problem (TBP) in the context of outlier detection, where the objective is to identify the outliers whose rewards are above a threshold. Distinct from the traditional TBP, the threshold is defined as a function of the rewards of all the arms, which is motivated by the criterion for identifying outliers. The learner needs to explore the rewards of the arms as well as the threshold. We refer to this problem as "double exploration for outlier detection". We construct an adaptively updated confidence interval for the threshold, based on the estimated value of the threshold in the previous rounds. Furthermore, by automatically trading off exploring the individual arms and exploring the outlier threshold, we provide an efficient algorithm in terms of the sample complexity. Experimental results on both synthetic datasets and real-world datasets demonstrate the efficiency of our algorithm.


2019 ◽  
Vol 35 (24) ◽  
pp. 5146-5154 ◽  
Author(s):  
Joanna Zyla ◽  
Michal Marczyk ◽  
Teresa Domaszewska ◽  
Stefan H E Kaufmann ◽  
Joanna Polanska ◽  
...  

Abstract Motivation Analysis of gene set (GS) enrichment is an essential part of functional omics studies. Here, we complement the established evaluation metrics of GS enrichment algorithms with a novel approach to assess the practical reproducibility of scientific results obtained from GS enrichment tests when applied to related data from different studies. Results We evaluated eight established and one novel algorithm for reproducibility, sensitivity, prioritization, false positive rate and computational time. In addition to eight established algorithms, we also included Coincident Extreme Ranks in Numerical Observations (CERNO), a flexible and fast algorithm based on modified Fisher P-value integration. Using real-world datasets, we demonstrate that CERNO is robust to ranking metrics, as well as sample and GS size. CERNO had the highest reproducibility while remaining sensitive, specific and fast. In the overall ranking Pathway Analysis with Down-weighting of Overlapping Genes, CERNO and over-representation analysis performed best, while CERNO and GeneSetTest scored high in terms of reproducibility. Availability and implementation tmod package implementing the CERNO algorithm is available from CRAN (cran.r-project.org/web/packages/tmod/index.html) and an online implementation can be found at http://tmod.online/. The datasets analyzed in this study are widely available in the KEGGdzPathwaysGEO, KEGGandMetacoreDzPathwaysGEO R package and GEO repository. Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Wei Zhou ◽  
Jonas B. Nielsen ◽  
Lars G. Fritsche ◽  
Rounak Dey ◽  
Maiken E. Gabrielsen ◽  
...  

AbstractIn genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, linear mixed model and the recently proposed logistic mixed model, perform poorly – producing large type I error rates – in the analysis of phenotypes with unbalanced case-control ratios. Here we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation (SPA) to calibrate the distribution of score test statistics. This method, SAIGE, provides accurate p-values even when case-control ratios are extremely unbalanced. It utilizes state-of-art optimization strategies to reduce computational time and memory cost of generalized mixed model. The computation cost linearly depends on sample size, and hence can be applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 white British European-ancestry samples for >1400 binary phenotypes, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness.


Biometric is an automated detection of the characteristics of an individual on the basis of the biological and social features. Detection of the uni-modal biometric system is based on the biometric data of an individual. Some issue of distortion level spoofing threats are more accessible to biometric data. Some of the issues overcome by multimodal biometric scheme in which signature of the biometric data are determine for better security of the data. Multimodal biometric is used on variety of the application areas which are human computer interface, detection of the sensor through unique method. The physical and social characteristics are used for the identification of an individual using multimodal biometric system. Multi-model biometric system applications are security system developed in banking sectors, business phase and Industry (MNC) companies. In existing work, using ESVM method to recognize the biometric traits and problem occurs in existing phase is distortion and degrades the image quality present and reduces the recognition rate and high error rates. In proposed research, determined the biometric features finger print, face and iris through CASIA dataset. Then, distortion rate is recognised through salt and pepper method and removal of interference using filtration technique. After that, discrete wavelet transformation is used for the extraction of the features of the biometric system through face, fingerprint and eye that determine the graphical features. Along with that, feed forward neural network algorithm developed for classification and recognition of multi modal biometric behaviour characteristics. The Encrypted NN method conducts simulation work on the metrics like as a recognition rate, true positive rate and computation time. The experimental results demonstrate that Encrypted NN method is able to enhance the image quality, recognition rate and TPR and reduces the computational time of Multi-model Biometric System when compared with existing work and simulation tool used MATLAB 2016a.


2021 ◽  
Author(s):  
Edward Rosales

Many approaches have been taken towards the development of a compliant stereo correspondence algorithm that is capable of producing accurate disparity maps within a short period of time. There has been great progress over the past decade due to the vast increase in optimization techniques. Currently, the most successful algorithms contain explicit assumptions of the real world such as definitive differences in disparity among objects and constant textures within objects. This thesis starts by giving a brief description of disparity, along with descriptions of some common applications. Next, it explores various methods used in common stereo correspondence algorithms, as well as gives an in depth description and analysis of top performing algorithms. These algorithms are later used to compare with the proposed algorithm. In the proposed algorithm, frequency stereo correspondence in parallel with the traditional color intensity stereo correspondence is used to develop an initial disparity map. Frequency stereo correspondence is achieved using a winner-take-all block based Discrete Cosine Transform (DCT) to find the largest frequency components as well as their positions to use in disparity estimation. The proposed algorithm uses methods that are computationally inexpensive to reduce the computational time that plagues many of the common stereo correspondence algorithms. The proposed algorithm achieves an average correct disparity rate of 95.3%. This results in a disparity error rate of 4.07% compared to the top performing algorithms in the Middlebury website [1]; the DoubleBP, CoopRegion, AdaptingBP, and ADCensus algorithms that have error rates of 4.19%, 4.41%, 4.23%, and 3.97%, respectively. Additionally, experimental results demonstrate that the proposed algorithm is computationally efficient and significantly reduces the processing time that plagues many of the common stereo correspondence algorithms.


Sign in / Sign up

Export Citation Format

Share Document