Resampling methods for generating continuous multivariate synthetic data for disclosure control

Author(s):  
Atikur R Khan ◽  
Enamul Kabir
2005 ◽  
Vol 225 (5) ◽  
Author(s):  
Sandra Gottschalk

SummaryNonparametric resampling is a method for generating synthetic microdata and is introduced as a procedure for microdata disclosure limitation. Theoretically, re-identification of individuals or firms is not possible with synthetic data. The resampling procedure creates datasets - the resample - which nearly have the same empirical cumulative distribution functions as the original survey data and thus permit econometricians to calculate meaningful regression results. The idea of nonparametric resampling, especially, is to draw from univariate or multivariate empirical distribution functions without having to estimate these explicitly. Until now, the resampling procedure shown here has only been applicable to variables with continuous distribution functions. Monte Carlo simulations and applications with data from the Mannheim Innovation Panel show that results of linear and nonlinear regression analyses can be reproduced quite precisely by nonparametric resamples. A univariate and a multivariate resampling version are examined. The univariate version as well as the multivariate version which is using the correlation structure of the original data as a scaling instrument turn out to be able to retain the coefficients of model estimations. Furthermore, multivariate resampling best reproduces regression results if all variables are anonymised.


Author(s):  
Martin Klein ◽  
Thomas Mathew ◽  
Bimal Sinha

In this article multiplication of original data values by random noise is suggested as a disclosure control strategy when only the top part of the data is sensitive, as is often the case with income data. The proposed method can serve as an alternative to top coding which is a standard method in this context. Because the log-normal distribution usually fits income data well, the present investigation focuses exclusively on the log-normal. It is assumed that the log-scale mean of the sensitive variable is described by a linear regression on a set of non-sensitive covariates, and we show how a data user can draw valid inference on the parameters of the regression. An appealing feature of noise multiplication is the presence of an explicit tuning mechanism, namely, the noise generating distribution. By appropriately choosing this distribution, one can control the accuracy of inferences and the level of disclosure protection desired in the released data. Usually, more information is retained on the top part of the data under noise multiplication than under top coding. Likelihood based analysis is developed when only the large values in the data set are noise multiplied, under the assumption that the original data form a sample from a log-normal distribution. In this scenario, data analysis methods are developed under two types of data releases: (I) each released value includes an indicator of whether or not it has been noise multiplied, and (II) no such indicator is provided. A simulation study is carried out to assess the accuracy of inference for some parameters of interest. Since top coding and synthetic data methods are already available as disclosure control strategies for extreme values, some comparisons with the proposed method are made through a simulation study. The results are illustrated with a data analysis example based on 2000 U.S. Current Population Survey data. Furthermore, a disclosure risk evaluation of the proposed methodology is presented in the context of the Current Population Survey data example, and the disclosure risk of the proposed noise multiplication method is compared with the disclosure risk of synthetic data.


2013 ◽  
Vol 32 (24) ◽  
pp. 4139-4161 ◽  
Author(s):  
Bronwyn Loong ◽  
Alan M. Zaslavsky ◽  
Yulei He ◽  
David P. Harrington

Author(s):  
P.L. Nikolaev

This article deals with method of binary classification of images with small text on them Classification is based on the fact that the text can have 2 directions – it can be positioned horizontally and read from left to right or it can be turned 180 degrees so the image must be rotated to read the sign. This type of text can be found on the covers of a variety of books, so in case of recognizing the covers, it is necessary first to determine the direction of the text before we will directly recognize it. The article suggests the development of a deep neural network for determination of the text position in the context of book covers recognizing. The results of training and testing of a convolutional neural network on synthetic data as well as the examples of the network functioning on the real data are presented.


1989 ◽  
Vol 21 (6-7) ◽  
pp. 593-602 ◽  
Author(s):  
Andrew T. Watkin ◽  
W. Wesley Eckenfelder

A technique for rapidly determining Monod and inhibition kinetic parameters in activated sludge is evaluated. The method studied is known as the fed-batch reactor technique and requires approximately three hours to complete. The technique allows for a gradual build-up of substrate in the test reactor by introducing the substrate at a feed rate greater than the maximum substrate utilization rate. Both inhibitory and non-inhibitory substrate responses are modeled using a nonlinear numerical curve-fitting technique. The responses of both glucose and 2,4-dichlorophenol (DCP) are studied using activated sludges with various acclimation histories. Statistically different inhibition constants, KI, for DCP inhibition of glucose utilization were found for the various sludges studied. The curve-fitting algorithm was verified in its ability to accurately retrieve two kinetic parameters from synthetic data generated by superimposing normally distributed random error onto the two parameter numerical solution generated by the algorithm.


2020 ◽  
Vol 38 (2) ◽  
Author(s):  
Razec Cezar Sampaio Pinto da Silva Torres ◽  
Leandro Di Bartolo

ABSTRACT. Reverse time migration (RTM) is one of the most powerful methods used to generate images of the subsurface. The RTM was proposed in the early 1980s, but only recently it has been routinely used in exploratory projects involving complex geology – Brazilian pre-salt, for example. Because the method uses the two-way wave equation, RTM is able to correctly image any kind of geological environment (simple or complex), including those with anisotropy. On the other hand, RTM is computationally expensive and requires the use of computer clusters. This paper proposes to investigate the influence of anisotropy on seismic imaging through the application of RTM for tilted transversely isotropic (TTI) media in pre-stack synthetic data. This work presents in detail how to implement RTM for TTI media, addressing the main issues and specific details, e.g., the computational resources required. A couple of simple models results are presented, including the application to a BP TTI 2007 benchmark model.Keywords: finite differences, wave numerical modeling, seismic anisotropy. Migração reversa no tempo em meios transversalmente isotrópicos inclinadosRESUMO. A migração reversa no tempo (RTM) é um dos mais poderosos métodos utilizados para gerar imagens da subsuperfície. A RTM foi proposta no início da década de 80, mas apenas recentemente tem sido rotineiramente utilizada em projetos exploratórios envolvendo geologia complexa, em especial no pré-sal brasileiro. Por ser um método que utiliza a equação completa da onda, qualquer configuração do meio geológico pode ser corretamente tratada, em especial na presença de anisotropia. Por outro lado, a RTM é dispendiosa computacionalmente e requer o uso de clusters de computadores por parte da indústria. Este artigo apresenta em detalhes uma implementação da RTM para meios transversalmente isotrópicos inclinados (TTI), abordando as principais dificuldades na sua implementação, além dos recursos computacionais exigidos. O algoritmo desenvolvido é aplicado a casos simples e a um benchmark padrão, conhecido como BP TTI 2007.Palavras-chave: diferenças finitas, modelagem numérica de ondas, anisotropia sísmica.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
João Lobo ◽  
Rui Henriques ◽  
Sara C. Madeira

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.


2021 ◽  
Vol 40 (3) ◽  
pp. 1-12
Author(s):  
Hao Zhang ◽  
Yuxiao Zhou ◽  
Yifei Tian ◽  
Jun-Hai Yong ◽  
Feng Xu

Reconstructing hand-object interactions is a challenging task due to strong occlusions and complex motions. This article proposes a real-time system that uses a single depth stream to simultaneously reconstruct hand poses, object shape, and rigid/non-rigid motions. To achieve this, we first train a joint learning network to segment the hand and object in a depth image, and to predict the 3D keypoints of the hand. With most layers shared by the two tasks, computation cost is saved for the real-time performance. A hybrid dataset is constructed here to train the network with real data (to learn real-world distributions) and synthetic data (to cover variations of objects, motions, and viewpoints). Next, the depth of the two targets and the keypoints are used in a uniform optimization to reconstruct the interacting motions. Benefitting from a novel tangential contact constraint, the system not only solves the remaining ambiguities but also keeps the real-time performance. Experiments show that our system handles different hand and object shapes, various interactive motions, and moving cameras.


Sign in / Sign up

Export Citation Format

Share Document