Aggregation Methods to Evaluate Multiple Protected Versions of the Same Confidential Data Set

Author(s):  
Aïda Valls ◽  
Vicenç Torra ◽  
Josep Domingo-Ferrer
2004 ◽  
Vol 36 (3) ◽  
pp. 627-638 ◽  
Author(s):  
Lynn Hunnicutt ◽  
Dee Von Bailey ◽  
Michelle Crook

Concentration in beef packing has risen dramatically in the past 25 years. We develop measures used to describe feedlot-packer relations: (1) a statistic based on the proportion of its sales a feedlot makes to a given packer, and (2) a measure of the switching behavior of feedlots. The measures are calculated using a confidential data set from the USDA Grain Inspection, Packers, and Stockyards Administration. Relationships are found to be both exclusive and stable. Causes for this rigidity are then examined using regression analysis. Transaction costs are shown to help explain why this market differs from a perfectly competitive one.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Zachary Hornberger ◽  
Bruce Cox ◽  
Raymond R. Hill

Purpose Large/stochastic spatiotemporal demand data sets can prove intractable for location optimization problems, motivating the need for aggregation. However, demand aggregation induces errors. Significant theoretical research has been performed related to the modifiable areal unit problem and the zone definition problem. Minimal research has been accomplished related to the specific issues inherent to spatiotemporal demand data, such as search and rescue (SAR) data. This study provides a quantitative comparison of various aggregation methodologies and their relation to distance and volume based aggregation errors. Design/methodology/approach This paper introduces and applies a framework for comparing both deterministic and stochastic aggregation methods using distance- and volume-based aggregation error metrics. This paper additionally applies weighted versions of these metrics to account for the reality that demand events are nonhomogeneous. These metrics are applied to a large, highly variable, spatiotemporal demand data set of SAR events in the Pacific Ocean. Comparisons using these metrics are conducted between six quadrat aggregations of varying scales and two zonal distribution models using hierarchical clustering. Findings As quadrat fidelity increases the distance-based aggregation error decreases, while the two deliberate zonal approaches further reduce this error while using fewer zones. However, the higher fidelity aggregations detrimentally affect volume error. Additionally, by splitting the SAR data set into training and test sets this paper shows the stochastic zonal distribution aggregation method is effective at simulating actual future demands. Originality/value This study indicates no singular best aggregation method exists, by quantifying trade-offs in aggregation-induced errors practitioners can utilize the method that minimizes errors most relevant to their study. Study also quantifies the ability of a stochastic zonal distribution method to effectively simulate future demand data.


2013 ◽  
Vol 5 (1) ◽  
Author(s):  
Joo Ho Lee ◽  
In Yong Kim ◽  
Christine M. O'Keefe

This paper concerns the use of synthetic data for protecting the confidentiality of business data during statistical analysis. Synthetic data sets are traditionally constructed by replacing sensitive values in a confidential data set with draws from statistical models estimated on the confidential data set. Unfortunately, the process of generating effective statistical models can be a difficult and labour-intensive task. Recently, it has been proposed to use easily-implemented methods from machine learning instead of statistical model estimation in the data synthesis task. J. Drechsler and J.P. Reiter (2011) have conducted an evaluation of four such methods, and have found that regression trees could give rise to synthetic data sets which provide reliable analysis results as well as low disclosure risks. Their conclusion was based on simulations using a subset of the 2002 Uganda census public use file. It is an interesting question whether the same conclusion applies to other types of data with different characteristics, for example business data which have quite different characteristics from population census and survey data. In particular, business data generally have few variables that are mostly categorical, and often have highly skewed distributions with outliers. In this paper we investigate the applicability of regression-tree-based methods for constructing synthetic business data. We give a detailed example comparing exploratory data analysis and linear regression results under two variants of a regression-tree-based synthetic data approach. We also include an evaluation of the analysis results with respect to the results of analysis of the original data. We further investigate the impact of different stopping criteria on performance. While it is certainly true that any method designed to protect confidentiality introduces error, and may indeed give misleading conclusions, our analysis of the results for synthesisers based on CART models has provided some evidence that this error is not random but is due to the particular characteristics of business data. We conclude that more careful analysis needs to be done in applying these methods and end users certainly need aware of possible discrepancies.


2014 ◽  
Vol 66 (1) ◽  
pp. 31-35 ◽  
Author(s):  
Berthold Koletzko ◽  
Mary Fewtrell ◽  
Robert Gibson ◽  
Johannes B. van Goudoever ◽  
Olle Hernell ◽  
...  

This paper presents an updated and revised summary of the ‘core data set' that has been proposed to be recorded and reported in all clinical trials on infant nutrition by the recently formed Consensus Group on Outcome Measures Made in Paediatric Enteral Nutrition Clinical Trials (COMMENT). This core data set was developed based on a previous proposal by the European Society for Paediatric Gastroenterology, Hepatology and Nutrition (ESPGHAN) Committee on Nutrition in 2003. It comprises confidential data to identify subjects and facilitate contact for further follow-up, data to characterize the cohort studied and data on withdrawals from the study, and some additional core data for all nutrition studies on preterm infants. We recommend that all studies on nutrition in infancy should collect and report this core data set to facilitate interpretation and comparison of results from clinical studies, and of systematic data evaluation and meta-analyses. Editors of journals publishing such reports are encouraged to require the reporting of the minimum data set described here either in the main body of the publication or as supplementary online material. © 2014 S. Karger AG, Basel


1994 ◽  
Vol 144 ◽  
pp. 139-141 ◽  
Author(s):  
J. Rybák ◽  
V. Rušin ◽  
M. Rybanský

AbstractFe XIV 530.3 nm coronal emission line observations have been used for the estimation of the green solar corona rotation. A homogeneous data set, created from measurements of the world-wide coronagraphic network, has been examined with a help of correlation analysis to reveal the averaged synodic rotation period as a function of latitude and time over the epoch from 1947 to 1991.The values of the synodic rotation period obtained for this epoch for the whole range of latitudes and a latitude band ±30° are 27.52±0.12 days and 26.95±0.21 days, resp. A differential rotation of green solar corona, with local period maxima around ±60° and minimum of the rotation period at the equator, was confirmed. No clear cyclic variation of the rotation has been found for examinated epoch but some monotonic trends for some time intervals are presented.A detailed investigation of the original data and their correlation functions has shown that an existence of sufficiently reliable tracers is not evident for the whole set of examinated data. This should be taken into account in future more precise estimations of the green corona rotation period.


Author(s):  
Jules S. Jaffe ◽  
Robert M. Glaeser

Although difference Fourier techniques are standard in X-ray crystallography it has only been very recently that electron crystallographers have been able to take advantage of this method. We have combined a high resolution data set for frozen glucose embedded Purple Membrane (PM) with a data set collected from PM prepared in the frozen hydrated state in order to visualize any differences in structure due to the different methods of preparation. The increased contrast between protein-ice versus protein-glucose may prove to be an advantage of the frozen hydrated technique for visualizing those parts of bacteriorhodopsin that are embedded in glucose. In addition, surface groups of the protein may be disordered in glucose and ordered in the frozen state. The sensitivity of the difference Fourier technique to small changes in structure provides an ideal method for testing this hypothesis.


Author(s):  
D. E. Becker

An efficient, robust, and widely-applicable technique is presented for computational synthesis of high-resolution, wide-area images of a specimen from a series of overlapping partial views. This technique can also be used to combine the results of various forms of image analysis, such as segmentation, automated cell counting, deblurring, and neuron tracing, to generate representations that are equivalent to processing the large wide-area image, rather than the individual partial views. This can be a first step towards quantitation of the higher-level tissue architecture. The computational approach overcomes mechanical limitations, such as hysterisis and backlash, of microscope stages. It also automates a procedure that is currently done manually. One application is the high-resolution visualization and/or quantitation of large batches of specimens that are much wider than the field of view of the microscope.The automated montage synthesis begins by computing a concise set of landmark points for each partial view. The type of landmarks used can vary greatly depending on the images of interest. In many cases, image analysis performed on each data set can provide useful landmarks. Even when no such “natural” landmarks are available, image processing can often provide useful landmarks.


Author(s):  
Jaap Brink ◽  
Wah Chiu

Crotoxin complex is the principal neurotoxin of the South American rattlesnake, Crotalus durissus terrificus and has a molecular weight of 24 kDa. The protein is a heterodimer with subunit A assigneda chaperone function. Subunit B carries the lethal activity, which is exerted on both sides ofthe neuro-muscular junction, and which is thought to involve binding to the acetylcholine receptor. Insight in crotoxin complex’ mode of action can be gained from a 3 Å resolution structure obtained by electron crystallography. This abstract communicates our progress in merging the electron diffraction amplitudes into a 3-dimensional (3D) intensity data set close to completion. Since the thickness of crotoxin complex crystals varies from one crystal to the other, we chose to collect tilt series of electron diffraction patterns after determining their thickness. Furthermore, by making use of the symmetry present in these tilt data, intensities collected only from similar crystals will be merged.Suitable crystals of glucose-embedded crotoxin complex were searched for in the defocussed diffraction mode with the goniometer tilted to 55° of higher in a JEOL4000 electron cryo-microscopc operated at 400 kV with the crystals kept at -120°C in a Gatan 626 cryo-holder. The crystal thickness was measured using the local contrast of the crystal relative to the supporting film from search-mode images acquired using a 1024 x 1024 slow-scan CCD camera (model 679, Gatan Inc.).


Author(s):  
J. K. Samarabandu ◽  
R. Acharya ◽  
D. R. Pareddy ◽  
P. C. Cheng

In the study of cell organization in a maize meristem, direct viewing of confocal optical sections in 3D (by means of 3D projection of the volumetric data set, Figure 1) becomes very difficult and confusing because of the large number of nucleus involved. Numerical description of the cellular organization (e.g. position, size and orientation of each structure) and computer graphic presentation are some of the solutions to effectively study the structure of such a complex system. An attempt at data-reduction by means of manually contouring cell nucleus in 3D was reported (Summers et al., 1990). Apart from being labour intensive, this 3D digitization technique suffers from the inaccuracies of manual 3D tracing related to the depth perception of the operator. However, it does demonstrate that reducing stack of confocal images to a 3D graphic representation helps to visualize and analyze complex tissues (Figure 2). This procedure also significantly reduce computational burden in an interactive operation.


Author(s):  
M. Shlepr ◽  
C. M. Vicroy

The microelectronics industry is heavily tasked with minimizing contaminates at all steps of the manufacturing process. Particles are generated by physical and/or chemical fragmentation from a mothersource. The tools and macrovolumes of chemicals used for processing, the environment surrounding the process, and the circuits themselves are all potential particle sources. A first step in eliminating these contaminants is to identify their source. Elemental analysis of the particles often proves useful toward this goal, and energy dispersive spectroscopy (EDS) is a commonly used technique. However, the large variety of source materials and process induced changes in the particles often make it difficult to discern if the particles are from a common source.Ordination is commonly used in ecology to understand community relationships. This technique usespair-wise measures of similarity. Separation of the data set is based on discrimination functions. Theend product is a spatial representation of the data with the distance between points equaling the degree of dissimilarity.


Sign in / Sign up

Export Citation Format

Share Document