scholarly journals A Probabilistic Procedure for Anonymisation, for Assessing the Risk of Re-identification and for the Analysis of Perturbed Data Sets

2020 ◽  
Vol 36 (1) ◽  
pp. 89-115 ◽  
Author(s):  
Harvey Goldstein ◽  
Natalie Shlomo

AbstractThe requirement to anonymise data sets that are to be released for secondary analysis should be balanced by the need to allow their analysis to provide efficient and consistent parameter estimates. The proposal in this article is to integrate the process of anonymisation and data analysis. The first stage uses the addition of random noise with known distributional properties to some or all variables in a released (already pseudonymised) data set, in which the values of some identifying and sensitive variables for data subjects of interest are also available to an external ‘attacker’ who wishes to identify those data subjects in order to interrogate their records in the data set. The second stage of the analysis consists of specifying the model of interest so that parameter estimation accounts for the added noise. Where the characteristics of the noise are made available to the analyst by the data provider, we propose a new method that allows a valid analysis. This is formally a measurement error model and we describe a Bayesian MCMC algorithm that recovers consistent estimates of the true model parameters. A new method for handling categorical data is presented. The article shows how an appropriate noise distribution can be determined.

Author(s):  
Fred L. Bookstein

AbstractA matrix manipulation new to the quantitative study of develomental stability reveals unexpected morphometric patterns in a classic data set of landmark-based calvarial growth. There are implications for evolutionary studies. Among organismal biology’s fundamental postulates is the assumption that most aspects of any higher animal’s growth trajectories are dynamically stable, resilient against the types of small but functionally pertinent transient perturbations that may have originated in genotype, morphogenesis, or ecophenotypy. We need an operationalization of this axiom for landmark data sets arising from longitudinal data designs. The present paper introduces a multivariate approach toward that goal: a method for identification and interpretation of patterns of dynamical stability in longitudinally collected landmark data. The new method is based in an application of eigenanalysis unfamiliar to most organismal biologists: analysis of a covariance matrix of Boas coordinates (Procrustes coordinates without the size standardization) against their changes over time. These eigenanalyses may yield complex eigenvalues and eigenvectors (terms involving $$i=\sqrt{-1}$$ i = - 1 ); the paper carefully explains how these are to be scattered, gridded, and interpreted by their real and imaginary canonical vectors. For the Vilmann neurocranial octagons, the classic morphometric data set used as the running example here, there result new empirical findings that offer a pattern analysis of the ways perturbations of growth are attenuated or otherwise modified over the course of developmental time. The main finding, dominance of a generalized version of dynamical stability (negative autoregressions, as announced by the negative real parts of their eigenvalues, often combined with shearing and rotation in a helpful canonical plane), is surprising in its strength and consistency. A closing discussion explores some implications of this novel pattern analysis of growth regulation. It differs in many respects from the usual way covariance matrices are wielded in geometric morphometrics, differences relevant to a variety of study designs for comparisons of development across species.


2020 ◽  
Vol 70 (1) ◽  
pp. 145-161 ◽  
Author(s):  
Marnus Stoltz ◽  
Boris Baeumer ◽  
Remco Bouckaert ◽  
Colin Fox ◽  
Gordon Hiscott ◽  
...  

Abstract We describe a new and computationally efficient Bayesian methodology for inferring species trees and demographics from unlinked binary markers. Likelihood calculations are carried out using diffusion models of allele frequency dynamics combined with novel numerical algorithms. The diffusion approach allows for analysis of data sets containing hundreds or thousands of individuals. The method, which we call Snapper, has been implemented as part of the BEAST2 package. We conducted simulation experiments to assess numerical error, computational requirements, and accuracy recovering known model parameters. A reanalysis of soybean SNP data demonstrates that the models implemented in Snapp and Snapper can be difficult to distinguish in practice, a characteristic which we tested with further simulations. We demonstrate the scale of analysis possible using a SNP data set sampled from 399 fresh water turtles in 41 populations. [Bayesian inference; diffusion models; multi-species coalescent; SNP data; species trees; spectral methods.]


2018 ◽  
Vol 2018 ◽  
pp. 1-10
Author(s):  
Siyu Ji ◽  
Chenglin Wen

Neural network is a data-driven algorithm; the process established by the network model requires a large amount of training data, resulting in a significant amount of time spent in parameter training of the model. However, the system modal update occurs from time to time. Prediction using the original model parameters will cause the output of the model to deviate greatly from the true value. Traditional methods such as gradient descent and least squares methods are all centralized, making it difficult to adaptively update model parameters according to system changes. Firstly, in order to adaptively update the network parameters, this paper introduces the evaluation function and gives a new method to evaluate the parameters of the function. The new method without changing other parameters of the model updates some parameters in the model in real time to ensure the accuracy of the model. Then, based on the evaluation function, the Mean Impact Value (MIV) algorithm is used to calculate the weight of the feature, and the weighted data is brought into the established fault diagnosis model for fault diagnosis. Finally, the validity of this algorithm is verified by the example of UCI-Combined Cycle Power Plant (UCI-ccpp) simulation of standard data set.


2015 ◽  
Vol 2015 ◽  
pp. 1-12
Author(s):  
Mohammed Alguraibawi ◽  
Habshah Midi ◽  
A. H. M. Rahmatullah Imon

Identification of high leverage point is crucial because it is responsible for inaccurate prediction and invalid inferential statement as it has a larger impact on the computed values of various estimates. It is essential to classify the high leverage points into good and bad leverage points because only the bad leverage points have an undue effect on the parameter estimates. It is now evident that when a group of high leverage points is present in a data set, the existing robust diagnostic plot fails to classify them correctly. This problem is due to the masking and swamping effects. In this paper, we propose a new robust diagnostic plot to correctly classify the good and bad leverage points by reducing both masking and swamping effects. The formulation of the proposed plot is based on the Modified Generalized Studentized Residuals. We investigate the performance of our proposed method by employing a Monte Carlo simulation study and some well-known data sets. The results indicate that the proposed method is able to improve the rate of detection of bad leverage points and also to reduce swamping and masking effects.


Geophysics ◽  
2020 ◽  
pp. 1-41 ◽  
Author(s):  
Jens Tronicke ◽  
Niklas Allroggen ◽  
Felix Biermann ◽  
Florian Fanselow ◽  
Julien Guillemoteau ◽  
...  

In near-surface geophysics, ground-based mapping surveys are routinely employed in a variety of applications including those from archaeology, civil engineering, hydrology, and soil science. The resulting geophysical anomaly maps of, for example, magnetic or electrical parameters are usually interpreted to laterally delineate subsurface structures such as those related to the remains of past human activities, subsurface utilities and other installations, hydrological properties, or different soil types. To ease the interpretation of such data sets, we propose a multi-scale processing, analysis, and visualization strategy. Our approach relies on a discrete redundant wavelet transform (RWT) implemented using cubic-spline filters and the à trous algorithm, which allows to efficiently compute a multi-scale decomposition of 2D data using a series of 1D convolutions. The basic idea of the approach is presented using a synthetic test image, while our archaeo-geophysical case study from North-East Germany demonstrates its potential to analyze and process rather typical geophysical anomaly maps including magnetic and topographic data. Our vertical-gradient magnetic data show amplitude variations over several orders of magnitude, complex anomaly patterns at various spatial scales, and typical noise patterns, while our topographic data show a distinct hill structure superimposed by a microtopographic stripe pattern and random noise. Our results demonstrate that the RWT approach is capable to successfully separate these components and that selected wavelet planes can be scaled and combined so that the reconstructed images allow for a detailed, multi-scale structural interpretation also using integrated visualizations of magnetic and topographic data. Because our analysis approach is straightforward to implement without laborious parameter testing and tuning, computationally efficient, and easily adaptable to other geophysical data sets, we believe that it can help to rapidly analyze and interpret different geophysical mapping data collected to address a variety of near-surface applications from engineering practice and research.


2017 ◽  
Vol 5 (4) ◽  
pp. 1
Author(s):  
I. E. Okorie ◽  
A. C. Akpanta ◽  
J. Ohakwe ◽  
D. C. Chikezie ◽  
C. U. Onyemachi ◽  
...  

This paper introduces a new generator of probability distribution-the adjusted log-logistic generalized (ALLoG) distribution and a new extension of the standard one parameter exponential distribution called the adjusted log-logistic generalized exponential (ALLoGExp) distribution. The ALLoGExp distribution is a special case of the ALLoG distribution and we have provided some of its statistical and reliability properties. Notably, the failure rate could be monotonically decreasing, increasing or upside-down bathtub shaped depending on the value of the parameters $\delta$ and $\theta$. The method of maximum likelihood estimation was proposed to estimate the model parameters. The importance and flexibility of he ALLoGExp distribution was demonstrated with a real and uncensored lifetime data set and its fit was compared with five other exponential related distributions. The results obtained from the model fittings shows that the ALLoGExp distribution provides a reasonably better fit than the one based on the other fitted distributions. The ALLoGExp distribution is therefore ecommended for effective modelling of lifetime data sets.


2013 ◽  
Vol 17 (10) ◽  
pp. 4043-4060 ◽  
Author(s):  
D. Herckenrath ◽  
G. Fiandaca ◽  
E. Auken ◽  
P. Bauer-Gottwein

Abstract. Increasingly, ground-based and airborne geophysical data sets are used to inform groundwater models. Recent research focuses on establishing coupling relationships between geophysical and groundwater parameters. To fully exploit such information, this paper presents and compares different hydrogeophysical inversion approaches to inform a field-scale groundwater model with time domain electromagnetic (TDEM) and electrical resistivity tomography (ERT) data. In a sequential hydrogeophysical inversion (SHI) a groundwater model is calibrated with geophysical data by coupling groundwater model parameters with the inverted geophysical models. We subsequently compare the SHI with a joint hydrogeophysical inversion (JHI). In the JHI, a geophysical model is simultaneously inverted with a groundwater model by coupling the groundwater and geophysical parameters to explicitly account for an established petrophysical relationship and its accuracy. Simulations for a synthetic groundwater model and TDEM data showed improved estimates for groundwater model parameters that were coupled to relatively well-resolved geophysical parameters when employing a high-quality petrophysical relationship. Compared to a SHI these improvements were insignificant and geophysical parameter estimates became slightly worse. When employing a low-quality petrophysical relationship, groundwater model parameters improved less for both the SHI and JHI, where the SHI performed relatively better. When comparing a SHI and JHI for a real-world groundwater model and ERT data, differences in parameter estimates were small. For both cases investigated in this paper, the SHI seems favorable, taking into account parameter error, data fit and the complexity of implementing a JHI in combination with its larger computational burden.


2021 ◽  
Vol 37 (3) ◽  
pp. 481-490
Author(s):  
Chenyong Song ◽  
Dongwei Wang ◽  
Haoran Bai ◽  
Weihao Sun

HighlightsThe proposed data enhancement method can be used for small-scale data sets with rich sample image features.The accuracy of the new model reaches 98.5%, which is better than the traditional CNN method.Abstract: GoogLeNet offers far better performance in identifying apple disease compared to traditional methods. However, the complexity of GoogLeNet is relatively high. For small volumes of data, GoogLeNet does not achieve the same performance as it does with large-scale data. We propose a new apple disease identification model using GoogLeNet’s inception module. The model adopts a variety of methods to optimize its generalization ability. First, geometric transformation and image modification of data enhancement methods (including rotation, scaling, noise interference, random elimination, color space enhancement) and random probability and appropriate combination of strategies are used to amplify the data set. Second, we employ a deep convolution generative adversarial network (DCGAN) to enhance the richness of generated images by increasing the diversity of the noise distribution of the generator. Finally, we optimize the GoogLeNet model structure to reduce model complexity and model parameters, making it more suitable for identifying apple tree diseases. The experimental results show that our approach quickly detects and classifies apple diseases including rust, spotted leaf disease, and anthrax. It outperforms the original GoogLeNet in recognition accuracy and model size, with identification accuracy reaching 98.5%, making it a feasible method for apple disease classification. Keywords: Apple disease identification, Data enhancement, DCGAN, GoogLeNet.


Author(s):  
Rajendra Prasad ◽  
Lalit Kumar Gupta ◽  
A. Beesham ◽  
G. K. Goswami ◽  
Anil Kumar Yadav

In this paper, we investigate a Bianchi type I exact Universe by taking into account the cosmological constant as the source of energy at the present epoch. We have performed a [Formula: see text] test to obtain the best fit values of the model parameters of the Universe in the derived model. We have used two types of data sets, viz., (i) 31 values of the Hubble parameter and (ii) the 1048 Pantheon data set of various supernovae distance moduli and apparent magnitudes. From both the data sets, we have estimated the current values of the Hubble constant, density parameters [Formula: see text] and [Formula: see text]. The dynamics of the deceleration parameter shows that the Universe was in a decelerating phase for redshift [Formula: see text]. At a transition redshift [Formula: see text], the present Universe entered an accelerating phase of expansion. The current age of the Universe is obtained as [Formula: see text] Gyrs. This is in good agreement with the value of [Formula: see text] calculated from the Plank collaboration results and WMAP observations.


2021 ◽  
Author(s):  
Gah-Yi Ban ◽  
N. Bora Keskin

We consider a seller who can dynamically adjust the price of a product at the individual customer level, by utilizing information about customers’ characteristics encoded as a d-dimensional feature vector. We assume a personalized demand model, parameters of which depend on s out of the d features. The seller initially does not know the relationship between the customer features and the product demand but learns this through sales observations over a selling horizon of T periods. We prove that the seller’s expected regret, that is, the revenue loss against a clairvoyant who knows the underlying demand relationship, is at least of order [Formula: see text] under any admissible policy. We then design a near-optimal pricing policy for a semiclairvoyant seller (who knows which s of the d features are in the demand model) who achieves an expected regret of order [Formula: see text]. We extend this policy to a more realistic setting, where the seller does not know the true demand predictors, and show that this policy has an expected regret of order [Formula: see text], which is also near-optimal. Finally, we test our theory on simulated data and on a data set from an online auto loan company in the United States. On both data sets, our experimentation-based pricing policy is superior to intuitive and/or widely-practiced customized pricing methods, such as myopic pricing and segment-then-optimize policies. Furthermore, our policy improves upon the loan company’s historical pricing decisions by 47% in expected revenue over a six-month period. This paper was accepted by Noah Gans, stochastic models and simulation.


Sign in / Sign up

Export Citation Format

Share Document