Towards Improving Privacy of Synthetic DataSets

Author(s):  
Aditya Kuppa ◽  
Lamine Aouad ◽  
Nhien-An Le-Khac
Keyword(s):  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
João Lobo ◽  
Rui Henriques ◽  
Sara C. Madeira

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.


2021 ◽  
Vol 16 (2) ◽  
pp. 1-31
Author(s):  
Chunkai Zhang ◽  
Zilin Du ◽  
Yuting Yang ◽  
Wensheng Gan ◽  
Philip S. Yu

Utility mining has emerged as an important and interesting topic owing to its wide application and considerable popularity. However, conventional utility mining methods have a bias toward items that have longer on-shelf time as they have a greater chance to generate a high utility. To eliminate the bias, the problem of on-shelf utility mining (OSUM) is introduced. In this article, we focus on the task of OSUM of sequence data, where the sequential database is divided into several partitions according to time periods and items are associated with utilities and several on-shelf time periods. To address the problem, we propose two methods, OSUM of sequence data (OSUMS) and OSUMS + , to extract on-shelf high-utility sequential patterns. For further efficiency, we also design several strategies to reduce the search space and avoid redundant calculation with two upper bounds time prefix extension utility ( TPEU ) and time reduced sequence utility ( TRSU ). In addition, two novel data structures are developed for facilitating the calculation of upper bounds and utilities. Substantial experimental results on certain real and synthetic datasets show that the two methods outperform the state-of-the-art algorithm. In conclusion, OSUMS may consume a large amount of memory and is unsuitable for cases with limited memory, while OSUMS + has wider real-life applications owing to its high efficiency.


Sensors ◽  
2021 ◽  
Vol 21 (11) ◽  
pp. 3726
Author(s):  
Ivan Vaccari ◽  
Vanessa Orani ◽  
Alessia Paglialonga ◽  
Enrico Cambiaso ◽  
Maurizio Mongelli

The application of machine learning and artificial intelligence techniques in the medical world is growing, with a range of purposes: from the identification and prediction of possible diseases to patient monitoring and clinical decision support systems. Furthermore, the widespread use of remote monitoring medical devices, under the umbrella of the “Internet of Medical Things” (IoMT), has simplified the retrieval of patient information as they allow continuous monitoring and direct access to data by healthcare providers. However, due to possible issues in real-world settings, such as loss of connectivity, irregular use, misuse, or poor adherence to a monitoring program, the data collected might not be sufficient to implement accurate algorithms. For this reason, data augmentation techniques can be used to create synthetic datasets sufficiently large to train machine learning models. In this work, we apply the concept of generative adversarial networks (GANs) to perform a data augmentation from patient data obtained through IoMT sensors for Chronic Obstructive Pulmonary Disease (COPD) monitoring. We also apply an explainable AI algorithm to demonstrate the accuracy of the synthetic data by comparing it to the real data recorded by the sensors. The results obtained demonstrate how synthetic datasets created through a well-structured GAN are comparable with a real dataset, as validated by a novel approach based on machine learning.


Energies ◽  
2021 ◽  
Vol 14 (9) ◽  
pp. 2371
Author(s):  
Matthieu Dubarry ◽  
David Beck

The development of data driven methods for Li-ion battery diagnosis and prognosis is a growing field of research for the battery community. A big limitation is usually the size of the training datasets which are typically not fully representative of the real usage of the cells. Synthetic datasets were proposed to circumvent this issue. This publication provides improved datasets for three major battery chemistries, LiFePO4, Nickel Aluminum Cobalt Oxide, and Nickel Manganese Cobalt Oxide 811. These datasets can be used for statistical or deep learning methods. This work also provides a detailed statistical analysis of the datasets. Accurate diagnosis as well as early prognosis comparable with state of the art, while providing physical interpretability, were demonstrated by using the combined information of three learnable parameters.


Atmosphere ◽  
2021 ◽  
Vol 12 (5) ◽  
pp. 577
Author(s):  
Gabriele Graffieti ◽  
Davide Maltoni

In this paper, we present a novel defogging technique, named CurL-Defog, with the aim of minimizing the insertion of artifacts while maintaining good contrast restoration and visibility enhancement. Many learning-based defogging approaches rely on paired data, where fog is artificially added to clear images; this usually provides good results on mildly fogged images but is not effective for difficult cases. On the other hand, the models trained with real data can produce visually impressive results, but unwanted artifacts are often present. We propose a curriculum learning strategy and an enhanced CycleGAN model to reduce the number of produced artifacts, where both synthetic and real data are used in the training procedure. We also introduce a new metric, called HArD (Hazy Artifact Detector), to numerically quantify the number of artifacts in the defogged images, thus avoiding the tedious and subjective manual inspection of the results. HArD is then combined with other defogging indicators to produce a solid metric that is not deceived by the presence of artifacts. The proposed approach compares favorably with state-of-the-art techniques on both real and synthetic datasets.


Energies ◽  
2021 ◽  
Vol 14 (2) ◽  
pp. 353
Author(s):  
Yu Hou ◽  
Rebekka Volk ◽  
Lucio Soibelman

Multi-sensor imagery data has been used by researchers for the image semantic segmentation of buildings and outdoor scenes. Due to multi-sensor data hunger, researchers have implemented many simulation approaches to create synthetic datasets, and they have also synthesized thermal images because such thermal information can potentially improve segmentation accuracy. However, current approaches are mostly based on the laws of physics and are limited to geometric models’ level of detail (LOD), which describes the overall planning or modeling state. Another issue in current physics-based approaches is that thermal images cannot be aligned to RGB images because the configurations of a virtual camera used for rendering thermal images are difficult to synchronize with the configurations of a real camera used for capturing RGB images, which is important for segmentation. In this study, we propose an image translation approach to directly convert RGB images to simulated thermal images for expanding segmentation datasets. We aim to investigate the benefits of using an image translation approach for generating synthetic aerial thermal images and compare those approaches with physics-based approaches. Our datasets for generating thermal images are from a city center and a university campus in Karlsruhe, Germany. We found that using the generating model established by the city center to generate thermal images for campus datasets performed better than using the latter to generate thermal images for the former. We also found that using a generating model established by one building style to generate thermal images for datasets with the same building styles performed well. Therefore, we suggest using training datasets with richer and more diverse building architectural information, more complex envelope structures, and similar building styles to testing datasets for an image translation approach.


Sensors ◽  
2021 ◽  
Vol 21 (11) ◽  
pp. 3896
Author(s):  
Dat Ngo ◽  
Gi-Dong Lee ◽  
Bongsoon Kang

Haze is a term that is widely used in image processing to refer to natural and human-activity-emitted aerosols. It causes light scattering and absorption, which reduce the visibility of captured images. This reduction hinders the proper operation of many photographic and computer-vision applications, such as object recognition/localization. Accordingly, haze removal, which is also known as image dehazing or defogging, is an apposite solution. However, existing dehazing algorithms unconditionally remove haze, even when haze occurs occasionally. Therefore, an approach for haze density estimation is highly demanded. This paper then proposes a model that is known as the haziness degree evaluator to predict haze density from a single image without reference to a corresponding haze-free image, an existing georeferenced digital terrain model, or training on a significant amount of data. The proposed model quantifies haze density by optimizing an objective function comprising three haze-relevant features that result from correlation and computation analysis. This objective function is formulated to maximize the image’s saturation, brightness, and sharpness while minimizing the dark channel. Additionally, this study describes three applications of the proposed model in hazy/haze-free image classification, dehazing performance assessment, and single image dehazing. Extensive experiments on both real and synthetic datasets demonstrate its efficacy in these applications.


2010 ◽  
Vol 08 (02) ◽  
pp. 337-356 ◽  
Author(s):  
SAAD I. SHEIKH ◽  
TANYA Y. BERGER-WOLF ◽  
ASHFAQ A. KHOKHAR ◽  
ISABEL C. CABALLERO ◽  
MARY V. ASHLEY ◽  
...  

While full-sibling group reconstruction from microsatellite data is a well-studied problem, reconstruction of half-sibling groups is much less studied, theoretically challenging, and computationally demanding. In this paper, we present a formulation of the half-sibling reconstruction problem and prove its APX-hardness. We also present exact solutions for this formulation and develop heuristics. Using biological and synthetic datasets we present experimental results and compare them with the leading alternative software COLONY. We show that our results are competitive and allow half-sibling group reconstruction in the presence of polygamy, which is prevalent in nature.


2021 ◽  
Author(s):  
Nusrat Parveen M Rafique ◽  
Satish. R. Devane ◽  
Shamim Akhtar

2020 ◽  
Vol 13 (10) ◽  
pp. 1669-1681
Author(s):  
Zijing Tan ◽  
Ai Ran ◽  
Shuai Ma ◽  
Sheng Qin

Pointwise order dependencies (PODs) are dependencies that specify ordering semantics on attributes of tuples. POD discovery refers to the process of identifying the set Σ of valid and minimal PODs on a given data set D. In practice D is typically large and keeps changing, and it is prohibitively expensive to compute Σ from scratch every time. In this paper, we make a first effort to study the incremental POD discovery problem, aiming at computing changes ΔΣ to Σ such that Σ ⊕ ΔΣ is the set of valid and minimal PODs on D with a set Δ D of tuple insertion updates. (1) We first propose a novel indexing technique for inputs Σ and D. We give algorithms to build and choose indexes for Σ and D , and to update indexes in response to Δ D. We show that POD violations w.r.t. Σ incurred by Δ D can be efficiently identified by leveraging the proposed indexes, with a cost dependent on log (| D |). (2) We then present an effective algorithm for computing ΔΣ, based on Σ and identified violations caused by Δ D. The PODs in Σ that become invalid on D + Δ D are efficiently detected with the proposed indexes, and further new valid PODs on D + Δ D are identified by refining those invalid PODs in Σ on D + Δ D. (3) Finally, using both real-life and synthetic datasets, we experimentally show that our approach outperforms the batch approach that computes from scratch, up to orders of magnitude.


Sign in / Sign up

Export Citation Format

Share Document