Second-order control of complex systems with correlated synthetic data

AbstractThe generation of synthetic data is an essential tool to study complex systems, allowing for example to test models of these in precisely controlled settings, or to parametrize simulation models when data is missing. This paper focuses on the generation of synthetic data with an emphasis on correlation structure. We introduce a new methodology to generate such correlated synthetic data. It is implemented in the field of socio-spatial systems, more precisely by coupling an urban growth model with a transportation network generation model. We also show the genericity of the method with an application on financial time-series. The simulation results show that the generation of correlated synthetic data for such systems is indeed feasible within a broad range of correlations, and suggest applications of such synthetic datasets.

Download Full-text

G-Tric: generating three-way synthetic datasets with triclustering solutions

BMC Bioinformatics ◽

10.1186/s12859-020-03925-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

João Lobo ◽

Rui Henriques ◽

Sara C. Madeira

Keyword(s):

State Of The Art ◽

Synthetic Data ◽

Ground Truth ◽

Real Data ◽

Three Dimensions ◽

Additional Advantage ◽

Urban Dynamics ◽

Data Generator ◽

Real World Datasets ◽

Synthetic Datasets

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.

Download Full-text

Intrusion detection of railway clearance from infrared images using generative adversarial networks

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-192141 ◽

2020 ◽

pp. 1-13

Author(s):

Yundong Li ◽

Yi Liu ◽

Han Dong ◽

Wei Hu ◽

Chen Lin

Keyword(s):

Intrusion Detection ◽

Synthetic Data ◽

Generative Adversarial Networks ◽

Generation Model ◽

Single Shot ◽

Data Generation ◽

Infrared Images ◽

Adversarial Networks ◽

Training Samples ◽

Rgb Images

The intrusion detection of railway clearance is crucial for avoiding railway accidents caused by the invasion of abnormal objects, such as pedestrians, falling rocks, and animals. However, detecting intrusions using deep learning methods from infrared images captured at night remains a challenging task because of the lack of sufficient training samples. To address this issue, a transfer strategy that migrates daytime RGB images to the nighttime style of infrared images is proposed in this study. The proposed method consists of two stages. In the first stage, a data generation model is trained on the basis of generative adversarial networks using RGB images and a small number of infrared images, and then, synthetic samples are generated using a well-trained model. In the second stage, a single shot multibox detector (SSD) model is trained using synthetic data and utilized to detect abnormal objects from infrared images at nighttime. To validate the effectiveness of the proposed method, two groups of experiments, namely, railway and non-railway scenes, are conducted. Experimental results demonstrate the effectiveness of the proposed method, and an improvement of 17.8% is achieved for object detection at nighttime.

Download Full-text

Unknown Parameter Excitation and Estimation for Complex Systems With Dynamic Performances

Journal of Mechanical Design ◽

10.1115/1.4050107 ◽

2021 ◽

Vol 143 (9) ◽

Author(s):

Yi-Ping Chen ◽

Kuei-Yuan Chan

Keyword(s):

Complex Systems ◽

Simulation Models ◽

Credible Interval ◽

Specific Model ◽

Model Parameters ◽

Real Model ◽

Dynamic Case ◽

Dynamic Performances ◽

Validation Stage ◽

Setting Parameters

Abstract Simulation models play crucial roles in efficient product development cycles, therefore many studies aim to improve the confidence of a model during the validation stage. In this research, we proposed a dynamic model validation to provide accurate parameter settings for minimal output errors between simulation models and real model experiments. The optimal operations for setting parameters are developed to maximize the effects by specific model parameters while minimizing interactions. To manage the excessive costs associated with simulations of complex systems, we propose a procedure with three main features: the optimal excitation based on global sensitivity analysis (GSA) is done via metamodel techniques, for estimating parameters with the polynomial chaos-based Kalman filter, and validating the updated model based on hypothesis testing. An illustrative mathematical model was used to demonstrate the detail processes in our proposed method. We also apply our method on a vehicle dynamic case with a composite maneuver for exciting unknown model parameters such as inertial and coefficients of the tire model; the unknown model parameters were successfully estimated within a 95% credible interval. The contributions of this research are also underscored through multiple cases.

Download Full-text

EVALUATION OF THE INFLUENCE OF SOCIAL AND ECONOMIC DETERMINANTS ON THE STATE OF PUBLIC HEALTH ON THE BASIS OF MICROIMITATION MODELING

Economic Analysis ◽

10.35774/econa2017.02.079 ◽

2017 ◽

pp. 79-90

Author(s):

Dmytro Shushpanov ◽

Volodymyr Sarioglo

Keyword(s):

Research Policy ◽

Synthetic Data ◽

Simulation Models ◽

The State ◽

Data Sets ◽

Economic Determinants ◽

Micro Simulation ◽

State Employment ◽

Health Research Policy ◽

Main Determinants

In the article the essence and peculiarities of microimitational modeling are considered. The advantages of microimitational models over the statistics models are substantiated. Micro-simulation models, that prognosticate somehow dynamic changes in health and which are most appropriate to use in development and health research policy, such as POHEM, CORSIM and Sife Paths, are outlined. It is proposed to use elements of statistical and dynamic microimitation modeling, agent modeling and the concept of a life course for the estimation of the influence social and economic determinants. The synthetic model of population which has been formed on the basis of representative data sets of sample surveys of living conditions of households and economic activity of the population of the State Employment Service of Ukraine, as well as microdata of the Multicultural Survey of the Population of Ukraine (2012) and the Medical and Demographic Survey (2013). The generalized scheme of the method of microimulation modeling of the influence of social and economic determinants on the health status of the population of Ukraine has been developed. The influence of the main determinants on the health of certain age, gender and social and economic groups of the population is estimated on the basis of the methodology of synthetic data.

Download Full-text

Inferring viral occurrence patterns through a synthetic data simulation

10.1101/2021.07.13.452220 ◽

2021 ◽

Author(s):

Ville N Pimenoff ◽

Ramon Cleries

Keyword(s):

Linear Models ◽

Population Sample ◽

Synthetic Data ◽

Interaction Patterns ◽

Viral Strain ◽

Data Simulation ◽

Synthetic Datasets ◽

Pathogen Occurrence ◽

Log Linear ◽

Occurrence Patterns

Viruses infecting humans are manifold and several of them provoke significant morbidity and mortality. Simulations creating large synthetic datasets from observed multiple viral strain infections in a limited population sample can be a powerful tool to infer significant pathogen occurrence and interaction patterns, particularly if limited number of observed data units is available. Here, to demonstrate diverse human papillomavirus (HPV) strain occurrence patterns, we used log-linear models combined with Bayesian framework for graphical independence network (GIN) analysis. That is, to simulate datasets based on modeling the probabilistic associations between observed viral data points, i.e different viral strain infections in a set of population samples. Our GIN analysis outperformed in precision all oversampling methods tested for simulating large synthetic viral strain-level prevalence dataset from observed set of HPVs data. Altogether, we demonstrate that network modeling is a potent tool for creating synthetic viral datasets for comprehensive pathogen occurrence and interaction pattern estimations.

Download Full-text

Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data

Frontiers in Big Data ◽

10.3389/fdata.2021.679939 ◽

2021 ◽

Vol 4 ◽

Author(s):

Michael Platzer ◽

Thomas Reutterer

Keyword(s):

Mixed Type ◽

Synthetic Data ◽

Training Data ◽

Privacy Risk ◽

Individual Level ◽

Empirical Assessment ◽

Model Free ◽

Private Data ◽

Synthetic Datasets ◽

The Individual

AI-based data synthesis has seen rapid progress over the last several years and is increasingly recognized for its promise to enable privacy-respecting high-fidelity data sharing. This is reflected by the growing availability of both commercial and open-sourced software solutions for synthesizing private data. However, despite these recent advances, adequately evaluating the quality of generated synthetic datasets is still an open challenge. We aim to close this gap and introduce a novel holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions for mixed-type tabular data. Measuring fidelity is based on statistical distances of lower-dimensional marginal distributions, which provide a model-free and easy-to-communicate empirical metric for the representativeness of a synthetic dataset. Privacy risk is assessed by calculating the individual-level distances to closest record with respect to the training data. By showing that the synthetic samples are just as close to the training as to the holdout data, we yield strong evidence that the synthesizer indeed learned to generalize patterns and is independent of individual training records. We empirically demonstrate the presented framework for seven distinct synthetic data solutions across four mixed-type datasets and compare these then to traditional data perturbation techniques. Both a Python-based implementation of the proposed metrics and the demonstration study setup is made available open-source. The results highlight the need to systematically assess the fidelity just as well as the privacy of these emerging class of synthetic data generators.

Download Full-text

A Synthetic Data Generation Model for Diabetic Foot Treatment

Future Data and Security Engineering. Big Data, Security and Privacy, Smart City and Industry 4.0 Applications - Communications in Computer and Information Science ◽

10.1007/978-981-33-4370-2_18 ◽

2020 ◽

pp. 249-264

Author(s):

Jayun Hyun ◽

Seo Hu Lee ◽

Ha Min Son ◽

Ji-Ung Park ◽

Tai-Myoung Chung

Keyword(s):

Diabetic Foot ◽

Synthetic Data ◽

Generation Model ◽

Data Generation ◽

Synthetic Data Generation

Download Full-text

Revisiting particle sizing using greyscale optical array probes: evaluation using laboratory experiments and synthetic data

Atmospheric Measurement Techniques ◽

10.5194/amt-12-3067-2019 ◽

2019 ◽

Vol 12 (6) ◽

pp. 3067-3079

Author(s):

Sebastian J. O'Shea ◽

Jonathan Crosier ◽

James Dorsey ◽

Waldemar Schledewitz ◽

Ian Crawford ◽

...

Keyword(s):

Climate Models ◽

Synthetic Data ◽

Mie Scattering ◽

Sample Volume ◽

Research Aircraft ◽

Order Of Magnitude ◽

Ambient Data ◽

In Situ Observations ◽

Synthetic Datasets

Abstract. In situ observations from research aircraft and instrumented ground sites are important contributions to developing our collective understanding of clouds and are used to inform and validate numerical weather and climate models. Unfortunately, biases in these datasets may be present, which can limit their value. In this paper, we discuss artefacts which may bias data from a widely used family of instrumentation in the field of cloud physics, optical array probes (OAPs). Using laboratory and synthetic datasets, we demonstrate how greyscale analysis can be used to filter data, constraining the sample volume of the OAP and improving data quality, particularly at small sizes where OAP data are considered unreliable. We apply the new methodology to ambient data from two contrasting case studies: one warm cloud and one cirrus cloud. In both cases the new methodology reduces the concentration of small particles (<60 µm) by approximately an order of magnitude. This significantly improves agreement with a Mie-scattering spectrometer for the liquid case and with a holographic imaging probe for the cirrus case. Based on these results, we make specific recommendations to instrument manufacturers, instrument operators and data processors about the optimal use of greyscale OAPs. The data from monoscale OAPs are unreliable and should not be used for particle diameters below approximately 100 µm.

Download Full-text

Validation of XRD phase quantification using semi-synthetic data

Powder Diffraction ◽

10.1017/s0885715620000573 ◽

2020 ◽

Vol 35 (4) ◽

pp. 262-275

Author(s):

Nicola Döbelin

Keyword(s):

Reference Materials ◽

Certified Reference Materials ◽

Synthetic Data ◽

Statistical Validation ◽

Validation Parameters ◽

Phase Quantification ◽

Characteristic Features ◽

Diffraction Patterns ◽

Synthetic Datasets ◽

Multi Phase

Validating phase quantification procedures of powder X-ray diffraction (XRD) data for an implementation in an ISO/IEC 17025 accredited environment has been challenging due to a general lack of suitable certified reference materials. The preparation of highly pure and crystalline reference materials and mixtures thereof may exceed the costs for a profitable and justifiable implementation. This study presents a method for the validation of XRD phase quantifications based on semi-synthetic datasets that reduces the effort for a full method validation drastically. Datasets of nearly pure reference substances are stripped of impurity signals and rescaled to 100% crystallinity, thus eliminating the need for the preparation of ultra-pure and -crystalline materials. The processed datasets are then combined numerically while preserving all sample- and instrument-characteristic features of the peak profile, thereby creating multi-phase diffraction patterns of precisely known composition. The number of compositions and repetitions is only limited by computational power and storage capacity. These datasets can be used as input files for the phase quantification procedure, in which statistical validation parameters such as precision, accuracy, linearity, and limits of detection and quantification can be determined from a statistically sound number of datasets and compositions.

Download Full-text

A methodology for developing simulation models of complex systems

Ecological Modelling ◽

10.1016/j.ecolmodel.2006.11.005 ◽

2007 ◽

Vol 202 (3-4) ◽

pp. 385-396 ◽

Cited By ~ 43

Author(s):

Craig A. Aumann

Keyword(s):

Complex Systems ◽

Simulation Models

Download Full-text