scholarly journals Overcoming barriers to data sharing with medical image generation: a comprehensive evaluation

2021 ◽  
Vol 4 (1) ◽  
Author(s):  
August DuMont Schütte ◽  
Jürgen Hetzel ◽  
Sergios Gatidis ◽  
Tobias Hepp ◽  
Benedikt Dietz ◽  
...  

AbstractPrivacy concerns around sharing personally identifiable information are a major barrier to data sharing in medical research. In many cases, researchers have no interest in a particular individual’s information but rather aim to derive insights at the level of cohorts. Here, we utilise generative adversarial networks (GANs) to create medical imaging datasets consisting entirely of synthetic patient data. The synthetic images ideally have, in aggregate, similar statistical properties to those of a source dataset but do not contain sensitive personal information. We assess the quality of synthetic data generated by two GAN models for chest radiographs with 14 radiology findings and brain computed tomography (CT) scans with six types of intracranial haemorrhages. We measure the synthetic image quality by the performance difference of predictive models trained on either the synthetic or the real dataset. We find that synthetic data performance disproportionately benefits from a reduced number of classes. Our benchmark also indicates that at low numbers of samples per class, label overfitting effects start to dominate GAN training. We conducted a reader study in which trained radiologists discriminate between synthetic and real images. In accordance with our benchmark results, the classification accuracy of radiologists improves with an increasing resolution. Our study offers valuable guidelines and outlines practical conditions under which insights derived from synthetic images are similar to those that would have been derived from real data. Our results indicate that synthetic data sharing may be an attractive alternative to sharing real patient-level data in the right setting.

2021 ◽  
Author(s):  
Muhammad Haris Naveed ◽  
Umair Hashmi ◽  
Nayab Tajved ◽  
Neha Sultan ◽  
Ali Imran

This paper explores whether Generative Adversarial Networks (GANs) can produce realistic network load data that can be utilized to train machine learning models in lieu of real data. In this regard, we evaluate the performance of three recent GAN architectures on the Telecom Italia data set across a set of qualitative and quantitative metrics. Our results show that GAN generated synthetic data is indeed similar to real data and forecasting models trained on this data achieve similar performance to those trained on real data.


PLoS ONE ◽  
2021 ◽  
Vol 16 (11) ◽  
pp. e0260308
Author(s):  
Mauro Castelli ◽  
Luca Manzoni ◽  
Tatiane Espindola ◽  
Aleš Popovič ◽  
Andrea De Lorenzo

Wireless networks are among the fundamental technologies used to connect people. Considering the constant advancements in the field, telecommunication operators must guarantee a high-quality service to keep their customer portfolio. To ensure this high-quality service, it is common to establish partnerships with specialized technology companies that deliver software services in order to monitor the networks and identify faults and respective solutions. A common barrier faced by these specialized companies is the lack of data to develop and test their products. This paper investigates the use of generative adversarial networks (GANs), which are state-of-the-art generative models, for generating synthetic telecommunication data related to Wi-Fi signal quality. We developed, trained, and compared two of the most used GAN architectures: the Vanilla GAN and the Wasserstein GAN (WGAN). Both models presented satisfactory results and were able to generate synthetic data similar to the real ones. In particular, the distribution of the synthetic data overlaps the distribution of the real data for all of the considered features. Moreover, the considered generative models can reproduce the same associations observed for the synthetic features. We chose the WGAN as the final model, but both models are suitable for addressing the problem at hand.


Sensors ◽  
2019 ◽  
Vol 19 (16) ◽  
pp. 3578 ◽  
Author(s):  
Dirvanauskas ◽  
Maskeliūnas ◽  
Raudonis ◽  
Damaševičius ◽  
Scherer

We propose a method for generating the synthetic images of human embryo cells that could later be used for classification, analysis, and training, thus resulting in the creation of new synthetic image datasets for research areas lacking real-world data. Our focus was not only to generate the generic image of a cell such, but to make sure that it has all necessary attributes of a real cell image to provide a fully realistic synthetic version. We use human embryo images obtained during cell development processes for training a deep neural network (DNN). The proposed algorithm used generative adversarial network (GAN) to generate one-, two-, and four-cell stage images. We achieved a misclassification rate of 12.3% for the generated images, while the expert evaluation showed the true recognition rate (TRR) of 80.00% (for four-cell images), 86.8% (for two-cell images), and 96.2% (for one-cell images). Texture-based comparison using the Haralick features showed that there is no statistically (using the Student’s t-test) significant (p < 0.01) differences between the real and synthetic embryo images except for the sum of variance (for one-cell and four-cell images), and variance and sum of average (for two-cell images) features. The obtained synthetic images can be later adapted to facilitate the development, training, and evaluation of new algorithms for embryo image processing tasks.


Sensors ◽  
2021 ◽  
Vol 21 (23) ◽  
pp. 7785
Author(s):  
Jun Mao ◽  
Change Zheng ◽  
Jiyan Yin ◽  
Ye Tian ◽  
Wenbin Cui

Training a deep learning-based classification model for early wildfire smoke images requires a large amount of rich data. However, due to the episodic nature of fire events, it is difficult to obtain wildfire smoke image data, and most of the samples in public datasets suffer from a lack of diversity. To address these issues, a method using synthetic images to train a deep learning classification model for real wildfire smoke was proposed in this paper. Firstly, we constructed a synthetic dataset by simulating a large amount of morphologically rich smoke in 3D modeling software and rendering the virtual smoke against many virtual wildland background images with rich environmental diversity. Secondly, to better use the synthetic data to train a wildfire smoke image classifier, we applied both pixel-level domain adaptation and feature-level domain adaptation. The CycleGAN-based pixel-level domain adaptation method for image translation was employed. On top of this, the feature-level domain adaptation method incorporated ADDA with DeepCORAL was adopted to further reduce the domain shift between the synthetic and real data. The proposed method was evaluated and compared on a test set of real wildfire smoke and achieved an accuracy of 97.39%. The method is applicable to wildfire smoke classification tasks based on RGB single-frame images and would also contribute to training image classification models without sufficient data.


Author(s):  
Zhanpeng Wang ◽  
Jiaping Wang ◽  
Michael Kourakos ◽  
Nhung Hoang ◽  
Hyong Hark Lee ◽  
...  

AbstractPopulation genetics relies heavily on simulated data for validation, inference, and intuition. In particular, since real data is always limited, simulated data is crucial for training machine learning methods. Simulation software can accurately model evolutionary processes, but requires many hand-selected input parameters. As a result, simulated data often fails to mirror the properties of real genetic data, which limits the scope of methods that rely on it. In this work, we develop a novel approach to estimating parameters in population genetic models that automatically adapts to data from any population. Our method is based on a generative adversarial network that gradually learns to generate realistic synthetic data. We demonstrate that our method is able to recover input parameters in a simulated isolation-with-migration model. We then apply our method to human data from the 1000 Genomes Project, and show that we can accurately recapitulate the features of real data.


2021 ◽  
Vol 2021 (3) ◽  
pp. 142-163
Author(s):  
Sumit Mukherjee ◽  
Yixi Xu ◽  
Anusua Trivedi ◽  
Nabajyoti Patowary ◽  
Juan L. Ferres

Abstract Generative Adversarial Networks (GANs) have made releasing of synthetic images a viable approach to share data without releasing the original dataset. It has been shown that such synthetic data can be used for a variety of downstream tasks such as training classifiers that would otherwise require the original dataset to be shared. However, recent work has shown that the GAN models and their synthetically generated data can be used to infer the training set membership by an adversary who has access to the entire dataset and some auxiliary information. Current approaches to mitigate this problem (such as DPGAN [1]) lead to dramatically poorer generated sample quality than the original non–private GANs. Here we develop a new GAN architecture (privGAN), where the generator is trained not only to cheat the discriminator but also to defend membership inference attacks. The new mechanism is shown to empirically provide protection against this mode of attack while leading to negligible loss in downstream performances. In addition, our algorithm has been shown to explicitly prevent memorization of the training set, which explains why our protection is so effective. The main contributions of this paper are: i) we propose a novel GAN architecture that can generate synthetic data in a privacy preserving manner with minimal hyperparameter tuning and architecture selection, ii) we provide a theoretical understanding of the optimal solution of the privGAN loss function, iii) we empirically demonstrate the effectiveness of our model against several white and black–box attacks on several benchmark datasets, iv) we empirically demonstrate on three common benchmark datasets that synthetic images generated by privGAN lead to negligible loss in downstream performance when compared against non– private GANs. While we have focused on benchmarking privGAN exclusively on image datasets, the architecture of privGAN is not exclusive to image datasets and can be easily extended to other types of datasets. Repository link: https://github.com/microsoft/privGAN.


Mathematics ◽  
2020 ◽  
Vol 9 (1) ◽  
pp. 71
Author(s):  
Pablo Bonilla-Escribano ◽  
David Ramírez ◽  
Alejandro Porras-Segovia ◽  
Antonio Artés-Rodríguez

Variability is defined as the propensity at which a given signal is likely to change. There are many choices for measuring variability, and it is not generally known which ones offer better properties. This paper compares different variability metrics applied to irregularly (nonuniformly) sampled time series, which have important clinical applications, particularly in mental healthcare. Using both synthetic and real patient data, we identify the most robust and interpretable variability measures out of a set 21 candidates. Some of these candidates are also proposed in this work based on the absolute slopes of the time series. An additional synthetic data experiment shows that when the complete time series is unknown, as it happens with real data, a non-negligible bias that favors normalized and/or metrics based on the raw observations of the series appears. Therefore, only the results of the synthetic experiments, which have access to the full series, should be used to draw conclusions. Accordingly, the median absolute deviation of the absolute value of the successive slopes of the data is the best way of measuring variability for this kind of time series.


Electronics ◽  
2021 ◽  
Vol 10 (18) ◽  
pp. 2220
Author(s):  
Luis Gonzalez-Abril ◽  
Cecilio Angulo ◽  
Juan-Antonio Ortega ◽  
José-Luis Lopez-Guerra

The digital twin in health care is the dynamic digital representation of the patient’s anatomy and physiology through computational models which are continuously updated from clinical data. Furthermore, used in combination with machine learning technologies, it should help doctors in therapeutic path and in minimally invasive intervention procedures. Confidentiality of medical records is a very delicate issue, therefore some anonymization process is mandatory in order to maintain patients privacy. Moreover, data availability is very limited in some health domains like lung cancer treatment. Hence, generation of synthetic data conformed to real data would solve this issue. In this paper, the use of generative adversarial networks (GAN) for the generation of synthetic data of lung cancer patients is introduced as a tool to solve this problem in the form of anonymized synthetic patients. Generated synthetic patients are validated using both statistical methods, as well as by oncologists using the indirect mortality rate obtained for patients in different stages.


2021 ◽  
Author(s):  
Muhammad Haris Naveed ◽  
Umair Hashmi ◽  
Nayab Tajved ◽  
Neha Sultan ◽  
Ali Imran

This paper explores whether Generative Adversarial Networks (GANs) can produce realistic network load data that can be utilized to train machine learning models in lieu of real data. In this regard, we evaluate the performance of three recent GAN architectures on the Telecom Italia data set across a set of qualitative and quantitative metrics. Our results show that GAN generated synthetic data is indeed similar to real data and forecasting models trained on this data achieve similar performance to those trained on real data.


JAMIA Open ◽  
2020 ◽  
Author(s):  
Randi E Foraker ◽  
Sean C Yu ◽  
Aditi Gupta ◽  
Andrew P Michelson ◽  
Jose A Pineda Soto ◽  
...  

Abstract Background Synthetic data may provide a solution to researchers who wish to generate and share data in support of precision healthcare. Recent advances in data synthesis enable the creation and analysis of synthetic derivatives as if they were the original data; this process has significant advantages over data deidentification. Objectives To assess a big-data platform with data-synthesizing capabilities (MDClone Ltd., Beer Sheva, Israel) for its ability to produce data that can be used for research purposes while obviating privacy and confidentiality concerns. Methods We explored three use cases and tested the robustness of synthetic data by comparing the results of analyses using synthetic derivatives to analyses using the original data using traditional statistics, machine learning approaches, and spatial representations of the data. We designed these use cases with the purpose of conducting analyses at the observation level (Use Case 1), patient cohorts (Use Case 2), and population-level data (Use Case 3). Results For each use case, the results of the analyses were sufficiently statistically similar (P &gt; 0.05) between the synthetic derivative and the real data to draw the same conclusions. Discussion and conclusion This article presents the results of each use case and outlines key considerations for the use of synthetic data, examining their role in clinical research for faster insights and improved data sharing in support of precision healthcare.


Sign in / Sign up

Export Citation Format

Share Document