Creating Artificial Human Genomes Using Generative Models

AbstractGenerative models have shown breakthroughs in a wide spectrum of domains due to recent advancements in machine learning algorithms and increased computational power. Despite these impressive achievements, the ability of generative models to create realistic synthetic data is still under-exploited in genetics and absent from population genetics.Yet a known limitation of this field is the reduced access to many genetic databases due to concerns about violations of individual privacy, although they would provide a rich resource for data mining and integration towards advancing genetic studies. In this study, we demonstrated that deep generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) can be trained to learn the high dimensional distributions of real genomic datasets and create high quality artificial genomes (AGs) with none to little privacy loss. To illustrate the promising outcomes of our method, we showed that (i) imputation quality for low frequency alleles can be improved by augmenting reference panels with AGs, (ii) scores obtained from selection tests on AGs and real genomes are highly correlated and (iii) AGs can inherit genotype-phenotype associations. AGs have the potential to become valuable assets in genetic studies by providing high quality anonymous substitutes for private databases.

Download Full-text

Creating artificial human genomes using generative neural networks

PLoS Genetics ◽

10.1371/journal.pgen.1009303 ◽

2021 ◽

Vol 17 (2) ◽

pp. e1009303

Author(s):

Burak Yelmen ◽

Aurélien Decelle ◽

Linda Ongaro ◽

Davide Marnetto ◽

Corentin Tallec ◽

...

Keyword(s):

Data Augmentation ◽

Wide Spectrum ◽

Generative Models ◽

Machine Learning Algorithms ◽

Easy Access ◽

Generative Adversarial Networks ◽

Compact Representation ◽

Restricted Boltzmann Machines ◽

High Quality ◽

Genetic Studies

Generative models have shown breakthroughs in a wide spectrum of domains due to recent advancements in machine learning algorithms and increased computational power. Despite these impressive achievements, the ability of generative models to create realistic synthetic data is still under-exploited in genetics and absent from population genetics. Yet a known limitation in the field is the reduced access to many genetic databases due to concerns about violations of individual privacy, although they would provide a rich resource for data mining and integration towards advancing genetic studies. In this study, we demonstrated that deep generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) can be trained to learn the complex distributions of real genomic datasets and generate novel high-quality artificial genomes (AGs) with none to little privacy loss. We show that our generated AGs replicate characteristics of the source dataset such as allele frequencies, linkage disequilibrium, pairwise haplotype distances and population structure. Moreover, they can also inherit complex features such as signals of selection. To illustrate the promising outcomes of our method, we showed that imputation quality for low frequency alleles can be improved by data augmentation to reference panels with AGs and that the RBM latent space provides a relevant encoding of the data, hence allowing further exploration of the reference dataset and features for solving supervised tasks. Generative models and AGs have the potential to become valuable assets in genetic studies by providing a rich yet compact representation of existing genomes and high-quality, easy-access and anonymous alternatives for private databases.

Download Full-text

Generative adversarial networks for generating synthetic features for Wi-Fi signal quality

PLoS ONE ◽

10.1371/journal.pone.0260308 ◽

2021 ◽

Vol 16 (11) ◽

pp. e0260308

Author(s):

Mauro Castelli ◽

Luca Manzoni ◽

Tatiane Espindola ◽

Aleš Popovič ◽

Andrea De Lorenzo

Keyword(s):

Synthetic Data ◽

Real Data ◽

Generative Models ◽

Generative Adversarial Networks ◽

Signal Quality ◽

Quality Service ◽

High Quality ◽

The Real ◽

Adversarial Networks ◽

High Quality Service

Wireless networks are among the fundamental technologies used to connect people. Considering the constant advancements in the field, telecommunication operators must guarantee a high-quality service to keep their customer portfolio. To ensure this high-quality service, it is common to establish partnerships with specialized technology companies that deliver software services in order to monitor the networks and identify faults and respective solutions. A common barrier faced by these specialized companies is the lack of data to develop and test their products. This paper investigates the use of generative adversarial networks (GANs), which are state-of-the-art generative models, for generating synthetic telecommunication data related to Wi-Fi signal quality. We developed, trained, and compared two of the most used GAN architectures: the Vanilla GAN and the Wasserstein GAN (WGAN). Both models presented satisfactory results and were able to generate synthetic data similar to the real ones. In particular, the distribution of the synthetic data overlaps the distribution of the real data for all of the considered features. Moreover, the considered generative models can reproduce the same associations observed for the synthetic features. We chose the WGAN as the final model, but both models are suitable for addressing the problem at hand.

Download Full-text

GANs and VAEs As Methods of Synthetic Data Generation and Augmentation to Enhance Heart Disease Prediction

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.b3263.1211221 ◽

2021 ◽

Vol 11 (2) ◽

pp. 17-23

Author(s):

Rohit Sahoo ◽

◽

Vedang Naik ◽

Saurabh Singh ◽

Shaveta Malik ◽

...

Keyword(s):

Machine Learning ◽

Heart Disease ◽

Critical Role ◽

Synthetic Data ◽

Generative Models ◽

Machine Learning Algorithms ◽

Training Data ◽

Generative Adversarial Networks ◽

Data Generation ◽

Original Dataset

Heart disease instances are rising at an alarming rate, and it is critical and essential to predict any such ailments in advance. This is a challenging diagnostic that must be done accurately and swiftly. Lack of relevant data is often the impeding factor when it comes to various areas of research. Data augmentation is a strategy for improving the training of discriminative models that may be accomplished in a variety of ways. Deep generative models, which have recently advanced, now provide new approaches to enrich current data sets. Generative Models like Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) are frequently used to generate high quality, realistic, synthetic data essential for machine learning algorithms as they play a critical role in various classification problems. In our case, we were provided with 304 rows of heart disease data to create a robust model for predicting the presence of an ailment in the patient. However, the identification of heart disease would not be efficient given the small amount of available training data. We used GAN, CGAN, and VAE to generate data to tackle this problem, thus augmenting the original data. This additional data will help in increasing the accuracy of the models created using the new dataset. We applied classification-based Machine Learning models such as Logistic Regression, Decision Trees, KNN, and Random Forest. We compared the accuracy of the said models, each of which was supplied with the original dataset and the augmented datasets that used the data generation techniques mentioned above. Our research suggests that using data generation techniques significantly boosts the accuracy of the machine learning techniques applied to them.

Download Full-text

Deep generative models in DataSHIELD

BMC Medical Research Methodology ◽

10.1186/s12874-021-01237-6 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Stefan Lenz ◽

Moritz Hess ◽

Harald Binder

Keyword(s):

Genetic Variant ◽

Small Sample Size ◽

Synthetic Data ◽

Routine Data ◽

Original Data ◽

Generative Models ◽

Small Sample ◽

Generative Adversarial Networks ◽

Artificial Data ◽

Data Set

Abstract Background The best way to calculate statistics from medical data is to use the data of individual patients. In some settings, this data is difficult to obtain due to privacy restrictions. In Germany, for example, it is not possible to pool routine data from different hospitals for research purposes without the consent of the patients. Methods The DataSHIELD software provides an infrastructure and a set of statistical methods for joint, privacy-preserving analyses of distributed data. The contained algorithms are reformulated to work with aggregated data from the participating sites instead of the individual data. If a desired algorithm is not implemented in DataSHIELD or cannot be reformulated in such a way, using artificial data is an alternative. Generating artificial data is possible using so-called generative models, which are able to capture the distribution of given data. Here, we employ deep Boltzmann machines (DBMs) as generative models. For the implementation, we use the package “BoltzmannMachines” from the Julia programming language and wrap it for use with DataSHIELD, which is based on R. Results We present a methodology together with a software implementation that builds on DataSHIELD to create artificial data that preserve complex patterns from distributed individual patient data. Such data sets of artificial patients, which are not linked to real patients, can then be used for joint analyses. As an exemplary application, we conduct a distributed analysis with DBMs on a synthetic data set, which simulates genetic variant data. Patterns from the original data can be recovered in the artificial data using hierarchical clustering of the virtual patients, demonstrating the feasibility of the approach. Additionally, we compare DBMs, variational autoencoders, generative adversarial networks, and multivariate imputation as generative approaches by assessing the utility and disclosure of synthetic data generated from real genetic variant data in a distributed setting with data of a small sample size. Conclusions Our implementation adds to DataSHIELD the ability to generate artificial data that can be used for various analyses, e.g., for pattern recognition with deep learning. This also demonstrates more generally how DataSHIELD can be flexibly extended with advanced algorithms from languages other than R.

Download Full-text

Synthetic Observational Health Data with GANs: from slow adoption to a boom in medical research and ultimately digital twins?

10.21203/rs.3.rs-116297/v1 ◽

2020 ◽

Author(s):

Jeremy Geogres-Filteau ◽

Elisa Cirillo

Keyword(s):

Medical Imaging ◽

Medical Research ◽

Synthetic Data ◽

Well Being ◽

Health Data ◽

Generative Models ◽

Generative Adversarial Networks ◽

Digital Twin ◽

Industrial Sectors ◽

The Subject

Abstract After being collected for patient care, Observational Health Data (OHD) can further benefit patient well-being by sustaining the development of health informatics and medical research. Vast potential is unexploited because of the fiercely private nature of patient-related data and regulations to protect it. Generative Adversarial Networks (GANs) have recently emerged as a groundbreaking way to learn generative models that produce realistic synthetic data. They have revolutionized practices in multiple domains such as self-driving cars, fraud detection, digital twin simulations in industrial sectors, and medical imaging. The digital twin concept could readily apply to modelling and quantifying disease progression. In addition, GANs posses many capabilities relevant to common problems in healthcare: lack of data, class imbalance, rare diseases, and preserving privacy. Unlocking open access to privacy-preserving OHD could be transformative for scientific research. In the midst of COVID-19, the healthcare system is facing unprecedented challenges, many of which of are data related for the reasons stated above. Considering these facts, publications concerning GAN applied to OHD seemed to be severely lacking. To uncover the reasons for this slow adoption, we broadly reviewed the published literature on the subject. Our findings show that the properties of OHD were initially challenging for the existing GAN algorithms (unlike medical imaging, for which state-of-the-art model were directly transferable) and the evaluation synthetic data lacked clear metrics. We find more publications on the subject than expected, starting slowly in 2017, and since then at an increasing rate. The difficulties of OHD remain, and we discuss issues relating to evaluation, consistency, benchmarking, data modelling, and reproducibility.

Download Full-text

Image Augmentation based on GAN deep learning approach with Textual Content Descriptors

Journal of Information Technology and Digital World - September 2019 ◽

10.36548/jitdw.2021.3.005 ◽

2021 ◽

Vol 3 (3) ◽

pp. 210-225

Author(s):

Judy Simon

Keyword(s):

Computer Vision ◽

Synthetic Data ◽

Original Data ◽

Generative Models ◽

Training Data ◽

Generative Adversarial Networks ◽

Data Sets ◽

Biological Vision ◽

Adversarial Networks ◽

Textual Content

Computer vision, also known as computational visual perception, is a branch of artificial intelligence that allows computers to interpret digital pictures and videos in a manner comparable to biological vision. It entails the development of techniques for simulating biological vision. The aim of computer vision is to extract more meaningful information from visual input than that of a biological vision. Computer vision is exploding due to the avalanche of data being produced today. Powerful generative models, such as Generative Adversarial Networks (GANs), are responsible for significant advances in the field of picture creation. The focus of this research is to concentrate on textual content descriptors in the images used by GANs to generate synthetic data from the MNIST dataset to either supplement or replace the original data while training classifiers. This can provide better performance than other traditional image enlarging procedures due to the good handling of synthetic data. It shows that training classifiers on synthetic data are as effective as training them on pure data alone, and it also reveals that, for small training data sets, supplementing the dataset by first training GANs on the data may lead to a significant increase in classifier performance.

Download Full-text

Generative models for fast cluster simulations in the TPC for the ALICE experiment

EPJ Web of Conferences ◽

10.1051/epjconf/201921406003 ◽

2019 ◽

Vol 214 ◽

pp. 06003 ◽

Cited By ~ 3

Author(s):

Kamil Deja ◽

Tomasz Trzcin´ski ◽

Łukasz Graczykowski

Keyword(s):

Computational Cost ◽

Real Life ◽

Synthetic Data ◽

Real Data ◽

Generative Models ◽

Generative Adversarial Networks ◽

Detector Response ◽

Alice Experiment ◽

The Real ◽

Speed Up

Simulating the detector response is a key component of every highenergy physics experiment. The methods used currently for this purpose provide high-fidelity results. However, this precision comes at a price of a high computational cost. In this work, we introduce our research aiming at fast generation of the possible responses of detector clusters to particle collisions. We present the results for the real-life example of the Time Projection Chamber in the ALICE experiment at CERN. The essential component of our solution is a generative model that allows to simulate synthetic data points that bear high similarity to the real data. Leveraging recent advancements in machine learning, we propose to use conditional Generative Adversarial Networks. In this work we present a method to simulate data samples possible to record in the detector based on the initial information about particles. We propose and evaluate several models based on convolutional or recursive networks. The main advantage offered by the proposed method is a significant speed-up in the execution time, reaching up to the factor of 102 with respect to the currently used simulation tool. Nevertheless, this speed-up comes at a price of a lower simulation quality. In this work we adapt available methods and show their quantitative and qualitative limitations.

Download Full-text

Dynamic GAN for High-Quality Sign Language Video Generation from Skeletal poses using Generative Adversarial Networks

10.21203/rs.3.rs-766083/v1 ◽

2021 ◽

Author(s):

B Natarajan ◽

Elakkiya R

Keyword(s):

Sign Language ◽

Performance Metrics ◽

Random Noise ◽

Video Quality ◽

Generative Models ◽

Generative Adversarial Networks ◽

Generation Process ◽

High Quality ◽

Adversarial Networks ◽

Proposed Model

Abstract The emergence of unsupervised generative models has resulted in greater performance in image and video generation tasks. However, existing generative models pose huge challenges in high-quality video generation process due to blurry and inconsistent results. In this paper, we introduce a novel generative framework named Dynamic Generative Adversarial Networks (Dynamic GAN) model for regulating the adversarial training and generating photorealistic high-quality sign language videos from skeletal poses. The proposed model comprises three stages of development such as generator network, classification and image quality enhancement and discriminator network. In the generator fold, the model generates samples similar to real images using random noise vectors, the classification of generated samples are carried out using the VGG-19 model and novel techniques are employed for improving the quality of generated samples in the second fold of the model and finally the discriminator networks fold identifies the real or fake samples. Unlike, existing approaches the proposed novel framework produces photo-realistic video quality results without using any animation or avatar approaches. To evaluate the model performance qualitatively and quantitatively, the proposed model has been evaluated using three benchmark datasets that yield plausible results. The datasets are RWTH-PHOENIX-Weather 2014T dataset, and our self-created dataset for Indian Sign Language (ISL-CSLTR), and the UCF-101 Action Recognition dataset. The output samples and performance metrics show the outstanding performance of our model.

Download Full-text

Seismic Data Augmentation Based on Conditional Generative Adversarial Networks

Sensors ◽

10.3390/s20236850 ◽

2020 ◽

Vol 20 (23) ◽

pp. 6850

Author(s):

Yuanming Li ◽

Bonhwa Ku ◽

Shou Zhang ◽

Jae-Kwang Ahn ◽

Hanseok Ko

Keyword(s):

Deep Learning ◽

Data Augmentation ◽

Synthetic Data ◽

Generative Adversarial Networks ◽

High Quality ◽

Seismic Waveforms ◽

Adversarial Networks ◽

Seismic Waveform ◽

Proposed Model

Realistic synthetic data can be useful for data augmentation when training deep learning models to improve seismological detection and classification performance. In recent years, various deep learning techniques have been successfully applied in modern seismology. Due to the performance of deep learning depends on a sufficient volume of data, the data augmentation technique as a data-space solution is widely utilized. In this paper, we propose a Generative Adversarial Networks (GANs) based model that uses conditional knowledge to generate high-quality seismic waveforms. Unlike the existing method of generating samples directly from noise, the proposed method generates synthetic samples based on the statistical characteristics of real seismic waveforms in embedding space. Moreover, a content loss is added to relate high-level features extracted by a pre-trained model to the objective function to enhance the quality of the synthetic data. The classification accuracy is increased from 96.84% to 97.92% after mixing a certain amount of synthetic seismic waveforms, and results of the quality of seismic characteristics derived from the representative experiment show that the proposed model provides an effective structure for generating high-quality synthetic seismic waveforms. Thus, the proposed model is experimentally validated as a promising approach to realistic high-quality seismic waveform data augmentation.

Download Full-text

Automatic Generation of Photorealistic Image Fillers for Privacy Enabled Urban Basemaps using Generative Adversarial Networks

Advances in Cartography and GIScience of the ICA ◽

10.5194/ica-adv-1-1-2019 ◽

2019 ◽

Vol 1 ◽

pp. 1-8

Author(s):

Amgad Agoub ◽

Yevgeniya Filippovska ◽

Valentina Schmidt ◽

Martin Kada

Keyword(s):

Image Data ◽

Automatic Generation ◽

Generative Models ◽

Aerial Images ◽

Generative Adversarial Networks ◽

Sensitive Information ◽

Generation Process ◽

High Quality ◽

Privacy And Security ◽

Adversarial Networks

Abstract. The abundance of high-quality satellite images is salutary for many activities but raises also privacy and security concerns. Manually obfuscating areas subject to privacy issues by applying locally pixelization techniques leads to undesirable discontinuities in the visual appearance of the depicted scenes. Alternatively, automatically generated photorealistic fillers can be used to obfuscate sensitive information while preserving the original visual aspect of high-resolution aerial images.Recent advances in the field of Deep Learning (DL) enable to synthesize high-quality image data. Particularly, generative models such as Generative Adversarial Networks (GANs) can be used to produce images that can be perceived as photorealistic even by human examiners. Additionally, Conditional Generative Adversarial Networks (cGANs) allow control over the image generation process and results. These developments give the opportunity to generate photorealistic fillers for the purpose of privacy and security in image data used within city models while preserving the quality of the original data. However, according to our knowledge, little research has been done to explore this potential. In order to close this gap, we propose a novel framework that is designed to correspond to the mentioned end goal and produces promising results.

Download Full-text