scholarly journals Inferring viral occurrence patterns through a synthetic data simulation

2021 ◽  
Author(s):  
Ville N Pimenoff ◽  
Ramon Cleries

Viruses infecting humans are manifold and several of them provoke significant morbidity and mortality. Simulations creating large synthetic datasets from observed multiple viral strain infections in a limited population sample can be a powerful tool to infer significant pathogen occurrence and interaction patterns, particularly if limited number of observed data units is available. Here, to demonstrate diverse human papillomavirus (HPV) strain occurrence patterns, we used log-linear models combined with Bayesian framework for graphical independence network (GIN) analysis. That is, to simulate datasets based on modeling the probabilistic associations between observed viral data points, i.e different viral strain infections in a set of population samples. Our GIN analysis outperformed in precision all oversampling methods tested for simulating large synthetic viral strain-level prevalence dataset from observed set of HPVs data. Altogether, we demonstrate that network modeling is a potent tool for creating synthetic viral datasets for comprehensive pathogen occurrence and interaction pattern estimations.

2020 ◽  
Vol 36 (20) ◽  
pp. 5045-5053
Author(s):  
Moritz Hess ◽  
Maren Hackenberg ◽  
Harald Binder

Abstract Motivation Following many successful applications to image data, deep learning is now also increasingly considered for omics data. In particular, generative deep learning not only provides competitive prediction performance, but also allows for uncovering structure by generating synthetic samples. However, exploration and visualization is not as straightforward as with image applications. Results We demonstrate how log-linear models, fitted to the generated, synthetic data can be used to extract patterns from omics data, learned by deep generative techniques. Specifically, interactions between latent representations learned by the approaches and generated synthetic data are used to determine sets of joint patterns. Distances of patterns with respect to the distribution of latent representations are then visualized in low-dimensional coordinate systems, e.g. for monitoring training progress. This is illustrated with simulated data and subsequently with cortical single-cell gene expression data. Using different kinds of deep generative techniques, specifically variational autoencoders and deep Boltzmann machines, the proposed approach highlights how the techniques uncover underlying structure. It facilitates the real-world use of such generative deep learning techniques to gain biological insights from omics data. Availability and implementation The code for the approach as well as an accompanying Jupyter notebook, which illustrates the application of our approach, is available via the GitHub repository: https://github.com/ssehztirom/Exploring-generative-deep-learning-for-omics-data-by-using-log-linear-models. Supplementary information Supplementary data are available at Bioinformatics online.


2015 ◽  
Author(s):  
Jacob Andreas ◽  
Dan Klein
Keyword(s):  

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
João Lobo ◽  
Rui Henriques ◽  
Sara C. Madeira

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.


1983 ◽  
Vol 15 (6) ◽  
pp. 801-813 ◽  
Author(s):  
B Fingleton

Log-linear models are an appropriate means of determining the magnitude and direction of interactions between categorical variables that in common with other statistical models assume independent observations. Spatial data are often dependent rather than independent and thus the analysis of spatial data by log-linear models may erroneously detect interactions between variables that are spurious and are the consequence of pairwise correlations between observations. A procedure is described in this paper to accommodate these effects that requires only very minimal assumptions about the nature of the autocorrelation process given systematic sampling at intersection points on a square lattice.


2008 ◽  
Vol 30 (1) ◽  
pp. 28-52 ◽  
Author(s):  
Dana Hamplova

In this article, educational homogamy among married and cohabiting couples in selected European countries is examined. Using data from two waves (2002 and 2004) of the European Social Survey, this article compares three cultural and institutional contexts that differ in terms of institutionalization of cohabitation. Evidence from log-linear models yields two main conclusions. First, as cohabitation becomes more common in society, marriage and cohabitation become more similar with respect to partner selection. Second, where married and unmarried unions differ in terms of educational homogamy, married couples have higher odds of overcoming educational barriers (i.e., intermarrying with other educational groups).


Sign in / Sign up

Export Citation Format

Share Document