scholarly journals Techniques to Produce and Evaluate Realistic Multivariate Synthetic Data

2021 ◽  
Author(s):  
John Heine ◽  
Erin E.E. Fowler ◽  
Anders Berglund ◽  
Michael J. Schell ◽  
Steven A Eschrich

Background: Proper data modeling in biomedical research requires sufficient data for exploration and reproducibility purposes. A limited sample size can inhibit objective performance evaluation. Objective: We are developing a synthetic population (SP) generation technique to address the limited sample size condition. We show how to estimate a multivariate empirical probability density function (pdf) by converting the task to multiple one-dimensional (1D) pdf estimations. Methods: Kernel density estimation (KDE) in 1D was used to construct univariate maps that converted the input variables (X) to normally distributed variables (Y). Principal component analysis (PCA) was used to transform the variables in Y to the uncoupled representation (T), where the univariate pdfs were assumed normal with specified variances. A standard random number generator was used to create synthetic variables with specified variances in T. Applying the inverse PCA transform to the synthetic variables in T produced the SP in Y. Applying the inverse maps produced the respective SP in X. Multiple tests were developed to compare univariate and multivariate pdfs and covariance matrices between the input (sample) and synthetic samples. Three datasets were investigated (n = 667) each with 10 input variables. Results: For all three datasets, both the univariate (in X, Y, and T) and multivariate (in X, Y, and T) tests showed that the univariate and multivariate pdfs from synthetic samples were statistically similar to their pdfs from the respective samples. Application of several tests for multivariate normality indicated that the SPs in Y were approximately normal. Covariance matrix comparisons (in X and Y) also indicated the same similarity. Conclusions: The work demonstrates how to generate multivariate synthetic data that matches the real input data by converting the input into multiple 1D problems. The work also shows that it is possible to convert a multivariate input pdf to a form that approximates a multivariate normal, although the technique is not dependent upon this finding. Further studies are required to evaluate the generalizability of the approach.

2014 ◽  
Vol 11 (3) ◽  
pp. 2555-2582 ◽  
Author(s):  
S. Pande ◽  
L. Arkesteijn ◽  
H. H. G. Savenije ◽  
L. A. Bastidas

Abstract. This paper presents evidence that model prediction uncertainty does not necessarily rise with parameter dimensionality (the number of parameters). Here by prediction we mean future simulation of a variable of interest conditioned on certain future values of input variables. We utilize a relationship between prediction uncertainty, sample size and model complexity based on Vapnik–Chervonenkis (VC) generalization theory. It suggests that models with higher complexity tend to have higher prediction uncertainty for limited sample size. However, model complexity is not necessarily related to the number of parameters. Here by limited sample size we mean a sample size that is limited in representing the dynamics of the underlying processes. Based on VC theory, we demonstrate that model complexity crucially depends on the magnitude of model parameters. We do this by using two model structures, SAC-SMA and its simplification, SIXPAR, and 5 MOPEX basin data sets across the United States. We conclude that parsimonious model selection based on parameter dimensionality may lead to a less informed model choice.


2019 ◽  
Author(s):  
Pengchao Ye ◽  
Wenbin Ye ◽  
Congting Ye ◽  
Shuchao Li ◽  
Lishan Ye ◽  
...  

Abstract Motivation Single-cell RNA-sequencing (scRNA-seq) is fast and becoming a powerful technique for studying dynamic gene regulation at unprecedented resolution. However, scRNA-seq data suffer from problems of extremely high dropout rate and cell-to-cell variability, demanding new methods to recover gene expression loss. Despite the availability of various dropout imputation approaches for scRNA-seq, most studies focus on data with a medium or large number of cells, while few studies have explicitly investigated the differential performance across different sample sizes or the applicability of the approach on small or imbalanced data. It is imperative to develop new imputation approaches with higher generalizability for data with various sample sizes. Results We proposed a method called scHinter for imputing dropout events for scRNA-seq with special emphasis on data with limited sample size. scHinter incorporates a voting-based ensemble distance and leverages the synthetic minority oversampling technique for random interpolation. A hierarchical framework is also embedded in scHinter to increase the reliability of the imputation for small samples. We demonstrated the ability of scHinter to recover gene expression measurements across a wide spectrum of scRNA-seq datasets with varied sample sizes. We comprehensively examined the impact of sample size and cluster number on imputation. Comprehensive evaluation of scHinter across diverse scRNA-seq datasets with imbalanced or limited sample size showed that scHinter achieved higher and more robust performance than competing approaches, including MAGIC, scImpute, SAVER and netSmooth. Availability and implementation Freely available for download at https://github.com/BMILAB/scHinter. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 227 ◽  
pp. 105534 ◽  
Author(s):  
Jing Luan ◽  
Chongliang Zhang ◽  
Binduo Xu ◽  
Ying Xue ◽  
Yiping Ren

2013 ◽  
Vol 8 (3) ◽  
pp. 647-690 ◽  
Author(s):  
Jennifer Lynn Clarke ◽  
Bertrand Clarke ◽  
Chi-Wai Yu

Author(s):  
Jens Nußberger ◽  
Frederic Boesel ◽  
Stefan Lenz ◽  
Harald Binder ◽  
Moritz Hess

AbstractDeep generative models can be trained to represent the joint distribution of data, such as measurements of single nucleotide polymorphisms (SNPs) from several individuals. Subsequently, synthetic observations are obtained by drawing from this distribution. This has been shown to be useful for several tasks, such as removal of noise, imputation, for better understanding underlying patterns, or even exchanging data under privacy constraints. Yet, it is still unclear how well these approaches work with limited sample size. We investigate such settings specifically for binary data, e.g., as relevant when considering SNP measurements, and evaluate three frequently employed generative modeling approaches, variational autoencoders (VAEs), deep Boltzmann machines (DBMs) and generative adversarial networks (GANs). This includes conditional approaches, such as when considering gene expression conditional on SNPs. Recovery of pair-wise odds ratios is considered as a primary performance criterion. For simulated as well as real SNP data, we observe that DBMs generally can recover structure for up to 100 variables with as little as 500 observations, with a tendency of over-estimating odds ratios when not carefully tuned. VAEs generally get the direction and relative strength of pairwise relations right, yet with considerable under-estimation of odds ratios. GANs provide stable results only with larger sample sizes and strong pair-wise relations in the data. Taken together, DBMs and VAEs (in contrast to GANs) appear to be well suited for binary omics data, even at rather small sample sizes. This opens the way for many potential applications where synthetic observations from omics data might be useful.


PLoS ONE ◽  
2019 ◽  
Vol 14 (11) ◽  
pp. e0224365 ◽  
Author(s):  
Andrius Vabalas ◽  
Emma Gowen ◽  
Ellen Poliakoff ◽  
Alexander J. Casson

2018 ◽  
Vol 7 (3.25) ◽  
pp. 44
Author(s):  
N Harudin ◽  
Jamaludin K R ◽  
M Nabil Muhtazaruddin ◽  
Ramlie F ◽  
S H Ismail ◽  
...  

Mahalanobis Taguchi System is an analytical tool involving classification, clustering as well as prediction techniques. T-Method which is part of it is a multivariate analysis technique designed mainly for prediction and optimization purposes. The good things about T-Method is that prediction is always possible even with limited sample size. In applying T-Method, the analyst is advised to clearly understand the trend and states of the data population since this method is good in dealing with limited sample size data but for higher samples or extremely high samples data it might have more things to ponder. T-Method is not being mentioned robust to the effect of outliers within it, so dealing with high sample data will put the prediction accuracy at risk. By incorporating outliers in overall data analysis, it may contribute to a non-normality state beside the entire classical methods breakdown. Considering the risk towards lower prediction accuracy, it is important to consider the risk of lower accuracy for the individual estimates so that the overall prediction accuracy will be increased. Dealing with that intention, there exist several robust parameters estimates such as M-estimator, that able to give good results even with the data contain or may not contain outliers in it. Generalized inverse regression estimator (GIR) also been used in this research as well as Ordinary Lease Square Method (OLS) as part of comparison study. Embedding these methods into T-Method individual estimates conditionally helps in enhancing the   accuracy of the T-Method while analyzing the robustness of T-method itself.  However, from the 3 main case studies been used within this analysis, it shows that T-Method contributed to a better and acceptable performance with error percentages range 2.5% ~ 22.8% between all cases compared to other methods. M-estimator is proved to be sensitive with data consist of leverage point in x-axis as well as data with limited sample size.   Referring to these 3 case studies only, it can be concluded that robust M-estimator is not feasible to be applied into T-Method as of now. Further enhance analysis is needed to encounter issues such as Airfoil noise case study data which T -method contributed to highest error% prediction.  Hence further analysis need to be done for better result review. 


Sign in / Sign up

Export Citation Format

Share Document