scholarly journals Synthetic observations from deep generative models and binary omics data with limited sample size

Author(s):  
Jens Nußberger ◽  
Frederic Boesel ◽  
Stefan Lenz ◽  
Harald Binder ◽  
Moritz Hess

AbstractDeep generative models can be trained to represent the joint distribution of data, such as measurements of single nucleotide polymorphisms (SNPs) from several individuals. Subsequently, synthetic observations are obtained by drawing from this distribution. This has been shown to be useful for several tasks, such as removal of noise, imputation, for better understanding underlying patterns, or even exchanging data under privacy constraints. Yet, it is still unclear how well these approaches work with limited sample size. We investigate such settings specifically for binary data, e.g., as relevant when considering SNP measurements, and evaluate three frequently employed generative modeling approaches, variational autoencoders (VAEs), deep Boltzmann machines (DBMs) and generative adversarial networks (GANs). This includes conditional approaches, such as when considering gene expression conditional on SNPs. Recovery of pair-wise odds ratios is considered as a primary performance criterion. For simulated as well as real SNP data, we observe that DBMs generally can recover structure for up to 100 variables with as little as 500 observations, with a tendency of over-estimating odds ratios when not carefully tuned. VAEs generally get the direction and relative strength of pairwise relations right, yet with considerable under-estimation of odds ratios. GANs provide stable results only with larger sample sizes and strong pair-wise relations in the data. Taken together, DBMs and VAEs (in contrast to GANs) appear to be well suited for binary omics data, even at rather small sample sizes. This opens the way for many potential applications where synthetic observations from omics data might be useful.

Author(s):  
Jens Nußberger ◽  
Frederic Boesel ◽  
Stefan Lenz ◽  
Harald Binder ◽  
Moritz Hess

Abstract Deep generative models can be trained to represent the joint distribution of data, such as measurements of single nucleotide polymorphisms (SNPs) from several individuals. Subsequently, synthetic observations are obtained by drawing from this distribution. This has been shown to be useful for several tasks, such as removal of noise, imputation, for better understanding underlying patterns, or even exchanging data under privacy constraints. Yet, it is still unclear how well these approaches work with limited sample size. We investigate such settings specifically for binary data, e.g. as relevant when considering SNP measurements, and evaluate three frequently employed generative modeling approaches, variational autoencoders (VAEs), deep Boltzmann machines (DBMs) and generative adversarial networks (GANs). This includes conditional approaches, such as when considering gene expression conditional on SNPs. Recovery of pair-wise odds ratios (ORs) is considered as a primary performance criterion. For simulated as well as real SNP data, we observe that DBMs generally can recover structure for up to 300 variables, with a tendency of over-estimating ORs when not carefully tuned. VAEs generally get the direction and relative strength of pairwise relations right, yet with considerable under-estimation of ORs. GANs provide stable results only with larger sample sizes and strong pair-wise relations in the data. Taken together, DBMs and VAEs (in contrast to GANs) appear to be well suited for binary omics data, even at rather small sample sizes. This opens the way for many potential applications where synthetic observations from omics data might be useful.


2019 ◽  
Author(s):  
Pengchao Ye ◽  
Wenbin Ye ◽  
Congting Ye ◽  
Shuchao Li ◽  
Lishan Ye ◽  
...  

Abstract Motivation Single-cell RNA-sequencing (scRNA-seq) is fast and becoming a powerful technique for studying dynamic gene regulation at unprecedented resolution. However, scRNA-seq data suffer from problems of extremely high dropout rate and cell-to-cell variability, demanding new methods to recover gene expression loss. Despite the availability of various dropout imputation approaches for scRNA-seq, most studies focus on data with a medium or large number of cells, while few studies have explicitly investigated the differential performance across different sample sizes or the applicability of the approach on small or imbalanced data. It is imperative to develop new imputation approaches with higher generalizability for data with various sample sizes. Results We proposed a method called scHinter for imputing dropout events for scRNA-seq with special emphasis on data with limited sample size. scHinter incorporates a voting-based ensemble distance and leverages the synthetic minority oversampling technique for random interpolation. A hierarchical framework is also embedded in scHinter to increase the reliability of the imputation for small samples. We demonstrated the ability of scHinter to recover gene expression measurements across a wide spectrum of scRNA-seq datasets with varied sample sizes. We comprehensively examined the impact of sample size and cluster number on imputation. Comprehensive evaluation of scHinter across diverse scRNA-seq datasets with imbalanced or limited sample size showed that scHinter achieved higher and more robust performance than competing approaches, including MAGIC, scImpute, SAVER and netSmooth. Availability and implementation Freely available for download at https://github.com/BMILAB/scHinter. Supplementary information Supplementary data are available at Bioinformatics online.


2013 ◽  
Vol 113 (1) ◽  
pp. 221-224 ◽  
Author(s):  
David R. Johnson ◽  
Lauren K. Bachan

In a recent article, Regan, Lakhanpal, and Anguiano (2012) highlighted the lack of evidence for different relationship outcomes between arranged and love-based marriages. Yet the sample size ( n = 58) used in the study is insufficient for making such inferences. This reply discusses and demonstrates how small sample sizes reduce the utility of this research.


PLoS ONE ◽  
2018 ◽  
Vol 13 (6) ◽  
pp. e0197910 ◽  
Author(s):  
Alexander Kirpich ◽  
Elizabeth A. Ainsworth ◽  
Jessica M. Wedow ◽  
Jeremy R. B. Newman ◽  
George Michailidis ◽  
...  

Author(s):  
Emilie Laurin ◽  
Julia Bradshaw ◽  
Laura Hawley ◽  
Ian A. Gardner ◽  
Kyle A Garver ◽  
...  

Proper sample size must be considered when designing infectious-agent prevalence studies for mixed-stock fisheries, because bias and uncertainty complicate interpretation of apparent (test)-prevalence estimates. Sample size varies between stocks, often smaller than expected during wild-salmonid surveys. Our case example of 2010-2016 survey data of Sockeye salmon (Oncorhynchus nerka) from different stocks of origin in British Columbia, Canada, illustrated the effect of sample size on apparent-prevalence interpretation. Molecular testing (viral RNA RT-qPCR) for infectious hematopoietic necrosis virus (IHNv) revealed large differences in apparent-prevalence across wild salmon stocks (much higher from Chilko Lake) and sampling location (freshwater or marine), indicating differences in both stock and host life-stage effects. Ten of the 13 marine non-Chilko stock-years with IHNv-positive results had small sample sizes (< 30 samples per stock-year) which, with imperfect diagnostic tests (particularly lower diagnostic sensitivity), could lead to inaccurate apparent-prevalence estimation. When calculating sample size for expected apparent prevalence using different approaches, smaller sample sizes often led to decreased confidence in apparent-prevalence results and decreased power to detect a true difference from a reference value.


2019 ◽  
Author(s):  
J.M. Gorriz ◽  
◽  
◽  

ABSTRACTIn the 70s a novel branch of statistics emerged focusing its effort in selecting a function in the pattern recognition problem, which fulfils a definite relationship between the quality of the approximation and its complexity. These data-driven approaches are mainly devoted to problems of estimating dependencies with limited sample sizes and comprise all the empirical out-of sample generalization approaches, e.g. cross validation (CV) approaches. Although the latter are not designed for testing competing hypothesis or comparing different models in neuroimaging, there are a number of theoretical developments within this theory which could be employed to derive a Statistical Agnostic (non-parametric) Mapping (SAM) at voxel or multi-voxel level. Moreover, SAMs could relieve i) the problem of instability in limited sample sizes when estimating the actual risk via the CV approaches, e.g. large error bars, and provide ii) an alternative way of Family-wise-error (FWE) corrected p-value maps in inferential statistics for hypothesis testing. In this sense, we propose a novel framework in neuroimaging based on concentration inequalities, which results in (i) a rigorous development for model validation with a small sample/dimension ratio, and (ii) a less-conservative procedure than FWE p-value correction, to determine the brain significance maps from the inferences made using small upper bounds of the actual risk.


2021 ◽  
Author(s):  
Metin Bulus

A recent systematic review of experimental studies conducted in Turkey between 2010 and 2020 reported that small sample sizes had been a significant drawback (Bulus and Koyuncu, 2021). A small chunk of the studies were small-scale true experiments (subjects randomized into the treatment and control groups). The remaining studies consisted of quasi-experiments (subjects in treatment and control groups were matched on pretest or other covariates) and weak experiments (neither randomized nor matched but had the control group). They had an average sample size below 70 for different domains and outcomes. These small sample sizes imply a strong (and perhaps erroneous) assumption about the minimum relevant effect size (MRES) of intervention before an experiment is conducted; that is, a standardized intervention effect of Cohen’s d &lt; 0.50 is not relevant to education policy or practice. Thus, an introduction to sample size determination for pretest-posttest simple experimental designs is warranted. This study describes nuts and bolts of sample size determination, derives expressions for optimal design under differential cost per treatment and control units, provide convenient tables to guide sample size decisions for MRES values between 0.20 ≤ Cohen’s d ≤ 0.50, and describe the relevant software along with illustrations.


2018 ◽  
Vol 7 (6) ◽  
pp. 68
Author(s):  
Karl Schweizer ◽  
Siegbert Reiß ◽  
Stefan Troche

An investigation of the suitability of threshold-based and threshold-free approaches for structural investigations of binary data is reported. Both approaches implicitly establish a relationship between binary data following the binomial distribution on one hand and continuous random variables assuming a normal distribution on the other hand. In two simulation studies we investigated: whether the fit results confirm the establishment of such a relationship, whether the differences between correct and incorrect models are retained and to what degree the sample size influences the results. Both approaches proved to establish the relationship. Using the threshold-free approach it was achieved by customary ML estimation whereas robust ML estimation was necessary in the threshold-based approach. Discrimination between correct and incorrect models was observed for both approaches. Larger CFI differences were found for the threshold-free approach than for the threshold-based approach. Dependency on sample size characterized the threshold-based approach but not the threshold-free approach. The threshold-based approach tended to perform better in large sample sizes, while the threshold-free approach performed better in smaller sample sizes.


2020 ◽  
Author(s):  
Chia-Lung Shih ◽  
Te-Yu Hung

Abstract Background A small sample size (n < 30 for each treatment group) is usually enrolled to investigate the differences in efficacy between treatments for knee osteoarthritis (OA). The objective of this study was to use simulation for comparing the power of four statistical methods for analysis of small sample size for detecting the differences in efficacy between two treatments for knee OA. Methods A total of 10,000 replicates of 5 sample sizes (n=10, 15, 20, 25, and 30 for each group) were generated based on the previous reported measures of treatment efficacy. Four statistical methods were used to compare the differences in efficacy between treatments, including the two-sample t-test (t-test), the Mann-Whitney U-test (M-W test), the Kolmogorov-Smirnov test (K-S test), and the permutation test (perm-test). Results The bias of simulated parameter means showed a decreased trend with sample size but the CV% of simulated parameter means varied with sample sizes for all parameters. For the largest sample size (n=30), the CV% could achieve a small level (<20%) for almost all parameters but the bias could not. Among the non-parametric tests for analysis of small sample size, the perm-test had the highest statistical power, and its false positive rate was not affected by sample size. However, the power of the perm-test could not achieve a high value (80%) even using the largest sample size (n=30). Conclusion The perm-test is suggested for analysis of small sample size to compare the differences in efficacy between two treatments for knee OA.


2020 ◽  
Vol 227 ◽  
pp. 105534 ◽  
Author(s):  
Jing Luan ◽  
Chongliang Zhang ◽  
Binduo Xu ◽  
Ying Xue ◽  
Yiping Ren

Sign in / Sign up

Export Citation Format

Share Document