genomicSimulation: fast R functions for stochastic simulation of breeding programs

Simulation tools are key to designing and optimising breeding programs that are many-year, high-effort endeavours. Tools that operate on real genotypes and integrate easily with other analysis software are needed for users to integrate simulated data into their analysis and decision-making processes. This paper presents genomicSimulation, a fast and flexible tool for the stochastic simulation of crossing and selection on real genotypes. It is fully written in C for high execution speeds, has minimal dependencies, and is available as an R package for integration with R's broad range of analysis and visualisation tools. Comparisons of a simulated recreation of a breeding program to the real data shows that the tool's simulated offspring correctly show key population features. Both versions of genomicSimulation are freely available on GitHub: The R package version at https://github.com/vllrs/genomicSimulation/ and the C library version at https://github.com/vllrs/genomicSimulationC

Download Full-text

GRAPHICAL APPROACH TO AGE-PERIOD-COHORT ANALYSIS

Acta Scientiarum Polonorum - Oeconomia ◽

10.22630/aspe.2021.20.2.17 ◽

2021 ◽

Vol 20 (2) ◽

pp. 69-77

Author(s):

Adam Sagan

Keyword(s):

Cohort Studies ◽

Longitudinal Design ◽

Cohort Analysis ◽

Simulated Data ◽

Real Data ◽

R Package ◽

Cohort Effects ◽

Graphical Approach ◽

Age Cohort ◽

Accelerated Longitudinal Design

The paper presents the graphical approach to decomposition of APC effect in cohort studies (mainly applied to demographic phenomena) using multilevel or accelerated longitudinal design. The aim of the paper is to present and visualize the pure age, period and cohort effects based on simulated data with an increment of five for each successive age, period and cohort variation. In cohort analysis on real data all of the effects are usually interrelated. The analysis shows basic patterns of two-variate APC decomposition (age within period, age within cohort, cohort within period, period within age, cohort within age, period within cohort) and reveals the trajectory of curves for each of the pure effects. The APC plots are developed using apc library of R package.

Download Full-text

SSP: An R package to estimate sampling effort in studies of ecological communities

10.1101/2020.03.19.996991 ◽

2020 ◽

Author(s):

Edlin J. Guerra-Castro ◽

Juan Carlos Cajas ◽

Nuno Simões ◽

Juan J Cruz-Motta ◽

Maite Mascaró

Keyword(s):

Simulated Data ◽

Real Data ◽

R Package ◽

Sampling Effort ◽

Ecological Communities ◽

Ecological Data ◽

Data Set ◽

Pilot Studies ◽

Ecological Features ◽

Wide Range

ABSTRACTSSP (simulation-based sampling protocol) is an R package that uses simulation of ecological data and dissimilarity-based multivariate standard error (MultSE) as an estimator of precision to evaluate the adequacy of different sampling efforts for studies that will test hypothesis using permutational multivariate analysis of variance. The procedure consists in simulating several extensive data matrixes that mimic some of the relevant ecological features of the community of interest using a pilot data set. For each simulated data, several sampling efforts are repeatedly executed and MultSE calculated. The mean value, 0.025 and 0.975 quantiles of MultSE for each sampling effort across all simulated data are then estimated and standardized regarding the lowest sampling effort. The optimal sampling effort is identified as that in which the increase in sampling effort do not improve the precision beyond a threshold value (e.g. 2.5 %). The performance of SSP was validated using real data, and in all examples the simulated data mimicked well the real data, allowing to evaluate the relationship MultSE – n beyond the sampling size of the pilot studies. SSP can be used to estimate sample size in a wide range of situations, ranging from simple (e.g. single site) to more complex (e.g. several sites for different habitats) experimental designs. The latter constitutes an important advantage, since it offers new possibilities for complex sampling designs, as it has been advised for multi-scale studies in ecology.

Download Full-text

SPARSim single cell: a count data simulator for scRNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz752 ◽

2019 ◽

Cited By ~ 2

Author(s):

Giacomo Baruzzo ◽

Ilaria Patuzzi ◽

Barbara Di Camillo

Keyword(s):

Single Cell ◽

Count Data ◽

Simulated Data ◽

Real Data ◽

R Package ◽

Supplementary Information ◽

Rna Seq ◽

Distribution Of Zeros ◽

New Methods ◽

Research Fields

Abstract Motivation Single cell RNA-seq (scRNA-seq) count data show many differences compared with bulk RNA-seq count data, making the application of many RNA-seq pre-processing/analysis methods not straightforward or even inappropriate. For this reason, the development of new methods for handling scRNA-seq count data is currently one of the most active research fields in bioinformatics. To help the development of such new methods, the availability of simulated data could play a pivotal role. However, only few scRNA-seq count data simulators are available, often showing poor or not demonstrated similarity with real data. Results In this article we present SPARSim, a scRNA-seq count data simulator based on a Gamma-Multivariate Hypergeometric model. We demonstrate that SPARSim allows to generate count data that resemble real data in terms of count intensity, variability and sparsity, performing comparably or better than one of the most used scRNA-seq simulator, Splat. In particular, SPARSim simulated count matrices well resemble the distribution of zeros across different expression intensities observed in real count data. Availability and implementation SPARSim R package is freely available at http://sysbiobig.dei.unipd.it/? q=SPARSim and at https://gitlab.com/sysbiobig/sparsim. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Simulated Data to Estimate Real Sensor Events—A Poisson-Regression-Based Modelling

Remote Sensing ◽

10.3390/rs12050771 ◽

2020 ◽

Vol 12 (5) ◽

pp. 771 ◽

Cited By ~ 1

Author(s):

Miguel Angel Ortíz-Barrios ◽

Ian Cleland ◽

Chris Nugent ◽

Pablo Pancardo ◽

Eric Järpe ◽

...

Keyword(s):

Poisson Regression ◽

Smart Home ◽

Real Life ◽

Synthetic Data ◽

Simulated Data ◽

Real Data ◽

The Real ◽

Intrinsic Nature ◽

Simulation Tools ◽

Effective Care

Automatic detection and recognition of Activities of Daily Living (ADL) are crucial for providing effective care to frail older adults living alone. A step forward in addressing this challenge is the deployment of smart home sensors capturing the intrinsic nature of ADLs performed by these people. As the real-life scenario is characterized by a comprehensive range of ADLs and smart home layouts, deviations are expected in the number of sensor events per activity (SEPA), a variable often used for training activity recognition models. Such models, however, rely on the availability of suitable and representative data collection and is habitually expensive and resource-intensive. Simulation tools are an alternative for tackling these barriers; nonetheless, an ongoing challenge is their ability to generate synthetic data representing the real SEPA. Hence, this paper proposes the use of Poisson regression modelling for transforming simulated data in a better approximation of real SEPA. First, synthetic and real data were compared to verify the equivalence hypothesis. Then, several Poisson regression models were formulated for estimating real SEPA using simulated data. The outcomes revealed that real SEPA can be better approximated ( R pred 2 = 92.72 % ) if synthetic data is post-processed through Poisson regression incorporating dummy variables.

Download Full-text

flexiMAP: a regression-based method for discovering differential alternative polyadenylation events in standard RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btaa854 ◽

2020 ◽

Author(s):

Krzysztof J Szkop ◽

David S Moss ◽

Irene Nobeli

Keyword(s):

Simulated Data ◽

Alternative Polyadenylation ◽

Real Data ◽

R Package ◽

Supplementary Information ◽

Beta Regression ◽

Rna Seq ◽

Good Balance ◽

Flexible Modeling ◽

Specificity And Sensitivity

Abstract Motivation We present flexible Modeling of Alternative PolyAdenylation (flexiMAP), a new beta-regression-based method implemented in R, for discovering differential alternative polyadenylation events in standard RNA-seq data. Results We show, using both simulated and real data, that flexiMAP exhibits a good balance between specificity and sensitivity and compares favourably to existing methods, especially at low fold changes. In addition, the tests on simulated data reveal some hitherto unrecognized caveats of existing methods. Importantly, flexiMAP allows modeling of multiple known covariates that often confound the results of RNA-seq data analysis. Availability and implementation The flexiMAP R package is available at: https://github.com/kszkop/flexiMAP. Scripts and data to reproduce the analysis in this paper are available at: https://doi.org/10.5281/zenodo.3689788. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

l1kdeconv: an R package for peak calling analysis with LINCS L1000 data

10.1101/165258 ◽

2017 ◽

Author(s):

Zhao Li ◽

Jin Li ◽

Peng Yu

Keyword(s):

Detection Method ◽

Single Channel ◽

Simulated Data ◽

Real Data ◽

R Package ◽

Gaussian Mixture ◽

Gene Expression Measurement ◽

Peak Calling ◽

High Throughput Technology ◽

Expression Measurement

AbstractBackgroundLINCS L1000 is a high-throughput technology that allows gene expression measurement in a large number of assays. However, to fit the measurements of ~ 1000 genes in the ~ 500 color channels of LINCS L1000, every two landmark genes are designed to share a single channel. Thus, a deconvolution step is required to infer the expression values of each gene. Any errors in this step can be propagated adversely to the downstream analyses.ResultsWe presented a LINCS L1000 data peak calling R package l1kdeconv based on a new outlier detection method and an aggregate Gaussian mixture model (AGMM). Upon the remove of outliers and the borrowing information among similar samples, l1kdeconv showed more stable and better performance than methods commonly used in LINCS L1000 data deconvolution.ConclusionsBased on the benchmark using both simulated data and real data, the l1kdeconv package achieved more stable results than the commonly used LINCS L1000 data deconvolution [email protected]

Download Full-text

Data-based RNA-seq Simulations by Binomial Thinning

10.1101/758524 ◽

2019 ◽

Cited By ~ 1

Author(s):

David Gerard

Keyword(s):

Theoretical Model ◽

Single Cell ◽

Differential Expression Analysis ◽

Simulated Data ◽

Real Data ◽

Theoretical Models ◽

Simulation Method ◽

R Package ◽

Rna Seq ◽

Ideal Model

AbstractWith the explosion in the number of methods designed to analyze bulk and single-cell RNA-seq data, there is a growing need for approaches that assess and compare these methods. The usual technique is to compare methods on data simulated according to some theoretical model. However, as real data often exhibit violations from theoretical models, this can result in un-substantiated claims of a method’s performance. Rather than generate data from a theoretical model, in this paper we develop methods to add signal to real RNA-seq datasets. Since the resulting simulated data are not generated from an unrealistic theoretical model, they exhibit realistic (annoying) attributes of real data. This lets RNA-seq methods developers assess their procedures in non-ideal (model-violating) scenarios. Our procedures may be applied to both single-cell and bulk RNA-seq. We show that our simulation method results in more realistic datasets and can alter the conclusions of a differential expression analysis study. We also demonstrate our approach by comparing various factor analysis techniques on RNA-seq datasets. Our tools are available in the seqgendiff R package on the Comprehensive R Archive Net-work: https://cran.r-project.org/package=seqgendiff.

Download Full-text

Algorithmic Jingle Jungle: A Comparison of Implementations of Principal Axis Factoring and Promax Rotation in R and SPSS

10.31234/osf.io/7hwrm ◽

2020 ◽

Cited By ~ 1

Author(s):

Silvia Grieder ◽

Markus D. Steiner

Keyword(s):

Data Structures ◽

Principal Axis ◽

Model Averaging ◽

Simulated Data ◽

Real Data ◽

R Package ◽

Data Sets ◽

Potential Alternative ◽

Principal Axis Factoring ◽

Promax Rotation

A statistical procedure is assumed to produce comparable results across programs. Using the case of an exploratory factor analysis procedure—principal axis factoring (PAF) and promax rotation—we show that this assumption is not always justified. Procedures with equal names are sometimes implemented differently across programs: a jingle fallacy. Focusing on two popular statistical analysis programs, we indeed discovered a jingle jungle for the above procedure: Both PAF and promax rotation are implemented differently in the psych R package and in SPSS. Based on analyses with 230 real and 216,000 simulated data sets implementing 108 different data structures, we show that these differences in implementations can result in fairly different factor solutions for a variety of different data structures. Differences in the solutions for real data sets ranged from negligible to very large, with 38% displaying at least one different indicator-to-factor correspondence. A simulation study revealed systematic differences in accuracies between different implementations, and large variation between data structures, with small numbers of indicators per factor, high factor intercorrelations, and weak factors resulting in the lowest accuracies. Moreover, although there was no single combination of settings that was superior for all data structures, we identified implementations of PAF and promax that maximize performance on average. We recommend researchers to use these implementations as best way through the jungle, discuss model averaging as a potential alternative, and highlight the importance of adhering to best practices of scale construction.

Download Full-text

Nonparametric estimation for probability mass function with Disake: an R package for discrete associated kernel estimators

Revue Africaine de la Recherche en Informatique et Mathématiques Appliquées ◽

10.46298/arima.1984 ◽

2015 ◽

Vol Volume 19 - 2015 - Special... ◽

Author(s):

W.E. Wansouwé ◽

C.C. Kokonendji ◽

D.T. Kolyang

Keyword(s):

Kernel Smoothing ◽

Simulated Data ◽

Kernel Estimator ◽

Real Data ◽

R Package ◽

Mass Function ◽

Kernel Estimators ◽

Probability Mass Function ◽

Probability Mass ◽

Discrete Associated Kernel

International audience Kernel smoothing is one of the most widely used nonparametric data smoothing techniques. We introduce a new R package, Disake, for computing discrete associated kernel estimators for probability mass function. When working with a kernel estimator, two choices must be made: the kernel function and the smoothing parameter. The Disake package focuses on discrete associated kernels and also on cross-validation and local Bayesian techniques to select the appropriate bandwidth. Applications on simulated data and real data show that the binomial kernel is appropriate for small or moderate count data while the empirical estimator or the discrete triangular kernel is indicated for large samples.

Download Full-text

ei.Datasets: Real Data Sets for Assessing Ecological Inference Algorithms

Social Science Computer Review ◽

10.1177/08944393211040808 ◽

2021 ◽

pp. 089443932110408

Author(s):

Jose M. Pavía

Keyword(s):

Simulated Data ◽

Ground Truth ◽

Real Data ◽

R Package ◽

Data Sets ◽

Ecological Inference ◽

Inference Models ◽

Individual Level ◽

Inference Algorithms ◽

Cross Classification

Ecological inference models aim to infer individual-level relationships using aggregate data. They are routinely used to estimate voter transitions between elections, disclose split-ticket voting behaviors, or infer racial voting patterns in U.S. elections. A large number of procedures have been proposed in the literature to solve these problems; therefore, an assessment and comparison of them are overdue. The secret ballot however makes this a difficult endeavor since real individual data are usually not accessible. The most recent work on ecological inference has assessed methods using a very small number of data sets with ground truth, combined with artificial, simulated data. This article dramatically increases the number of real instances by presenting a unique database (available in the R package ei.Datasets) composed of data from more than 550 elections where the true inner-cell values of the global cross-classification tables are known. The article describes how the data sets are organized, details the data curation and data wrangling processes performed, and analyses the main features characterizing the different data sets.

Download Full-text