Estimating Migration of Gonioctena quinquepunctata (Coleoptera: Chrysomelidae) Inside a Mountain Range in a Spatially Explicit Context

Abstract The cold-tolerant leaf beetle Gonioctena quinquepunctata displays a large but fragmented European distribution and is restricted to mountain regions in the southern part of its range. Using a RAD-seq-generated large single nucleotide polymorphism (SNP) data set (> 10,000 loci), we investigated the geographic distribution of genetic variation within the Vosges mountains (eastern France), where the species is common. To translate this pattern of variation into an estimate of its capacity to disperse, we simulated SNP data under a spatially explicit model of population evolution (essentially a grid overlapping a map, in which each cell is considered a different population) and compared the simulated and real data with an approximate Bayesian computation (ABC) approach. For this purpose, we assessed a new SNP statistic, the DSVSF (distribution of spatial variation in SNP frequencies) that summarizes genetic variation in a spatially explicit context, and compared its usefulness to standard statistics often used in population genetic analyses. A test of our overall strategy was conducted with simulated data and showed that it can provide a good estimate of the level of dispersal of an organism over its geographic range. The results of our analyses suggested that this insect disperses well within the Vosges mountains, much more than was initially expected given the current and probably past fragmentation of its habitat and given the results of previous studies on genetic variation in other mountain leaf beetles.

Download Full-text

A Growth Model for Multilevel Ordinal Data

Journal of Educational and Behavioral Statistics ◽

10.3102/10769986030004369 ◽

2005 ◽

Vol 30 (4) ◽

pp. 369-396 ◽

Cited By ~ 8

Author(s):

Eisuke Segawa

Keyword(s):

Latent Variable ◽

Ordinal Data ◽

Linear Models ◽

Growth Models ◽

Simulated Data ◽

Real Data ◽

Analytic Structure ◽

Data Sets ◽

Data Set ◽

Time Points

Multi-indicator growth models were formulated as special three-level hierarchical generalized linear models to analyze growth of a trait latent variable measured by ordinal items. Items are nested within a time-point, and time-points are nested within subject. These models are special because they include factor analytic structure. This model can analyze not only data with item- and time-level missing observations, but also data with time points freely specified over subjects. Furthermore, features useful for longitudinal analyses, “autoregressive error degree one” structure for the trait residuals and estimated time-scores, were included. The approach is Bayesian with Markov Chain and Monte Carlo, and the model is implemented in WinBUGS. They are illustrated with two simulated data sets and one real data set with planned missing items within a scale.

Download Full-text

SSP: An R package to estimate sampling effort in studies of ecological communities

10.1101/2020.03.19.996991 ◽

2020 ◽

Author(s):

Edlin J. Guerra-Castro ◽

Juan Carlos Cajas ◽

Nuno Simões ◽

Juan J Cruz-Motta ◽

Maite Mascaró

Keyword(s):

Simulated Data ◽

Real Data ◽

R Package ◽

Sampling Effort ◽

Ecological Communities ◽

Ecological Data ◽

Data Set ◽

Pilot Studies ◽

Ecological Features ◽

Wide Range

ABSTRACTSSP (simulation-based sampling protocol) is an R package that uses simulation of ecological data and dissimilarity-based multivariate standard error (MultSE) as an estimator of precision to evaluate the adequacy of different sampling efforts for studies that will test hypothesis using permutational multivariate analysis of variance. The procedure consists in simulating several extensive data matrixes that mimic some of the relevant ecological features of the community of interest using a pilot data set. For each simulated data, several sampling efforts are repeatedly executed and MultSE calculated. The mean value, 0.025 and 0.975 quantiles of MultSE for each sampling effort across all simulated data are then estimated and standardized regarding the lowest sampling effort. The optimal sampling effort is identified as that in which the increase in sampling effort do not improve the precision beyond a threshold value (e.g. 2.5 %). The performance of SSP was validated using real data, and in all examples the simulated data mimicked well the real data, allowing to evaluate the relationship MultSE – n beyond the sampling size of the pilot studies. SSP can be used to estimate sample size in a wide range of situations, ranging from simple (e.g. single site) to more complex (e.g. several sites for different habitats) experimental designs. The latter constitutes an important advantage, since it offers new possibilities for complex sampling designs, as it has been advised for multi-scale studies in ecology.

Download Full-text

Smoothing Gene Expression Data with Network Information Improves Consistency of Regulated Genes

Statistical Applications in Genetics and Molecular Biology ◽

10.2202/1544-6115.1618 ◽

2011 ◽

Vol 10 (1) ◽

Cited By ~ 6

Author(s):

Guro Dørum ◽

Lars Snipen ◽

Margrete Solheim ◽

Solve Saebo

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Gene Networks ◽

Simulated Data ◽

Real Data ◽

Biological Knowledge ◽

Expression Data ◽

Data Set ◽

Gene Set ◽

Network Information

Gene set analysis methods have become a widely used tool for including prior biological knowledge in the statistical analysis of gene expression data. Advantages of these methods include increased sensitivity, easier interpretation and more conformity in the results. However, gene set methods do not employ all the available information about gene relations. Genes are arranged in complex networks where the network distances contain detailed information about inter-gene dependencies. We propose a method that uses gene networks to smooth gene expression data with the aim of reducing the number of false positives and identify important subnetworks. Gene dependencies are extracted from the network topology and are used to smooth genewise test statistics. To find the optimal degree of smoothing, we propose using a criterion that considers the correlation between the network and the data. The network smoothing is shown to improve the ability to identify important genes in simulated data. Applied to a real data set, the smoothing accentuates parts of the network with a high density of differentially expressed genes.

Download Full-text

ESTIMATION OF EXTREME QUANTILES: EMPIRICAL TOOLS FOR METHODS ASSESSMENT AND COMPARISON

International Journal of Reliability Quality and Safety Engineering ◽

10.1142/s0218539300000079 ◽

2000 ◽

Vol 07 (01) ◽

pp. 75-94 ◽

Cited By ~ 3

Author(s):

J. DIEBOLT ◽

M.-A. EL-AROUI ◽

V. DURBEC ◽

B. VILLAIN

Keyword(s):

Goodness Of Fit ◽

Simulated Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Extreme Quantiles ◽

Maintenance Policies ◽

Simulated Data Sets ◽

Industrial Context

When extreme quantiles have to be estimated from a given data set, the classical parametric approach can lead to very poor estimations. This has led to the introduction of specific methods for estimating extreme quantiles (MEEQ's) in a nonparametric spirit, e.g., Pickands excess method, methods based on Hill's estimate of the Pareto index, exponential tail (ET) and quadratic tail (QT) methods. However, no practical technique for assessing and comparing these MEEQ's when they are to be used on a given data set is available. This paper is a first attempt to provide such techniques. We first compare the estimations given by the main MEEQ's on several simulated data sets. Then we suggest goodness-of-fit (Gof) tests to assess the MEEQ's by measuring the quality of their underlying approximations. It is shown that Gof techniques bring very relevant tools to assess and compare ET and excess methods. Other empirical criterions for comparing MEEQ's are also proposed and studied through Monte-Carlo analyses. Finally, these assessment and comparison techniques are experimented on real-data sets issued from an industrial context where extreme quantiles are needed to define maintenance policies.

Download Full-text

Identifying and Classifying Aberrant Response Patterns Through Functional Data Analysis

Journal of Educational and Behavioral Statistics ◽

10.3102/1076998620911941 ◽

2020 ◽

Vol 45 (6) ◽

pp. 719-749

Author(s):

Eduardo Doval ◽

Pedro Delicado

Keyword(s):

Data Analysis ◽

Functional Data Analysis ◽

Functional Data ◽

Simulated Data ◽

Real Data ◽

Response Patterns ◽

Fit Indices ◽

Person Fit ◽

Data Set ◽

Aberrant Response Patterns

We propose new methods for identifying and classifying aberrant response patterns (ARPs) by means of functional data analysis. These methods take the person response function (PRF) of an individual and compare it with the pattern that would correspond to a generic individual of the same ability according to the item-person response surface. ARPs correspond to atypical difference functions. The ARP classification is done with functional data clustering applied to the PRFs identified as ARP. We apply these methods to two sets of simulated data (the first is used to illustrate the ARP identification methods and the second demonstrates classification of the response patterns flagged as ARP) and a real data set (a Grade 12 science assessment test, SAT, with 32 items answered by 600 examinees). For comparative purposes, ARPs are also identified with three nonparametric person-fit indices (Ht, Modified Caution Index, and ZU3). Our results indicate that the ARP detection ability of one of our proposed methods is comparable to that of person-fit indices. Moreover, the proposed classification methods enable ARP associated with either spuriously low or spuriously high scores to be distinguished.

Download Full-text

Sources of Artifacts in SLODR Detection

Psychology in Russia State of Art ◽

10.11621/pir.2021.0107 ◽

2021 ◽

Vol 14 (1) ◽

pp. 86-100

Author(s):

Aleksei A. Korneev ◽

Anatoly N. Krichevets ◽

Konstantin V. Sugonyaev ◽

Dmitriy V. Ushakov ◽

Alexander G. Vinogradov ◽

...

Keyword(s):

Correlation Matrix ◽

Simulated Data ◽

Real Data ◽

False Identification ◽

Data Set ◽

The Third ◽

Intellectual Abilities ◽

The Matrix ◽

Simulation Parameters ◽

Selection Of

Background. Spearman’s law of diminishing returns (SLODR) states that intercorrelations between scores on tests of intellectual abilities were higher when the data set was comprised of subjects with lower intellectual abilities and vice versa. After almost a hundred years of research, this trend has only been detected on average. Objective. To determine whether the very different results were obtained due to variations in scaling and the selection of subjects. Design. We used three methods for SLODR detection based on moderated factor analysis (MFCA) to test real data and three sets of simulated data. Of the latter group, the first one simulated a real SLODR effect. The second one simulated the case of a different density of tasks of varying difficulty; it did not have a real SLODR effect. The third one simulated a skewed selection of respondents with different abilities and also did not have a real SLODR effect. We selected the simulation parameters so that the correlation matrix of the simulated data was similar to the matrix created from the real data, and all distributions had similar skewness parameters (about -0.3). Results. The results of MFCA are contradictory and we cannot clearly distinguish by this method the dataset with real SLODR from datasets with similar correlation structure and skewness, but without a real SLODR effect. Theresults allow us to conclude that when effects like SLODR are very subtle and can be identified only with a large sample, then features of the psychometric scale become very important, because small variations of scale metrics may lead either to masking of real SLODR or to false identification of SLODR.

Download Full-text

Fully Moderated T-statistic for Small Sample Size Gene Expression Arrays

Statistical Applications in Genetics and Molecular Biology ◽

10.2202/1544-6115.1701 ◽

2011 ◽

Vol 10 (1) ◽

Cited By ~ 15

Author(s):

Lianbo Yu ◽

Parul Gulati ◽

Soledad Fernandez ◽

Michael Pennell ◽

Lawrence Kirschner ◽

...

Keyword(s):

Gene Expression ◽

Small Sample Size ◽

Simulated Data ◽

Real Data ◽

Small Sample ◽

Data Set ◽

Expression Arrays ◽

Gene Expression Arrays ◽

Higher Power ◽

Constant Coefficient Of Variation

Gene expression microarray experiments with few replications lead to great variability in estimates of gene variances. Several Bayesian methods have been developed to reduce this variability and to increase power. Thus far, moderated t methods assumed a constant coefficient of variation (CV) for the gene variances. We provide evidence against this assumption, and extend the method by allowing the CV to vary with gene expression. Our CV varying method, which we refer to as the fully moderated t-statistic, was compared to three other methods (ordinary t, and two moderated t predecessors). A simulation study and a familiar spike-in data set were used to assess the performance of the testing methods. The results showed that our CV varying method had higher power than the other three methods, identified a greater number of true positives in spike-in data, fit simulated data under varying assumptions very well, and in a real data set better identified higher expressing genes that were consistent with functional pathways associated with the experiments.

Download Full-text

VARIABLE SELECTION FOR BAYESIAN SURVIVAL MODELS USING BREGMAN DIVERGENCE MEASURE

Probability in the Engineering and Informational Sciences ◽

10.1017/s0269964818000190 ◽

2018 ◽

Vol 34 (3) ◽

pp. 364-380

Author(s):

Daoyuan Shi ◽

Lynn Kuo

Keyword(s):

Survival Analysis ◽

Variable Selection ◽

Rapid Development ◽

Simulated Data ◽

Semiparametric Regression ◽

Real Data ◽

Survival Models ◽

Cancer Center ◽

Bregman Divergence ◽

Data Set

The variable selection has been an important topic in regression and Bayesian survival analysis. In the era of rapid development of genomics and precision medicine, the topic is becoming more important and challenging. In addition to the challenges of handling censored data in survival analysis, we are facing increasing demand of handling big data with too many predictors where most of them may not be relevant to the prediction of the survival outcome. With the desire of improving upon the accuracy of prediction, we explore the Bregman divergence criterion in selecting predictive models. We develop sparse Bayesian formulation for parametric regression and semiparametric regression models and demonstrate how variable selection is done using the predictive approach. Model selections for a simulated data set, and two real-data sets (one for a kidney transplant study, and the other for a breast cancer microarray study at the Memorial Sloan-Kettering Cancer Center) are carried out to illustrate our methods.

Download Full-text

Deconvoluting the diversity of within-host pathogen strains in a multi-locus sequence typing framework

BMC Bioinformatics ◽

10.1186/s12859-019-3204-8 ◽

2019 ◽

Vol 20 (S20) ◽

Cited By ~ 1

Author(s):

Guo Liang Gan ◽

Elijah Willie ◽

Cedric Chauve ◽

Leonid Chindelevitch

Keyword(s):

Borrelia Burgdorferi ◽

Disease Transmission ◽

Bacterial Pathogen ◽

Simulated Data ◽

Real Data ◽

Genomic Diversity ◽

Mixed Integer ◽

Data Set ◽

Mlst Scheme ◽

Host Pathogen

Abstract Background Bacterial pathogens exhibit an impressive amount of genomic diversity. This diversity can be informative of evolutionary adaptations, host-pathogen interactions, and disease transmission patterns. However, capturing this diversity directly from biological samples is challenging. Results We introduce a framework for understanding the within-host diversity of a pathogen using multi-locus sequence types (MLST) from whole-genome sequencing (WGS) data. Our approach consists of two stages. First we process each sample individually by assigning it, for each locus in the MLST scheme, a set of alleles and a proportion for each allele. Next, we associate to each sample a set of strain types using the alleles and the strain proportions obtained in the first step. We achieve this by using the smallest possible number of previously unobserved strains across all samples, while using those unobserved strains which are as close to the observed ones as possible, at the same time respecting the allele proportions as closely as possible. We solve both problems using mixed integer linear programming (MILP). Our method performs accurately on simulated data and generates results on a real data set of Borrelia burgdorferi genomes suggesting a high level of diversity for this pathogen. Conclusions Our approach can apply to any bacterial pathogen with an MLST scheme, even though we developed it with Borrelia burgdorferi, the etiological agent of Lyme disease, in mind. Our work paves the way for robust strain typing in the presence of within-host heterogeneity, overcoming an essential challenge currently not addressed by any existing methodology for pathogen genomics.

Download Full-text

Classifying exoplanet candidates with convolutional neural networks: application to the Next Generation Transit Survey

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/stz2058 ◽

2019 ◽

Vol 488 (4) ◽

pp. 5232-5250 ◽

Cited By ~ 2

Author(s):

Alexander Chaushev ◽

Liam Raynard ◽

Michael R Goad ◽

Philipp Eigmüller ◽

David J Armstrong ◽

...

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Network Performance ◽

Area Under The Curve ◽

Simulated Data ◽

Real Data ◽

Training Data ◽

Next Generation ◽

Data Set ◽

Time Required

ABSTRACT Vetting of exoplanet candidates in transit surveys is a manual process, which suffers from a large number of false positives and a lack of consistency. Previous work has shown that convolutional neural networks (CNN) provide an efficient solution to these problems. Here, we apply a CNN to classify planet candidates from the Next Generation Transit Survey (NGTS). For training data sets we compare both real data with injected planetary transits and fully simulated data, as well as how their different compositions affect network performance. We show that fewer hand labelled light curves can be utilized, while still achieving competitive results. With our best model, we achieve an area under the curve (AUC) score of $(95.6\pm {0.2}){{\ \rm per\ cent}}$ and an accuracy of $(88.5\pm {0.3}){{\ \rm per\ cent}}$ on our unseen test data, as well as $(76.5\pm {0.4}){{\ \rm per\ cent}}$ and $(74.6\pm {1.1}){{\ \rm per\ cent}}$ in comparison to our existing manual classifications. The neural network recovers 13 out of 14 confirmed planets observed by NGTS, with high probability. We use simulated data to show that the overall network performance is resilient to mislabelling of the training data set, a problem that might arise due to unidentified, low signal-to-noise transits. Using a CNN, the time required for vetting can be reduced by half, while still recovering the vast majority of manually flagged candidates. In addition, we identify many new candidates with high probabilities which were not flagged by human vetters.

Download Full-text