scholarly journals SSP: An R package to estimate sampling effort in studies of ecological communities

2020 ◽  
Author(s):  
Edlin J. Guerra-Castro ◽  
Juan Carlos Cajas ◽  
Nuno Simões ◽  
Juan J Cruz-Motta ◽  
Maite Mascaró

ABSTRACTSSP (simulation-based sampling protocol) is an R package that uses simulation of ecological data and dissimilarity-based multivariate standard error (MultSE) as an estimator of precision to evaluate the adequacy of different sampling efforts for studies that will test hypothesis using permutational multivariate analysis of variance. The procedure consists in simulating several extensive data matrixes that mimic some of the relevant ecological features of the community of interest using a pilot data set. For each simulated data, several sampling efforts are repeatedly executed and MultSE calculated. The mean value, 0.025 and 0.975 quantiles of MultSE for each sampling effort across all simulated data are then estimated and standardized regarding the lowest sampling effort. The optimal sampling effort is identified as that in which the increase in sampling effort do not improve the precision beyond a threshold value (e.g. 2.5 %). The performance of SSP was validated using real data, and in all examples the simulated data mimicked well the real data, allowing to evaluate the relationship MultSE – n beyond the sampling size of the pilot studies. SSP can be used to estimate sample size in a wide range of situations, ranging from simple (e.g. single site) to more complex (e.g. several sites for different habitats) experimental designs. The latter constitutes an important advantage, since it offers new possibilities for complex sampling designs, as it has been advised for multi-scale studies in ecology.

2021 ◽  
Vol 9 (1) ◽  
pp. 62-81
Author(s):  
Kjersti Aas ◽  
Thomas Nagler ◽  
Martin Jullum ◽  
Anders Løland

Abstract In this paper the goal is to explain predictions from complex machine learning models. One method that has become very popular during the last few years is Shapley values. The original development of Shapley values for prediction explanation relied on the assumption that the features being described were independent. If the features in reality are dependent this may lead to incorrect explanations. Hence, there have recently been attempts of appropriately modelling/estimating the dependence between the features. Although the previously proposed methods clearly outperform the traditional approach assuming independence, they have their weaknesses. In this paper we propose two new approaches for modelling the dependence between the features. Both approaches are based on vine copulas, which are flexible tools for modelling multivariate non-Gaussian distributions able to characterise a wide range of complex dependencies. The performance of the proposed methods is evaluated on simulated data sets and a real data set. The experiments demonstrate that the vine copula approaches give more accurate approximations to the true Shapley values than their competitors.


2021 ◽  
Vol 14 (12) ◽  
pp. 612
Author(s):  
Jianan Zhu ◽  
Yang Feng

We propose a new ensemble classification algorithm, named super random subspace ensemble (Super RaSE), to tackle the sparse classification problem. The proposed algorithm is motivated by the random subspace ensemble algorithm (RaSE). The RaSE method was shown to be a flexible framework that can be coupled with any existing base classification. However, the success of RaSE largely depends on the proper choice of the base classifier, which is unfortunately unknown to us. In this work, we show that Super RaSE avoids the need to choose a base classifier by randomly sampling a collection of classifiers together with the subspace. As a result, Super RaSE is more flexible and robust than RaSE. In addition to the vanilla Super RaSE, we also develop the iterative Super RaSE, which adaptively changes the base classifier distribution as well as the subspace distribution. We show that the Super RaSE algorithm and its iterative version perform competitively for a wide range of simulated data sets and two real data examples. The new Super RaSE algorithm and its iterative version are implemented in a new version of the R package RaSEn.


Mathematics ◽  
2021 ◽  
Vol 9 (23) ◽  
pp. 3074
Author(s):  
Cristian Preda ◽  
Quentin Grimonprez ◽  
Vincent Vandewalle

Categorical functional data represented by paths of a stochastic jump process with continuous time and a finite set of states are considered. As an extension of the multiple correspondence analysis to an infinite set of variables, optimal encodings of states over time are approximated using an arbitrary finite basis of functions. This allows dimension reduction, optimal representation, and visualisation of data in lower dimensional spaces. The methodology is implemented in the cfda R package and is illustrated using a real data set in the clustering framework.


2014 ◽  
Vol 33 (2) ◽  
pp. 27
Author(s):  
Maria Angeles Gallego ◽  
Maria Victoria Ibanez ◽  
Amelia Simó

Many medical and biological problems require to extract information from microscopical images. Boolean models have been extensively used to analyze binary images of random clumps in many scientific fields. In this paper, a particular type of Boolean model with an underlying non-stationary point process is considered. The intensity of the underlying point process is formulated as a fixed function of the distance to a region of interest. A method to estimate the parameters of this Boolean model is introduced, and its performance is checked in two different settings. Firstly, a comparative study with other existent methods is done using simulated data. Secondly, the method is applied to analyze the longleaf data set, which is a very popular data set in the context of point processes included in the R package spatstat. Obtained results show that the new method provides as accurate estimates as those obtained with more complex methods developed for the general case. Finally, to illustrate the application of this model and this method, a particular type of phytopathological images are analyzed. These images show callose depositions in leaves of Arabidopsis plants. The analysis of callose depositions, is very popular in the phytopathological literature to quantify activity of plant immunity.


2005 ◽  
Vol 30 (4) ◽  
pp. 369-396 ◽  
Author(s):  
Eisuke Segawa

Multi-indicator growth models were formulated as special three-level hierarchical generalized linear models to analyze growth of a trait latent variable measured by ordinal items. Items are nested within a time-point, and time-points are nested within subject. These models are special because they include factor analytic structure. This model can analyze not only data with item- and time-level missing observations, but also data with time points freely specified over subjects. Furthermore, features useful for longitudinal analyses, “autoregressive error degree one” structure for the trait residuals and estimated time-scores, were included. The approach is Bayesian with Markov Chain and Monte Carlo, and the model is implemented in WinBUGS. They are illustrated with two simulated data sets and one real data set with planned missing items within a scale.


Author(s):  
Jianan Zhu ◽  
Yang Feng

We propose a new ensemble classification algorithm, named Super Random Subspace Ensemble (Super RaSE), to tackle the sparse classification problem. The proposed algorithm is motivated by the Random Subspace Ensemble algorithm (RaSE). The RaSE method was shown to be a flexible framework that can be coupled with any existing base classification. However, the success of RaSE largely depends on the proper choice of the base classifier, which is unfortunately unknown to us. In this work, we show that Super RaSE avoids the need to choose a base classifier by randomly sampling a collection of classifiers together with the subspace. As a result, Super RaSE is more flexible and robust than RaSE. In addition to the vanilla Super RaSE, we also develop the iterative Super RaSE, which adaptively changes the base classifier distribution as well as the subspace distribution. We show the Super RaSE algorithm and its iterative version perform competitively for a wide range of simulated datasets and two real data examples. The new Super RaSE algorithm and its iterative version are implemented in a new version of the R package RaSEn.


2021 ◽  
Vol 20 (2) ◽  
pp. 69-77
Author(s):  
Adam Sagan

The paper presents the graphical approach to decomposition of APC effect in cohort studies (mainly applied to demographic phenomena) using multilevel or accelerated longitudinal design. The aim of the paper is to present and visualize the pure age, period and cohort effects based on simulated data with an increment of five for each successive age, period and cohort variation. In cohort analysis on real data all of the effects are usually interrelated. The analysis shows basic patterns of two-variate APC decomposition (age within period, age within cohort, cohort within period, period within age, cohort within age, period within cohort) and reveals the trajectory of curves for each of the pure effects. The APC plots are developed using apc library of R package.


2020 ◽  
Vol 8 (2) ◽  
pp. B35-B43
Author(s):  
Julio Cesar S. O. Lyrio ◽  
Paulo T. L. Menezes ◽  
Jorlivan L. Correa ◽  
Adriano R. Viana

When collecting and processing geophysical data for exploration, the same geologic feature can generate a different response for each rock property being targeted. Typically, the units of these responses may differ by several orders of magnitude; therefore, the combination of geophysical data in integrated interpretation is not a straightforward process and cannot be performed by visual inspection only. The multiphysics anomaly map (MAM) that we have developed is a data fusion solution that consists of a spatial representation of the correlation between anomalies detected with different geophysical methods. In the MAM, we mathematically process geophysical data such as seismic attributes, gravity, magnetic, and resistivity before combining them in a single map. In each data set, anomalous regions of interest, which are problem-dependent, are selected by the interpreter. Selected anomalies are highlighted through the use of a logistic function, which is specially designed to clip large magnitudes and rescale the range of values, increasing the discrimination of anomalies. The resulting anomalies, named logistic anomalies, represent regions of large probabilities of target occurrence. This new solution highlights areas where individual interpretations of different geophysical methods correlate, increasing the confidence in the interpretation. We determine the effectiveness of our MAM with application to real data from onshore and offshore Brazil. In the onshore Recôncavo Basin, the MAM allows the interpreter to identify a channel where a drilled well found the largest sandstone thickness on the area. In a second example, from offshore Sergipe-Alagoas Basin, the MAM helps differentiate between a dry and an oil-bearing channel previously outlined in seismic data. Therefore, these outcomes indicate that the MAM is a valid interpretation tool that we believe can be applied to a wide range of geologic problems.


Author(s):  
Guro Dørum ◽  
Lars Snipen ◽  
Margrete Solheim ◽  
Solve Saebo

Gene set analysis methods have become a widely used tool for including prior biological knowledge in the statistical analysis of gene expression data. Advantages of these methods include increased sensitivity, easier interpretation and more conformity in the results. However, gene set methods do not employ all the available information about gene relations. Genes are arranged in complex networks where the network distances contain detailed information about inter-gene dependencies. We propose a method that uses gene networks to smooth gene expression data with the aim of reducing the number of false positives and identify important subnetworks. Gene dependencies are extracted from the network topology and are used to smooth genewise test statistics. To find the optimal degree of smoothing, we propose using a criterion that considers the correlation between the network and the data. The network smoothing is shown to improve the ability to identify important genes in simulated data. Applied to a real data set, the smoothing accentuates parts of the network with a high density of differentially expressed genes.


Author(s):  
Giacomo Baruzzo ◽  
Ilaria Patuzzi ◽  
Barbara Di Camillo

Abstract Motivation Single cell RNA-seq (scRNA-seq) count data show many differences compared with bulk RNA-seq count data, making the application of many RNA-seq pre-processing/analysis methods not straightforward or even inappropriate. For this reason, the development of new methods for handling scRNA-seq count data is currently one of the most active research fields in bioinformatics. To help the development of such new methods, the availability of simulated data could play a pivotal role. However, only few scRNA-seq count data simulators are available, often showing poor or not demonstrated similarity with real data. Results In this article we present SPARSim, a scRNA-seq count data simulator based on a Gamma-Multivariate Hypergeometric model. We demonstrate that SPARSim allows to generate count data that resemble real data in terms of count intensity, variability and sparsity, performing comparably or better than one of the most used scRNA-seq simulator, Splat. In particular, SPARSim simulated count matrices well resemble the distribution of zeros across different expression intensities observed in real count data. Availability and implementation SPARSim R package is freely available at http://sysbiobig.dei.unipd.it/? q=SPARSim and at https://gitlab.com/sysbiobig/sparsim. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document