simulated data sets
Recently Published Documents


TOTAL DOCUMENTS

72
(FIVE YEARS 19)

H-INDEX

19
(FIVE YEARS 2)

2021 ◽  
Author(s):  
Oskar Hickl ◽  
Pedro Queirós ◽  
Paul Wilmes ◽  
Patrick May ◽  
Anna Heintz-Buschart

The reconstruction of genomes is a critical step in genome-resolved metagenomics as well as for multi-omic data integration from microbial communities. Here, we present binny, a binning tool that produces high-quality metagenome-assembled genomes from both contiguous and highly fragmented genomes. Based on established metrics, binny outperforms existing state-of-the-art binning methods and finds unique genomes that could not be detected by other methods. binny uses k-mer-composition and coverage by metagenomic reads for iterative, non-linear dimension reduction of genomic signatures as well as subsequent automated contig clustering with cluster assessment using lineage-specific marker gene sets. When compared to five widely used binning algorithms, binny recovers the most near-complete (>95% pure, >90% complete) and high-quality (>90% pure, >70% complete) genomes from simulated data sets from the Critical Assessment of Metagenome Interpretation (CAMI) initiative, as well as from a real-world benchmark comprised of metagenomes from various environments. binny is implemented as Snakemake workflow and available from https://github.com/a-h-b/binny.


2021 ◽  
Vol 12 ◽  
Author(s):  
Li Xu ◽  
Yin Xu ◽  
Tong Xue ◽  
Xinyu Zhang ◽  
Jin Li

Motivation: The emergence of single-cell RNA sequencing (scRNA-seq) technology has paved the way for measuring RNA levels at single-cell resolution to study precise biological functions. However, the presence of a large number of missing values in its data will affect downstream analysis. This paper presents AdImpute: an imputation method based on semi-supervised autoencoders. The method uses another imputation method (DrImpute is used as an example) to fill the results as imputation weights of the autoencoder, and applies the cost function with imputation weights to learn the latent information in the data to achieve more accurate imputation.Results: As shown in clustering experiments with the simulated data sets and the real data sets, AdImpute is more accurate than other four publicly available scRNA-seq imputation methods, and minimally modifies the biologically silent genes. Overall, AdImpute is an accurate and robust imputation method.


Author(s):  
David E. Losada ◽  
David Elsweiler ◽  
Morgan Harvey ◽  
Christoph Trattner

AbstractTwo major barriers to conducting user studies are the costs involved in recruiting participants and researcher time in performing studies. Typical solutions are to study convenience samples or design studies that can be deployed on crowd-sourcing platforms. Both solutions have benefits but also drawbacks. Even in cases where these approaches make sense, it is still reasonable to ask whether we are using our resources – participants’ and our time – efficiently and whether we can do better. Typically user studies compare randomly-assigned experimental conditions, such that a uniform number of opportunities are assigned to each condition. This sampling approach, as has been demonstrated in clinical trials, is sub-optimal. The goal of many Information Retrieval (IR) user studies is to determine which strategy (e.g., behaviour or system) performs the best. In such a setup, it is not wise to waste participant and researcher time and money on conditions that are obviously inferior. In this work we explore whether Best Arm Identification (BAI) algorithms provide a natural solution to this problem. BAI methods are a class of Multi-armed Bandits (MABs) where the only goal is to output a recommended arm and the algorithms are evaluated by the average payoff of the recommended arm. Using three datasets associated with previously published IR-related user studies and a series of simulations, we test the extent to which the cost required to run user studies can be reduced by employing BAI methods. Our results suggest that some BAI instances (racing algorithms) are promising devices to reduce the cost of user studies. One of the racing algorithms studied, Hoeffding, holds particular promise. This algorithm offered consistent savings across both the real and simulated data sets and only extremely rarely returned a result inconsistent with the result of the full trial. We believe the results can have an important impact on the way research is performed in this field. The results show that the conditions assigned to participants could be dynamically changed, automatically, to make efficient use of participant and experimenter time.


Author(s):  
Catherine R. Alimboyong

<span>The infections in computer networks are complex. Its spread is analogous to a contagious disease which can cause destruction within a few seconds. Viruses in a computer or computer networks can spread rapidly by various means such as access to online social networking sites like twitter, Facebook, and opening of email attachments.  Thus, infections can go from being little dangerous to significantly harmful for a network. This paper proposed a simulation model that can predict the propagation of virus including the trend and the average infection rate using NetLogo software. Observed and simulated data sets were validated using chi-square tests. Results of the experiment have demonstrated accurate performance of the proposed model. The model could be very helpful for network administrators in mitigating the virus propagation and obstruct the spread of computer virus other than the usual prevention scheme particularly the use of antivirus software and inclusion of firewall security. </span>


The Holocene ◽  
2021 ◽  
pp. 095968362110032
Author(s):  
Charles E Umbanhowar ◽  
James A Umbanhowar

Imaging of charcoal particles extracted from lake sediments provides an important way to understand past fire regimes. Imaging of large numbers of particles can be time consuming. In this note we explore the effects of subsampling and extrapolation of area on estimates of sum charcoal area, using resampling of real and simulated data sets and propose a protocol in which all particles are counted with only the first 100 encountered being imaged. Extrapolated estimates of sum total area of charcoal for 40 real samples were nearly identical to actual values, and error introduced due to subsampling was low (Coefficient of variation <0.2) for all but samples originally containing fewer than 50 particles. Similarly, error was low for simulated data (CV <0.02). Extrapolation provided better estimates of charcoal area than did a regression-based approach. Our results suggest that imaging a fixed number of pieces of charcoal ( n = 100) and counting any additional pieces represents a time efficient way to estimate charcoal area while at the same time retaining useful information on particle size and shape


Author(s):  
R.V. Dutaut ◽  
D. Marcotte

SYNOPSIS In most exploration or mining grade data-sets, the presence of outliers or extreme values represents a significant challenge to mineral resource estimators. The most common practice is to cap the extreme values at a predefined level. A new capping approach is presented that uses QA/QC coarse duplicate data correlation to predict the real data coefficient of variation (i.e., error-free CV). The cap grade is determined such that the capped data has a CV equal to the predicted CV. The robustness of the approach with regard to original core assay length decisions, departure from lognormality, and capping before or after compositing is assessed using simulated data-sets. Real case studies of gold and nickel deposits are used to compare the proposed approach to the methods most widely used in industry. The new approach is simple and objective. It provides a cap grade that is determined automatically, based on predicted CV, and takes into account the quality of the assay procedure as determined by coarse duplicates correlation. Keywords: geostatistics, outliers, capping, duplicates, QA/QC, lognormal distribution.


Author(s):  
Jeremiah P. Sjoberg ◽  
Richard A. Anthes ◽  
Therese Rieckh

AbstractThe three-cornered hat (3CH) method, which was originally developed to assess the random errors of atomic clocks, is a means for estimating the error variances of three different data sets. Here we give an overview of the historical development of the 3CH and select other methods for estimating error variances that use either two or three data sets. We discuss similarities and differences between these methods and the 3CH method.This study assesses the sensitivity of the 3CH method to the factors that limit its accuracy, including sample size, outliers, different magnitudes of errors between the data sets, biases, and unknown error correlations. Using simulated data sets for which the errors and their correlations among the data sets are known, this analysis shows the conditions under which the 3CH method provides the most and least accurate estimates. The effect of representativeness errors caused by differences in vertical resolution of data sets is investigated. These representativeness errors are generally small relative to the magnitude of the random errors in the data sets, and the impact of this source of errors can be reduced by appropriate filtering.


2020 ◽  
Author(s):  
Yun Zhang ◽  
Chanhee Park ◽  
Christopher Bennett ◽  
Micah Thornton ◽  
Daehwan Kim

Nucleotide conversion sequencing technologies such as bisulfite-seq and SLAM-seq are powerful tools to explore the intricacies of cellular processes. In this paper, we describe HISAT-3N (hierarchical indexing for spliced alignment of transcripts - 3 nucleotides), which rapidly and accurately aligns sequences consisting of nucleotide conversions by leveraging powerful hierarchical index and repeat index algorithms originally developed for the HISAT software. Tests on real and simulated data sets demonstrate that HISAT-3N is over 7 times faster, has greater alignment accuracy, and has smaller memory requirements than other modern systems. Taken together HISAT-3N is the ideal aligner for use with converted sequence technologies.


2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Tijana Radivojević ◽  
Zak Costello ◽  
Kenneth Workman ◽  
Hector Garcia Martin

Abstract Synthetic biology allows us to bioengineer cells to synthesize novel valuable molecules such as renewable biofuels or anticancer drugs. However, traditional synthetic biology approaches involve ad-hoc engineering practices, which lead to long development times. Here, we present the Automated Recommendation Tool (ART), a tool that leverages machine learning and probabilistic modeling techniques to guide synthetic biology in a systematic fashion, without the need for a full mechanistic understanding of the biological system. Using sampling-based optimization, ART provides a set of recommended strains to be built in the next engineering cycle, alongside probabilistic predictions of their production levels. We demonstrate the capabilities of ART on simulated data sets, as well as experimental data from real metabolic engineering projects producing renewable biofuels, hoppy flavored beer without hops, fatty acids, and tryptophan. Finally, we discuss the limitations of this approach, and the practical consequences of the underlying assumptions failing.


2020 ◽  
Vol 499 (3) ◽  
pp. 3992-4010
Author(s):  
Jean-Baptiste Jolly ◽  
Kirsten K Knudsen ◽  
Flora Stanley

ABSTRACT linestacker is a new open access and open source tool for stacking of spectral lines in interferometric data. linestacker is an ensemble of casa tasks, and can stack both 3D cubes or already extracted spectra. The algorithm is tested on increasingly complex simulated data sets, mimicking Atacama Large Millimeter/submillimeter Array, and Karl G. Jansky Very Large Array observations of [C ii] and CO(3–2) emission lines, from z ∼ 7 and z ∼ 4 galaxies, respectively. We find that the algorithm is very robust, successfully retrieving the input parameters of the stacked lines in all cases with an accuracy ≳90 per cent. However, we distinguish some specific situations showcasing the intrinsic limitations of the method. Mainly that high uncertainties on the redshifts (Δz &gt; 0.01) can lead to poor signal-to-noise ratio improvement, due to lines being stacked on shifted central frequencies. Additionally, we give an extensive description of the embedded statistical tools included in linestacker: mainly bootstrapping, rebinning, and subsampling. Velocity rebinning is applied on the data before stacking and proves necessary when studying line profiles, in order to avoid artificial spectral features in the stack. Subsampling is useful to sort the stacked sources, allowing to find a subsample maximizing the searched parameters, while bootstrapping allows to detect inhomogeneities in the stacked sample. linestacker is a useful tool for extracting the most from spectral observations of various types.


Sign in / Sign up

Export Citation Format

Share Document