simulated data
Recently Published Documents

Laboratory diffraction contrast tomography (LabDCT) is a recently developed technique to map crystallographic orientations of polycrystalline samples in three dimensions non-destructively using a laboratory X-ray source. In this work, a new theoretical procedure, named LabXRS, expanding LabDCT to include mapping of the deviatoric strain tensors on the grain scale, is proposed and validated using simulated data. For the validation, the geometries investigated include a typical near-field LabDCT setup utilizing Laue focusing with equal source-to-sample and sample-to-detector distances of 14 mm, a magnified setup where the sample-to-detector distance is increased to 200 mm, a far-field Laue focusing setup where the source-to-sample distance is also increased to 200 mm, and a near-field setup with a source-to-sample distance of 200 mm. The strain resolution is found to be in the range of 1–5 × 10−4, depending on the geometry of the experiment. The effects of other experimental parameters, including pixel binning, number of projections and imaging noise, as well as microstructural parameters, including grain position, grain size and grain orientation, on the strain resolution are examined. The dependencies of these parameters, as well as the implications for practical experiments, are discussed.

Download Full-text

Uncovering variability in children's concepts and conceptual change

10.31234/osf.io/d9zbw ◽

2022 ◽

Author(s):

Pablo Leon-Villagra ◽

Christopher G. Lucas ◽

Daphna Buchsbaum ◽

Isaac Ehrlich

Keyword(s):

Conceptual Knowledge ◽

Developmentally Appropriate ◽

Simulated Data ◽

Human Perception ◽

Developmental Studies ◽

Large Numbers ◽

Fixed Set ◽

Similarity Judgments ◽

True Structure ◽

Insight Into

Capturing the structure and development of human conceptual knowledge is a challenging but fundamental task in Cognitive Science. The most prominent approach to uncovering these concepts is Multidimensional scaling (MDS), which has provided insight into the structure of human perception and conceptual knowledge. However, MDS usually requires participants to produce large numbers of similarity judgments, leading to prohibitively long experiments for most developmental research. Furthermore, MDS provides a single psychological space, tailored to a fixed set of stimuli. In contrast, we present a method that learns psychological spaces flexibly and generalizes to novel stimuli. In addition, our approach uses a simple, developmentally appropriate task, which allows for short and engaging developmental studies. We evaluate the feasibility of our approach on simulated data and find that it can uncover the true structure even when the data consists of aggregations of diverse categorizers. We then apply the method to data from the World Color Survey and find that it can discover language-specific color organization. Finally, we use the method in a novel developmental experiment and find age-dependent differences in conceptual spaces for fruit categories. These results suggest that our method is robust and widely applicable in developmental tasks with children as young as four years old.

Download Full-text

Benchmarking software to predict antibiotic resistance phenotypes in shotgun metagenomes using simulated data

10.1101/2022.01.13.476279 ◽

2022 ◽

Author(s):

Emily F Wissel ◽

Brooke M Talbot ◽

Bjorn A Johnson ◽

Robert A Petit ◽

Vicki Hertzberg ◽

...

Keyword(s):

Antibiotic Resistance ◽

Open Source ◽

Simulated Data ◽

Metagenomic Data ◽

Clinical Samples ◽

Bacterial Strains ◽

Minimal Processing ◽

Genotypic Resistance ◽

Shotgun Metagenomics ◽

Bioinformatic Tools

The use of shotgun metagenomics for AMR detection is appealing because data can be generated from clinical samples with minimal processing. Detecting antimicrobial resistance (AMR) in clinical genomic data is an important epidemiological task, yet a complex bioinformatic process. Many software tools exist to detect AMR genes, but they have mostly been tested in their detection of genotypic resistance in individual bacterial strains. It is important to understand how well these bioinformatic tools detect AMR genes in shotgun metagenomic data. We developed a software pipeline, hAMRoaster (https://github.com/ewissel/hAMRoaster), for assessing accuracy of prediction of antibiotic resistance phenotypes. For evaluation purposes, we simulated a short read (Illumina) shotgun metagenomics community of eight bacterial pathogens with extensive antibiotic susceptibility testing profiles. We benchmarked nine open source bioinformatics tools for detecting AMR genes that 1) were conda or Docker installable, 2) had been actively maintained, 3) had an open source license, and 4) took FASTA or FASTQ files as input. Several metrics were calculated for each tool including sensitivity, specificity, and F1 at three coverage levels. This study revealed that tools were highly variable in sensitivity (0.25 - 0.99) and specificity (0.2 - 1) in detection of resistance in our synthetic FASTQ files despite similar databases and methods implemented. Tools performed similarly at all coverage levels (5x, 50x, 100x). Cohen’s kappa revealed low agreement across tools.

Download Full-text

Indoor Emergency Path Planning Based on the Q-Learning Optimization Algorithm

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi11010066 ◽

2022 ◽

Vol 11 (1) ◽

pp. 66

Author(s):

Shenghua Xu ◽

Yang Gu ◽

Xiaoyan Li ◽

Cai Chen ◽

Yingyi Hu ◽

...

Keyword(s):

Path Planning ◽

Shortest Path ◽

Optimization Algorithm ◽

Indoor Environment ◽

Large Scale ◽

Learning Algorithm ◽

Simulated Data ◽

Convergence Speed ◽

Grid Environment ◽

Q Learning

The internal structure of buildings is becoming increasingly complex. Providing a scientific and reasonable evacuation route for trapped persons in a complex indoor environment is important for reducing casualties and property losses. In emergency and disaster relief environments, indoor path planning has great uncertainty and higher safety requirements. Q-learning is a value-based reinforcement learning algorithm that can complete path planning tasks through autonomous learning without establishing mathematical models and environmental maps. Therefore, we propose an indoor emergency path planning method based on the Q-learning optimization algorithm. First, a grid environment model is established. The discount rate of the exploration factor is used to optimize the Q-learning algorithm, and the exploration factor in the ε-greedy strategy is dynamically adjusted before selecting random actions to accelerate the convergence of the Q-learning algorithm in a large-scale grid environment. An indoor emergency path planning experiment based on the Q-learning optimization algorithm was carried out using simulated data and real indoor environment data. The proposed Q-learning optimization algorithm basically converges after 500 iterative learning rounds, which is nearly 2000 rounds higher than the convergence rate of the Q-learning algorithm. The SASRA algorithm has no obvious convergence trend in 5000 iterations of learning. The results show that the proposed Q-learning optimization algorithm is superior to the SARSA algorithm and the classic Q-learning algorithm in terms of solving time and convergence speed when planning the shortest path in a grid environment. The convergence speed of the proposed Q- learning optimization algorithm is approximately five times faster than that of the classic Q- learning algorithm. The proposed Q-learning optimization algorithm in the grid environment can successfully plan the shortest path to avoid obstacle areas in a short time.

Download Full-text

Long short-term memory networks enhance rainfall-runoff modelling at the national scale of Denmark

Geological Survey of Denmark and Greenland Bulletin ◽

10.34194/geusb.v49.8292 ◽

2022 ◽

Vol 49 ◽

Author(s):

Julian Koch ◽

Raphael Schneider

Keyword(s):

Short Term Memory ◽

Simulated Data ◽

National Scale ◽

Short Term ◽

Ungauged Catchments ◽

Term Memory ◽

Convincing Argument ◽

Physically Based ◽

Long Short Term Memory ◽

Using Data

This study explores the application of long short-term memory (LSTM) networks to simulate runoff at the national scale of Denmark using data from 301 catchments. This is the first LSTM application on Danish data. The results were benchmarked against the Danish national water resources model (DK-model), a physically based hydrological model. The median Kling-Gupta Efficiency (KGE), a common metric to assess performance of runoff predictions (optimum of 1), increased from 0.7 (DK-model) to 0.8 (LSTM) when trained against all catchments. Overall, the LSTM outperformed the DK-model in 80% of catchments. Despite the compelling KGE evaluation, the water balance closure was modelled less accurately by the LSTM. The applicability of LSTM networks for modelling ungauged catchments was assessed via a spatial split-sample experiment. A 20% spatial hold-out showed poorer performance of the LSTM with respect to the DK model. However, after pre-training, that is, weight initialisation obtained from training against simulated data from the DK-model, the performance of the LSTM was effectively improved. This formed a convincing argument supporting the knowledge-guided machine learning (ML) paradigm to integrate physically based models and ML to train robust models that generalise well.

Download Full-text

EagleImp: Fast and Accurate Genome-wide Phasing and Imputation in a Single Tool

10.1101/2022.01.11.475810 ◽

2022 ◽

Author(s):

Lars Wienbrandt ◽

David Ellinghaus

Keyword(s):

Memory Management ◽

Imputation Accuracy ◽

Simulated Data ◽

Genotype Imputation ◽

Whole Genome Sequencing Data ◽

Common Variants ◽

Sequencing Data ◽

1000 Genomes ◽

Genome Wide ◽

Reference Genomes

Background: Reference-based phasing and genotype imputation algorithms have been developed with sublinear theoretical runtime behaviour, but runtimes are still high in practice when large genome-wide reference datasets are used. Methods: We developed EagleImp, a software with algorithmic and technical improvements and new features for accurate and accelerated phasing and imputation in a single tool. Results: We compared accuracy and runtime of EagleImp with Eagle2, PBWT and prominent imputation servers using whole-genome sequencing data from the 1000 Genomes Project, the Haplotype Reference Consortium and simulated data with more than 1 million reference genomes. EagleImp is 2 to 10 times faster (depending on the single or multiprocessor configuration selected) than Eagle2/PBWT, with the same or better phasing and imputation quality in all tested scenarios. For common variants investigated in typical GWAS studies, EagleImp provides same or higher imputation accuracy than the Sanger Imputation Service, Michigan Imputation Server and the newly developed TOPMed Imputation Server, despite larger (not publicly available) reference panels. It has many new features, including automated chromosome splitting and memory management at runtime to avoid job aborts, fast reading and writing of large files, and various user-configurable algorithm and output options. Conclusions: Due to the technical optimisations, EagleImp can perform fast and accurate reference-based phasing and imputation for future very large reference panels with more than 1 million genomes. EagleImp is freely available for download from https://github.com/ikmb/eagleimp.

Download Full-text

Exploring dynamic metabolomics data with multiway data analysis: a simulation study

BMC Bioinformatics ◽

10.1186/s12859-021-04550-5 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Lu Li ◽

Huub Hoefsloot ◽

Albert A. de Graaf ◽

Evrim Acar ◽

Age K. Smilde

Keyword(s):

Data Analysis ◽

Individual Variation ◽

Simulated Data ◽

Ground Truth ◽

Metabolomics Data ◽

Analysis Methods ◽

Multiway Data Analysis ◽

Induced Variation ◽

Underlying Mechanisms ◽

Data Analysis Methods

Abstract Background Analysis of dynamic metabolomics data holds the promise to improve our understanding of underlying mechanisms in metabolism. For example, it may detect changes in metabolism due to the onset of a disease. Dynamic or time-resolved metabolomics data can be arranged as a three-way array with entries organized according to a subjects mode, a metabolites mode and a time mode. While such time-evolving multiway data sets are increasingly collected, revealing the underlying mechanisms and their dynamics from such data remains challenging. For such data, one of the complexities is the presence of a superposition of several sources of variation: induced variation (due to experimental conditions or inborn errors), individual variation, and measurement error. Multiway data analysis (also known as tensor factorizations) has been successfully used in data mining to find the underlying patterns in multiway data. To explore the performance of multiway data analysis methods in terms of revealing the underlying mechanisms in dynamic metabolomics data, simulated data with known ground truth can be studied. Results We focus on simulated data arising from different dynamic models of increasing complexity, i.e., a simple linear system, a yeast glycolysis model, and a human cholesterol model. We generate data with induced variation as well as individual variation. Systematic experiments are performed to demonstrate the advantages and limitations of multiway data analysis in analyzing such dynamic metabolomics data and their capacity to disentangle the different sources of variations. We choose to use simulations since we want to understand the capability of multiway data analysis methods which is facilitated by knowing the ground truth. Conclusion Our numerical experiments demonstrate that despite the increasing complexity of the studied dynamic metabolic models, tensor factorization methods CANDECOMP/PARAFAC(CP) and Parallel Profiles with Linear Dependences (Paralind) can disentangle the sources of variations and thereby reveal the underlying mechanisms and their dynamics.

Download Full-text

Systematic comparison of ranking aggregation methods for gene lists in experimental results

10.1101/2022.01.09.475491 ◽

2022 ◽

Author(s):

Bo Wang ◽

Andy Law ◽

Tim Regan ◽

Nicholas Parkinson ◽

Joby Cole ◽

...

Keyword(s):

Meta Analysis ◽

Simulated Data ◽

Real Data ◽

Biomedical Science ◽

Aggregation Method ◽

Aggregation Methods ◽

The Common ◽

Ranking Aggregation ◽

Meta Analyses ◽

Answer Ranking

A common experimental output in biomedical science is a list of genes implicated in a given biological process or disease. The results of a group of studies answering the same, or similar, questions can be combined by meta-analysis to find a consensus or a more reliable answer. Ranking aggregation methods can be used to combine gene lists from various sources in meta-analyses. Evaluating a ranking aggregation method on a specific type of dataset before using it is required to support the reliability of the result since the property of a dataset can influence the performance of an algorithm. Evaluation of aggregation methods is usually based on a simulated database especially for the algorithms designed for gene lists because of the lack of a known truth for real data. However, simulated datasets tend to be too small compared to experimental data and neglect key features, including heterogeneity of quality, relevance and the inclusion of unranked lists. In this study, a group of existing methods and their variations which are suitable for meta-analysis of gene lists are compared using simulated and real data. Simulated data was used to explore the performance of the aggregation methods as a function of emulating the common scenarios of real genomics data, with various heterogeneity of quality, noise level, and a mix of unranked and ranked data using 20000 possible entities. In addition to the evaluation with simulated data, a comparison using real genomic data on the SARS-CoV-2 virus, cancer (NSCLC), and bacteria (macrophage apoptosis) was performed. We summarise our evaluation results in terms of a simple flowchart to select a ranking aggregation method for genomics data.

Download Full-text

Identifying SARS-CoV-2 regional introductions and transmission clusters in real time

10.1101/2022.01.07.22268918 ◽

2022 ◽

Author(s):

Jakob McBroome ◽

Jennifer Martin ◽

Adriano de Bernardi Schneider ◽

Yatish Turakhia ◽

Russell Corbett-Detig

Keyword(s):

Public Health ◽

Computational Models ◽

Simulated Data ◽

The United States ◽

Data Exploration ◽

Public Health Action ◽

Web Based ◽

Geographic Origins ◽

Health Action ◽

Effective Public Health

The unprecedented SARS-CoV-2 global sequencing effort has suffered from an analytical bottleneck. Many existing methods for phylogenetic analysis are designed for sparse, static datasets and are too computationally expensive to apply to densely sampled, rapidly expanding datasets when results are needed immediately to inform public health action. For example, public health is often concerned with identifying clusters of closely related samples, but the sheer scale of the data prevents manual inspection and the current computational models are often too expensive in time and resources. Even when results are available, intuitive data exploration tools are of critical importance to effective public health interpretation and action. To help address this need, we present a phylogenetic summary statistic which quickly and efficiently identifies newly introduced strains in a region, resulting clusters of infected individuals, and their putative geographic origins. We show that this approach performs well on simulated data and is congruent with a more sophisticated analysis performed during the pandemic. We also introduce Cluster Tracker (https://clustertracker.gi.ucsc.edu/), a novel interactive web-based tool to facilitate effective and intuitive SARS-CoV-2 geographic data exploration and visualization. Cluster-Tracker is updated daily and automatically identifies and highlights groups of closely related SARS-CoV-2 infections resulting from inter-regional transmission across the United States, streamlining public health tracking of local viral diversity and emerging infection clusters. The combination of these open-source tools will empower detailed investigations of the geographic origins and spread of SARS-CoV-2 and other densely-sampled pathogens.

Download Full-text

The Water Yield Pattern for Annual and Monthly Scales Through a Unifying Catchment Water Balance Model

10.21203/rs.3.rs-1193877/v1 ◽

2022 ◽

Author(s):

Dedi Liu ◽

Dezhi Fu

Keyword(s):

Water Resources ◽

Water Balance ◽

Water Resources Management ◽

Simulated Data ◽

Water Balance Model ◽

Balance Model ◽

Water Yield ◽

Short Term ◽

Resources Management

Abstract Long-term scheduling and short-term decision-making for water resources management often require understanding the relationship of water yield pattern between the annual and monthly scales. As the water yield pattern mainly depends on land cover/use and climate, a unifying catchment water balance model with factors has been adopted to derive a theoretical water yield pattern with annual and monthly scales. Two critical values at the parameters ε=1-√2/2 and ϕ=1.0 are identified. The parameter ε referring to the water storage (land use/cover) and evaporation (climate) changes can make more contribution than ϕ for water yield when ϕ>1.0, especially with ε<1-√2/2. But there is less contribution made by ε when ϕ<1.0. The derived theoretical water yield patterns have also been validated by the observed data or the simulated data through the hydrological model. Due to the bias of the soil moisture data, a lot of the estimated parameter ε values are over its theoretical range, especially for the monthly scale in humid basins. The performance of the derived theoretical water yield pattern at annual scale is much better than that at monthly scale while there are only a few data sets from the arid basin at every months fall within their theoretical ranges. Even the relative contributions of ε is found to be bigger than those of ϕ due to ε<1-√2/2 and ϕ>1.0, there are no significant linear relationships between annual and monthly parameters ε and ϕ. Our results not only validate the derived theoretical water yield pattern with the estimated parameter directly by the observed or simulated data rather than the calibrated parameter, but also can guide for further understanding physical of water balance to conversion time scales for the combing long-term and short-term water resources management.

Download Full-text

simulated dataRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Three-dimensional grain resolved strain mapping using laboratory X-ray diffraction contrast tomography: theoretical analysis

Uncovering variability in children's concepts and conceptual change

Benchmarking software to predict antibiotic resistance phenotypes in shotgun metagenomes using simulated data

Indoor Emergency Path Planning Based on the Q-Learning Optimization Algorithm

Long short-term memory networks enhance rainfall-runoff modelling at the national scale of Denmark

EagleImp: Fast and Accurate Genome-wide Phasing and Imputation in a Single Tool

Exploring dynamic metabolomics data with multiway data analysis: a simulation study

Systematic comparison of ranking aggregation methods for gene lists in experimental results

Identifying SARS-CoV-2 regional introductions and transmission clusters in real time

The Water Yield Pattern for Annual and Monthly Scales Through a Unifying Catchment Water Balance Model

simulated data
Recently Published Documents