A day at the races

Applied Intelligence ◽

10.1007/s10489-021-02719-2 ◽

2021 ◽

Author(s):

David E. Losada ◽

David Elsweiler ◽

Morgan Harvey ◽

Christoph Trattner

Keyword(s):

Simulated Data ◽

User Studies ◽

Data Sets ◽

Crowd Sourcing ◽

Experimental Conditions ◽

Design Studies ◽

Natural Solution ◽

The Cost ◽

Simulated Data Sets ◽

Full Trial

AbstractTwo major barriers to conducting user studies are the costs involved in recruiting participants and researcher time in performing studies. Typical solutions are to study convenience samples or design studies that can be deployed on crowd-sourcing platforms. Both solutions have benefits but also drawbacks. Even in cases where these approaches make sense, it is still reasonable to ask whether we are using our resources – participants’ and our time – efficiently and whether we can do better. Typically user studies compare randomly-assigned experimental conditions, such that a uniform number of opportunities are assigned to each condition. This sampling approach, as has been demonstrated in clinical trials, is sub-optimal. The goal of many Information Retrieval (IR) user studies is to determine which strategy (e.g., behaviour or system) performs the best. In such a setup, it is not wise to waste participant and researcher time and money on conditions that are obviously inferior. In this work we explore whether Best Arm Identification (BAI) algorithms provide a natural solution to this problem. BAI methods are a class of Multi-armed Bandits (MABs) where the only goal is to output a recommended arm and the algorithms are evaluated by the average payoff of the recommended arm. Using three datasets associated with previously published IR-related user studies and a series of simulations, we test the extent to which the cost required to run user studies can be reduced by employing BAI methods. Our results suggest that some BAI instances (racing algorithms) are promising devices to reduce the cost of user studies. One of the racing algorithms studied, Hoeffding, holds particular promise. This algorithm offered consistent savings across both the real and simulated data sets and only extremely rarely returned a result inconsistent with the result of the full trial. We believe the results can have an important impact on the way research is performed in this field. The results show that the conditions assigned to participants could be dynamically changed, automatically, to make efficient use of participant and experimenter time.

AdImpute: An Imputation Method for Single-Cell RNA-Seq Data Based on Semi-Supervised Autoencoders

Frontiers in Genetics ◽

10.3389/fgene.2021.739677 ◽

2021 ◽

Vol 12 ◽

Author(s):

Li Xu ◽

Yin Xu ◽

Tong Xue ◽

Xinyu Zhang ◽

Jin Li

Keyword(s):

Single Cell ◽

Missing Values ◽

Simulated Data ◽

Real Data ◽

Imputation Method ◽

Data Sets ◽

Silent Genes ◽

Downstream Analysis ◽

The Cost ◽

Simulated Data Sets

Motivation: The emergence of single-cell RNA sequencing (scRNA-seq) technology has paved the way for measuring RNA levels at single-cell resolution to study precise biological functions. However, the presence of a large number of missing values in its data will affect downstream analysis. This paper presents AdImpute: an imputation method based on semi-supervised autoencoders. The method uses another imputation method (DrImpute is used as an example) to fill the results as imputation weights of the autoencoder, and applies the cost function with imputation weights to learn the latent information in the data to achieve more accurate imputation.Results: As shown in clustering experiments with the simulated data sets and the real data sets, AdImpute is more accurate than other four publicly available scRNA-seq imputation methods, and minimally modifies the biologically silent genes. Overall, AdImpute is an accurate and robust imputation method.

Spectral Convolution Feature-Based SPD Matrix Representation for Signal Detection Using a Deep Neural Network

Entropy ◽

10.3390/e22090949 ◽

2020 ◽

Vol 22 (9) ◽

pp. 949

Author(s):

Jiangyi Wang ◽

Min Liu ◽

Xinwu Zeng ◽

Xiaoqiang Hua

Keyword(s):

Neural Network ◽

Signal Detection ◽

Convolutional Neural Network ◽

Deep Neural Network ◽

Detection Method ◽

Learning Algorithm ◽

Simulated Data ◽

Data Sets ◽

Feature Maps ◽

Simulated Data Sets

Convolutional neural networks have powerful performances in many visual tasks because of their hierarchical structures and powerful feature extraction capabilities. SPD (symmetric positive definition) matrix is paid attention to in visual classification, because it has excellent ability to learn proper statistical representation and distinguish samples with different information. In this paper, a deep neural network signal detection method based on spectral convolution features is proposed. In this method, local features extracted from convolutional neural network are used to construct the SPD matrix, and a deep learning algorithm for the SPD matrix is used to detect target signals. Feature maps extracted by two kinds of convolutional neural network models are applied in this study. Based on this method, signal detection has become a binary classification problem of signals in samples. In order to prove the availability and superiority of this method, simulated and semi-physical simulated data sets are used. The results show that, under low SCR (signal-to-clutter ratio), compared with the spectral signal detection method based on the deep neural network, this method can obtain a gain of 0.5–2 dB on simulated data sets and semi-physical simulated data sets.

Benchmarking Statistical Multiple Sequence Alignment

10.1101/304659 ◽

2018 ◽

Cited By ~ 1

Author(s):

Michael Nute ◽

Ehsan Saleh ◽

Tandy Warnow

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structural Alignment ◽

Estimation Method ◽

Simulated Data ◽

Protein Sequences ◽

Data Sets ◽

Sequence Alignments ◽

Multiple Sequence ◽

Simulated Data Sets

AbstractThe estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical co-estimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical co-estimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy is dramatically more accurate than the other alignment methods on the simulated data sets, but is among the least accurate on the biological benchmarks. There are several potential causes for this discordance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments; future research is needed to understand the most likely explanation for our observations. multiple sequence alignment, BAli-Phy, protein sequences, structural alignment, homology

Bayesian Planet Searches for the 10 cm/s Radial Velocity Era

Proceedings of the International Astronomical Union ◽

10.1017/s1743921316002817 ◽

2015 ◽

Vol 11 (A29A) ◽

pp. 205-207

Author(s):

Philip C. Gregory

Keyword(s):

Radial Velocity ◽

State Of The Art ◽

Simulated Data ◽

Model Parameters ◽

Data Sets ◽

Stellar Activity ◽

Bayesian Fusion ◽

Multiple State ◽

Simulated Data Sets ◽

Apodization Function

AbstractA new apodized Keplerian model is proposed for the analysis of precision radial velocity (RV) data to model both planetary and stellar activity (SA) induced RV signals. A symmetrical Gaussian apodization function with unknown width and center can distinguish planetary signals from SA signals on the basis of the width of the apodization function. The general model for m apodized Keplerian signals also includes a linear regression term between RV and the stellar activity diagnostic In (R'hk), as well as an extra Gaussian noise term with unknown standard deviation. The model parameters are explored using a Bayesian fusion MCMC code. A differential version of the Generalized Lomb-Scargle periodogram provides an additional way of distinguishing SA signals and helps guide the choice of new periods. Sample results are reported for a recent international RV blind challenge which included multiple state of the art simulated data sets supported by a variety of stellar activity diagnostics.

A comparison of procedures for classifying remotely-sensed data using simulated data sets incorporating autocorrelations between spectral responses

International Journal of Remote Sensing ◽

10.1080/01431169208904073 ◽

1992 ◽

Vol 13 (14) ◽

pp. 2701-2725 ◽

Cited By ~ 3

Author(s):

J. D. WILSON

Keyword(s):

Simulated Data ◽

Remotely Sensed ◽

Data Sets ◽

Remotely Sensed Data ◽

Simulated Data Sets ◽

Spectral Responses

Erratum to: The Use of Geographically Weighted Regression for Spatial Prediction: An Evaluation of Models Using Simulated Data Sets

Mathematical Geosciences ◽

10.1007/s11004-011-9323-z ◽

2011 ◽

Vol 43 (3) ◽

pp. 399-399 ◽

Cited By ~ 1

Author(s):

P. Harris ◽

A. S. Fotheringham ◽

R. Crespo ◽

M. Charlton

Keyword(s):

Geographically Weighted Regression ◽

Simulated Data ◽

Spatial Prediction ◽

Weighted Regression ◽

Data Sets ◽

Simulated Data Sets

An evaluation of HapMap sample size and tagging SNP performance in large-scale empirical and simulated data sets

Nature Genetics ◽

10.1038/ng1670 ◽

2005 ◽

Vol 37 (12) ◽

pp. 1320-1322 ◽

Cited By ~ 76

Author(s):

Eleftheria Zeggini ◽

William Rayner ◽

Andrew P Morris ◽

Andrew T Hattersley ◽

Mark Walker ◽

...

Keyword(s):

Sample Size ◽

Large Scale ◽

Simulated Data ◽

Data Sets ◽

Hapmap Sample ◽

Tagging Snp ◽

Simulated Data Sets

The Use of Geographically Weighted Regression for Spatial Prediction: An Evaluation of Models Using Simulated Data Sets

Mathematical Geosciences ◽

10.1007/s11004-010-9284-7 ◽

2010 ◽

Vol 42 (6) ◽

pp. 657-680 ◽

Cited By ~ 89

Author(s):

P. Harris ◽

A. S. Fotheringham ◽

R. Crespo ◽

M. Charlton

Keyword(s):

Geographically Weighted Regression ◽

Simulated Data ◽

Spatial Prediction ◽

Weighted Regression ◽

Data Sets ◽

Simulated Data Sets

Comparison of single-nucleotide polymorphisms and microsatellite markers for linkage analysis in the COGA and simulated data sets for Genetic Analysis Workshop 14: Presentation Groups 1, 2, and 3

Genetic Epidemiology ◽

10.1002/gepi.20106 ◽

2005 ◽

Vol 29 (S1) ◽

pp. S7-S28 ◽

Cited By ~ 22

Author(s):

Marsha A. Wilcox ◽

Elizabeth W. Pugh ◽

Heping Zhang ◽

Xiaoyun Zhong ◽

Douglas F. Levinson ◽

...

Keyword(s):

Single Nucleotide Polymorphisms ◽

Linkage Analysis ◽

Genetic Analysis ◽

Microsatellite Markers ◽

Genetic Analysis Workshop ◽

Simulated Data ◽

Data Sets ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Simulated Data Sets

Evaluating Phylogenetic Informativeness as a Predictor of Phylogenetic Signal for Metazoan, Fungal, and Mammalian Phylogenomic Data Sets

BioMed Research International ◽

10.1155/2013/621604 ◽

2013 ◽

Vol 2013 ◽

pp. 1-14 ◽

Cited By ~ 14

Author(s):

Francesc López-Giráldez ◽

Andrew H. Moeller ◽

Jeffrey P. Townsend

Keyword(s):

Phylogenetic Signal ◽

Simulated Data ◽

Quantitative Measure ◽

Data Sets ◽

Phylogenetic Informativeness ◽

Phylogenetic Resolution ◽

Taxonomic Groups ◽

Diverse Groups ◽

Simulated Data Sets ◽

Selection Of

Phylogenetic research is often stymied by selection of a marker that leads to poor phylogenetic resolution despite considerable cost and effort. Profiles of phylogenetic informativeness provide a quantitative measure for prioritizing gene sampling to resolve branching order in a particular epoch. To evaluate the utility of these profiles, we analyzed phylogenomic data sets from metazoans, fungi, and mammals, thus encompassing diverse time scales and taxonomic groups. We also evaluated the utility of profiles created based on simulated data sets. We found that genes selected via their informativeness dramatically outperformed haphazard sampling of markers. Furthermore, our analyses demonstrate that the original phylogenetic informativeness method can be extended to trees with more than four taxa. Thus, although the method currently predicts phylogenetic signal without specifically accounting for the misleading effects of stochastic noise, it is robust to the effects of homoplasy. The phylogenetic informativeness rankings obtained will allow other researchers to select advantageous genes for future studies within these clades, maximizing return on effort and investment. Genes identified might also yield efficient experimental designs for phylogenetic inference for many sister clades and outgroup taxa that are closely related to the diverse groups of organisms analyzed.