false discoveries
Recently Published Documents


TOTAL DOCUMENTS

123
(FIVE YEARS 48)

H-INDEX

23
(FIVE YEARS 4)

2022 ◽  
Author(s):  
Rodney T Richardson

Metagenetic methods are commonplace within ecological and environmental research. One concern with these methods is the phenomenon of critical mistagging, where sequences from one sample are erroneously inferred to have originated from another sample due to errors in the attachment, PCR replication or sequencing of sample-specific dual-index tags. For studies using PCR-based library preparation on large sample sizes, the most cost-effective approach to limiting mistag-associated false detections involves using an unsaturated Latin square dual-indexing design. This allows researchers to estimate mistagging rates during sequencing but the statistical procedures for filtering out detections using this mistag rate have received little attention. We propose a straightforward method to limit mistag-associated false discoveries during metabarcoding applications. We analyzed two Illumina metabarcoding datasets produced using unsaturated Latin square designs to explore the distribution of mistagged sequences across dual-index combinations on a per taxon basis. We tested these data for conformity to the assumptions that 1) mistagging follows a binomial distribution [i.e., X ~ B(n, p)] where p, the probability of a sequence being mistagged, varies minimally across taxa and 2) mistags are distributed uniformly across dual-index combinations. We provide R functions that estimate the 95th percentile of expected mistags per dual-index combination for each taxon under these assumptions. We show that mistagging rates were consistent across taxa within the datasets analyzed and that modelling mistagging as a binomial process with uniform distribution across dual-index combinations enabled robust control of mistag-associated false discoveries. We propose that this method of taxon-specific filtering of detections based on the maximum mistags expected per dual-index combination should be broadly accepted during metagenetic analysis, provided that experimental and control sequence abundances per taxon are strongly correlated. When this assumption is violated, data may be better fit by assuming that the distribution of mistags across combinations follows Poisson characteristics [i.e., X ~ Pois(𝜆)], with 𝜆 empirically estimated from the abundance distribution of mistags among control samples. We provide a second R function for this case, though we have yet to observe such a dataset. Both functions and demonstrations associated with this work are freely available at https://github.com/RTRichar/ModellingCriticalMistags.


Significance ◽  
2021 ◽  
Vol 18 (6) ◽  
pp. 22-25
Author(s):  
David H. Bailey ◽  
Marcos López de Prado
Keyword(s):  

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
M. Büttner ◽  
J. Ostner ◽  
C. L. Müller ◽  
F. J. Theis ◽  
B. Schubert

AbstractCompositional changes of cell types are main drivers of biological processes. Their detection through single-cell experiments is difficult due to the compositionality of the data and low sample sizes. We introduce scCODA (https://github.com/theislab/scCODA), a Bayesian model addressing these issues enabling the study of complex cell type effects in disease, and other stimuli. scCODA demonstrated excellent detection performance, while reliably controlling for false discoveries, and identified experimentally verified cell type changes that were missed in original analyses.


2021 ◽  
Author(s):  
Armin Bunde ◽  
Josef Ludescher ◽  
Hans Joachim Schellnhuber

AbstractWe consider trends in the m seasonal subrecords of a record. To determine the statistical significance of the m trends, one usually determines the p value of each season either numerically or analytically and compares it with a significance level $${{\tilde{\alpha }}}$$ α ~ . We show in great detail for short- and long-term persistent records that this procedure, which is standard in climate science, is inadequate since it produces too many false positives (false discoveries). We specify, on the basis of the family wise error rate and by adapting ideas from multiple testing correction approaches, how the procedure must be changed to obtain more suitable significance criteria for the m trends. Our analysis is valid for data with all kinds of persistence. Specifically for long-term persistent data, we derive simple analytical expressions for the quantities of interest, which allow to determine easily the statistical significance of a trend in a seasonal record. As an application, we focus on 17 Antarctic station data. We show that only four trends in the seasonal temperature data are outside the bounds of natural variability, in marked contrast to earlier conclusions.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Jordan W. Squair ◽  
Matthieu Gautier ◽  
Claudia Kathe ◽  
Mark A. Anderson ◽  
Nicholas D. James ◽  
...  

AbstractDifferential expression analysis in single-cell transcriptomics enables the dissection of cell-type-specific responses to perturbations such as disease, trauma, or experimental manipulations. While many statistical methods are available to identify differentially expressed genes, the principles that distinguish these methods and their performance remain unclear. Here, we show that the relative performance of these methods is contingent on their ability to account for variation between biological replicates. Methods that ignore this inevitable variation are biased and prone to false discoveries. Indeed, the most widely used methods can discover hundreds of differentially expressed genes in the absence of biological differences. To exemplify these principles, we exposed true and false discoveries of differentially expressed genes in the injured mouse spinal cord.


Author(s):  
Oliver Gutiérrez-Hernández ◽  
Luis Ventura García

Multiplicity arises when data analysis involves multiple simultaneous inferences, increasing the chance of spurious findings. It is a widespread problem frequently ignored by researchers. In this paper, we perform an exploratory analysis of the Web of Science database for COVID-19 observational studies. We examined 100 top-cited COVID-19 peer-reviewed articles based on p-values, including up to 7100 simultaneous tests, with 50% including >34 tests, and 20% > 100 tests. We found that the larger the number of tests performed, the larger the number of significant results (r = 0.87, p < 10−6). The number of p-values in the abstracts was not related to the number of p-values in the papers. However, the highly significant results (p < 0.001) in the abstracts were strongly correlated (r = 0.61, p < 10−6) with the number of p < 0.001 significances in the papers. Furthermore, the abstracts included a higher proportion of significant results (0.91 vs. 0.50), and 80% reported only significant results. Only one reviewed paper addressed multiplicity-induced type I error inflation, pointing to potentially spurious results bypassing the peer-review process. We conclude the need to pay special attention to the increased chance of false discoveries in observational studies, including non-replicated striking discoveries with a potentially large social impact. We propose some easy-to-implement measures to assess and limit the effects of multiplicity.


Author(s):  
Tristan Mary-Huard ◽  
Sarmistha Das ◽  
Indranil Mukhopadhyay ◽  
Stephane Robin

Abstract Motivation Combining the results of different experiments to exhibit complex patterns or to improve statistical power is a typical aim of data integration. The starting point of the statistical analysis often comes as sets of p-values resulting from previous analyses, that need to be combined in a flexible way to explore complex hypotheses, while guaranteeing a low proportion of false discoveries. Results We introduce the generic concept of composed hypothesis, which corresponds to an arbitrary complex combination of simple hypotheses. We rephrase the problem of testing a composed hypothesis as a classification task, and show that finding items for which the composed null hypothesis is rejected boils down to fitting a mixture model and classify the items according to their posterior probabilities. We show that inference can be efficiently performed and provide a thorough classification rule to control for type I error. The performance and the usefulness of the approach are illustrated on simulations and on two different applications. The method is scalable, does not require any parameter tuning, and provided valuable biological insight on the considered application cases. Availability The QCH methodology is implemented in the qch R package hosted on CRAN.


PLoS ONE ◽  
2021 ◽  
Vol 16 (7) ◽  
pp. e0255240
Author(s):  
Shoaib Bin Masud ◽  
Conor Jenkins ◽  
Erika Hussey ◽  
Seth Elkin-Frankston ◽  
Phillip Mach ◽  
...  

Metabolomic data processing pipelines have been improving in recent years, allowing for greater feature extraction and identification. Lately, machine learning and robust statistical techniques to control false discoveries are being incorporated into metabolomic data analysis. In this paper, we introduce one such recently developed technique called aggregate knockoff filtering to untargeted metabolomic analysis. When applied to a publicly available dataset, aggregate knockoff filtering combined with typical p-value filtering improves the number of significantly changing metabolites by 25% when compared to conventional untargeted metabolomic data processing. By using this method, features that would normally not be extracted under standard processing would be brought to researchers’ attention for further analysis.


2021 ◽  
Author(s):  
Wenpin Hou ◽  
Zhicheng Ji ◽  
Zeyu Chen ◽  
E John Wherry ◽  
Stephanie C Hicks ◽  
...  

Pseudotime analysis with single-cell RNA-sequencing (scRNA-seq) data has been widely used to study dynamic gene regulatory programs along continuous biological processes. While many computational methods have been developed to infer the pseudo-temporal trajectories of cells within a biological sample, methods that compare pseudo-temporal patterns with multiple samples (or replicates) across different experimental conditions are lacking. Lamian is a comprehensive and statistically-rigorous computational framework for differential multi-sample pseudotime analysis. It can be used to identify changes in a biological process associated with sample covariates, such as different biological conditions, and also to detect changes in gene expression, cell density, and topology of a pseudotemporal trajectory. Unlike existing methods that ignore sample variability, Lamian draws statistical inference after accounting for cross-sample variability and hence substantially reduces sample-specific false discoveries that are not generalizable to new samples. Using both simulations and real scRNA-seq data, including an analysis of differential immune response programs between COVID-19 patients with different disease severity levels, we demonstrate the advantages of Lamian in decoding cellular gene expression programs in continuous biological processes.


Sign in / Sign up

Export Citation Format

Share Document