scholarly journals Querying multiple sets of p-values through composed hypothesis testing

Author(s):  
Tristan Mary-Huard ◽  
Sarmistha Das ◽  
Indranil Mukhopadhyay ◽  
Stephane Robin

Abstract Motivation Combining the results of different experiments to exhibit complex patterns or to improve statistical power is a typical aim of data integration. The starting point of the statistical analysis often comes as sets of p-values resulting from previous analyses, that need to be combined in a flexible way to explore complex hypotheses, while guaranteeing a low proportion of false discoveries. Results We introduce the generic concept of composed hypothesis, which corresponds to an arbitrary complex combination of simple hypotheses. We rephrase the problem of testing a composed hypothesis as a classification task, and show that finding items for which the composed null hypothesis is rejected boils down to fitting a mixture model and classify the items according to their posterior probabilities. We show that inference can be efficiently performed and provide a thorough classification rule to control for type I error. The performance and the usefulness of the approach are illustrated on simulations and on two different applications. The method is scalable, does not require any parameter tuning, and provided valuable biological insight on the considered application cases. Availability The QCH methodology is implemented in the qch R package hosted on CRAN.

Author(s):  
Oliver Gutiérrez-Hernández ◽  
Luis Ventura García

Multiplicity arises when data analysis involves multiple simultaneous inferences, increasing the chance of spurious findings. It is a widespread problem frequently ignored by researchers. In this paper, we perform an exploratory analysis of the Web of Science database for COVID-19 observational studies. We examined 100 top-cited COVID-19 peer-reviewed articles based on p-values, including up to 7100 simultaneous tests, with 50% including >34 tests, and 20% > 100 tests. We found that the larger the number of tests performed, the larger the number of significant results (r = 0.87, p < 10−6). The number of p-values in the abstracts was not related to the number of p-values in the papers. However, the highly significant results (p < 0.001) in the abstracts were strongly correlated (r = 0.61, p < 10−6) with the number of p < 0.001 significances in the papers. Furthermore, the abstracts included a higher proportion of significant results (0.91 vs. 0.50), and 80% reported only significant results. Only one reviewed paper addressed multiplicity-induced type I error inflation, pointing to potentially spurious results bypassing the peer-review process. We conclude the need to pay special attention to the increased chance of false discoveries in observational studies, including non-replicated striking discoveries with a potentially large social impact. We propose some easy-to-implement measures to assess and limit the effects of multiplicity.


2020 ◽  
Author(s):  
Han Du ◽  
Ge Jiang ◽  
Zijun Ke

Meta-analysis combines pertinent information from existing studies to provide an overall estimate of population parameters/effect sizes, as well as to quantify and explain the differences between studies. However, testing the between-study heterogeneity is one of the most troublesome topics in meta-analysis research. Additionally, no methods have been proposed to test whether the size of the heterogeneity is larger than a specific level. The existing methods, such as the Q test and likelihood ratio (LR) tests, are criticized for their failure to control the Type I error rate and/or failure to attain enough statistical power. Although better reference distribution approximations have been proposed in the literature, the expression is complicated and the application is limited. In this article, we propose bootstrap based heterogeneity tests combining the restricted maximum likelihood (REML) ratio test or Q test with bootstrap procedures, denoted as B-REML-LRT and B-Q respectively. Simulation studies were conducted to examine and compare the performance of the proposed methods with the regular LR tests, the regular Q test, and the improved Q test in both the random-effects meta-analysis and mixed-effects meta-analysis. Based on the results of Type I error rates and statistical power, B-Q is recommended. An R package \mathtt{boot.heterogeneity} is provided to facilitate the implementation of the proposed method.


2019 ◽  
Vol 227 (4) ◽  
pp. 261-279 ◽  
Author(s):  
Frank Renkewitz ◽  
Melanie Keiner

Abstract. Publication biases and questionable research practices are assumed to be two of the main causes of low replication rates. Both of these problems lead to severely inflated effect size estimates in meta-analyses. Methodologists have proposed a number of statistical tools to detect such bias in meta-analytic results. We present an evaluation of the performance of six of these tools. To assess the Type I error rate and the statistical power of these methods, we simulated a large variety of literatures that differed with regard to true effect size, heterogeneity, number of available primary studies, and sample sizes of these primary studies; furthermore, simulated studies were subjected to different degrees of publication bias. Our results show that across all simulated conditions, no method consistently outperformed the others. Additionally, all methods performed poorly when true effect sizes were heterogeneous or primary studies had a small chance of being published, irrespective of their results. This suggests that in many actual meta-analyses in psychology, bias will remain undiscovered no matter which detection method is used.


2019 ◽  
Author(s):  
Alvin Vista

Cheating detection is an important issue in standardized testing, especially in large-scale settings. Statistical approaches are often computationally intensive and require specialised software to conduct. We present a two-stage approach that quickly filters suspected groups using statistical testing on an IRT-based answer-copying index. We also present an approach to mitigate data contamination and improve the performance of the index. The computation of the index was implemented through a modified version of an open source R package, thus enabling wider access to the method. Using data from PIRLS 2011 (N=64,232) we conduct a simulation to demonstrate our approach. Type I error was well-controlled and no control group was falsely flagged for cheating, while 16 (combined n=12,569) of the 18 (combined n=14,149) simulated groups were detected. Implications for system-level cheating detection and further improvements of the approach were discussed.


2019 ◽  
Author(s):  
Rob Cribbie ◽  
Nataly Beribisky ◽  
Udi Alter

Many bodies recommend that a sample planning procedure, such as traditional NHST a priori power analysis, is conducted during the planning stages of a study. Power analysis allows the researcher to estimate how many participants are required in order to detect a minimally meaningful effect size at a specific level of power and Type I error rate. However, there are several drawbacks to the procedure that render it “a mess.” Specifically, the identification of the minimally meaningful effect size is often difficult but unavoidable for conducting the procedure properly, the procedure is not precision oriented, and does not guide the researcher to collect as many participants as feasibly possible. In this study, we explore how these three theoretical issues are reflected in applied psychological research in order to better understand whether these issues are concerns in practice. To investigate how power analysis is currently used, this study reviewed the reporting of 443 power analyses in high impact psychology journals in 2016 and 2017. It was found that researchers rarely use the minimally meaningful effect size as a rationale for the chosen effect in a power analysis. Further, precision-based approaches and collecting the maximum sample size feasible are almost never used in tandem with power analyses. In light of these findings, we offer that researchers should focus on tools beyond traditional power analysis when sample planning, such as collecting the maximum sample size feasible.


2010 ◽  
Vol 23 (2) ◽  
pp. 200-229 ◽  
Author(s):  
Anna L. Macready ◽  
Laurie T. Butler ◽  
Orla B. Kennedy ◽  
Judi A. Ellis ◽  
Claire M. Williams ◽  
...  

In recent years there has been a rapid growth of interest in exploring the relationship between nutritional therapies and the maintenance of cognitive function in adulthood. Emerging evidence reveals an increasingly complex picture with respect to the benefits of various food constituents on learning, memory and psychomotor function in adults. However, to date, there has been little consensus in human studies on the range of cognitive domains to be tested or the particular tests to be employed. To illustrate the potential difficulties that this poses, we conducted a systematic review of existing human adult randomised controlled trial (RCT) studies that have investigated the effects of 24 d to 36 months of supplementation with flavonoids and micronutrients on cognitive performance. There were thirty-nine studies employing a total of 121 different cognitive tasks that met the criteria for inclusion. Results showed that less than half of these studies reported positive effects of treatment, with some important cognitive domains either under-represented or not explored at all. Although there was some evidence of sensitivity to nutritional supplementation in a number of domains (for example, executive function, spatial working memory), interpretation is currently difficult given the prevailing ‘scattergun approach’ for selecting cognitive tests. Specifically, the practice means that it is often difficult to distinguish between a boundary condition for a particular nutrient and a lack of task sensitivity. We argue that for significant future progress to be made, researchers need to pay much closer attention to existing human RCT and animal data, as well as to more basic issues surrounding task sensitivity, statistical power and type I error.


2020 ◽  
Vol 6 (2) ◽  
pp. 106-113
Author(s):  
A. M. Grjibovski ◽  
M. A. Gorbatova ◽  
A. N. Narkevich ◽  
K. A. Vinogradov

Sample size calculation in a planning phase is still uncommon in Russian research practice. This situation threatens validity of the conclusions and may introduce Type I error when the false null hypothesis is accepted due to lack of statistical power to detect the existing difference between the means. Comparing two means using unpaired Students’ ttests is the most common statistical procedure in the Russian biomedical literature. However, calculations of the minimal required sample size or retrospective calculation of the statistical power were observed only in very few publications. In this paper we demonstrate how to calculate required sample size for comparing means in unpaired samples using WinPepi and Stata software. In addition, we produced tables for minimal required sample size for studies when two means have to be compared and body mass index and blood pressure are the variables of interest. The tables were constructed for unpaired samples for different levels of statistical power and standard deviations obtained from the literature.


Author(s):  
Shengjie Liu ◽  
Jun Gao ◽  
Yuling Zheng ◽  
Lei Huang ◽  
Fangrong Yan

AbstractBioequivalence (BE) studies are an integral component of new drug development process, and play an important role in approval and marketing of generic drug products. However, existing design and evaluation methods are basically under the framework of frequentist theory, while few implements Bayesian ideas. Based on the bioequivalence predictive probability model and sample re-estimation strategy, we propose a new Bayesian two-stage adaptive design and explore its application in bioequivalence testing. The new design differs from existing two-stage design (such as Potvin’s method B, C) in the following aspects. First, it not only incorporates historical information and expert information, but further combines experimental data flexibly to aid decision-making. Secondly, its sample re-estimation strategy is based on the ratio of the information in interim analysis to total information, which is simpler in calculation than the Potvin’s method. Simulation results manifested that the two-stage design can be combined with various stop boundary functions, and the results are different. Moreover, the proposed method saves sample size compared to the Potvin’s method under the conditions that type I error rate is below 0.05 and statistical power reaches 80 %.


2019 ◽  
Vol 35 (24) ◽  
pp. 5155-5162 ◽  
Author(s):  
Chengzhong Ye ◽  
Terence P Speed ◽  
Agus Salim

Abstract Motivation Dropout is a common phenomenon in single-cell RNA-seq (scRNA-seq) data, and when left unaddressed it affects the validity of the statistical analyses. Despite this, few current methods for differential expression (DE) analysis of scRNA-seq data explicitly model the process that gives rise to the dropout events. We develop DECENT, a method for DE analysis of scRNA-seq data that explicitly and accurately models the molecule capture process in scRNA-seq experiments. Results We show that DECENT demonstrates improved DE performance over existing DE methods that do not explicitly model dropout. This improvement is consistently observed across several public scRNA-seq datasets generated using different technological platforms. The gain in improvement is especially large when the capture process is overdispersed. DECENT maintains type I error well while achieving better sensitivity. Its performance without spike-ins is almost as good as when spike-ins are used to calibrate the capture model. Availability and implementation The method is implemented as a publicly available R package available from https://github.com/cz-ye/DECENT. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document