Defuzzification of a Fuzzy p-value by the Signed Distance: Application on Real Data

Author(s):  
Rédina Berkachy ◽  
Laurent Donzé
Keyword(s):  
P Value ◽  
2019 ◽  
Author(s):  
Leili Tapak ◽  
Omid Hamidi ◽  
Majid Sadeghifar ◽  
Hassan Doosti ◽  
Ghobad Moradi

Abstract Objectives Zero-inflated proportion or rate data nested in clusters due to the sampling structure can be found in many disciplines. Sometimes, the rate response may not be observed for some study units because of some limitations (false negative) like failure in recording data and the zeros are observed instead of the actual value of the rate/proportions (low incidence). In this study, we proposed a multilevel zero-inflated censored Beta regression model that can address zero-inflation rate data with low incidence.Methods We assumed that the random effects are independent and normally distributed. The performance of the proposed approach was evaluated by application on a three level real data set and a simulation study. We applied the proposed model to analyze brucellosis diagnosis rate data and investigate the effects of climatic and geographical position. For comparison, we also applied the standard zero-inflated censored Beta regression model that does not account for correlation.Results Results showed the proposed model performed better than zero-inflated censored Beta based on AIC criterion. Height (p-value <0.0001), temperature (p-value <0.0001) and precipitation (p-value = 0.0006) significantly affected brucellosis rates. While, precipitation in ZICBETA model was not statistically significant (p-value =0.385). Simulation study also showed that the estimations obtained by maximum likelihood approach had reasonable in terms of mean square error.Conclusions The results showed that the proposed method can capture the correlations in the real data set and yields accurate parameter estimates.


2018 ◽  
Vol 28 (9) ◽  
pp. 2868-2875
Author(s):  
Zhongxue Chen ◽  
Qingzhong Liu ◽  
Kai Wang

Several gene- or set-based association tests have been proposed recently in the literature. Powerful statistical approaches are still highly desirable in this area. In this paper we propose a novel statistical association test, which uses information of the burden component and its complement from the genotypes. This new test statistic has a simple null distribution, which is a special and simplified variance-gamma distribution, and its p-value can be easily calculated. Through a comprehensive simulation study, we show that the new test can control type I error rate and has superior detecting power compared with some popular existing methods. We also apply the new approach to a real data set; the results demonstrate that this test is promising.


2018 ◽  
Vol 28 (5) ◽  
pp. 1508-1522 ◽  
Author(s):  
Qianya Qi ◽  
Li Yan ◽  
Lili Tian

In testing differentially expressed genes between tumor and healthy tissues, data are usually collected in paired form. However, incomplete paired data often occur. While extensive statistical researches exist for paired data with incompleteness in both arms, hardly any recent work can be found on paired data with incompleteness in single arm. This paper aims to fill this gap by proposing some new methods, namely, P-value pooling methods and a nonparametric combination test. Simulation studies are conducted to investigate the performance of the proposed methods in terms of type I error and power at small to moderate sample sizes. A real data set from The Cancer Genome Atlas (TCGA) breast cancer study is analyzed using the proposed methods.


2019 ◽  
Author(s):  
Florian Privé ◽  
Bjarni J. Vilhjálmsson ◽  
Hugues Aschard ◽  
Michael G.B. Blum

AbstractPolygenic prediction has the potential to contribute to precision medicine. Clumping and Thresh-olding (C+T) is a widely used method to derive polygenic scores. When using C+T, it is common to test several p-value thresholds to maximize predictive ability of the derived polygenic scores. Along with this p-value threshold, we propose to tune three other hyper-parameters for C+T. We implement an efficient way to derive thousands of different C+T polygenic scores corresponding to a grid over four hyper-parameters. For example, it takes a few hours to derive 123,200 different C+T scores for 300K individuals and 1M variants on a single node with 16 cores.We find that optimizing over these four hyper-parameters improves the predictive performance of C+T in both simulations and real data applications as compared to tuning only the p-value threshold. A particularly large increase can be noted when predicting depression status, from an AUC of 0.557 (95% CI: [0.544-0.569]) when tuning only the p-value threshold in C+T to an AUC of 0.592 (95% CI: [0.580-0.604]) when tuning all four hyper-parameters we propose for C+T.We further propose Stacked Clumping and Thresholding (SCT), a polygenic score that results from stacking all derived C+T scores. Instead of choosing one set of hyper-parameters that maximizes prediction in some training set, SCT learns an optimal linear combination of all C+T scores by using an efficient penalized regression. We apply SCT to 8 different case-control diseases in the UK biobank data and find that SCT substantially improves prediction accuracy with an average AUC increase of 0.035 over standard C+T.


2015 ◽  
Author(s):  
David M Rocke ◽  
Luyao Ruan ◽  
Yilun Zhang ◽  
J. Jared Gossett ◽  
Blythe Durbin-Johnson ◽  
...  

Motivation: An important property of a valid method for testing for differential expression is that the false positive rate should at least roughly correspond to the p-value cutoff, so that if 10,000 genes are tested at a p-value cutoff of 10−4, and if all the null hypotheses are true, then there should be only about 1 gene declared to be significantly differentially expressed. We tested this by resampling from existing RNA-Seq data sets and also by matched negative binomial simulations. Results: Methods we examined, which rely strongly on a negative binomial model, such as edgeR, DESeq, and DESeq2, show large numbers of false positives in both the resampled real-data case and in the simulated negative binomial case. This also occurs with a negative binomial generalized linear model function in R. Methods that use only the variance function, such as limma-voom, do not show excessive false positives, as is also the case with a variance stabilizing transformation followed by linear model analysis with limma. The excess false positives are likely caused by apparently small biases in estimation of negative binomial dispersion and, perhaps surprisingly, occur mostly when the mean and/or the dis-persion is high, rather than for low-count genes.


2019 ◽  
Vol 35 (22) ◽  
pp. 4837-4839 ◽  
Author(s):  
Hanna Julienne ◽  
Huwenbo Shi ◽  
Bogdan Pasaniuc ◽  
Hugues Aschard

Abstract Motivation Multi-trait analyses using public summary statistics from genome-wide association studies (GWASs) are becoming increasingly popular. A constraint of multi-trait methods is that they require complete summary data for all traits. Although methods for the imputation of summary statistics exist, they lack precision for genetic variants with small effect size. This is benign for univariate analyses where only variants with large effect size are selected a posteriori. However, it can lead to strong p-value inflation in multi-trait testing. Here we present a new approach that improve the existing imputation methods and reach a precision suitable for multi-trait analyses. Results We fine-tuned parameters to obtain a very high accuracy imputation from summary statistics. We demonstrate this accuracy for variants of all effect sizes on real data of 28 GWAS. We implemented the resulting methodology in a python package specially designed to efficiently impute multiple GWAS in parallel. Availability and implementation The python package is available at: https://gitlab.pasteur.fr/statistical-genetics/raiss, its accompanying documentation is accessible here http://statistical-genetics.pages.pasteur.fr/raiss/. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Salem Alawbathani ◽  
Mehreen Batool ◽  
Jan Fleckhaus ◽  
Sarkawt Hamad ◽  
Floyd Hassenrück ◽  
...  

AbstractA poor understanding of statistical analysis has been proposed as a key reason for lack of replicability of many studies in experimental biomedicine. While several authors have demonstrated the fickleness of calculated p values based on simulations, we have experienced that such simulations are difficult to understand for many biomedical scientists and often do not lead to a sound understanding of the role of variability between random samples in statistical analysis. Therefore, we as trainees and trainers in a course of statistics for biomedical scientists have used real data from a large published study to develop a tool that allows scientists to directly experience the fickleness of p values. A tool based on a commonly used software package was developed that allows using random samples from real data. The tool is described and together with the underlying database is made available. The tool has been tested successfully in multiple other groups of biomedical scientists. It can also let trainees experience the impact of randomness, sample sizes and choice of specific statistical test on measured p values. We propose that live exercises based on real data will be more impactful in the training of biomedical scientists on statistical concepts.


2017 ◽  
Vol 3 (3) ◽  
pp. 31 ◽  
Author(s):  
Isabel González Gayte ◽  
Rocío Bautista Moreno ◽  
Pedro Seoane Zonjic ◽  
M. Gonzalo Claros

Differential gene expression based on RNA-seq is widely used. Bioinformatics skills are required since no algorithm is appropriate for all experimental designs. Moreover, when working with organisms without reference genome, functional analysis is less than straightforward in most situations. DEgenes Hunter, an attempt to automate the process, is based on two independent scripts, one for differential expression and one for functional interpretation. Based on replicates, the R script decides which of the edgeR, DEseq2, NOISeq and limma algorithms are appropriate. It performs quality control calculations and provides the prevalent, most reliable, set of differentially expressed genes, and lists all other possible candidates for further functional interpretation. It also provides a combined P-value that allows differentially expressed genes ranking. It has been tested with synthetic and real-world datasets, showing in both cases ease of use and reliable results. With real data, DEgenes Hunter offers straightforward functional interpretation.


2020 ◽  
Vol 3 (2) ◽  
pp. 216-228
Author(s):  
Hannes Rosenbusch ◽  
Leon P. Hilbert ◽  
Anthony M. Evans ◽  
Marcel Zeelenberg

Sometimes interesting statistical findings are produced by a small number of “lucky” data points within the tested sample. To address this issue, researchers and reviewers are encouraged to investigate outliers and influential data points. Here, we present StatBreak, an easy-to-apply method, based on a genetic algorithm, that identifies the observations that most strongly contributed to a finding (e.g., effect size, model fit, p value, Bayes factor). Within a given sample, StatBreak searches for the largest subsample in which a previously observed pattern is not present or is reduced below a specifiable threshold. Thus, it answers the following question: “Which (and how few) ‘lucky’ cases would need to be excluded from the sample for the data-based conclusion to change?” StatBreak consists of a simple R function and flags the luckiest data points for any form of statistical analysis. Here, we demonstrate the effectiveness of the method with simulated and real data across a range of study designs and analyses. Additionally, we describe StatBreak’s R function and explain how researchers and reviewers can apply the method to the data they are working with.


Sign in / Sign up

Export Citation Format

Share Document