Estimation of permutation-based metabolome-wide significance thresholds

AbstractMotivationA key issue in the omics literature is the search for statistically significant relationships between molecular markers and phenotype. The aim is to detect disease-related discriminatory features while controlling for false positive associations at adequate power. Metabolome-wide association studies have revealed significant relationships of metabolic phenotypes with disease risk by analysing hundreds to tens of thousands of molecular variables leading to multivariate data which are highly noisy and collinear. In this context, conventional Bonferroni or Sidak multiple testing corrections are rather useful as these are valid for independent tests, while permutation procedures allow for the estimation of significance levels from the null distribution without assuming independence among features. Nevertheless, under the permutation approach the distribution of p-values may present systematic deviations from the theoretical null distribution which leads to overly conservative adjusted threshold estimates i.e. smaller than a Bonferroni or Sidak correction.MethodsWe make use of parametric approximation methods based on a multivariate Normal distribution to derive stable estimates of the metabolome-wide significance level. A univariate approach is applied based on a permutation procedure which effectively controls the overall type I error rate at the α level.ResultsWe illustrate the approach for different model parametrizations and distributional features of the outcome measure, using both simulated and real data. We also investigate different levels of correlation within the features and between the features and the outcome.AvailabilityMWSL is an open-source R software package for the empirical estimation of the metabolome-wide significance level available at https://github.com/AlinaPeluso/MWSL.

Download Full-text

A novel gene-set association test based on variance-gamma distribution

Statistical Methods in Medical Research ◽

10.1177/0962280218791205 ◽

2018 ◽

Vol 28 (9) ◽

pp. 2868-2875

Author(s):

Zhongxue Chen ◽

Qingzhong Liu ◽

Kai Wang

Keyword(s):

Gamma Distribution ◽

Type I Error ◽

Null Distribution ◽

Real Data ◽

Association Test ◽

P Value ◽

Type I ◽

Test Statistic ◽

Data Set ◽

Variance Gamma

Several gene- or set-based association tests have been proposed recently in the literature. Powerful statistical approaches are still highly desirable in this area. In this paper we propose a novel statistical association test, which uses information of the burden component and its complement from the genotypes. This new test statistic has a simple null distribution, which is a special and simplified variance-gamma distribution, and its p-value can be easily calculated. Through a comprehensive simulation study, we show that the new test can control type I error rate and has superior detecting power compared with some popular existing methods. We also apply the new approach to a real data set; the results demonstrate that this test is promising.

Download Full-text

A comparative analysis of cell-type adjustment methods for epigenome-wide association studies based on simulated and real data sets

Briefings in Bioinformatics ◽

10.1093/bib/bby068 ◽

2018 ◽

Vol 20 (6) ◽

pp. 2055-2065 ◽

Cited By ~ 1

Author(s):

Johannes Brägelmann ◽

Justo Lorenzo Bermejo

Keyword(s):

Statistical Power ◽

Type I Error ◽

Association Studies ◽

Real Data ◽

Error Rates ◽

Data Sets ◽

Type I ◽

Cell Type ◽

Type I Error Rates

Abstract Technological advances and reduced costs of high-density methylation arrays have led to an increasing number of association studies on the possible relationship between human disease and epigenetic variability. DNA samples from peripheral blood or other tissue types are analyzed in epigenome-wide association studies (EWAS) to detect methylation differences related to a particular phenotype. Since information on the cell-type composition of the sample is generally not available and methylation profiles are cell-type specific, statistical methods have been developed for adjustment of cell-type heterogeneity in EWAS. In this study we systematically compared five popular adjustment methods: the factored spectrally transformed linear mixed model (FaST-LMM-EWASher), the sparse principal component analysis algorithm ReFACTor, surrogate variable analysis (SVA), independent SVA (ISVA) and an optimized version of SVA (SmartSVA). We used real data and applied a multilayered simulation framework to assess the type I error rate, the statistical power and the quality of estimated methylation differences according to major study characteristics. While all five adjustment methods improved false-positive rates compared with unadjusted analyses, FaST-LMM-EWASher resulted in the lowest type I error rate at the expense of low statistical power. SVA efficiently corrected for cell-type heterogeneity in EWAS up to 200 cases and 200 controls, but did not control type I error rates in larger studies. Results based on real data sets confirmed simulation findings with the strongest control of type I error rates by FaST-LMM-EWASher and SmartSVA. Overall, ReFACTor, ISVA and SmartSVA showed the best comparable statistical power, quality of estimated methylation differences and runtime.

Download Full-text

Multiple Testing. Part I. Single-Step Procedures for Control of General Type I Error Rates

Statistical Applications in Genetics and Molecular Biology ◽

10.2202/1544-6115.1040 ◽

2004 ◽

Vol 3 (1) ◽

pp. 1-69 ◽

Cited By ~ 53

Author(s):

Sandrine Dudoit ◽

Mark J. van der Laan ◽

Katherine S. Pollard

Keyword(s):

Error Rate ◽

Multiple Testing ◽

Type I Error ◽

Null Distribution ◽

Error Rates ◽

Single Step ◽

Type I ◽

Test Statistics ◽

Testing Procedures ◽

Multiple Testing Procedures

The present article proposes general single-step multiple testing procedures for controlling Type I error rates defined as arbitrary parameters of the distribution of the number of Type I errors, such as the generalized family-wise error rate. A key feature of our approach is the test statistics null distribution (rather than data generating null distribution) used to derive cut-offs (i.e., rejection regions) for these test statistics and the resulting adjusted p-values. For general null hypotheses, corresponding to submodels for the data generating distribution, we identify an asymptotic domination condition for a null distribution under which single-step common-quantile and common-cut-off procedures asymptotically control the Type I error rate, for arbitrary data generating distributions, without the need for conditions such as subset pivotality. Inspired by this general characterization of a null distribution, we then propose as an explicit null distribution the asymptotic distribution of the vector of null value shifted and scaled test statistics. In the special case of family-wise error rate (FWER) control, our method yields the single-step minP and maxT procedures, based on minima of unadjusted p-values and maxima of test statistics, respectively, with the important distinction in the choice of null distribution. Single-step procedures based on consistent estimators of the null distribution are shown to also provide asymptotic control of the Type I error rate. A general bootstrap algorithm is supplied to conveniently obtain consistent estimators of the null distribution. The special cases of t- and F-statistics are discussed in detail. The companion articles focus on step-down multiple testing procedures for control of the FWER (van der Laan et al., 2004b) and on augmentations of FWER-controlling methods to control error rates such as tail probabilities for the number of false positives and for the proportion of false positives among the rejected hypotheses (van der Laan et al., 2004a). The proposed bootstrap multiple testing procedures are evaluated by a simulation study and applied to genomic data in the fourth article of the series (Pollard et al., 2004).

Download Full-text

MARS: leveraging allelic heterogeneity to increase power of association testing

Genome Biology ◽

10.1186/s13059-021-02353-8 ◽

2021 ◽

Vol 22 (1) ◽

Cited By ~ 1

Author(s):

Farhad Hormozdiari ◽

Junghyun Jung ◽

Eleazar Eskin ◽

Jong Wha J. Joo

Keyword(s):

Type I Error ◽

Association Studies ◽

Simulated Data ◽

Real Data ◽

Association Test ◽

Type I ◽

Genome Wide Association Studies ◽

Association Testing ◽

Causal Status ◽

Causal Variants

AbstractIn standard genome-wide association studies (GWAS), the standard association test is underpowered to detect associations between loci with multiple causal variants with small effect sizes. We propose a statistical method, Model-based Association test Reflecting causal Status (MARS), that finds associations between variants in risk loci and a phenotype, considering the causal status of variants, only requiring the existing summary statistics to detect associated risk loci. Utilizing extensive simulated data and real data, we show that MARS increases the power of detecting true associated risk loci compared to previous approaches that consider multiple variants, while controlling the type I error.

Download Full-text

ProgPerm: Progressive permutation for a dynamic representation of the robustness of microbiome discoveries

BMC Bioinformatics ◽

10.1186/s12859-021-04061-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Liangliang Zhang ◽

Yushu Shi ◽

Kim-Anh Do ◽

Christine B. Peterson ◽

Robert R. Jenq

Keyword(s):

Multiple Testing ◽

Type I Error ◽

Real Data ◽

Nonparametric Test ◽

Original Data ◽

Mixed Data ◽

Type I ◽

Differential Abundance ◽

Dynamic Representation ◽

R Shiny

Abstract Background Identification of features is a critical task in microbiome studies that is complicated by the fact that microbial data are high dimensional and heterogeneous. Masked by the complexity of the data, the problem of separating signals (differential features between groups) from noise (features that are not differential between groups) becomes challenging and troublesome. For instance, when performing differential abundance tests, multiple testing adjustments tend to be overconservative, as the probability of a type I error (false positive) increases dramatically with the large numbers of hypotheses. Moreover, the grouping effect of interest can be obscured by heterogeneity. These factors can incorrectly lead to the conclusion that there are no differences in the microbiome compositions. Results We translate and represent the problem of identifying differential features, which are differential in two-group comparisons (e.g., treatment versus control), as a dynamic layout of separating the signal from its random background. More specifically, we progressively permute the grouping factor labels of the microbiome samples and perform multiple differential abundance tests in each scenario. We then compare the signal strength of the most differential features from the original data with their performance in permutations, and will observe a visually apparent decreasing trend if these features are true positives identified from the data. Simulations and applications on real data show that the proposed method creates a U-curve when plotting the number of significant features versus the proportion of mixing. The shape of the U-Curve can convey the strength of the overall association between the microbiome and the grouping factor. We also define a fragility index to measure the robustness of the discoveries. Finally, we recommend the identified features by comparing p-values in the observed data with p-values in the fully mixed data. Conclusions We have developed this into a user-friendly and efficient R-shiny tool with visualizations. By default, we use the Wilcoxon rank sum test to compute the p-values, since it is a robust nonparametric test. Our proposed method can also utilize p-values obtained from other testing methods, such as DESeq. This demonstrates the potential of the progressive permutation method to be extended to new settings.

Download Full-text

Interpretable network-guided epistasis detection

10.1101/2020.09.24.310136 ◽

2020 ◽

Author(s):

Diane Duroux ◽

Héctor Climente-González ◽

Chloé-Agathe Azencott ◽

Kristel Van Steen

Keyword(s):

Gene Pair ◽

Type I Error ◽

Association Studies ◽

3D Structure ◽

Null Distribution ◽

Biological Knowledge ◽

Type I ◽

Disease Biology ◽

Prior Biological Knowledge ◽

Genome Wide

AbstractDetecting epistatic interactions at the gene level is essential to understanding the biological mechanisms of complex diseases. Unfortunately, genome-wide interaction association studies (GWAIS) involve many statistical challenges that make such detection hard. We propose a multi-step protocol for epistasis detection along the edges of a gene-gene co-function network. Such an approach reduces the number of tests performed and provides interpretable interactions, while keeping type I error controlled. Yet, mapping gene-interactions into testable SNP-interaction hypotheses, as well as computing gene pair association scores from SNP pair ones, is not trivial. Here we compare three SNP-gene mappings (positional overlap, eQTL and proximity in 3D structure) and used the adaptive truncated product method to compute gene pair scores. This method is non-parametric, does not require a known null distribution, and is fast to compute. We apply multiple variants of this protocol to a GWAS inflammatory bowel disease (IBD) dataset. Different configurations produced different results, highlighting that various mechanisms are implicated in IBD, while at the same time, results overlapped with known disease biology. Importantly, the proposed pipeline also differs from a conventional approach were no network is used, showing the potential for additional discoveries when prior biological knowledge is incorporated into epistasis detection.

Download Full-text

Assessing Treatment Effects with Pharmacometric Models: A New Method that Addresses Problems with Standard Assessments

The AAPS Journal ◽

10.1208/s12248-021-00596-8 ◽

2021 ◽

Vol 23 (3) ◽

Author(s):

Estelle Chasseloup ◽

Adrien Tessier ◽

Mats O. Karlsson

Keyword(s):

Multiple Testing ◽

Type I Error ◽

Drug Effect ◽

Model Averaging ◽

Clinical Trial Data ◽

Real Data ◽

Data Sets ◽

Type I ◽

Data Types ◽

Inflated Type

AbstractLongitudinal pharmacometric models offer many advantages in the analysis of clinical trial data, but potentially inflated type I error and biased drug effect estimates, as a consequence of model misspecifications and multiple testing, are main drawbacks. In this work, we used real data to compare these aspects for a standard approach (STD) and a new one using mixture models, called individual model averaging (IMA). Placebo arm data sets were obtained from three clinical studies assessing ADAS-Cog scores, Likert pain scores, and seizure frequency. By randomly (1:1) assigning patients in the above data sets to “treatment” or “placebo,” we created data sets where any significant drug effect was known to be a false positive. Repeating the process of random assignment and analysis for significant drug effect many times (N = 1000) for each of the 40 to 66 placebo-drug model combinations, statistics of the type I error and drug effect bias were obtained. Across all models and the three data types, the type I error was (5th, 25th, 50th, 75th, 95th percentiles) 4.1, 11.4, 40.6, 100.0, 100.0 for STD, and 1.6, 3.5, 4.3, 5.0, 6.0 for IMA. IMA showed no bias in the drug effect estimates, whereas in STD bias was frequently present. In conclusion, STD is associated with inflated type I error and risk of biased drug effect estimates. IMA demonstrated controlled type I error and no bias.

Download Full-text

The Power of t-test with large sample size under the different condition of sample size and significance level between real data, transformed data and data from monte carlo simulation technique.

11th GLOBAL CONFERENCE ON BUSINESS AND SOCIAL SCIENCES - Global Conference on Business and Social Sciences Proceeding ◽

10.35609/gcbssproceeding.2020.11(108) ◽

2020 ◽

Vol 11 (1) ◽

pp. 108-108

Author(s):

Natcha Mahapoonyanont ◽

Suwichaya Putuptim

Keyword(s):

Sample Size ◽

Type I Error ◽

Statistical Significance ◽

Real Data ◽

T Test ◽

Type I ◽

Type Ii Error ◽

Significance Level ◽

Monte Carlo Simulation Technique ◽

Research Findings

The power of test is the probability that the test rejects the null hypothesis (H0) when a specific alternative hypothesis (H1) is true. The probability of occurrence of a type I error is modelled on medical research that tried to avoid the type I error, such as testing of new medicines, etc. The statistical significance level must be set to be as small as possible, and the probability of type II error would be considered later. In behavioural sciences and social sciences research, the researcher wants to avoid a type I error by determining the level of statistical significance. There are arguments of statistical significance could affect the errors of the findings. Independent variables may have a real influence on the dependent variables but the researcher could not detect them because of statistical significance was setting at the low level. Therefore, in some situations, more attention should be paid to the occurrence of the type II error, and less interest in type I error. This may demonstrate more realistic and valid results. The objectives of this research were to compare of the power of test on t – test under the condition of different sample size (n; 30, 60, 90), statistical significance (sig; .001, .01, .05), and type of data (real data, transformed data, simulation data (Monte Carlo Simulation Technique)). The research findings provide significant information for researcher that is useful for further research using t-test, to improve the accuracy of research findings.

Download Full-text

A permutation method for detecting trend correlations in rare variant association studies

Genetics Research ◽

10.1017/s0016672319000120 ◽

2019 ◽

Vol 101 ◽

Author(s):

Lifeng Liu ◽

Pengfei Wang ◽

Jingbo Meng ◽

Lili Chen ◽

Wensheng Zhu ◽

...

Keyword(s):

Rare Variant ◽

Type I Error ◽

Rare Variants ◽

Association Studies ◽

Complex Diseases ◽

Type I ◽

Phenotypic Variance ◽

Rare Variant Association ◽

Significance Level ◽

Association Analyses

Abstract In recent years, there has been an increasing interest in detecting disease-related rare variants in sequencing studies. Numerous studies have shown that common variants can only explain a small proportion of the phenotypic variance for complex diseases. More and more evidence suggests that some of this missing heritability can be explained by rare variants. Considering the importance of rare variants, researchers have proposed a considerable number of methods for identifying the rare variants associated with complex diseases. Extensive research has been carried out on testing the association between rare variants and dichotomous, continuous or ordinal traits. So far, however, there has been little discussion about the case in which both genotypes and phenotypes are ordinal variables. This paper introduces a method based on the γ-statistic, called OV-RV, for examining disease-related rare variants when both genotypes and phenotypes are ordinal. At present, little is known about the asymptotic distribution of the γ-statistic when conducting association analyses for rare variants. One advantage of OV-RV is that it provides a robust estimation of the distribution of the γ-statistic by employing the permutation approach proposed by Fisher. We also perform extensive simulations to investigate the numerical performance of OV-RV under various model settings. The simulation results reveal that OV-RV is valid and efficient; namely, it controls the type I error approximately at the pre-specified significance level and achieves greater power at the same significance level. We also apply OV-RV for rare variant association studies of diastolic blood pressure.

Download Full-text

Which Statistic Should Be Used to Detect Item Preknowledge When the Set of Compromised Items Is Known?

Applied Psychological Measurement ◽

10.1177/0146621617698453 ◽

2017 ◽

Vol 41 (6) ◽

pp. 403-421 ◽

Cited By ~ 5

Author(s):

Sandip Sinharay

Keyword(s):

Likelihood Ratio ◽

Likelihood Ratio Test ◽

Type I Error ◽

Null Distribution ◽

Real Data ◽

Major Type ◽

Ratio Test ◽

Type I ◽

Posterior Shift ◽

Item Preknowledge

Benefiting from item preknowledge is a major type of fraudulent behavior during educational assessments. Belov suggested the posterior shift statistic for detection of item preknowledge and showed its performance to be better on average than that of seven other statistics for detection of item preknowledge for a known set of compromised items. Sinharay suggested a statistic based on the likelihood ratio test for detection of item preknowledge; the advantage of the statistic is that its null distribution is known. Results from simulated and real data and adaptive and nonadaptive tests are used to demonstrate that the Type I error rate and power of the statistic based on the likelihood ratio test are very similar to those of the posterior shift statistic. Thus, the statistic based on the likelihood ratio test appears promising in detecting item preknowledge when the set of compromised items is known.

Download Full-text