An optimal kernel-based multivariate U-statistic to test for associations with multiple phenotypes

Biostatistics ◽  
2020 ◽  
Author(s):  
Y Wen ◽  
Qing Lu

Summary Set-based analysis that jointly considers multiple predictors in a group has been broadly conducted for association tests. However, their power can be sensitive to the distribution of phenotypes, and the underlying relationships between predictors and outcomes. Moreover, most of the set-based methods are designed for single-trait analysis, making it hard to explore the pleiotropic effect and borrow information when multiple phenotypes are available. Here, we propose a kernel-based multivariate U-statistics (KMU) that is robust and powerful in testing the association between a set of predictors and multiple outcomes. We employed a rank-based kernel function for the outcomes, which makes our method robust to various outcome distributions. Rather than selecting a single kernel, our test statistics is built based on multiple kernels selected in a data-driven manner, and thus is capable of capturing various complex relationships between predictors and outcomes. The asymptotic properties of our test statistics have been developed. Through simulations, we have demonstrated that KMU has controlled type I error and higher power than its counterparts. We further showed its practical utility by analyzing a whole genome sequencing data from Alzheimer’s Disease Neuroimaging Initiative study, where novel genes have been detected to be associated with imaging phenotypes.

2015 ◽  
Vol 2015 ◽  
pp. 1-7 ◽  
Author(s):  
Guogen Shan ◽  
Amei Amei ◽  
Daniel Young

Sensitivity and specificity are often used to assess the performance of a diagnostic test with binary outcomes. Wald-type test statistics have been proposed for testing sensitivity and specificity individually. In the presence of a gold standard, simultaneous comparison between two diagnostic tests for noninferiority of sensitivity and specificity based on an asymptotic approach has been studied by Chen et al. (2003). However, the asymptotic approach may suffer from unsatisfactory type I error control as observed from many studies, especially in small to medium sample settings. In this paper, we compare three unconditional approaches for simultaneously testing sensitivity and specificity. They are approaches based on estimation, maximization, and a combination of estimation and maximization. Although the estimation approach does not guarantee type I error, it has satisfactory performance with regard to type I error control. The other two unconditional approaches are exact. The approach based on estimation and maximization is generally more powerful than the approach based on maximization.


2021 ◽  
Author(s):  
Guhan Ram Venkataraman ◽  
Yosuke Tanigawa ◽  
Matti Pirinen ◽  
Manuel A Rivas

Rare-variant aggregate analysis from exome and whole genome sequencing data typically summarizes with a single statistic the signal for a gene or the unit that is being aggre- gated. However, when doing so, the effect profile within the unit may not be easily characterized across one or multiple phenotypes. Here, we present an approach we call Multiple Rare-Variants and Phenotypes Mixture Model (MRPMM), which clusters rare variants into groups based on their effects on the multivariate phenotype and makes statistical inferences about the properties of the underlying mixture of genetic effects. Using summary statistic data from a meta-analysis of exome sequencing data of 184,698 individuals in the UK Biobank across 6 populations, we demonstrate that our mixture model can identify clusters of variants responsible for significantly disparate effects across a multivariate phenotype; we study three lipid and three renal traits separately. The method is able to estimate (1) the proportion of non-null variants, (2) whether variants with the same predicted consequence in one gene behave similarly, (3) whether variants across genes share effect profiles across the multivariate phenotype, and (4) whether different annotations differ in the magnitude of their effects. As rare-variant data and aggregation techniques become more common, this method can be used to ascribe further meaning to association results.


2019 ◽  
Vol 2019 ◽  
pp. 1-8 ◽  
Author(s):  
Can Ateş ◽  
Özlem Kaymaz ◽  
H. Emre Kale ◽  
Mustafa Agah Tekindal

In this study, we investigate how Wilks’ lambda, Pillai’s trace, Hotelling’s trace, and Roy’s largest root test statistics can be affected when the normal and homogeneous variance assumptions of the MANOVA method are violated. In other words, in these cases, the robustness of the tests is examined. For this purpose, a simulation study is conducted in different scenarios. In different variable numbers and different sample sizes, considering the group variances are homogeneous σ12=σ22=⋯=σg2 and heterogeneous (increasing) σ12<σ22<⋯<σg2, random numbers are generated from Gamma(4-4-4; 0.5), Gamma(4-9-36; 0.5), Student’s t(2), and Normal(0; 1) distributions. Furthermore, the number of observations in the groups being balanced and unbalanced is also taken into account. After 10000 repetitions, type-I error values are calculated for each test for α = 0.05. In the Gamma distribution, Pillai’s trace test statistic gives more robust results in the case of homogeneous and heterogeneous variances for 2 variables, and in the case of 3 variables, Roy’s largest root test statistic gives more robust results in balanced samples and Pillai’s trace test statistic in unbalanced samples. In Student’s t distribution, Pillai’s trace test statistic gives more robust results in the case of homogeneous variance and Wilks’ lambda test statistic in the case of heterogeneous variance. In the normal distribution, in the case of homogeneous variance for 2 variables, Roy’s largest root test statistic gives relatively more robust results and Wilks’ lambda test statistic for 3 variables. Also in the case of heterogeneous variance for 2 and 3 variables, Roy’s largest root test statistic gives robust results in the normal distribution. The test statistics used with MANOVA are affected by the violation of homogeneity of covariance matrices and normality assumptions particularly from unbalanced number of observations.


Methodology ◽  
2016 ◽  
Vol 12 (2) ◽  
pp. 44-51 ◽  
Author(s):  
José Manuel Caperos ◽  
Ricardo Olmos ◽  
Antonio Pardo

Abstract. Correlation analysis is one of the most widely used methods to test hypotheses in social and health sciences; however, its use is not completely error free. We have explored the frequency of inconsistencies between reported p-values and the associated test statistics in 186 papers published in four Spanish journals of psychology (1,950 correlation tests); we have also collected information about the use of one- versus two-tailed tests in the presence of directional hypotheses, and about the use of some kind of adjustment to control Type I errors due to simultaneous inference. Reported correlation tests (83.8%) are incomplete and 92.5% include an inexact p-value. Gross inconsistencies, which are liable to alter the statistical conclusions, appear in 4% of the reviewed tests, and 26.9% of the inconsistencies found were large enough to bias the results of a meta-analysis. The election of one-tailed tests and the use of adjustments to control the Type I error rate are negligible. We therefore urge authors, reviewers, and editorial boards to pay particular attention to this in order to prevent inconsistencies in statistical reports.


2016 ◽  
Vol 27 (3) ◽  
pp. 905-919
Author(s):  
Anne Buu ◽  
L Keoki Williams ◽  
James J Yang

We propose a new genome-wide association test for mixed binary and continuous phenotypes that uses an efficient numerical method to estimate the empirical distribution of the Fisher’s combination statistic under the null hypothesis. Our simulation study shows that the proposed method controls the type I error rate and also maintains its power at the level of the permutation method. More importantly, the computational efficiency of the proposed method is much higher than the one of the permutation method. The simulation results also indicate that the power of the test increases when the genetic effect increases, the minor allele frequency increases, and the correlation between responses decreases. The statistical analysis on the database of the Study of Addiction: Genetics and Environment demonstrates that the proposed method combining multiple phenotypes can increase the power of identifying markers that may not be, otherwise, chosen using marginal tests.


2020 ◽  
Vol 36 (10) ◽  
pp. 3099-3106
Author(s):  
Burim Ramosaj ◽  
Lubna Amro ◽  
Markus Pauly

Abstract Motivation Imputation procedures in biomedical fields have turned into statistical practice, since further analyses can be conducted ignoring the former presence of missing values. In particular, non-parametric imputation schemes like the random forest have shown favorable imputation performance compared to the more traditionally used MICE procedure. However, their effect on valid statistical inference has not been analyzed so far. This article closes this gap by investigating their validity for inferring mean differences in incompletely observed pairs while opposing them to a recent approach that only works with the given observations at hand. Results Our findings indicate that machine-learning schemes for (multiply) imputing missing values may inflate type I error or result in comparably low power in small-to-moderate matched pairs, even after modifying the test statistics using Rubin’s multiple imputation rule. In addition to an extensive simulation study, an illustrative data example from a breast cancer gene study has been considered. Availability and implementation The corresponding R-code can be accessed through the authors and the gene expression data can be downloaded at www.gdac.broadinstitute.org. Supplementary information Supplementary data are available at Bioinformatics online.


2004 ◽  
Vol 3 (1) ◽  
pp. 1-69 ◽  
Author(s):  
Sandrine Dudoit ◽  
Mark J. van der Laan ◽  
Katherine S. Pollard

The present article proposes general single-step multiple testing procedures for controlling Type I error rates defined as arbitrary parameters of the distribution of the number of Type I errors, such as the generalized family-wise error rate. A key feature of our approach is the test statistics null distribution (rather than data generating null distribution) used to derive cut-offs (i.e., rejection regions) for these test statistics and the resulting adjusted p-values. For general null hypotheses, corresponding to submodels for the data generating distribution, we identify an asymptotic domination condition for a null distribution under which single-step common-quantile and common-cut-off procedures asymptotically control the Type I error rate, for arbitrary data generating distributions, without the need for conditions such as subset pivotality. Inspired by this general characterization of a null distribution, we then propose as an explicit null distribution the asymptotic distribution of the vector of null value shifted and scaled test statistics. In the special case of family-wise error rate (FWER) control, our method yields the single-step minP and maxT procedures, based on minima of unadjusted p-values and maxima of test statistics, respectively, with the important distinction in the choice of null distribution. Single-step procedures based on consistent estimators of the null distribution are shown to also provide asymptotic control of the Type I error rate. A general bootstrap algorithm is supplied to conveniently obtain consistent estimators of the null distribution. The special cases of t- and F-statistics are discussed in detail. The companion articles focus on step-down multiple testing procedures for control of the FWER (van der Laan et al., 2004b) and on augmentations of FWER-controlling methods to control error rates such as tail probabilities for the number of false positives and for the proportion of false positives among the rejected hypotheses (van der Laan et al., 2004a). The proposed bootstrap multiple testing procedures are evaluated by a simulation study and applied to genomic data in the fourth article of the series (Pollard et al., 2004).


2017 ◽  
Vol 13 (1) ◽  
Author(s):  
Asanao Shimokawa ◽  
Etsuo Miyaoka

AbstractTo estimate or test the treatment effect in randomized clinical trials, it is important to adjust for the potential influence of covariates that are likely to affect the association between the treatment or control group and the response. If these covariates are known at the start of the trial, random assignment of the treatment within each stratum would be considered. On the other hand, if these covariates are not clear at the start of the trial, or if it is difficult to allocate the treatment within each stratum, completely randomized assignment of the treatment would be performed. In both sampling structures, the use of a stratified adjusted test is a useful way to evaluate the significance of the overall treatment effect by reducing the variance and/or bias of the result. If the trial has a binary endpoint, the Cochran and Mantel-Haenszel tests are generally used. These tests are constructed based on the assumption that the number of patients within a stratum is fixed. However, in practice, the stratum sizes are not fixed at the start of the trial in many situations, and are instead allowed to vary. Therefore, there is a risk that using these tests under such situations would result in an error in the estimated variation of the test statistics. To handle the problem, we propose new test statistics under both sampling structures based on multinomial distributions. Our proposed approach is based on the Cochran test, and the difference between the two tests tends to have similar values in the case of a large number of patients. When the total number of patients is small, our approach yields a more conservative result. Through simulation studies, we show that the new approach could correctly maintain the type I error better than the traditional approach.


2021 ◽  
pp. 096228022110616
Author(s):  
Bo Chen ◽  
Wei Xu

Functional regression has been widely used on longitudinal data, but it is not clear how to apply functional regression to microbiome sequencing data. We propose a novel functional response regression model analyzing correlated longitudinal microbiome sequencing data, which extends the classic functional response regression model only working for independent functional responses. We derive the theory of generalized least squares estimators for predictors’ effects when functional responses are correlated, and develop a data transformation technique to solve the computational challenge for analyzing correlated functional response data using existing functional regression method. We show by extensive simulations that our proposed method provides unbiased estimations for predictors’ effect, and our model has accurate type I error and power performance for correlated functional response data, compared with classic functional response regression model. Finally we implement our method to a real infant gut microbiome study to evaluate the relationship of clinical factors to predominant taxa along time.


2015 ◽  
Vol 112 (4) ◽  
pp. 1019-1024 ◽  
Author(s):  
Yi-Juan Hu ◽  
Yun Li ◽  
Paul L. Auer ◽  
Dan-Yu Lin

In the large cohorts that have been used for genome-wide association studies (GWAS), it is prohibitively expensive to sequence all cohort members. A cost-effective strategy is to sequence subjects with extreme values of quantitative traits or those with specific diseases. By imputing the sequencing data from the GWAS data for the cohort members who are not selected for sequencing, one can dramatically increase the number of subjects with information on rare variants. However, ignoring the uncertainties of imputed rare variants in downstream association analysis will inflate the type I error when sequenced subjects are not a random subset of the GWAS subjects. In this article, we provide a valid and efficient approach to combining observed and imputed data on rare variants. We consider commonly used gene-level association tests, all of which are constructed from the score statistic for assessing the effects of individual variants on the trait of interest. We show that the score statistic based on the observed genotypes for sequenced subjects and the imputed genotypes for nonsequenced subjects is unbiased. We derive a robust variance estimator that reflects the true variability of the score statistic regardless of the sampling scheme and imputation quality, such that the corresponding association tests always have correct type I error. We demonstrate through extensive simulation studies that the proposed tests are substantially more powerful than the use of accurately imputed variants only and the use of sequencing data alone. We provide an application to the Women’s Health Initiative. The relevant software is freely available.


Sign in / Sign up

Export Citation Format

Share Document