scholarly journals Finding Statistically Significant Interactions between Continuous Features

Author(s):  
Mahito Sugiyama ◽  
Karsten Borgwardt

The search for higher-order feature interactions that are statistically significantly associated with a class variable is of high relevance in fields such as Genetics or Healthcare, but the combinatorial explosion of the candidate space makes this problem extremely challenging in terms of computational efficiency and proper correction for multiple testing. While recent progress has been made regarding this challenge for binary features, we here present the first solution for continuous features. We propose an algorithm which overcomes the combinatorial explosion of the search space of higher-order interactions by deriving a lower bound on the p-value for each interaction, which enables us to massively prune interactions that can never reach significance and to thereby gain more statistical power. In our experiments, our approach efficiently detects all significant interactions in a variety of synthetic and real-world datasets.

2019 ◽  
Vol 116 (4) ◽  
pp. 1195-1200 ◽  
Author(s):  
Daniel J. Wilson

Analysis of “big data” frequently involves statistical comparison of millions of competing hypotheses to discover hidden processes underlying observed patterns of data, for example, in the search for genetic determinants of disease in genome-wide association studies (GWAS). Controlling the familywise error rate (FWER) is considered the strongest protection against false positives but makes it difficult to reach the multiple testing-corrected significance threshold. Here, I introduce the harmonic mean p-value (HMP), which controls the FWER while greatly improving statistical power by combining dependent tests using generalized central limit theorem. I show that the HMP effortlessly combines information to detect statistically significant signals among groups of individually nonsignificant hypotheses in examples of a human GWAS for neuroticism and a joint human–pathogen GWAS for hepatitis C viral load. The HMP simultaneously tests all ways to group hypotheses, allowing the smallest groups of hypotheses that retain significance to be sought. The power of the HMP to detect significant hypothesis groups is greater than the power of the Benjamini–Hochberg procedure to detect significant hypotheses, although the latter only controls the weaker false discovery rate (FDR). The HMP has broad implications for the analysis of large datasets, because it enhances the potential for scientific discovery.


2020 ◽  
Vol 36 (Supplement_2) ◽  
pp. i840-i848
Author(s):  
Thomas Gumbsch ◽  
Christian Bock ◽  
Michael Moor ◽  
Bastian Rieck ◽  
Karsten Borgwardt

Abstract Motivation Temporal biomarker discovery in longitudinal data is based on detecting reoccurring trajectories, the so-called shapelets. The search for shapelets requires considering all subsequences in the data. While the accompanying issue of multiple testing has been mitigated in previous work, the redundancy and overlap of the detected shapelets results in an a priori unbounded number of highly similar and structurally meaningless shapelets. As a consequence, current temporal biomarker discovery methods are impractical and underpowered. Results We find that the pre- or post-processing of shapelets does not sufficiently increase the power and practical utility. Consequently, we present a novel method for temporal biomarker discovery: Statistically Significant Submodular Subset Shapelet Mining (S5M) that retrieves short subsequences that are (i) occurring in the data, (ii) are statistically significantly associated with the phenotype and (iii) are of manageable quantity while maximizing structural diversity. Structural diversity is achieved by pruning non-representative shapelets via submodular optimization. This increases the statistical power and utility of S5M compared to state-of-the-art approaches on simulated and real-world datasets. For patients admitted to the intensive care unit (ICU) showing signs of severe organ failure, we find temporal patterns in the sequential organ failure assessment score that are associated with in-ICU mortality. Availability and implementation S5M is an option in the python package of S3M: github.com/BorgwardtLab/S3M.


2017 ◽  
Author(s):  
Daniel J. Wilson

Analysis of ‘big data’ frequently involves statistical comparison of millions of competing hypotheses to discover hidden processes underlying observed patterns of data, for example in the search for genetic determinants of disease in genome-wide association studies (GWAS). Controlling the family-wise error rate (FWER) is considered the strongest protection against false positives, but makes it difficult to reach the multiple testing-corrected significance threshold. Here I introduce the harmonic mean p-value (HMP) which controls the FWER while greatly improving statistical power by combining dependent tests using generalized central limit theorem. I show that the HMP easily combines information to detect statistically significant signals among groups of individually nonsignificant hypotheses in examples of a human GWAS for neuroticism and a joint human-pathogen GWAS for hepatitis C viral load. The HMP simultaneously tests all combinations of hypotheses, allowing the smallest groups of hypotheses that retain significance to be sought. The power of the HMP to detect significant hypothesis groups is greater than the power of the Benjamini-Hochberg procedure to detect significant hypotheses, even though the latter only controls the weaker false discovery rate (FDR). The HMP has broad implications for the analysis of large datasets because it enhances the potential for scientific discovery.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Sangyoon Yi ◽  
Xianyang Zhang ◽  
Lu Yang ◽  
Jinyan Huang ◽  
Yuanhang Liu ◽  
...  

AbstractOne challenge facing omics association studies is the loss of statistical power when adjusting for confounders and multiple testing. The traditional statistical procedure involves fitting a confounder-adjusted regression model for each omics feature, followed by multiple testing correction. Here we show that the traditional procedure is not optimal and present a new approach, 2dFDR, a two-dimensional false discovery rate control procedure, for powerful confounder adjustment in multiple testing. Through extensive evaluation, we demonstrate that 2dFDR is more powerful than the traditional procedure, and in the presence of strong confounding and weak signals, the power improvement could be more than 100%.


Biostatistics ◽  
2017 ◽  
Vol 18 (3) ◽  
pp. 477-494 ◽  
Author(s):  
Jakub Pecanka ◽  
Marianne A. Jonker ◽  
Zoltan Bochdanovits ◽  
Aad W. Van Der Vaart ◽  

Summary For over a decade functional gene-to-gene interaction (epistasis) has been suspected to be a determinant in the “missing heritability” of complex traits. However, searching for epistasis on the genome-wide scale has been challenging due to the prohibitively large number of tests which result in a serious loss of statistical power as well as computational challenges. In this article, we propose a two-stage method applicable to existing case-control data sets, which aims to lessen both of these problems by pre-assessing whether a candidate pair of genetic loci is involved in epistasis before it is actually tested for interaction with respect to a complex phenotype. The pre-assessment is based on a two-locus genotype independence test performed in the sample of cases. Only the pairs of loci that exhibit non-equilibrium frequencies are analyzed via a logistic regression score test, thereby reducing the multiple testing burden. Since only the computationally simple independence tests are performed for all pairs of loci while the more demanding score tests are restricted to the most promising pairs, genome-wide association study (GWAS) for epistasis becomes feasible. By design our method provides strong control of the type I error. Its favourable power properties especially under the practically relevant misspecification of the interaction model are illustrated. Ready-to-use software is available. Using the method we analyzed Parkinson’s disease in four cohorts and identified possible interactions within several SNP pairs in multiple cohorts.


2013 ◽  
Vol 45 (2) ◽  
pp. 79-88 ◽  
Author(s):  
Virginia M. Miller ◽  
Tanya M. Petterson ◽  
Elysia N. Jeavons ◽  
Abhinita S. Lnu ◽  
David N. Rider ◽  
...  

Menopausal hormone treatment (MHT) may limit progression of cardiovascular disease (CVD) but poses a thrombosis risk. To test targeted candidate gene variation for association with subclinical CVD defined by carotid artery intima-media thickness (CIMT) and coronary artery calcification (CAC), 610 women participating in the Kronos Early Estrogen Prevention Study (KEEPS), a clinical trial of MHT to prevent progression of CVD, were genotyped for 13,229 single nucleotide polymorphisms (SNPs) within 764 genes from anticoagulant, procoagulant, fibrinolytic, or innate immunity pathways. According to linear regression, proportion of European ancestry correlated negatively, but age at enrollment and pulse pressure correlated positively with CIMT. Adjusting for these variables, two SNPs, one on chromosome 2 for MAP4K4 gene (rs2236935, β = 0.037, P value = 2.36 × 10−06) and one on chromosome 5 for IL5 gene (rs739318, β = 0.051, P value = 5.02 × 10−05), associated positively with CIMT; two SNPs on chromosome 17 for CCL5 (rs4796119, β = −0.043, P value = 3.59 × 10−05; rs2291299, β = −0.032, P value = 5.59 × 10−05) correlated negatively with CIMT; only rs2236935 remained significant after correcting for multiple testing. Using logistic regression, when we adjusted for waist circumference, two SNPs (rs11465886, IRAK2, chromosome 3, OR = 3.91, P value = 1.10 × 10−04; and rs17751769, SERPINA1, chromosome 14, OR = 1.96, P value = 2.42 × 10−04) associated positively with a CAC score of >0 Agatston unit; one SNP (rs630014, ABO, OR = 0.51, P value = 2.51 × 10−04) associated negatively; none remained significant after correcting for multiple testing. Whether these SNPs associate with CIMT and CAC in women randomized to MHT remains to be determined.


Breast Care ◽  
2016 ◽  
Vol 11 (4) ◽  
pp. 240-246 ◽  
Author(s):  
Ute Berndt ◽  
Bernd Leplow ◽  
Robby Schoenfeld ◽  
Tilmann Lantzsch ◽  
Regina Grosse ◽  
...  

Introduction: It is generally accepted that estrogens play a protective role in cognitive function. Therefore, it can be expected that subtotal estrogen deprivation following aromatase inhibition will alter cognitive performance. Methods: In a cross-sectional study we investigated 80 postmenopausal women with breast cancer. Memory and spatial cognition were compared across 4 treatment groups: tamoxifen only (TAM, n = 22), aromatase inhibitor only (AI, n = 22), TAM followed by AI (‘SWITCH group', n = 15), and patients with local therapy (LT) only (surgery and radiation, n = 21). Duration of the 2 endocrine monotherapy arms prior to the assessment ranged from 1 to 3 years. The ‘SWITCH group' received 2-3 years TAM followed by at least 1 year and at most 3 years of AI. Memory and spatial cognition were investigated as planned comparisons. Investigations of processing speed, attention, executive function, visuoconstruction and self-perception of memory were exploratory. Results: With regard to general memory, AI patients performed significantly worse than the LT group (p = 0.013). Significant differences in verbal memory did not remain significant after p-value correction for multiple testing. We found no significant differences concerning spatial cognition between the groups. Conclusion: AI treatment alone significantly impairs general memory compared to the LT group.


Stroke ◽  
2021 ◽  
Vol 52 (Suppl_1) ◽  
Author(s):  
Sarah E Wetzel-Strong ◽  
Shantel M Weinsheimer ◽  
Jeffrey Nelson ◽  
Ludmila Pawlikowska ◽  
Dewi Clark ◽  
...  

Objective: Circulating plasma protein profiling may aid in the identification of cerebrovascular disease signatures. This study aimed to identify circulating angiogenic and inflammatory biomarkers that may serve as biomarkers to differentiate sporadic brain arteriovenous malformation (bAVM) patients from other conditions with brain AVMs, including hereditary hemorrhagic telangiectasia (HHT) patients. Methods: The Quantibody Human Angiogenesis Array 1000 (Raybiotech) is an ELISA multiplex panel that was used to assess the levels of 60 proteins related to angiogenesis and inflammation in heparin plasma samples from 13 sporadic unruptured bAVM patients (69% male, mean age 51 years) and 37 patients with HHT (40% male, mean age 47 years, n=19 (51%) with bAVM). The Quantibody Q-Analyzer tool was used to calculate biomarker concentrations based on the standard curve for each marker and log-transformed marker levels were evaluated for associations between disease states using a multivariable interval regression model adjusted for age, sex, ethnicity and collection site. Statistical significance was based on Bonferroni correction for multiple testing of 60 biomarkers (P< 8.3x10 - 4 ). Results: Circulating levels of two plasma proteins differed significantly between sporadic bAVM and HHT patients: PDGF-BB (P=2.6x10 -4 , PI= 3.37, 95% CI:1.76-6.46) and CCL5 (P=6.0x10 -6 , PI=3.50, 95% CI=2.04-6.03). When considering markers with a nominal p-value of less than 0.01, MMP1 and angiostatin levels also differed between patients with sporadic bAVM and HHT. Markers with nominal p-values less than 0.05 when comparing sporadic brain AVM and HHT patients also included angiostatin, IL2, VEGF, GRO, CXCL16, ITAC, and TGFB3. Among HHT patients, the circulating levels of UPAR and IL6 were elevated in patients with documented bAVMs when considering markers with nominal p-values less than 0.05. Conclusions: This study identified differential expression of two promising plasma biomarkers that differentiate sporadic bAVMs from patients with HHT. Furthermore, this study allowed us to evaluate markers that are associated with the presence of bAVMs in HHT patients, which may offer insight into mechanisms underlying bAVM pathophysiology.


Genes ◽  
2018 ◽  
Vol 9 (10) ◽  
pp. 496 ◽  
Author(s):  
Bethany Wolf ◽  
Paula Ramos ◽  
J. Hyer ◽  
Viswanathan Ramakrishnan ◽  
Gary Gilkeson ◽  
...  

Development and progression of many human diseases, such as systemic lupus erythematosus (SLE), are hypothesized to result from interactions between genetic and environmental factors. Current approaches to identify and evaluate interactions are limited, most often focusing on main effects and two-way interactions. While higher order interactions associated with disease are documented, they are difficult to detect since expanding the search space to all possible interactions of p predictors means evaluating 2p − 1 terms. For example, data with 150 candidate predictors requires considering over 1045 main effects and interactions. In this study, we present an analytical approach involving selection of candidate single nucleotide polymorphisms (SNPs) and environmental and/or clinical factors and use of Logic Forest to identify predictors of disease, including higher order interactions, followed by confirmation of the association between those predictors and interactions identified with disease outcome using logistic regression. We applied this approach to a study investigating whether smoking and/or secondhand smoke exposure interacts with candidate SNPs resulting in elevated risk of SLE. The approach identified both genetic and environmental risk factors, with evidence suggesting potential interactions between exposure to secondhand smoke as a child and genetic variation in the ITGAM gene associated with increased risk of SLE.


2019 ◽  
Author(s):  
Mathias Kuhring ◽  
Joerg Doellinger ◽  
Andreas Nitsche ◽  
Thilo Muth ◽  
Bernhard Y. Renard

AbstractUntargeted accurate strain-level classification of a priori unidentified organisms using tandem mass spectrometry is a challenging task. Reference databases often lack taxonomic depth, limiting peptide assignments to the species level. However, the extension with detailed strain information increases runtime and decreases statistical power. In addition, larger databases contain a higher number of similar proteomes.We present TaxIt, an iterative workflow to address the increasing search space required for MS/MS-based strain-level classification of samples with unknown taxonomic origin. TaxIt first applies reference sequence data for initial identification of species candidates, followed by automated acquisition of relevant strain sequences for low level classification. Furthermore, proteome similarities resulting in ambiguous taxonomic assignments are addressed with an abundance weighting strategy to improve candidate confidence.We apply our iterative workflow on several samples of bacterial and viral origin. In comparison to non-iterative approaches using unique peptides or advanced abundance correction, TaxIt identifies microbial strains correctly in all examples presented (with one tie), thereby demonstrating the potential for untargeted and deeper taxonomic classification. TaxIt makes extensive use of public, unrestricted and continuously growing sequence resources such as the NCBI databases and is available under open-source license at https://gitlab.com/rki_bioinformatics.


Sign in / Sign up

Export Citation Format

Share Document