Multiple Hypothesis Testing for Data Mining

Author(s):  
Sach Mukherjee

A number of important problems in data mining can be usefully addressed within the framework of statistical hypothesis testing. However, while the conventional treatment of statistical significance deals with error probabilities at the level of a single variable, practical data mining tasks tend to involve thousands, if not millions, of variables. This Chapter looks at some of the issues that arise in the application of hypothesis tests to multi-variable data mining problems, and describes two computationally efficient procedures by which these issues can be addressed.

Author(s):  
Sach Mukherjee

A number of important problems in data mining can be usefully addressed within the framework of statistical hypothesis testing. However, while the conventional treatment of statistical significance deals with error probabilities at the level of a single variable, practical data mining tasks tend to involve thousands, if not millions, of variables. This Chapter looks at some of the issues that arise in the application of hypothesis tests to multi-variable data mining problems, and describes two computationally efficient procedures by which these issues can be addressed.


2019 ◽  
Vol 81 (8) ◽  
pp. 535-542
Author(s):  
Robert A. Cooper

Statistical methods are indispensable to the practice of science. But statistical hypothesis testing can seem daunting, with P-values, null hypotheses, and the concept of statistical significance. This article explains the concepts associated with statistical hypothesis testing using the story of “the lady tasting tea,” then walks the reader through an application of the independent-samples t-test using data from Peter and Rosemary Grant's investigations of Darwin's finches. Understanding how scientists use statistics is an important component of scientific literacy, and students should have opportunities to use statistical methods like this in their science classes.


F1000Research ◽  
2018 ◽  
Vol 7 ◽  
pp. 1176 ◽  
Author(s):  
Nicholas Graves ◽  
Adrian G. Barnett ◽  
Edward Burn ◽  
David Cook

Background: Clinical trials might be larger than needed because arbitrary high levels of statistical confidence are sought in the results. Traditional sample size calculations ignore the marginal value of the information collected for decision making. The statistical hypothesis testing objective is misaligned with the goal of generating information necessary for decision-making. The aim of the present study was to show that a clinical trial designed to test a prior hypothesis against an arbitrary threshold of confidence may recruit too many participants, wasting scarce research dollars and exposing participants to research unnecessarily. Methods: We used data from a recent RCT powered for traditional rules of statistical significance. The data were also used for an economic analysis to show the intervention led to cost savings and improved health outcomes. Adoption represented a good investment for decision-makers. We examined the effect of reducing the trial’s sample size on the results of the statistical hypothesis-testing analysis and the conclusions that would be drawn by decision-makers reading the economic analysis. Results: As the sample size reduced it became more likely that the null hypothesis of no difference in the primary outcome between groups would fail to be rejected. For decision-makers reading the economic analysis, reducing the sample size had little effect on the conclusion about whether to adopt the intervention. There was always high probability the intervention reduced costs and improved health. Conclusions: Decision makers managing health services are largely invariant to the sample size of the primary trial and the arbitrary p-value of 0.05. If the goal is to make a good decision about whether the intervention should be adopted widely, then that could have been achieved with a much smaller trial. It is plausible that hundreds of millions of research dollars are wasted each year recruiting more participants than required for RCTs.


2020 ◽  
Author(s):  
Herty Liany ◽  
Anand Jeyasekharan ◽  
Vaibhav Rajan

AbstractA Synthetic Lethal (SL) interaction is a functional relationship between two genes or functional entities where the loss of either entity is viable but the loss of both is lethal. Such pairs can be used to develop targeted anticancer therapies with fewer side effects and reduced overtreatment. However, finding clinically actionable SL interactions remains challenging. Leveraging large-scale unified gene expression data of both disease-free and cancerous data, we design a new technique, based on statistical hypothesis testing, called ASTER (Analysis of Synthetic lethality by comparison with Tissue-specific disease-free gEnomic and tRanscriptomic data) to identify SL pairs. For large-scale multiple hypothesis testing, we develop an extension called ASTER++ that can utilize additional input gene features within the hypothesis testing framework. Our extensive experiments demonstrate the efficacy of ASTER in accurately identifying SL pairs that are therapeutically actionable in stomach and breast cancers.


Sign in / Sign up

Export Citation Format

Share Document