scholarly journals Statistical hypothesis testing in wavelet analysis: theoretical developments and applications to Indian rainfall

2019 ◽  
Vol 26 (2) ◽  
pp. 91-108 ◽  
Author(s):  
Justin A. Schulte

Abstract. Statistical hypothesis tests in wavelet analysis are methods that assess the degree to which a wavelet quantity (e.g., power and coherence) exceeds background noise. Commonly, a point-wise approach is adopted in which a wavelet quantity at every point in a wavelet spectrum is individually compared to the critical level of the point-wise test. However, because adjacent wavelet coefficients are correlated and wavelet spectra often contain many wavelet quantities, the point-wise test can produce many false positive results that occur in clusters or patches. To circumvent the point-wise test drawbacks, it is necessary to implement the recently developed area-wise, geometric, cumulative area-wise, and topological significance tests, which are reviewed and developed in this paper. To improve the computational efficiency of the cumulative area-wise test, a simplified version of the testing procedure is created based on the idea that its output is the mean of individual estimates of statistical significance calculated from the geometric test applied at a set of point-wise significance levels. Ideal examples are used to show that the geometric and cumulative area-wise tests are unable to differentiate wavelet spectral features arising from singularity-like structures from those associated with periodicities. A cumulative arc-wise test is therefore developed to strictly test for periodicities by using normalized arclength, which is defined as the number of points composing a cross section of a patch divided by the wavelet scale in question. A previously proposed topological significance test is formalized using persistent homology profiles (PHPs) measuring the number of patches and holes corresponding to the set of all point-wise significance values. Ideal examples show that the PHPs can be used to distinguish time series containing signal components from those that are purely noise. To demonstrate the practical uses of the existing and newly developed statistical methodologies, a first comprehensive wavelet analysis of Indian rainfall is also provided. An R software package has been written by the author to implement the various testing procedures.

2018 ◽  
Author(s):  
Justin A. Schulte

Abstract. Statistical hypothesis tests in wavelet analysis are reviewed and developed. The output of a recently developed cumulative area-wise is shown to be the ensemble mean of individual estimates of statistical significance calculated from a geometric test assessing statistical significance based on the area of contiguous regions (i.e. patches) of point-wise significance. This new interpretation is then used to construct a simplified version of the cumulative area-wise test to improve computational efficiency. Ideal examples are used to show that the geometric and cumulative area-wise tests are unable to differentiate features arising from singularity-like structures from those associated with periodicities. A cumulative arc-wise test is therefore developed to test for periodicities in a strict sense. A previously proposed topological significance test is formalized using persistent homology profiles (PHPs) measuring the number of patches and holes corresponding to the set of all point-wise significance values. Ideal examples show that the PHPs can be used to distinguish time series containing signal components from those that are purely noise. To demonstrate the practical uses of the existing and newly developed statistical methodologies, a first comprehensive wavelet analysis of India rainfall is also provided. A R-software package has been written by the author to implement the various testing procedures.


Author(s):  
CHEONG HEE PARK ◽  
HONGSUK SHIM

Most traditional classifiers implicitly assume that data samples belong to at least one class among predefined several classes. However, all data patterns may not be known at the time of data collection or a new pattern can be emerging over time. In this paper, a new method is presented for monitoring the change in class distribution and detecting an emerging class. First a statistical significance test is designed which can signal for a change in class distribution. When an alarm for new class generation is set on, retrieval of new class members is performed using density estimation and entropy-based thresholding. Our experimental results demonstrate competent performances of the proposed method.


Author(s):  
Sach Mukherjee

A number of important problems in data mining can be usefully addressed within the framework of statistical hypothesis testing. However, while the conventional treatment of statistical significance deals with error probabilities at the level of a single variable, practical data mining tasks tend to involve thousands, if not millions, of variables. This Chapter looks at some of the issues that arise in the application of hypothesis tests to multi-variable data mining problems, and describes two computationally efficient procedures by which these issues can be addressed.


Author(s):  
Sach Mukherjee

A number of important problems in data mining can be usefully addressed within the framework of statistical hypothesis testing. However, while the conventional treatment of statistical significance deals with error probabilities at the level of a single variable, practical data mining tasks tend to involve thousands, if not millions, of variables. This Chapter looks at some of the issues that arise in the application of hypothesis tests to multi-variable data mining problems, and describes two computationally efficient procedures by which these issues can be addressed.


2019 ◽  
Vol 81 (8) ◽  
pp. 535-542
Author(s):  
Robert A. Cooper

Statistical methods are indispensable to the practice of science. But statistical hypothesis testing can seem daunting, with P-values, null hypotheses, and the concept of statistical significance. This article explains the concepts associated with statistical hypothesis testing using the story of “the lady tasting tea,” then walks the reader through an application of the independent-samples t-test using data from Peter and Rosemary Grant's investigations of Darwin's finches. Understanding how scientists use statistics is an important component of scientific literacy, and students should have opportunities to use statistical methods like this in their science classes.


Author(s):  
VICTOR K. Y. CHAN ◽  
W. ERIC WONG ◽  
T. F. XIE

Software metric models predict the target software metric(s), e.g., the development work effort or defect rates, for any future software project based on the project's predictor software metric(s), e.g., the project team size. Obviously, the construction of such a software metric model makes use of a data sample of such metrics from analogous past projects. However, incomplete data often appear in such data samples. Moreover, the decision on whether a particular predictor metric should be included is most likely based on an intuitive or experience-based assumption that the predictor metric has an impact on the target metric with a statistical significance. However, this assumption is usually not verifiable "retrospectively" after the model is constructed, leading to redundant predictor metric(s) and/or unnecessary predictor metric complexity. To solve all these problems, we derived a methodology consisting of the k-nearest neighbors (k-NN) imputation method, statistical hypothesis testing, and a "goodness-of-fit" criterion. This methodology was tested on software effort metric models and software quality metric models, the latter usually suffers from far more serious incomplete data. This paper documents this methodology and the tests on these two types of software metric models.


2019 ◽  
Vol 35 (19) ◽  
pp. 3592-3598 ◽  
Author(s):  
Justin G Chitpin ◽  
Aseel Awdeh ◽  
Theodore J Perkins

Abstract Motivation Chromatin Immunopreciptation (ChIP)-seq is used extensively to identify sites of transcription factor binding or regions of epigenetic modifications to the genome. A key step in ChIP-seq analysis is peak calling, where genomic regions enriched for ChIP versus control reads are identified. Many programs have been designed to solve this task, but nearly all fall into the statistical trap of using the data twice—once to determine candidate enriched regions, and again to assess enrichment by classical statistical hypothesis testing. This double use of the data invalidates the statistical significance assigned to enriched regions, thus the true significance or reliability of peak calls remains unknown. Results Using simulated and real ChIP-seq data, we show that three well-known peak callers, MACS, SICER and diffReps, output biased P-values and false discovery rate estimates that can be many orders of magnitude too optimistic. We propose a wrapper algorithm, RECAP, that uses resampling of ChIP-seq and control data to estimate a monotone transform correcting for biases built into peak calling algorithms. When applied to null hypothesis data, where there is no enrichment between ChIP-seq and control, P-values recalibrated by RECAP are approximately uniformly distributed. On data where there is genuine enrichment, RECAP P-values give a better estimate of the true statistical significance of candidate peaks and better false discovery rate estimates, which correlate better with empirical reproducibility. RECAP is a powerful new tool for assessing the true statistical significance of ChIP-seq peak calls. Availability and implementation The RECAP software is available through www.perkinslab.ca or on github at https://github.com/theodorejperkins/RECAP. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Zeyu Xue ◽  
Paul Ullrich

AbstractClimate models are frequently-used tools for adaptation planning in light of future uncertainty. However, not all climate models are equally trustworthy, and so model biases must be assessed to select models suitable for producing credible projections. Drought is a well-known and high-impact form of extreme weather, and knowledge of its frequency, intensity, and duration key for regional water management plans. Droughts are also difficult to assess in climate datasets, due to the long duration per event, relative to the length of a typical simulation. Therefore, there is a growing need for a standardized suite of metrics addressing how well models capture this phenomenon. In this study, we present a widely applicable set of metrics for evaluating agreement between climate datasets and observations in the context of drought. Two notable advances are made in our evaluation system: First, statistical hypothesis testing is employed for normalization of individual scores against the threshold for statistical significance. And second, within each evaluation region and dataset, principal feature analysis is used to select the most descriptive metrics among 11 metrics that capture essential features of drought. Our metrics package is applied to three characteristically distinct regions in the conterminous US and across several commonly employed climate datasets (CMIP5/6, LOCA and CORDEX). As a result, insights emerge into the underlying drivers of model bias in global climate models, regional climate models, and statistically downscaled models.


Sign in / Sign up

Export Citation Format

Share Document