DETECTION OF AN EMERGING NEW CLASS USING STATISTICAL HYPOTHESIS TESTING AND DENSITY ESTIMATION

Author(s):  
CHEONG HEE PARK ◽  
HONGSUK SHIM

Most traditional classifiers implicitly assume that data samples belong to at least one class among predefined several classes. However, all data patterns may not be known at the time of data collection or a new pattern can be emerging over time. In this paper, a new method is presented for monitoring the change in class distribution and detecting an emerging class. First a statistical significance test is designed which can signal for a change in class distribution. When an alarm for new class generation is set on, retrieval of new class members is performed using density estimation and entropy-based thresholding. Our experimental results demonstrate competent performances of the proposed method.

2019 ◽  
Vol 26 (2) ◽  
pp. 91-108 ◽  
Author(s):  
Justin A. Schulte

Abstract. Statistical hypothesis tests in wavelet analysis are methods that assess the degree to which a wavelet quantity (e.g., power and coherence) exceeds background noise. Commonly, a point-wise approach is adopted in which a wavelet quantity at every point in a wavelet spectrum is individually compared to the critical level of the point-wise test. However, because adjacent wavelet coefficients are correlated and wavelet spectra often contain many wavelet quantities, the point-wise test can produce many false positive results that occur in clusters or patches. To circumvent the point-wise test drawbacks, it is necessary to implement the recently developed area-wise, geometric, cumulative area-wise, and topological significance tests, which are reviewed and developed in this paper. To improve the computational efficiency of the cumulative area-wise test, a simplified version of the testing procedure is created based on the idea that its output is the mean of individual estimates of statistical significance calculated from the geometric test applied at a set of point-wise significance levels. Ideal examples are used to show that the geometric and cumulative area-wise tests are unable to differentiate wavelet spectral features arising from singularity-like structures from those associated with periodicities. A cumulative arc-wise test is therefore developed to strictly test for periodicities by using normalized arclength, which is defined as the number of points composing a cross section of a patch divided by the wavelet scale in question. A previously proposed topological significance test is formalized using persistent homology profiles (PHPs) measuring the number of patches and holes corresponding to the set of all point-wise significance values. Ideal examples show that the PHPs can be used to distinguish time series containing signal components from those that are purely noise. To demonstrate the practical uses of the existing and newly developed statistical methodologies, a first comprehensive wavelet analysis of Indian rainfall is also provided. An R software package has been written by the author to implement the various testing procedures.


2018 ◽  
Author(s):  
Justin A. Schulte

Abstract. Statistical hypothesis tests in wavelet analysis are reviewed and developed. The output of a recently developed cumulative area-wise is shown to be the ensemble mean of individual estimates of statistical significance calculated from a geometric test assessing statistical significance based on the area of contiguous regions (i.e. patches) of point-wise significance. This new interpretation is then used to construct a simplified version of the cumulative area-wise test to improve computational efficiency. Ideal examples are used to show that the geometric and cumulative area-wise tests are unable to differentiate features arising from singularity-like structures from those associated with periodicities. A cumulative arc-wise test is therefore developed to test for periodicities in a strict sense. A previously proposed topological significance test is formalized using persistent homology profiles (PHPs) measuring the number of patches and holes corresponding to the set of all point-wise significance values. Ideal examples show that the PHPs can be used to distinguish time series containing signal components from those that are purely noise. To demonstrate the practical uses of the existing and newly developed statistical methodologies, a first comprehensive wavelet analysis of India rainfall is also provided. A R-software package has been written by the author to implement the various testing procedures.


Author(s):  
Sach Mukherjee

A number of important problems in data mining can be usefully addressed within the framework of statistical hypothesis testing. However, while the conventional treatment of statistical significance deals with error probabilities at the level of a single variable, practical data mining tasks tend to involve thousands, if not millions, of variables. This Chapter looks at some of the issues that arise in the application of hypothesis tests to multi-variable data mining problems, and describes two computationally efficient procedures by which these issues can be addressed.


Author(s):  
Sach Mukherjee

A number of important problems in data mining can be usefully addressed within the framework of statistical hypothesis testing. However, while the conventional treatment of statistical significance deals with error probabilities at the level of a single variable, practical data mining tasks tend to involve thousands, if not millions, of variables. This Chapter looks at some of the issues that arise in the application of hypothesis tests to multi-variable data mining problems, and describes two computationally efficient procedures by which these issues can be addressed.


2019 ◽  
Vol 81 (8) ◽  
pp. 535-542
Author(s):  
Robert A. Cooper

Statistical methods are indispensable to the practice of science. But statistical hypothesis testing can seem daunting, with P-values, null hypotheses, and the concept of statistical significance. This article explains the concepts associated with statistical hypothesis testing using the story of “the lady tasting tea,” then walks the reader through an application of the independent-samples t-test using data from Peter and Rosemary Grant's investigations of Darwin's finches. Understanding how scientists use statistics is an important component of scientific literacy, and students should have opportunities to use statistical methods like this in their science classes.


Author(s):  
VICTOR K. Y. CHAN ◽  
W. ERIC WONG ◽  
T. F. XIE

Software metric models predict the target software metric(s), e.g., the development work effort or defect rates, for any future software project based on the project's predictor software metric(s), e.g., the project team size. Obviously, the construction of such a software metric model makes use of a data sample of such metrics from analogous past projects. However, incomplete data often appear in such data samples. Moreover, the decision on whether a particular predictor metric should be included is most likely based on an intuitive or experience-based assumption that the predictor metric has an impact on the target metric with a statistical significance. However, this assumption is usually not verifiable "retrospectively" after the model is constructed, leading to redundant predictor metric(s) and/or unnecessary predictor metric complexity. To solve all these problems, we derived a methodology consisting of the k-nearest neighbors (k-NN) imputation method, statistical hypothesis testing, and a "goodness-of-fit" criterion. This methodology was tested on software effort metric models and software quality metric models, the latter usually suffers from far more serious incomplete data. This paper documents this methodology and the tests on these two types of software metric models.


1984 ◽  
Vol 9 (1) ◽  
pp. 139-186 ◽  
Author(s):  
Paul Meier ◽  
Jerome Sacks ◽  
Sandy L. Zabell

Tests of statistical significance have increasingly been used in employment discrimination cases since the Supreme Court's decision in Hazelwood. In that case, the United States Supreme Court ruled that “in a proper case” statistical evidence can suffice for a prima facie showing of employment discrimination. The Court also discussed the use of a binomial significance test to assess whether the difference between the proportion of black teachers employed by the Hazelwood School District and the proportion of black teachers in the relevant labor market was substantial enough to indicate discrimination. The Equal Employment Opportunity Commission has proposed a somewhat stricter standard for evaluating how substantial a difference must be to constitute evidence of discrimination. Under the so-called 80% rule promulgated by the EEOC, the difference must not only be statistically significant, but the hire rate for the allegedly discriminated group must also be less than 80% of the rate for the favored group. This article argues that a binomial statistical significance test standing alone is unsatisfactory for evaluating allegations of discrimination because many of the assumptions on which such tests are based are inapplicable to employment settings; the 80% rule is a more appropriate standard for evaluating whether a difference in hire rates should be treated as a prima facie showing of discrimination.


Sign in / Sign up

Export Citation Format

Share Document