scholarly journals Mean Shift versus Variance Inflation Approach for Outlier Detection—A Comparative Study

Mathematics ◽  
2020 ◽  
Vol 8 (6) ◽  
pp. 991 ◽  
Author(s):  
Rüdiger Lehmann ◽  
Michael Lösler ◽  
Frank Neitzel

Outlier detection is one of the most important tasks in the analysis of measured quantities to ensure reliable results. In recent years, a variety of multi-sensor platforms has become available, which allow autonomous and continuous acquisition of large quantities of heterogeneous observations. Because the probability that such data sets contain outliers increases with the quantity of measured values, powerful methods are required to identify contaminated observations. In geodesy, the mean shift model (MS) is one of the most commonly used approaches for outlier detection. In addition to the MS model, there is an alternative approach with the model of variance inflation (VI). In this investigation the VI approach is derived in detail, truly maximizing the likelihood functions and examined for outlier detection of one or multiple outliers. In general, the variance inflation approach is non-linear, even if the null model is linear. Thus, an analytical solution does usually not exist, except in the case of repeated measurements. The test statistic is derived from the likelihood ratio (LR) of the models. The VI approach is compared with the MS model in terms of statistical power, identifiability of actual outliers, and numerical effort. The main purpose of this paper is to examine the performance of both approaches in order to derive recommendations for the practical application of outlier detection.

Genetics ◽  
2003 ◽  
Vol 164 (1) ◽  
pp. 381-387
Author(s):  
B Law ◽  
J S Buckleton ◽  
C M Triggs ◽  
B S Weir

Abstract The probability of multilocus genotype counts conditional on allelic counts and on allelic independence provides a test statistic for independence within and between loci. As the number of loci increases and each sampled genotype becomes unique, the conditional probability becomes a function of total heterozygosity. In that case, it does not address between-locus dependence directly but only indirectly through detection of the Wahlund effect. Moreover, the test will reject the hypothesis of allelic independence only for small values of heterozygosity. Low heterozygosity is expected for population subdivision but not for population admixture. The test may therefore be inappropriate for admixed populations. If individuals with parents in two different populations are always considered to belong to one of the populations, then heterozygosity is increased in that population and the exact test should not be used for sparse data sets from that population. If such a case is suspected, then alternative testing strategies are suggested.


Genetics ◽  
1997 ◽  
Vol 147 (4) ◽  
pp. 1855-1861 ◽  
Author(s):  
Montgomery Slatkin ◽  
Bruce Rannala

Abstract A theory is developed that provides the sampling distribution of low frequency alleles at a single locus under the assumption that each allele is the result of a unique mutation. The numbers of copies of each allele is assumed to follow a linear birth-death process with sampling. If the population is of constant size, standard results from theory of birth-death processes show that the distribution of numbers of copies of each allele is logarithmic and that the joint distribution of numbers of copies of k alleles found in a sample of size n follows the Ewens sampling distribution. If the population from which the sample was obtained was increasing in size, if there are different selective classes of alleles, or if there are differences in penetrance among alleles, the Ewens distribution no longer applies. Likelihood functions for a given set of observations are obtained under different alternative hypotheses. These results are applied to published data from the BRCA1 locus (associated with early onset breast cancer) and the factor VIII locus (associated with hemophilia A) in humans. In both cases, the sampling distribution of alleles allows rejection of the null hypothesis, but relatively small deviations from the null model can account for the data. In particular, roughly the same population growth rate appears consistent with both data sets.


2021 ◽  
Vol 5 (1) ◽  
pp. 10
Author(s):  
Mark Levene

A bootstrap-based hypothesis test of the goodness-of-fit for the marginal distribution of a time series is presented. Two metrics, the empirical survival Jensen–Shannon divergence (ESJS) and the Kolmogorov–Smirnov two-sample test statistic (KS2), are compared on four data sets—three stablecoin time series and a Bitcoin time series. We demonstrate that, after applying first-order differencing, all the data sets fit heavy-tailed α-stable distributions with 1<α<2 at the 95% confidence level. Moreover, ESJS is more powerful than KS2 on these data sets, since the widths of the derived confidence intervals for KS2 are, proportionately, much larger than those of ESJS.


2021 ◽  
Vol 115 ◽  
pp. 107874
Author(s):  
Jiawei Yang ◽  
Susanto Rahardja ◽  
Pasi Fränti
Keyword(s):  

2006 ◽  
Vol 12 (2-3) ◽  
pp. 203-228 ◽  
Author(s):  
Matthew Eric Otey ◽  
Amol Ghoting ◽  
Srinivasan Parthasarathy

2006 ◽  
Vol 45 (9) ◽  
pp. 1181-1189 ◽  
Author(s):  
D. S. Wilks

Abstract The conventional approach to evaluating the joint statistical significance of multiple hypothesis tests (i.e., “field,” or “global,” significance) in meteorology and climatology is to count the number of individual (or “local”) tests yielding nominally significant results and then to judge the unusualness of this integer value in the context of the distribution of such counts that would occur if all local null hypotheses were true. The sensitivity (i.e., statistical power) of this approach is potentially compromised both by the discrete nature of the test statistic and by the fact that the approach ignores the confidence with which locally significant tests reject their null hypotheses. An alternative global test statistic that has neither of these problems is the minimum p value among all of the local tests. Evaluation of field significance using the minimum local p value as the global test statistic, which is also known as the Walker test, has strong connections to the joint evaluation of multiple tests in a way that controls the “false discovery rate” (FDR, or the expected fraction of local null hypothesis rejections that are incorrect). In particular, using the minimum local p value to evaluate field significance at a level αglobal is nearly equivalent to the slightly more powerful global test based on the FDR criterion. An additional advantage shared by Walker’s test and the FDR approach is that both are robust to spatial dependence within the field of tests. The FDR method not only provides a more broadly applicable and generally more powerful field significance test than the conventional counting procedure but also allows better identification of locations with significant differences, because fewer than αglobal × 100% (on average) of apparently significant local tests will have resulted from local null hypotheses that are true.


2018 ◽  
Vol 64 ◽  
pp. 08006 ◽  
Author(s):  
Kummerow André ◽  
Nicolai Steffen ◽  
Bretschneider Peter

The scope of this survey is the uncovering of potential critical events from mixed PMU data sets. An unsupervised procedure is introduced with the use of different outlier detection methods. For that, different techniques for signal analysis are used to generate features in time and frequency domain as well as linear and non-linear dimension reduction techniques. That approach enables the exploration of critical grid dynamics in power systems without prior knowledge about existing failure patterns. Furthermore new failure patterns can be extracted for the creation of training data sets used for online detection algorithms.


2019 ◽  
Vol 22 (03) ◽  
pp. 187-194 ◽  
Author(s):  
Johan Fellman

AbstractThe seasonality of demographic data has been of great interest. It depends mainly on the climatic conditions, and the findings may vary from study to study. Commonly, the studies are based on monthly data. The population at risk plays a central role. For births or deaths over short periods, the population at risk is proportional to the lengths of the months. Hence, one must analyze the number of births (and deaths) per day. If one studies the seasonality of multiple maternities, the population at risk is the total monthly number of confinements and the number of multiple maternities in a given month must be compared with the monthly number of all maternities. Consequently, when one considers the monthly rates of multiple maternities, the monthly number of births is eliminated and one obtains an unaffected seasonality measure of the rates. In general, comparisons between the seasonality of different data sets presuppose standardization of the data to indices with common means, mainly 100. If one assumes seasonality as ‘non-flatness’ throughout a year, a chi-squared test would be an option, but this test calculates only the heterogeneity and the same test statistic can be obtained for data sets with extreme values occurring in consecutive months or in separate months. Hence, chi-squared tests for seasonality are weak because of this arbitrariness and cannot be considered a model test. When seasonal models are applied, one must pay special attention to how well the applied model fits the data. If the goodness of fit is poor, nonsignificant models obtained can erroneously lead to statements that the seasonality is slight, although the observed seasonal fluctuations are marked. In this study, we investigate how the application of seasonal models can be applied to different demographic variables.


2013 ◽  
Vol 2013 ◽  
pp. 1-12 ◽  
Author(s):  
V. Ezhil Swanly ◽  
L. Selvam ◽  
P. Mohan Kumar ◽  
J. Arokia Renjith ◽  
M. Arunachalam ◽  
...  

One third of the world’s population is thought to have been infected with mycobacterium tuberculosis (TB) with new infection occurring at a rate of about one per second. TB typically attacks the lungs. Indication of cavities in upper lobes of lungs shows the high infection. Traditionally, it has been detected manually by physicians. But the automatic technique proposed in this paper focuses on accurate detection of disease by computed tomography (CT) using computer-aided detection (CAD) system. The various steps of the detection process include the following: (i) image preprocessing, which is done by techniques such as resizing, masking, and Gaussian smoothening, (ii) image egmentation that is implemented by using mean-shift model and gradient vector flow (GVF) model, (iii) feature extraction that can be achieved by Gradient inverse coefficient of variation and circularity measure, and (iv) classification using Bayesian classifier. Experimental results show that its perfection of detecting cavities is very accurate in low false positive rate (FPR).


Sign in / Sign up

Export Citation Format

Share Document