scholarly journals Commentary: a decomposition of the outlier detection problem into a set of supervised learning problems

2016 ◽  
Vol 105 (2) ◽  
pp. 301-304
Author(s):  
Ye Zhu ◽  
Kai Ming Ting
Author(s):  
Никита Сергеевич Олейник ◽  
Владислав Юрьевич Щеколдин

Рассмотрена задача выявления аномальных наблюдений в данных больших размерностей на основе метода многомерного шкалирования с учетом возможности построения качественной визуализации данных. Предложен алгоритм модифицированного метода главных проекций Торгерсона, основанный на построении подпространства проектирования исходных данных путем изменения способа факторизации матрицы скалярных произведений при помощи метода анализа кумулятивных кривых. Построено и проанализировано эмпирическое распределение F -меры для разных вариантов проектирования исходных данных Purpose. Purpose of the article. The paper aims at the development of methods for multidimensional data presentation for solving classification problems based on the cumulative curves analysis. The paper considers the outlier detection problem for high-dimensional data based on the multidimensional scaling, in order to construct high-quality data visualization. An abnormal observation (or outlier), according to D. Hawkins, is an observation that is so different from others that it may be assumed as appeared in the sample in a fundamentally different way. Methods. One of the conceptual approaches that allow providing the classification of sample observations is multidimensional scaling, representing by the classical Orlochi method, the Torgerson main projections and others. The Torgerson method assumes that when converting data to construct the most convenient classification, the origin must be placed at the gravity center of the analyzed data, after which the matrix of scalar products of vectors with the origin at the gravity center is calculated, the two largest eigenvalues and corresponding eigenvectors are chosen and projection matrix is evaluated. Moreover, the method assumes the linear partitioning of regular and anomalous observations, which arises rarely. Therefore, it is logical to choose among the possible axes for designing those that allow obtaining more effective results for solving the problem of detecting outlier observations. A procedure of modified CC-ABOD (Cumulative Curves for Angle Based Outlier Detection) to estimate the visualization quality has been applied. It is based on the estimation of the variances of angles assumed by particular observation and remaining observations in multidimensional space. Further the cumulative curves analysis is implemented, which allows partitioning out groups of closely localized observations (in accordance with the chosen metric) and form classes of regular, intermediate, and anomalous observations. Results. A proposed modification of the Torgerson method is developed. The F1-measure distribution is constructed and analyzed for different design options in the source data. An analysis of the empirical distribution showed that in a number of cases the best axes are corresponding to the second, third, or even fourth largest eigenvalues. Findings. The multidimensional scaling methods for constructing visualizations of multi-dimensional data and solving problems of outlier detection have been considered. It was found out that the determination of design is an ambiguous problem.


Author(s):  
Greg Ver Steeg

Learning by children and animals occurs effortlessly and largely without obvious supervision. Successes in automating supervised learning have not translated to the more ambiguous realm of unsupervised learning where goals and labels are not provided. Barlow (1961) suggested that the signal that brains leverage for unsupervised learning is dependence, or redundancy, in the sensory environment. Dependence can be characterized using the information-theoretic multivariate mutual information measure called total correlation. The principle of Total Cor-relation Ex-planation (CorEx) is to learn representations of data that "explain" as much dependence in the data as possible. We review some manifestations of this principle along with successes in unsupervised learning problems across diverse domains including human behavior, biology, and language.


2015 ◽  
Vol 27 (9) ◽  
pp. 1899-1914
Author(s):  
Marthinus Christoffel du Plessis ◽  
Hiroaki Shiino ◽  
Masashi Sugiyama

Many machine learning problems, such as nonstationarity adaptation, outlier detection, dimensionality reduction, and conditional density estimation, can be effectively solved by using the ratio of probability densities. Since the naive two-step procedure of first estimating the probability densities and then taking their ratio performs poorly, methods to directly estimate the density ratio from two sets of samples without density estimation have been extensively studied recently. However, these methods are batch algorithms that use the whole data set to estimate the density ratio, and they are inefficient in the online setup, where training samples are provided sequentially and solutions are updated incrementally without storing previous samples. In this letter, we propose two online density-ratio estimators based on the adaptive regularization of weight vectors. Through experiments on inlier-based outlier detection, we demonstrate the usefulness of the proposed methods.


2019 ◽  
Vol 63 (1) ◽  
pp. 55-70
Author(s):  
Bahattin Erdogan ◽  
Serif Hekimoglu ◽  
Utkan Mustafa Durdag ◽  
Taylan Ocalan

2012 ◽  
Vol 155-156 ◽  
pp. 342-347 ◽  
Author(s):  
Xun Biao Zhong ◽  
Xiao Xia Huang

In order to solve the density based outlier detection problem with low accuracy and high computation, a variance of distance and density (VDD) measure is proposed in this paper. And the k-means clustering and score based VDD (KSVDD) approach proposed can efficiently detect outliers with high performance. For illustration, two real-world datasets are utilized to show the feasibility of the approach. Empirical results show that KSVDD has a good detection precision.


RSC Advances ◽  
2016 ◽  
Vol 6 (86) ◽  
pp. 82801-82809 ◽  
Author(s):  
P. Žuvela ◽  
J. Jay Liu

Feature selection for supervised learning problems involving analytical information.


Entropy ◽  
2021 ◽  
Vol 23 (10) ◽  
pp. 1330
Author(s):  
Maxime Haddouche ◽  
Benjamin Guedj ◽  
Omar Rivasplata ◽  
John Shawe-Taylor

We present new PAC-Bayesian generalisation bounds for learning problems with unbounded loss functions. This extends the relevance and applicability of the PAC-Bayes learning framework, where most of the existing literature focuses on supervised learning problems with a bounded loss function (typically assumed to take values in the interval [0;1]). In order to relax this classical assumption, we propose to allow the range of the loss to depend on each predictor. This relaxation is captured by our new notion of HYPothesis-dependent rangE (HYPE). Based on this, we derive a novel PAC-Bayesian generalisation bound for unbounded loss functions, and we instantiate it on a linear regression problem. To make our theory usable by the largest audience possible, we include discussions on actual computation, practicality and limitations of our assumptions.


Sign in / Sign up

Export Citation Format

Share Document