scholarly journals Uncertain distance-based outlier detection with arbitrarily shaped data objects

Author(s):  
Fabrizio Angiulli ◽  
Fabio Fassetti

Abstract Enabling information systems to face anomalies in the presence of uncertainty is a compelling and challenging task. In this work the problem of unsupervised outlier detection in large collections of data objects modeled by means of arbitrary multidimensional probability density functions is considered. We present a novel definition of uncertain distance-based outlier under the attribute level uncertainty model, according to which an uncertain object is an object that always exists but its actual value is modeled by a multivariate pdf. According to this definition an uncertain object is declared to be an outlier on the basis of the expected number of its neighbors in the dataset. To the best of our knowledge this is the first work that considers the unsupervised outlier detection problem on data objects modeled by means of arbitrarily shaped multidimensional distribution functions. We present the UDBOD algorithm which efficiently detects the outliers in an input uncertain dataset by taking advantages of three optimized phases, that are parameter estimation, candidate selection, and the candidate filtering. An experimental campaign is presented, including a sensitivity analysis, a study of the effectiveness of the technique, a comparison with related algorithms, also in presence of high dimensional data, and a discussion about the behavior of our technique in real case scenarios.

2021 ◽  
Vol 15 (4) ◽  
pp. 1-20
Author(s):  
Georg Steinbuss ◽  
Klemens Böhm

Benchmarking unsupervised outlier detection is difficult. Outliers are rare, and existing benchmark data contains outliers with various and unknown characteristics. Fully synthetic data usually consists of outliers and regular instances with clear characteristics and thus allows for a more meaningful evaluation of detection methods in principle. Nonetheless, there have only been few attempts to include synthetic data in benchmarks for outlier detection. This might be due to the imprecise notion of outliers or to the difficulty to arrive at a good coverage of different domains with synthetic data. In this work, we propose a generic process for the generation of datasets for such benchmarking. The core idea is to reconstruct regular instances from existing real-world benchmark data while generating outliers so that they exhibit insightful characteristics. We propose and describe a generic process for the benchmarking of unsupervised outlier detection, as sketched so far. We then describe three instantiations of this generic process that generate outliers with specific characteristics, like local outliers. To validate our process, we perform a benchmark with state-of-the-art detection methods and carry out experiments to study the quality of data reconstructed in this way. Next to showcasing the workflow, this confirms the usefulness of our proposed process. In particular, our process yields regular instances close to the ones from real data. Summing up, we propose and validate a new and practical process for the benchmarking of unsupervised outlier detection.


2003 ◽  
Vol 11 (2) ◽  
pp. 169-206 ◽  
Author(s):  
Riccardo Poli ◽  
Nicholas Freitag McPhee

This paper is the second part of a two-part paper which introduces a general schema theory for genetic programming (GP) with subtree-swapping crossover (Part I (Poli and McPhee, 2003)). Like other recent GP schema theory results, the theory gives an exact formulation (rather than a lower bound) for the expected number of instances of a schema at the next generation. The theory is based on a Cartesian node reference system, introduced in Part I, and on the notion of a variable-arity hyperschema, introduced here, which generalises previous definitions of a schema. The theory includes two main theorems describing the propagation of GP schemata: a microscopic and a macroscopic schema theorem. The microscopic version is applicable to crossover operators which replace a subtree in one parent with a subtree from the other parent to produce the offspring. Therefore, this theorem is applicable to Koza's GP crossover with and without uniform selection of the crossover points, as well as one-point crossover, size-fair crossover, strongly-typed GP crossover, context-preserving crossover and many others. The macroscopic version is applicable to crossover operators in which the probability of selecting any two crossover points in the parents depends only on the parents' size and shape. In the paper we provide examples, we show how the theory can be specialised to specific crossover operators and we illustrate how it can be used to derive other general results. These include an exact definition of effective fitness and a size-evolution equation for GP with subtree-swapping crossover.


Author(s):  
MIGUEL G. ECHEVARRÍA ◽  
AHMAD IDILBI ◽  
IGNAZIO SCIMEMI

We consider the definition of unpolarized transverse-momentum-dependent parton distribution functions while staying on-the-light-cone. By imposing a requirement of identical treatment of two collinear sectors, our approach, compatible with a generic factorization theorem with the soft function included, is valid for all non-ultra-violet regulators (as it should), an issue which causes much confusion in the whole field. We explain how large logarithms can be resummed in a way which can be considered as an alternative to the use of Collins-Soper evolution equation. The evolution properties are also discussed and the gauge-invariance, in both classes of gauges, regular and singular, is emphasized.


Author(s):  
Fabrizio Angiulli

Data mining techniques can be grouped in four main categories: clustering, classification, dependency detection, and outlier detection. Clustering is the process of partitioning a set of objects into homogeneous groups, or clusters. Classification is the task of assigning objects to one of several predefined categories. Dependency detection searches for pairs of attribute sets which exhibit some degree of correlation in the data set at hand. The outlier detection task can be defined as follows: “Given a set of data points or objects, find the objects that are considerably dissimilar, exceptional or inconsistent with respect to the remaining data”. These exceptional objects as also referred to as outliers. Most of the early methods for outlier identification have been developed in the field of statistics (Hawkins, 1980; Barnett & Lewis, 1994). Hawkins’ definition of outlier clarifies the approach: “An outlier is an observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism”. Indeed, statistical techniques assume that the given data set has a distribution model. Outliers are those points that satisfy a discordancy test, that is, that are significantly far from what would be their expected position given the hypothesized distribution. Many clustering, classification and dependency detection methods produce outliers as a by-product of their main task. For example, in classification, mislabeled objects are considered outliers and thus they are removed from the training set to improve the accuracy of the resulting classifier, while in clustering, objects that do not strongly belong to any cluster are considered outliers. Nevertheless, it must be said that searching for outliers through techniques specifically designed for tasks different from outlier detection could not be advantageous. As an example, clusters can be distorted by outliers and, thus, the quality of the outliers returned is affected by their presence. Moreover, other than returning a solution of higher quality, outlier detection algorithms can be vastly more efficient than non ad-hoc algorithms. While in many contexts outliers are considered as noise that must be eliminated, as pointed out elsewhere, “one person’s noise could be another person’s signal”, and thus outliers themselves can be of great interest. Outlier mining is used in telecom or credit card frauds to detect the atypical usage of telecom services or credit cards, in intrusion detection for detecting unauthorized accesses, in medical analysis to test abnormal reactions to new medical therapies, in marketing and customer segmentations to identify customers spending much more or much less than average customer, in surveillance systems, in data cleaning, and in many other fields.


Author(s):  
Sharanjit Kaur

Knowledge discovery in databases (KDD) is a nontrivial process of detecting valid, novel, potentially useful and ultimately understandable patterns in data (Fayyad, Piatetsky-Shapiro, Smyth & Uthurusamy, 1996). In general KDD tasks can be classified into four categories i) Dependency detection, ii) Class identification, iii) Class description and iv) Outlier detection. The first three categories of tasks correspond to patterns that apply to many objects while the task (iv) focuses on a small fraction of data objects often called outliers (Han & Kamber, 2006). Typically, outliers are data points which deviate more than user expectation from the majority of points in a dataset. There are two types of outliers: i) data points/objects with abnormally large errors and ii) data points/objects with normal errors but at far distance from its neighboring points (Maimon & Rokach, 2005). The former type may be the outcome of malfunctioning of data generator or due to errors while recording data, whereas latter is due to genuine data variation reflecting an unexpected trend in data. Outliers may be present in real life datasets because of several reasons including errors in capturing, storage and communication of data. Since outliers often interfere and obstruct the data mining process, they are considered to be nuisance. In several commercial and scientific applications, a small set of objects representing some rare or unexpected events is often more interesting than the larger ones. Example applications in commercial domain include credit-card fraud detection, criminal activities in e-commerce, pharmaceutical research etc.. In scientific domain, unknown astronomical objects, unexpected values of vital parameters in patient analysis etc. manifest as exceptions in observed data. Outliers are required to be reported immediately to take appropriate action in applications like network intrusion, weather prediction etc., whereas in other applications like astronomy, further investigation of outliers may lead to discovery of new celestial objects. Thus exception/ outlier handling is an important task in KDD and often leads to a more meaningful discovery (Breunig, Kriegel, Raymond & Sander, 2000). In this article different approaches for outlier detection in static datasets are presented.


2019 ◽  
Vol 36 (7) ◽  
pp. 2165-2172 ◽  
Author(s):  
F Maggioli ◽  
T Mancini ◽  
E Tronci

Abstract Motivation SBML is the most widespread language for the definition of biochemical models. Although dozens of SBML simulators are available, there is a general lack of support to the integration of SBML models within open-standard general-purpose simulation ecosystems. This hinders co-simulation and integration of SBML models within larger model networks, in order to, e.g. enable in silico clinical trials of drugs, pharmacological protocols, or engineering artefacts such as biomedical devices against Virtual Physiological Human models. Modelica is one of the most popular existing open-standard general-purpose simulation languages, supported by many simulators. Modelica models are especially suited for the definition of complex networks of heterogeneous models from virtually all application domains. Models written in Modelica (and in 100+ other languages) can be readily exported into black-box Functional Mock-Up Units (FMUs), and seamlessly co-simulated and integrated into larger model networks within open-standard language-independent simulation ecosystems. Results In order to enable SBML model integration within heterogeneous model networks, we present SBML2Modelica, a software system translating SBML models into well-structured, user-intelligible, easily modifiable Modelica models. SBML2Modelica is SBML Level 3 Version 2—compliant and succeeds on 96.47% of the SBML Test Suite Core (with a few rare, intricate and easily avoidable combinations of constructs unsupported and cleanly signalled to the user). Our experimental campaign on 613 models from the BioModels database (with up to 5438 variables) shows that the major open-source (general-purpose) Modelica and FMU simulators achieve performance comparable to state-of-the-art specialized SBML simulators. Availability and implementation SBML2Modelica is written in Java and is freely available for non-commercial use at https://bitbucket.org/mclab/sbml2modelica.


Buildings ◽  
2018 ◽  
Vol 8 (10) ◽  
pp. 139 ◽  
Author(s):  
Benedetta Barozzi ◽  
Alice Bellazzi ◽  
Claudio Maffè ◽  
Italo Meroni

Green roofs are one of the most extensively investigated roofing technologies. Most of the bibliographical studies show results of researches focused on the analysis of different configurations of green roofs, but only few researches deal with the calculation of the growing media thermal resistance using laboratory tests. From 2009 to 2013, ITC-CNR, the Construction Technologies Institute of the National Research Council of Italy, carried out a first laboratory experimental campaign focused on the definition of thermal performances curves of growing media for green roofs as a function of both density and percentage of internal moisture. During this campaign, the experimental results underlined some existing gaps, such as the absence of specific standards concerning the sample laboratory preparation, the absence of shared references concerning the compaction level reached by samples in real working conditions and the evaluation of the internal moisture content of growing media exposed to atmospheric agents. For this reason, the ITC-CNR has set up a second experimental campaign focused on the solution of the gaps underlined by the first phase concerning the preparation of samples for the laboratory calculation of the thermal resistance of growing media for green roofs. This paper proposes and presents methodological approaches, methods and new test devices implemented to solve these gaps, and the results obtained.


1981 ◽  
Vol 18 (03) ◽  
pp. 707-714 ◽  
Author(s):  
Shun-Chen Niu

Using a definition of partial ordering of distribution functions, it is proven that for a tandem queueing system with many stations in series, where each station can have either one server with an arbitrary service distribution or a number of constant servers in parallel, the expected total waiting time in system of every customer decreases as the interarrival and service distributions becomes smaller with respect to that ordering. Some stronger conclusions are also given under stronger order relations. Using these results, bounds for the expected total waiting time in system are then readily obtained for wide classes of tandem queues.


Sign in / Sign up

Export Citation Format

Share Document