Outlier Detection

Author(s):  
Sharanjit Kaur

Knowledge discovery in databases (KDD) is a nontrivial process of detecting valid, novel, potentially useful and ultimately understandable patterns in data (Fayyad, Piatetsky-Shapiro, Smyth & Uthurusamy, 1996). In general KDD tasks can be classified into four categories i) Dependency detection, ii) Class identification, iii) Class description and iv) Outlier detection. The first three categories of tasks correspond to patterns that apply to many objects while the task (iv) focuses on a small fraction of data objects often called outliers (Han & Kamber, 2006). Typically, outliers are data points which deviate more than user expectation from the majority of points in a dataset. There are two types of outliers: i) data points/objects with abnormally large errors and ii) data points/objects with normal errors but at far distance from its neighboring points (Maimon & Rokach, 2005). The former type may be the outcome of malfunctioning of data generator or due to errors while recording data, whereas latter is due to genuine data variation reflecting an unexpected trend in data. Outliers may be present in real life datasets because of several reasons including errors in capturing, storage and communication of data. Since outliers often interfere and obstruct the data mining process, they are considered to be nuisance. In several commercial and scientific applications, a small set of objects representing some rare or unexpected events is often more interesting than the larger ones. Example applications in commercial domain include credit-card fraud detection, criminal activities in e-commerce, pharmaceutical research etc.. In scientific domain, unknown astronomical objects, unexpected values of vital parameters in patient analysis etc. manifest as exceptions in observed data. Outliers are required to be reported immediately to take appropriate action in applications like network intrusion, weather prediction etc., whereas in other applications like astronomy, further investigation of outliers may lead to discovery of new celestial objects. Thus exception/ outlier handling is an important task in KDD and often leads to a more meaningful discovery (Breunig, Kriegel, Raymond & Sander, 2000). In this article different approaches for outlier detection in static datasets are presented.

Author(s):  
Frank Klawonn ◽  
Frank Rehm

For many applications in knowledge discovery in databases finding outliers, rare events, is of importance. Outliers are observations, which deviate significantly from the rest of the data, so that it seems they are generated by another process (Hawkins, 1980). Such outlier objects often contain information about an untypical behavior of the system. However, outliers bias the results of many data mining methods like the mean value, the standard deviation or the positions of the prototypes of k-means clustering (Estivill-Castro, 2004; Keller, 2000). Therefore, before further analysis or processing of data is carried out with more sophisticated data mining techniques, identifying outliers is a crucial step. Usually, data objects are considered as outliers, when they occur in a region of extremely low data density. Many clustering techniques like possibilistic clustering (PCM) (Krishnapuram & Keller, 1993; Krishnapuram & Keller, 1996) or noise clustering (NC) (Dave, 1991; Dave & Krishnapuram, 1997) that deal with noisy data and can identify outliers, need good initializations or suffer from lack of adaptability to different cluster sizes (Rehm, Klawonn & Kruse, 2007). Distance-based approaches (Knorr, 1998; Knorr, Ng & Tucakov, 2000) have a global view on the data set. These algorithms can hardly treat data sets containing regions with different data density (Breuning, Kriegel, Ng & Sander, 2000). In this work we present an approach that combines a fuzzy clustering algorithm (Höppner, Klawonn, Kruse & Runkler, 1999) (or any other prototype-based clustering algorithm) with statistical distribution-based outlier detection.


2011 ◽  
pp. 29-43
Author(s):  
Faxin Zhao ◽  
Yubin Bao ◽  
Huanliang Sun ◽  
Ge Yu

In data mining fields, outlier detection is an important research issue. The number of cells in the cell-based disk algorithm increases exponentially. The performance of this algorithm will decrease dramatically with the increasing of the number of cells and data points. Through further analysis, we find that there are many empty cells that are useless to outlier detection. So this chapter proposes a novel index structure, called CD-Tree, in which only non-empty cells are stored, and a cluster technique is adopted to store the data objects in the same cell into linked disk pages. Some experiments are made to test the performance of the proposed algorithms. The experimental results show that the performance of the CD-Tree structure and of the cluster technique based disk algorithm outperforms that of the cell-based disk algorithm, and the dimensionality processed by the proposed algorithm is higher than that of the old one.


Author(s):  
A. Kalpana ◽  
P. Rambabu ◽  
Lakshmi Sreeniuvasareddy D

An Outlier is a data point which is significantly different from the remaining data points. Outlier is also referred as discordant, deviants and abnormalities. Outliers may have a particular interest, such as credit card fraud detection, where outliers indicate fraudulent activity. Thus, outlier detection analysis is an interesting data mining task, referred to as outlier analysis. Detecting outliers efficiently from dataset is an important task in many fields like Credit card Fraud, Medicine, Law enforcement, Earth Sciences etc. Many methods are available to identify outliers in numerical dataset. But there exist limited number of methods are available for categorical and mixed attribute datasets. In the proposed work, a novel outlier detection method is proposed. This proposed method finds anomalies based on each record’s “multi attribute outlier factor through correlation” score and it has great intuitive appeal. This algorithm utilizes the frequency of each value in categorical part of the dataset and correlation factor of each record with mean record of the entire dataset. This proposed method used Attribute Value Frequency score (AVF score) concept for categorical part. Results of the proposed method are compared with existing methods. The Bank data (Mixed) is used for experiments in this paper which is taken from UCI machine learning repository.Keyword: Outlier, Mixed Attribute Datasets, Attribute Value Frequency Score


2021 ◽  
Vol 50 (1) ◽  
pp. 138-152
Author(s):  
Mujeeb Ur Rehman ◽  
Dost Muhammad Khan

Recently, anomaly detection has acquired a realistic response from data mining scientists as a graph of its reputation has increased smoothly in various practical domains like product marketing, fraud detection, medical diagnosis, fault detection and so many other fields. High dimensional data subjected to outlier detection poses exceptional challenges for data mining experts and it is because of natural problems of the curse of dimensionality and resemblance of distant and adjoining points. Traditional algorithms and techniques were experimented on full feature space regarding outlier detection. Customary methodologies concentrate largely on low dimensional data and hence show ineffectiveness while discovering anomalies in a data set comprised of a high number of dimensions. It becomes a very difficult and tiresome job to dig out anomalies present in high dimensional data set when all subsets of projections need to be explored. All data points in high dimensional data behave like similar observations because of its intrinsic feature i.e., the distance between observations approaches to zero as the number of dimensions extends towards infinity. This research work proposes a novel technique that explores deviation among all data points and embeds its findings inside well established density-based techniques. This is a state of art technique as it gives a new breadth of research towards resolving inherent problems of high dimensional data where outliers reside within clusters having different densities. A high dimensional dataset from UCI Machine Learning Repository is chosen to test the proposed technique and then its results are compared with that of density-based techniques to evaluate its efficiency.


2009 ◽  
Vol 9 (2) ◽  
pp. 8101-8119 ◽  
Author(s):  
S. M. Illingworth ◽  
J. J. Remedios ◽  
R. J. Parker

Abstract. The mission objectives of the Infrared Atmospheric Sounding Interferometer (IASI) are driven by the needs of the Numerical Weather Prediction (NWP) and climate monitoring communities. These objectives rely upon the IASI instrument being able to measure top of atmosphere radiances accurately. This paper presents a technique and results for the validation of the radiometric calibration of radiances for IASI, using a cross-calibration with the Advanced Along Track Scanning Radiometer (AATSR). The AATSR is able to measure Brightness Temperature (BT) to an accuracy of 30 mK, and by applying the AATSR spectral filter function to the IASI measured radiances we are able to compare AATSR and IASI Brightness Temperatures. By choosing coincidental data points that are over the sea and in clear sky conditions, a threshold of homogeneity is derived. It is found that in these homogenous conditions, the IASI BTs agree with those measured by the AATSR to within 0.5 K, with a precision of order 0.04 K. These results indicate that IASI is likely to be meeting its target objective of 0.5 K accuracy. It is believed that a refinement of the AATSR spectral filter function will hopefully permit a tighter error constraint on the quality of the IASI data and hence further assessment of the climate quality of the radiances.


2022 ◽  
Vol 13 (1) ◽  
pp. 1-17
Author(s):  
Ankit Kumar ◽  
Abhishek Kumar ◽  
Ali Kashif Bashir ◽  
Mamoon Rashid ◽  
V. D. Ambeth Kumar ◽  
...  

Detection of outliers or anomalies is one of the vital issues in pattern-driven data mining. Outlier detection detects the inconsistent behavior of individual objects. It is an important sector in the data mining field with several different applications such as detecting credit card fraud, hacking discovery and discovering criminal activities. It is necessary to develop tools used to uncover the critical information established in the extensive data. This paper investigated a novel method for detecting cluster outliers in a multidimensional dataset, capable of identifying the clusters and outliers for datasets containing noise. The proposed method can detect the groups and outliers left by the clustering process, like instant irregular sets of clusters (C) and outliers (O), to boost the results. The results obtained after applying the algorithm to the dataset improved in terms of several parameters. For the comparative analysis, the accurate average value and the recall value parameters are computed. The accurate average value is 74.05% of the existing COID algorithm, and our proposed algorithm has 77.21%. The average recall value is 81.19% and 89.51% of the existing and proposed algorithm, which shows that the proposed work efficiency is better than the existing COID algorithm.


Author(s):  
Fabrizio Angiulli

Data mining techniques can be grouped in four main categories: clustering, classification, dependency detection, and outlier detection. Clustering is the process of partitioning a set of objects into homogeneous groups, or clusters. Classification is the task of assigning objects to one of several predefined categories. Dependency detection searches for pairs of attribute sets which exhibit some degree of correlation in the data set at hand. The outlier detection task can be defined as follows: “Given a set of data points or objects, find the objects that are considerably dissimilar, exceptional or inconsistent with respect to the remaining data”. These exceptional objects as also referred to as outliers. Most of the early methods for outlier identification have been developed in the field of statistics (Hawkins, 1980; Barnett & Lewis, 1994). Hawkins’ definition of outlier clarifies the approach: “An outlier is an observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism”. Indeed, statistical techniques assume that the given data set has a distribution model. Outliers are those points that satisfy a discordancy test, that is, that are significantly far from what would be their expected position given the hypothesized distribution. Many clustering, classification and dependency detection methods produce outliers as a by-product of their main task. For example, in classification, mislabeled objects are considered outliers and thus they are removed from the training set to improve the accuracy of the resulting classifier, while in clustering, objects that do not strongly belong to any cluster are considered outliers. Nevertheless, it must be said that searching for outliers through techniques specifically designed for tasks different from outlier detection could not be advantageous. As an example, clusters can be distorted by outliers and, thus, the quality of the outliers returned is affected by their presence. Moreover, other than returning a solution of higher quality, outlier detection algorithms can be vastly more efficient than non ad-hoc algorithms. While in many contexts outliers are considered as noise that must be eliminated, as pointed out elsewhere, “one person’s noise could be another person’s signal”, and thus outliers themselves can be of great interest. Outlier mining is used in telecom or credit card frauds to detect the atypical usage of telecom services or credit cards, in intrusion detection for detecting unauthorized accesses, in medical analysis to test abnormal reactions to new medical therapies, in marketing and customer segmentations to identify customers spending much more or much less than average customer, in surveillance systems, in data cleaning, and in many other fields.


Author(s):  
Aijun An

Generally speaking, classification is the action of assigning an object to a category according to the characteristics of the object. In data mining, classification refers to the task of analyzing a set of pre-classified data objects to learn a model (or a function) that can be used to classify an unseen data object into one of several predefined classes. A data object, referred to as an example, is described by a set of attributes or variables. One of the attributes describes the class that an example belongs to and is thus called the class attribute or class variable. Other attributes are often called independent or predictor attributes (or variables). The set of examples used to learn the classification model is called the training data set. Tasks related to classification include regression, which builds a model from training data to predict numerical values, and clustering, which groups examples to form categories. Classification belongs to the category of supervised learning, distinguished from unsupervised learning. In supervised learning, the training data consists of pairs of input data (typically vectors), and desired outputs, while in unsupervised learning there is no a priori output. Classification has various applications, such as learning from a patient database to diagnose a disease based on the symptoms of a patient, analyzing credit card transactions to identify fraudulent transactions, automatic recognition of letters or digits based on handwriting samples, and distinguishing highly active compounds from inactive ones based on the structures of compounds for drug discovery.


2008 ◽  
pp. 26-49 ◽  
Author(s):  
Yong Shi ◽  
Yi Peng ◽  
Gang Kou ◽  
Zhengxin Chen

This chapter provides an overview of a series of multiple criteria optimization-based data mining methods, which utilize multiple criteria programming (MCP) to solve data mining problems, and outlines some research challenges and opportunities for the data mining community. To achieve these goals, this chapter first introduces the basic notions and mathematical formulations for multiple criteria optimization-based classification models, including the multiple criteria linear programming model, multiple criteria quadratic programming model, and multiple criteria fuzzy linear programming model. Then it presents the real-life applications of these models in credit card scoring management, HIV-1 associated dementia (HAD) neuronal dam-age and dropout, and network intrusion detection. Finally, the chapter discusses research challenges and opportunities.


Author(s):  
Susan Imberman ◽  
Abdullah Uz Uz Tansel

With the advent of mass storage devices, databases have become larger and larger. Point-of-sale data, patient medical data, scientific data, and credit card transactions are just a few sources of the ever-increasing amounts of data. These large datasets provide a rich source of useful information. Knowledge Discovery in Databases (KDD) is a paradigm for the analysis of these large datasets. KDD uses various methods from such diverse fields as machine learning, artificial intelligence, pattern recognition, database management and design, statistics, expert systems, and data visualization.


Sign in / Sign up

Export Citation Format

Share Document