Improving the Efficiency of Inclusion Dependency Detection

Data mining techniques can be grouped in four main categories: clustering, classification, dependency detection, and outlier detection. Clustering is the process of partitioning a set of objects into homogeneous groups, or clusters. Classification is the task of assigning objects to one of several predefined categories. Dependency detection searches for pairs of attribute sets which exhibit some degree of correlation in the data set at hand. The outlier detection task can be defined as follows: “Given a set of data points or objects, find the objects that are considerably dissimilar, exceptional or inconsistent with respect to the remaining data”. These exceptional objects as also referred to as outliers. Most of the early methods for outlier identification have been developed in the field of statistics (Hawkins, 1980; Barnett & Lewis, 1994). Hawkins’ definition of outlier clarifies the approach: “An outlier is an observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism”. Indeed, statistical techniques assume that the given data set has a distribution model. Outliers are those points that satisfy a discordancy test, that is, that are significantly far from what would be their expected position given the hypothesized distribution. Many clustering, classification and dependency detection methods produce outliers as a by-product of their main task. For example, in classification, mislabeled objects are considered outliers and thus they are removed from the training set to improve the accuracy of the resulting classifier, while in clustering, objects that do not strongly belong to any cluster are considered outliers. Nevertheless, it must be said that searching for outliers through techniques specifically designed for tasks different from outlier detection could not be advantageous. As an example, clusters can be distorted by outliers and, thus, the quality of the outliers returned is affected by their presence. Moreover, other than returning a solution of higher quality, outlier detection algorithms can be vastly more efficient than non ad-hoc algorithms. While in many contexts outliers are considered as noise that must be eliminated, as pointed out elsewhere, “one person’s noise could be another person’s signal”, and thus outliers themselves can be of great interest. Outlier mining is used in telecom or credit card frauds to detect the atypical usage of telecom services or credit cards, in intrusion detection for detecting unauthorized accesses, in medical analysis to test abnormal reactions to new medical therapies, in marketing and customer segmentations to identify customers spending much more or much less than average customer, in surveillance systems, in data cleaning, and in many other fields.

Download Full-text

Conceptual-based Functional Dependency Detection Framework

10.5339/qfarc.2016.ictpp3084 ◽

2016 ◽

Author(s):

Fahad Islam

Keyword(s):

Functional Dependency ◽

Dependency Detection

Download Full-text

Measuring Independence between Statistical Randomness Tests by Mutual Information

Entropy ◽

10.3390/e22070741 ◽

2020 ◽

Vol 22 (7) ◽

pp. 741

Author(s):

Jorge Augusto Karell-Albo ◽

Carlos Miguel Legón-Pérez ◽

Evaristo José Madarro-Capó ◽

Omar Rojas ◽

Guillermo Sosa-Gómez

Keyword(s):

Mutual Information ◽

Correlation Coefficient ◽

Linear Correlation ◽

Statistical Dependency ◽

Nonlinear Correlations ◽

Randomness Tests ◽

Dependency Detection

The analysis of independence between statistical randomness tests has had great attention in the literature recently. Dependency detection between statistical randomness tests allows one to discriminate statistical randomness tests that measure similar characteristics, and thus minimize the amount of statistical randomness tests that need to be used. In this work, a method for detecting statistical dependency by using mutual information is proposed. The main advantage of using mutual information is its ability to detect nonlinear correlations, which cannot be detected by the linear correlation coefficient used in previous work. This method analyzes the correlation between the battery tests of the National Institute of Standards and Technology, used as a standard in the evaluation of randomness. The results of the experiments show the existence of statistical dependencies between the tests that have not been previously detected.

Download Full-text