scholarly journals Outlier Detection Ensemble with Embedded Feature Selection

2020 ◽  
Vol 34 (04) ◽  
pp. 3503-3512
Author(s):  
Li Cheng ◽  
Yijie Wang ◽  
Xinwang Liu ◽  
Bin Li

Feature selection places an important role in improving the performance of outlier detection, especially for noisy data. Existing methods usually perform feature selection and outlier scoring separately, which would select feature subsets that may not optimally serve for outlier detection, leading to unsatisfying performance. In this paper, we propose an outlier detection ensemble framework with embedded feature selection (ODEFS), to address this issue. Specifically, for each random sub-sampling based learning component, ODEFS unifies feature selection and outlier detection into a pairwise ranking formulation to learn feature subsets that are tailored for the outlier detection method. Moreover, we adopt the thresholded self-paced learning to simultaneously optimize feature selection and example selection, which is helpful to improve the reliability of the training set. After that, we design an alternate algorithm with proved convergence to solve the resultant optimization problem. In addition, we analyze the generalization error bound of the proposed framework, which provides theoretical guarantee on the method and insightful practical guidance. Comprehensive experimental results on 12 real-world datasets from diverse domains validate the superiority of the proposed ODEFS.

2020 ◽  
Vol 34 (04) ◽  
pp. 6853-6860
Author(s):  
Xuchao Zhang ◽  
Xian Wu ◽  
Fanglan Chen ◽  
Liang Zhao ◽  
Chang-Tien Lu

The success of training accurate models strongly depends on the availability of a sufficient collection of precisely labeled data. However, real-world datasets contain erroneously labeled data samples that substantially hinder the performance of machine learning models. Meanwhile, well-labeled data is usually expensive to obtain and only a limited amount is available for training. In this paper, we consider the problem of training a robust model by using large-scale noisy data in conjunction with a small set of clean data. To leverage the information contained via the clean labels, we propose a novel self-paced robust learning algorithm (SPRL) that trains the model in a process from more reliable (clean) data instances to less reliable (noisy) ones under the supervision of well-labeled data. The self-paced learning process hedges the risk of selecting corrupted data into the training set. Moreover, theoretical analyses on the convergence of the proposed algorithm are provided under mild assumptions. Extensive experiments on synthetic and real-world datasets demonstrate that our proposed approach can achieve a considerable improvement in effectiveness and robustness to existing methods.


Author(s):  
Deepak P ◽  
Savitha Sam Abraham

AbstractAn outlier detection method may be considered fair over specified sensitive attributes if the results of outlier detection are not skewed toward particular groups defined on such sensitive attributes. In this paper, we consider the task of fair outlier detection. Our focus is on the task of fair outlier detection over multiple multi-valued sensitive attributes (e.g., gender, race, religion, nationality and marital status, among others), one that has broad applications across modern data scenarios. We propose a fair outlier detection method, FairLOF, that is inspired by the popular LOF formulation for neighborhood-based outlier detection. We outline ways in which unfairness could be induced within LOF and develop three heuristic principles to enhance fairness, which form the basis of the FairLOF method. Being a novel task, we develop an evaluation framework for fair outlier detection, and use that to benchmark FairLOF on quality and fairness of results. Through an extensive empirical evaluation over real-world datasets, we illustrate that FairLOF is able to achieve significant improvements in fairness at sometimes marginal degradations on result quality as measured against the fairness-agnostic LOF method. We also show that a generalization of our method, named FairLOF-Flex, is able to open possibilities of further deepening fairness in outlier detection beyond what is offered by FairLOF.


Author(s):  
Mengdi Huai ◽  
Hongfei Xue ◽  
Chenglin Miao ◽  
Liuyi Yao ◽  
Lu Su ◽  
...  

As an effective way to learn a distance metric between pairs of samples, deep metric learning (DML) has drawn significant attention in recent years. The key idea of DML is to learn a set of hierarchical nonlinear mappings using deep neural networks, and then project the data samples into a new feature space for comparing or matching. Although DML has achieved practical success in many applications, there is no existing work that theoretically analyzes the generalization error bound for DML, which can measure how good a learned DML model is able to perform on unseen data. In this paper, we try to fill up this research gap and derive the generalization error bound for DML. Additionally, based on the derived generalization bound, we propose a novel DML method (called ADroDML), which can adaptively learn the retention rates for the DML models with dropout in a theoretically justified way. Compared with existing DML works that require predefined retention rates, ADroDML can learn the retention rates in an optimal way and achieve better performance. We also conduct experiments on real-world datasets to verify the findings derived from the generalization error bound and demonstrate the effectiveness of the proposed adaptive DML method.


Data ◽  
2020 ◽  
Vol 6 (1) ◽  
pp. 1
Author(s):  
Ahmed Elmogy ◽  
Hamada Rizk ◽  
Amany M. Sarhan

In data mining, outlier detection is a major challenge as it has an important role in many applications such as medical data, image processing, fraud detection, intrusion detection, and so forth. An extensive variety of clustering based approaches have been developed to detect outliers. However they are by nature time consuming which restrict their utilization with real-time applications. Furthermore, outlier detection requests are handled one at a time, which means that each request is initiated individually with a particular set of parameters. In this paper, the first clustering based outlier detection framework, (On the Fly Clustering Based Outlier Detection (OFCOD)) is presented. OFCOD enables analysts to effectively find out outliers on time with request even within huge datasets. The proposed framework has been tested and evaluated using two real world datasets with different features and applications; one with 699 records, and another with five millions records. The experimental results show that the performance of the proposed framework outperforms other existing approaches while considering several evaluation metrics.


2021 ◽  
Vol 68 (4) ◽  
pp. 1-25
Author(s):  
Thodoris Lykouris ◽  
Sergei Vassilvitskii

Traditional online algorithms encapsulate decision making under uncertainty, and give ways to hedge against all possible future events, while guaranteeing a nearly optimal solution, as compared to an offline optimum. On the other hand, machine learning algorithms are in the business of extrapolating patterns found in the data to predict the future, and usually come with strong guarantees on the expected generalization error. In this work, we develop a framework for augmenting online algorithms with a machine learned predictor to achieve competitive ratios that provably improve upon unconditional worst-case lower bounds when the predictor has low error. Our approach treats the predictor as a complete black box and is not dependent on its inner workings or the exact distribution of its errors. We apply this framework to the traditional caching problem—creating an eviction strategy for a cache of size k . We demonstrate that naively following the oracle’s recommendations may lead to very poor performance, even when the average error is quite low. Instead, we show how to modify the Marker algorithm to take into account the predictions and prove that this combined approach achieves a competitive ratio that both (i) decreases as the predictor’s error decreases and (ii) is always capped by O (log k ), which can be achieved without any assistance from the predictor. We complement our results with an empirical evaluation of our algorithm on real-world datasets and show that it performs well empirically even when using simple off-the-shelf predictions.


Sign in / Sign up

Export Citation Format

Share Document