Unsupervised Outlier Detection: A Meta-Learning Algorithm Based on Feature Selection

Outlier detection refers to the problem of the identification and, where appropriate, the elimination of anomalous observations from data. Such anomalous observations can emerge due to a variety of reasons, including human or mechanical errors, fraudulent behaviour as well as environmental or systematic changes, occurring either naturally or purposefully. The accurate and timely detection of deviant observations allows for the early identification of potentially extensive problems, such as fraud or system failures, before they escalate. Several unsupervised outlier detection methods have been developed; however, there is no single best algorithm or family of algorithms, as typically each relies on a measure of `outlierness’ such as density or distance, ignoring other measures. To add to that, in an unsupervised setting, the absence of ground-truth labels makes finding a single best algorithm an impossible feat even for a single given dataset. In this study, a new meta-learning algorithm for unsupervised outlier detection is introduced in order to mitigate this problem. The proposed algorithm, in a fully unsupervised manner, attempts not only to combine the best of many worlds from the existing techniques through ensemble voting but also mitigate any undesired shortcomings by employing an unsupervised feature selection strategy in order to identify the most informative algorithms for a given dataset. The proposed methodology was evaluated extensively through experimentation, where it was benchmarked and compared against a wide range of commonly-used techniques for outlier detection. Results obtained using a variety of widely accepted datasets demonstrated its usefulness and its state-of-the-art results as it topped the Friedman ranking test for both the area under receiver operating characteristic (ROC) curve and precision metrics when averaged over five independent trials.

Download Full-text

Benchmarking Unsupervised Outlier Detection with Realistic Synthetic Data

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3441453 ◽

2021 ◽

Vol 15 (4) ◽

pp. 1-20

Author(s):

Georg Steinbuss ◽

Klemens Böhm

Keyword(s):

Outlier Detection ◽

Synthetic Data ◽

Real Data ◽

Detection Methods ◽

Quality Of Data ◽

Benchmark Data ◽

Core Idea ◽

Generic Process ◽

Unsupervised Outlier Detection

Benchmarking unsupervised outlier detection is difficult. Outliers are rare, and existing benchmark data contains outliers with various and unknown characteristics. Fully synthetic data usually consists of outliers and regular instances with clear characteristics and thus allows for a more meaningful evaluation of detection methods in principle. Nonetheless, there have only been few attempts to include synthetic data in benchmarks for outlier detection. This might be due to the imprecise notion of outliers or to the difficulty to arrive at a good coverage of different domains with synthetic data. In this work, we propose a generic process for the generation of datasets for such benchmarking. The core idea is to reconstruct regular instances from existing real-world benchmark data while generating outliers so that they exhibit insightful characteristics. We propose and describe a generic process for the benchmarking of unsupervised outlier detection, as sketched so far. We then describe three instantiations of this generic process that generate outliers with specific characteristics, like local outliers. To validate our process, we perform a benchmark with state-of-the-art detection methods and carry out experiments to study the quality of data reconstructed in this way. Next to showcasing the workflow, this confirms the usefulness of our proposed process. In particular, our process yields regular instances close to the ones from real data. Summing up, we propose and validate a new and practical process for the benchmarking of unsupervised outlier detection.

Download Full-text

Isolation-based feature Selection for Unsupervised Outlier Detection

Annual Conference of the PHM Society ◽

10.36001/phmconf.2019.v11i1.824 ◽

2019 ◽

Vol 11 (1) ◽

Author(s):

Qibo Yang ◽

Jaskaran Singh ◽

Jay Lee

Keyword(s):

Feature Selection ◽

Outlier Detection ◽

Simulated Data ◽

Support Vector ◽

Detection Algorithms ◽

Complex Interactions ◽

Laplacian Score ◽

High Dimensional Datasets ◽

Isolation Forest ◽

Unsupervised Outlier Detection

For high-dimensional datasets, bad features and complex interactions between features can cause high computational costs and make outlier detection algorithms inefficient. Most feature selection methods are designed for supervised classification and regression, and limited works are specifically for unsupervised outlier detection. This paper proposes a novel isolation-based feature selection (IBFS) method for unsupervised outlier detection. It is based on the training process of isolation forest. When a point of a feature is used to split the data, the imbalanced distribution of split data is measured and used to quantify how strong this feature can detect outliers. We also compare the proposed method with variance, Laplacian score and kurtosis. These methods are benchmarked on simulated data to show their characteristics. Then we evaluate the performance using one-class support vector machine, isolation forest and local outlier factor on several real-word datasets. The evaluation results show that the proposed method can improve the performance of isolation forest, and its results are similar to and sometimes better than another useful outlier indicator: kurtosis, which demonstrate the effectiveness of the proposed method. We also notice that sometimes variance and Laplacian score has similar performance on the datasets.

Download Full-text

Learning low-dimensional manifolds under the L0-norm constraint for unsupervised outlier detection

International Journal of Data Science and Analytics ◽

10.1007/s41060-021-00269-x ◽

2021 ◽

Author(s):

Yoshinao Ishii ◽

Satoshi Koide ◽

Keiichiro Hayakawa

Keyword(s):

Outlier Detection ◽

Data Matrix ◽

Detection Methods ◽

Detection Accuracy ◽

Error Matrix ◽

Nonlinear Features ◽

Norm Constraint ◽

Low Dimensional ◽

Artificial Datasets ◽

Unsupervised Outlier Detection

AbstractUnsupervised outlier detection without the need for clean data has attracted great attention because it is suitable for real-world problems as a result of its low data collection costs. Reconstruction-based methods are popular approaches for unsupervised outlier detection. These methods decompose a data matrix into low-dimensional manifolds and an error matrix. Then, samples with a large error are detected as outliers. To achieve high outlier detection accuracy, when data are corrupted by large noise, the detection method should have the following two properties: (1) it should be able to decompose the data under the L0-norm constraint on the error matrix and (2) it should be able to reflect the nonlinear features of the data in the manifolds. Despite significant efforts, no method with both of these properties exists. To address this issue, we propose a novel reconstruction-based method: “L0-norm constrained autoencoders (L0-AE).” L0-AE uses autoencoders to learn low-dimensional manifolds that capture the nonlinear features of the data and uses a novel optimization algorithm that can decompose the data under the L0-norm constraints on the error matrix. This novel L0-AE algorithm provably guarantees the convergence of the optimization if the autoencoder is trained appropriately. The experimental results show that L0-AE is more robust, accurate and stable than other unsupervised outlier detection methods not only for artificial datasets with corrupted samples but also artificial datasets with well-known outlier distributions and real datasets. Additionally, the results show that the accuracy of L0-AE is moderately stable to changes in the parameter of the constrained term, and for real datasets, L0-AE achieves higher accuracy than the baseline non-robustified method for most parameter values.

Download Full-text

APPLICATION OF ENSEMBLE ALGORITHM INTEGRATING MULTIPLE CRITERIA FEATURE SELECTION IN CORONARY HEART DISEASE DETECTION

Biomedical Engineering Applications Basis and Communications ◽

10.4015/s1016237217500430 ◽

2017 ◽

Vol 29 (06) ◽

pp. 1750043 ◽

Cited By ~ 7

Author(s):

Cai-Jie Qin ◽

Qiang Guan ◽

Xin-Pei Wang

Keyword(s):

Machine Learning ◽

Coronary Heart Disease ◽

Feature Selection ◽

Heart Disease ◽

Subjective Experience ◽

Learning Algorithm ◽

Low Cost ◽

Detection Methods ◽

Disease Detection ◽

Ensemble Algorithm

Conventional coronary heart disease (CHD) detection methods are expensive, rely much on doctors’ subjective experience, and some of them have side effects. In order to obtain rapid, high-precision, low-cost, non-invasive detection results, several methods in machine learning were attempted for CHD detection in this paper. The paper adopted multiple evaluation criteria to measure features, combined with heuristic search strategy and seven common classification algorithms to verify the validity and the importance of feature selection (FS) in the Z-Alizadeh Sani CHD dataset. On this basis, a novelty algorithm integrating multiple FS methods into the ensemble algorithm (ensemble algorithm based on multiple feature selection, EA-MFS) was further proposed. The algorithm adopted Bagging approach to increase data diversity, used the aforementioned MFS methods for functional perturbation, employed major voting method to carry out the decision results, and performed selective integration in terms of the difference of base classifiers in the ensemble process. Compared with the single FS method, the EA-MFS algorithm could comprehensively describe the relationship of features, enhance the classification effect, and displayed better robustness. That meant the EA-MFS algorithm could reduce the dependence on dataset and strengthen the stability of the algorithm, all of which were of great significance for the clinical application of machine learning algorithm in coronary heart disease detection.

Download Full-text

Unsupervised Feature Selection for Outlier Detection on Streaming Data to Enhance Network Security

Applied Sciences ◽

10.3390/app112412073 ◽

2021 ◽

Vol 11 (24) ◽

pp. 12073

Author(s):

Michael Heigl ◽

Enrico Weigelt ◽

Dalibor Fiala ◽

Martin Schramm

Keyword(s):

Feature Selection ◽

Outlier Detection ◽

Data Streams ◽

State Of The Art ◽

Streaming Data ◽

Detection Methods ◽

Unsupervised Feature Selection ◽

Detection Algorithms ◽

Efficient Detection ◽

Selection For

Over the past couple of years, machine learning methods—especially the outlier detection ones—have anchored in the cybersecurity field to detect network-based anomalies rooted in novel attack patterns. However, the ubiquity of massive continuously generated data streams poses an enormous challenge to efficient detection schemes and demands fast, memory-constrained online algorithms that are capable to deal with concept drifts. Feature selection plays an important role when it comes to improve outlier detection in terms of identifying noisy data that contain irrelevant or redundant features. State-of-the-art work either focuses on unsupervised feature selection for data streams or (offline) outlier detection. Substantial requirements to combine both fields are derived and compared with existing approaches. The comprehensive review reveals a research gap in unsupervised feature selection for the improvement of outlier detection methods in data streams. Thus, a novel algorithm for Unsupervised Feature Selection for Streaming Outlier Detection, denoted as UFSSOD, will be proposed, which is able to perform unsupervised feature selection for the purpose of outlier detection on streaming data. Furthermore, it is able to determine the amount of top-performing features by clustering their score values. A generic concept that shows two application scenarios of UFSSOD in conjunction with off-the-shell online outlier detection algorithms has been derived. Extensive experiments have shown that a promising feature selection mechanism for streaming data is not applicable in the field of outlier detection. Moreover, UFSSOD, as an online capable algorithm, yields comparable results to a state-of-the-art offline method trimmed for outlier detection.

Download Full-text

Identifying mixture copula components using outlier detection methods and goodness-of-fit tests

The Journal of Risk ◽

10.21314/jor.2014.288 ◽

2014 ◽

Vol 16 (4) ◽

pp. 61-101 ◽

Cited By ~ 1

Author(s):

Gregor Weiß

Keyword(s):

Outlier Detection ◽

Goodness Of Fit ◽

Detection Methods ◽

Goodness Of Fit Tests

Download Full-text

Detectivity optimization to detect of ultraweak light fluxes with an EM-CCD as binary photon counter array

Scientific Reports ◽

10.1038/s41598-021-82611-8 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Ibtissame Khaoua ◽

Guillaume Graciani ◽

Andrey Kim ◽

François Amblard

Keyword(s):

Dynamic Range ◽

Signal To Noise Ratio ◽

Photon Counting ◽

Detection Methods ◽

Strong Nonlinearity ◽

Wide Range ◽

Depth Analysis ◽

Spatially Extended ◽

Extended Sources ◽

Photon Counting Mode

AbstractFor a wide range of purposes, one faces the challenge to detect light from extremely faint and spatially extended sources. In such cases, detector noises dominate over the photon noise of the source, and quantum detectors in photon counting mode are generally the best option. Here, we combine a statistical model with an in-depth analysis of detector noises and calibration experiments, and we show that visible light can be detected with an electron-multiplying charge-coupled devices (EM-CCD) with a signal-to-noise ratio (SNR) of 3 for fluxes less than $$30\,{\text{photon}}\,{\text{s}}^{ - 1} \,{\text{cm}}^{ - 2}$$ 30 photon s - 1 cm - 2 . For green photons, this corresponds to 12 aW $${\text{cm}}^{ - 2}$$ cm - 2 ≈ $$9{ } \times 10^{ - 11}$$ 9 × 10 - 11 lux, i.e. 15 orders of magnitude less than typical daylight. The strong nonlinearity of the SNR with the sampling time leads to a dynamic range of detection of 4 orders of magnitude. To detect possibly varying light fluxes, we operate in conditions of maximal detectivity $${\mathcal{D}}$$ D rather than maximal SNR. Given the quantum efficiency $$QE\left( \lambda \right)$$ Q E λ of the detector, we find $${ \mathcal{D}} = 0.015\,{\text{photon}}^{ - 1} \,{\text{s}}^{1/2} \,{\text{cm}}$$ D = 0.015 photon - 1 s 1 / 2 cm , and a non-negligible sensitivity to blackbody radiation for T > 50 °C. This work should help design highly sensitive luminescence detection methods and develop experiments to explore dynamic phenomena involving ultra-weak luminescence in biology, chemistry, and material sciences.

Download Full-text

Machine learning algorithm to identifies fraud emails with feature selection

IOP Conference Series Materials Science and Engineering ◽

10.1088/1757-899x/1088/1/012011 ◽

2021 ◽

Vol 1088 (1) ◽

pp. 012011

Author(s):

Anita Sindar Sinaga ◽

Musthafa Haris Munandar ◽

Arjon Samuel Sitio

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Algorithm ◽

Machine Learning Algorithm

Download Full-text

Defects and uncertainties of adhesively bonded composite joints

SN Applied Sciences ◽

10.1007/s42452-021-04753-8 ◽

2021 ◽

Vol 3 (9) ◽

Author(s):

Sadik Omairey ◽

Nithin Jayasree ◽

Mihalis Kazilas

Keyword(s):

Composite Materials ◽

Composite Structures ◽

Production Efficiency ◽

Adhesive Bonding ◽

Detection Methods ◽

Bonded Joints ◽

Adhesively Bonded ◽

Adhesively Bonded Joints ◽

Wide Range ◽

Bonded Composite

AbstractThe increasing use of fibre reinforced polymer composite materials in a wide range of applications increases the use of similar and dissimilar joints. Traditional joining methods such as welding, mechanical fastening and riveting are challenging in composites due to their material properties, heterogeneous nature, and layup configuration. Adhesive bonding allows flexibility in materials selection and offers improved production efficiency from product design and manufacture to final assembly, enabling cost reduction. However, the performance of adhesively bonded composite structures cannot be fully verified by inspection and testing due to the unforeseen nature of defects and manufacturing uncertainties presented in this joining method. These uncertainties can manifest as kissing bonds, porosity and voids in the adhesive. As a result, the use of adhesively bonded joints is often constrained by conservative certification requirements, limiting the potential of composite materials in weight reduction, cost-saving, and performance. There is a need to identify these uncertainties and understand their effect when designing these adhesively bonded joints. This article aims to report and categorise these uncertainties, offering the reader a reliable and inclusive source to conduct further research, such as the development of probabilistic reliability-based design optimisation, sensitivity analysis, defect detection methods and process development.

Download Full-text

Information-Theoretic Generalization Bounds for Meta-Learning and Applications

Entropy ◽

10.3390/e23010126 ◽

2021 ◽

Vol 23 (1) ◽

pp. 126

Author(s):

Sharu Theresa Jose ◽

Osvaldo Simeone

Keyword(s):

Learning Algorithm ◽

Broad Class ◽

Performance Measure ◽

Training Data ◽

Learning To Learn ◽

Data Set ◽

Information Theoretic ◽

Meta Learning ◽

Task Training ◽

Test Sets

Meta-learning, or “learning to learn”, refers to techniques that infer an inductive bias from data corresponding to multiple related tasks with the goal of improving the sample efficiency for new, previously unobserved, tasks. A key performance measure for meta-learning is the meta-generalization gap, that is, the difference between the average loss measured on the meta-training data and on a new, randomly selected task. This paper presents novel information-theoretic upper bounds on the meta-generalization gap. Two broad classes of meta-learning algorithms are considered that use either separate within-task training and test sets, like model agnostic meta-learning (MAML), or joint within-task training and test sets, like reptile. Extending the existing work for conventional learning, an upper bound on the meta-generalization gap is derived for the former class that depends on the mutual information (MI) between the output of the meta-learning algorithm and its input meta-training data. For the latter, the derived bound includes an additional MI between the output of the per-task learning procedure and corresponding data set to capture within-task uncertainty. Tighter bounds are then developed for the two classes via novel individual task MI (ITMI) bounds. Applications of the derived bounds are finally discussed, including a broad class of noisy iterative algorithms for meta-learning.

Download Full-text