scholarly journals Benchmarking Unsupervised Outlier Detection with Realistic Synthetic Data

2021 ◽  
Vol 15 (4) ◽  
pp. 1-20
Author(s):  
Georg Steinbuss ◽  
Klemens Böhm

Benchmarking unsupervised outlier detection is difficult. Outliers are rare, and existing benchmark data contains outliers with various and unknown characteristics. Fully synthetic data usually consists of outliers and regular instances with clear characteristics and thus allows for a more meaningful evaluation of detection methods in principle. Nonetheless, there have only been few attempts to include synthetic data in benchmarks for outlier detection. This might be due to the imprecise notion of outliers or to the difficulty to arrive at a good coverage of different domains with synthetic data. In this work, we propose a generic process for the generation of datasets for such benchmarking. The core idea is to reconstruct regular instances from existing real-world benchmark data while generating outliers so that they exhibit insightful characteristics. We propose and describe a generic process for the benchmarking of unsupervised outlier detection, as sketched so far. We then describe three instantiations of this generic process that generate outliers with specific characteristics, like local outliers. To validate our process, we perform a benchmark with state-of-the-art detection methods and carry out experiments to study the quality of data reconstructed in this way. Next to showcasing the workflow, this confirms the usefulness of our proposed process. In particular, our process yields regular instances close to the ones from real data. Summing up, we propose and validate a new and practical process for the benchmarking of unsupervised outlier detection.

2020 ◽  
Vol 64 (11) ◽  
pp. 1825-1833
Author(s):  
Jennifer S. Li ◽  
Andreas Hamann ◽  
Elisabeth Beaubien

Testing is very essential in Data warehouse systems for decision making because the accuracy, validation and correctness of data depends on it. By looking to the characteristics and complexity of iData iwarehouse, iin ithis ipaper, iwe ihave itried ito ishow the scope of automated testing in assuring ibest data iwarehouse isolutions. Firstly, we developed a data set generator for creating synthetic but near to real data; then in isynthesized idata, with ithe help of hand icoded Extraction, Transformation and Loading (ETL) routine, anomalies are classified. For the quality assurance of data for a Data warehouse and to give the idea of how important the iExtraction, iTransformation iand iLoading iis, some very important test cases were identified. After that, to ensure the quality of data, the procedures of automated testing iwere iembedded iin ihand icoded iETL iroutine. Statistical analysis was done and it revealed a big enhancement in the quality of data with the procedures of automated testing. It enhances the fact that automated testing gives promising results in the data warehouse quality. For effective and easy maintenance of distributed data,a novel architecture was proposed. Although the desired result of this research is achieved successfully and the objectives are promising, but still there's a need to validate the results with the real life environment, as this research was done in simulated environment, which may not always give the desired results in real life environment. Hence, the overall potential of the proposed architecture can be seen until it is deployed to manage the real data which is distributed globally.


Author(s):  
Hoon Kim ◽  
Kangwook Lee ◽  
Gyeongjo Hwang ◽  
Changho Suh

Developing a computer vision-based algorithm for identifying dangerous vehicles requires a large amount of labeled accident data, which is difficult to collect in the real world. To tackle this challenge, we first develop a synthetic data generator built on top of a driving simulator. We then observe that the synthetic labels that are generated based on simulation results are very noisy, resulting in poor classification performance. In order to improve the quality of synthetic labels, we propose a new label adaptation technique that first extracts internal states of vehicles from the underlying driving simulator, and then refines labels by predicting future paths of vehicles based on a well-studied motion model. Via real-data experiments, we show that our dangerous vehicle classifier can reduce the missed detection rate by at least 18.5% compared with those trained with real data when time-to-collision is between 1.6s and 1.8s.


Geophysics ◽  
2008 ◽  
Vol 73 (3) ◽  
pp. F91-F95 ◽  
Author(s):  
Yutaka Sasaki ◽  
Jeong-Sul Son ◽  
Changryol Kim ◽  
Jung-Ho Kim

Handheld frequency-domain electromagnetic (EM) instruments are being used increasingly for shallow environmental and geotechnical surveys because of their portability and speed of use in field operations. However, in many cases, the quality of data is so poor that quantitative interpretation is not justified. This is because the small-loop EM method is required to detect very weak signals (the secondary magnetic fields) in the presence of the dominant primary field, so the data are inherently susceptible to calibration errors. Although these errors can be measured by raising the instrument high above the ground so that the effect of the conducting ground is negligible, it is impracticable to do so for every survey. We have developed an algorithm that simultaneously inverts small-loop EM data for a multidimensional resistivity distribution and offset errors. For this inversion method to work successfully the data must be collected at two heights. The forward modeling used in the inversion is based on a staggered-grid 3D finite-difference method; its solution has been checked against a 2.5D finite-element solution. Synthetic and real data examples demonstrate that the inversion recovers reliable resistivity models from multifrequency data that are contaminated severely by offset errors.


2017 ◽  
Author(s):  
Alain de Cheveigné ◽  
Dorothée Arzounian

AbstractElectroencephalography (EEG), magnetoencephalography (MEG) and related techniques are prone to glitches, slow drift, steps, etc., that contaminate the data and interfere with the analysis and interpretation. These artifacts are usually addressed in a preprocessing phase that attempts to remove them or minimize their impact. This paper offers a set of useful techniques for this purpose: robust detrending, robust rereferencing, outlier detection, data interpolation (inpainting), step removal, and filter ringing artifact removal. These techniques provide a less wasteful alternative to discarding corrupted trials or channels, and they are relatively immune to artifacts that disrupt alternative approaches such as filtering. Robust detrending allows slow drifts and common mode signals to be factored out while avoiding the deleterious effects of glitches. Robust rereferencing reduces the impact of artifacts on the reference. Inpainting allows corrupt data to be interpolated from intact parts based on the correlation structure estimated over the intact parts. Outlier detection allows the corrupt parts to be identified. Step removal fixes the high-amplitude flux jump artifacts that are common with some MEG systems. Ringing removal allows the ringing response of the antialiasing filter to glitches (steps, pulses) to be suppressed. The performance of the methods is illustrated and evaluated using synthetic data and data from real EEG and MEG systems. These methods, which are are mainly automatic and require little tuning, can greatly improve the quality of the data.


Author(s):  
Anh Duy Tran ◽  
Somjit Arch-int ◽  
Ngamnij Arch-int

Conditional functional dependencies (CFDs) have been used to improve the quality of data, including detecting and repairing data inconsistencies. Approximation measures have significant importance for data dependencies in data mining. To adapt to exceptions in real data, the measures are used to relax the strictness of CFDs for more generalized dependencies, called approximate conditional functional dependencies (ACFDs). This paper analyzes the weaknesses of dependency degree, confidence and conviction measures for general CFDs (constant and variable CFDs). A new measure for general CFDs based on incomplete knowledge granularity is proposed to measure the approximation of these dependencies as well as the distribution of data tuples into the conditional equivalence classes. Finally, the effectiveness of stripped conditional partitions and this new measure are evaluated on synthetic and real data sets. These results are important to the study of theory of approximation dependencies and improvement of discovery algorithms of CFDs and ACFDs.


Author(s):  
Yoshinao Ishii ◽  
Satoshi Koide ◽  
Keiichiro Hayakawa

AbstractUnsupervised outlier detection without the need for clean data has attracted great attention because it is suitable for real-world problems as a result of its low data collection costs. Reconstruction-based methods are popular approaches for unsupervised outlier detection. These methods decompose a data matrix into low-dimensional manifolds and an error matrix. Then, samples with a large error are detected as outliers. To achieve high outlier detection accuracy, when data are corrupted by large noise, the detection method should have the following two properties: (1) it should be able to decompose the data under the L0-norm constraint on the error matrix and (2) it should be able to reflect the nonlinear features of the data in the manifolds. Despite significant efforts, no method with both of these properties exists. To address this issue, we propose a novel reconstruction-based method: “L0-norm constrained autoencoders (L0-AE).” L0-AE uses autoencoders to learn low-dimensional manifolds that capture the nonlinear features of the data and uses a novel optimization algorithm that can decompose the data under the L0-norm constraints on the error matrix. This novel L0-AE algorithm provably guarantees the convergence of the optimization if the autoencoder is trained appropriately. The experimental results show that L0-AE is more robust, accurate and stable than other unsupervised outlier detection methods not only for artificial datasets with corrupted samples but also artificial datasets with well-known outlier distributions and real datasets. Additionally, the results show that the accuracy of L0-AE is moderately stable to changes in the parameter of the constrained term, and for real datasets, L0-AE achieves higher accuracy than the baseline non-robustified method for most parameter values.


Geophysics ◽  
1991 ◽  
Vol 56 (7) ◽  
pp. 1071-1080 ◽  
Author(s):  
Mark Sams

A long‐spaced sonic survey may be thought of as a special case of ray theoretical tomographic imaging. With such an approach estimates of borehole properties at a resolution of 6 inches (0.15 m) have been obtained by inversion compared with a resolution of 2 ft (0.6 m) from standard borehole‐compensated techniques (BHC). The inversion scheme employs the conjugate gradient technique which is fast and efficient. Unlike BHC, the method compensates for variable refraction angles and provides estimates of errors in the measurements. Results from synthetic data show that these factors greatly improve the imaging of the properties of a finely layered medium, though amplitude decay and coupling are less well defined than velocity and mud traveltime. Results from real data confirm the superior quality of logs from inversion. Furthermore, they indicate that measured amplitudes can be dominated by errors that cause deterioration of BHC estimates of amplitude decay and coupling.


2021 ◽  
Vol 33 (6) ◽  
pp. 265-274
Author(s):  
Hyeon-Jae Kim ◽  
Dong-Hoon Kim ◽  
Chaewook Lim ◽  
Youngtak Shin ◽  
Sang-Chul Lee ◽  
...  

Outlier detection research in ocean data has traditionally been performed using statistical and distance-based machine learning algorithms. Recently, AI-based methods have received a lot of attention and so-called supervised learning methods that require classification information for data are mainly used. This supervised learning method requires a lot of time and costs because classification information (label) must be manually designated for all data required for learning. In this study, an autoencoder based on unsupervised learning was applied as an outlier detection to overcome this problem. For the experiment, two experiments were designed: one is univariate learning, in which only SST data was used among the observation data of Deokjeok Island and the other is multivariate learning, in which SST, air temperature, wind direction, wind speed, air pressure, and humidity were used. Period of data is 25 years from 1996 to 2020, and a pre-processing considering the characteristics of ocean data was applied to the data. An outlier detection of actual SST data was tried with a learned univariate and multivariate autoencoder. We tried to detect outliers in real SST data using trained univariate and multivariate autoencoders. To compare model performance, various outlier detection methods were applied to synthetic data with artificially inserted errors. As a result of quantitatively evaluating the performance of these methods, the multivariate/univariate accuracy was about 96%/91%, respectively, indicating that the multivariate autoencoder had better outlier detection performance. Outlier detection using an unsupervised learning-based autoencoder is expected to be used in various ways in that it can reduce subjective classification errors and cost and time required for data labeling.


Sign in / Sign up

Export Citation Format

Share Document