A Data Stream Outlier Detection Algorithm Based on Reverse K Nearest Neighbors

2011 ◽  
Vol 225-226 ◽  
pp. 1032-1035 ◽  
Author(s):  
Zhong Ping Zhang ◽  
Yong Xin Liang

This paper proposes a new data stream outlier detection algorithm SODRNN based on reverse nearest neighbors. We deal with the sliding window model, where outlier queries are performed in order to detect anomalies in the current window. The update of insertion or deletion only needs one scan of the current window, which improves efficiency. The capability of queries at arbitrary time on the whole current window is achieved by Query Manager Procedure, which can capture the phenomenon of concept drift of data stream in time. Results of experiments conducted on both synthetic and real data sets show that SODRNN algorithm is both effective and efficient.

2014 ◽  
Vol 6 ◽  
pp. 830402 ◽  
Author(s):  
Changhao Piao ◽  
Zhi Huang ◽  
Ling Su ◽  
Sheng Lu

Battery system is the key part of the electric vehicle. To realize outlier detection in the running process of battery system effectively, a new high-dimensional data stream outlier detection algorithm (DSOD) based on angle distribution is proposed. First, in order to improve the algorithm stability in high-dimensional space, the method of angle distribution-based outlier detection algorithm is employed. Second, to reduce the computational complexity, a small-scale calculation set of data stream is established, which is composed of normal set and border set. For the purpose of solving the problem of concept drift, an update mechanism for the normal set and border set is developed in this paper. By this way, these hidden abnormal points will be rapidly detected. The experimental results on real data sets and battery system simulation data sets demonstrate that DSOD is more efficient than Simple variance of angles (Simple VOA) and angle-based outlier detection (ABOD) and is very suitable for the evaluation of battery system safety.


2014 ◽  
Vol 7 (5) ◽  
pp. 2303-2311 ◽  
Author(s):  
M. Martinez-Camara ◽  
B. Béjar Haro ◽  
A. Stohl ◽  
M. Vetterli

Abstract. Emissions of harmful substances into the atmosphere are a serious environmental concern. In order to understand and predict their effects, it is necessary to estimate the exact quantity and timing of the emissions from sensor measurements taken at different locations. There are a number of methods for solving this problem. However, these existing methods assume Gaussian additive errors, making them extremely sensitive to outlier measurements. We first show that the errors in real-world measurement data sets come from a heavy-tailed distribution, i.e., include outliers. Hence, we propose robustifying the existing inverse methods by adding a blind outlier-detection algorithm. The improved performance of our method is demonstrated on a real data set and compared to previously proposed methods. For the blind outlier detection, we first use an existing algorithm, RANSAC, and then propose a modification called TRANSAC, which provides a further performance improvement.


2014 ◽  
Vol 23 (04) ◽  
pp. 1460021 ◽  
Author(s):  
Mohamed Bouguessa

In this paper, we propose an approach that combines different outlier detection algorithms in order to gain an improved effectiveness. To this end, we first estimate an outlier score vector for each data object. Each element of the estimated vectors corresponds to an outlier score produced by a specific outlier detection algorithm. We then use the multivariate beta mixture model to cluster the outlier score vectors into several components so that the component that corresponds to the outliers can be identified. A notable feature of the proposed approach is the automatic identification of outliers, while most existing methods return only a ranked list of points, expecting the outliers to come first; or require empirical threshold estimation to identify outliers. Experimental results, on both synthetic and real data sets, show that our approach substantially enhances the accuracy of outlier base detectors considered in the combination and overcome their drawbacks.


2020 ◽  
Vol 10 (4) ◽  
pp. 287-298
Author(s):  
Piotr Duda ◽  
Krzysztof Przybyszewski ◽  
Lipo Wang

AbstractThe training set consists of many features that influence the classifier in different degrees. Choosing the most important features and rejecting those that do not carry relevant information is of great importance to the operating of the learned model. In the case of data streams, the importance of the features may additionally change over time. Such changes affect the performance of the classifier but can also be an important indicator of occurring concept-drift. In this work, we propose a new algorithm for data streams classification, called Random Forest with Features Importance (RFFI), which uses the measure of features importance as a drift detector. The RFFT algorithm implements solutions inspired by the Random Forest algorithm to the data stream scenarios. The proposed algorithm combines the ability of ensemble methods for handling slow changes in a data stream with a new method for detecting concept drift occurrence. The work contains an experimental analysis of the proposed algorithm, carried out on synthetic and real data.


2018 ◽  
Vol 14 (9) ◽  
pp. 1213-1225 ◽  
Author(s):  
Vo Ngoc Phu ◽  
Vo Thi Ngoc Tran

Author(s):  
Chunhua Ren ◽  
Linfu Sun

AbstractThe classic Fuzzy C-means (FCM) algorithm has limited clustering performance and is prone to misclassification of border points. This study offers a bi-directional FCM clustering ensemble approach that takes local information into account (LI_BIFCM) to overcome these challenges and increase clustering quality. First, various membership matrices are created after running FCM multiple times, based on the randomization of the initial cluster centers, and a vertical ensemble is performed using the maximum membership principle. Second, after each execution of FCM, multiple local membership matrices of the sample points are created using multiple K-nearest neighbors, and a horizontal ensemble is performed. Multiple horizontal ensembles can be created using multiple FCM clustering. Finally, the final clustering results are obtained by combining the vertical and horizontal clustering ensembles. Twelve data sets were chosen for testing from both synthetic and real data sources. The LI_BIFCM clustering performance outperformed four traditional clustering algorithms and three clustering ensemble algorithms in the experiments. Furthermore, the final clustering results has a weak correlation with the bi-directional cluster ensemble parameters, indicating that the suggested technique is robust.


Sign in / Sign up

Export Citation Format

Share Document