A Novel Fuzzy Kernel Clustering Algorithm for Outlier Detection

Author(s):  
Hongyi Zhang ◽  
Qingtao Wu ◽  
Jiexin Pu
2013 ◽  
Vol 791-793 ◽  
pp. 1337-1340
Author(s):  
Xue Zhang Zhao ◽  
Ming Qi ◽  
Yong Yi Feng

Fuzzy kernel clustering algorithm is a combination of unsupervised clustering and fuzzy set of the concept of image segmentation techniques, But the algorithm is sensitive to initial value, to a large extent dependent on the initial clustering center of choice, and easy to converge to local minimum values, when used in image segmentation, membership of the calculation only consider the current pixel values in the image, and did not consider the relationship between neighborhood pixels, and so on segmentation contains noise image is not ideal. This paper puts forward an improved fuzzy kernel clustering image segmentation algorithm, the multi-objective problem, change the single objective problem to increase the secondary goals concerning membership functions, Then add the constraint information space; Finally, using spatial neighborhood pixels corrected membership degree of the current pixel. The experimental results show that the algorithm effectively avoids the algorithm converges to local extremism and the stagnation of the iterative process will appear problem, significantly lower iterative times, and has good robustness and adaptability.


2013 ◽  
Vol 7 (3) ◽  
pp. 1005-1012 ◽  
Author(s):  
Zhang Chen ◽  
Xia Shixiong ◽  
Liu Bing

Author(s):  
WASEEM AHMAD ◽  
AJIT NARAYANAN

Outlier detection has important applications in various data mining domains such as fraud detection, intrusion detection, customers' behavior and employees' performance analysis. Outliers are characterized by being significantly or "interestingly" different from the rest of the data. In this paper, a novel cluster-based outlier detection method is proposed using a humoral-mediated clustering algorithm (HAIS) based on concepts of antibody secretion in natural immune systems. The proposed method finds meaningful clusters as well as outliers simultaneously. This is an iterative approach where only clusters above threshold (larger sized clusters) are carried forward to the next cycle of cluster formation while removing small sized clusters. This paper also demonstrates through experimental results that the mere existence of outliers severely affects the clustering outcome, and removing those outliers can result in better clustering solutions. The feasibility of the method is demonstrated through simulated datasets, current datasets from the literature as well as a real-world doctors' performance evaluation dataset where the task is to identify potentially under-performing doctors. The results indicate that HAIS has capabilities of detecting single point as well as cluster-based outliers.


Author(s):  
John Waller

Geographic outliers at GBIF (Global Biodiversity Information Facility) are a known problem. Outliers can be errors, coordinates with high uncertainty, or simply occurrences from an undersampled region. Often in data cleaning pipelines, outliers are removed (even if they are legitimate points) because the researcher does not have time to verify each record one-by-one. Outlier points are usually occurrences that need attention. Currently, there is no outlier detection implemented at GBIF and it is up to the user to flag outliers themselves. DBSCAN (a density-based algorithm for discovering clusters in large spatial databases with noise) is a simple and popular clustering algorithm. It uses two parameters, (1) distance and (2) a minimum number of points per cluster, to decide if something is an outlier. Since occurrence data can be very patchy, non-clustering distance-based methods will fail often Fig. 1. DBSCAN does not need to know the expected number of clusters in advance. DBSCAN does well using only distance and does not require some additional environmental variables like Bioclim. Advanatages of DBSCAN : Simple Easy to understand Only two parameters to set Scales well No additional data sources needed Users would understand how their data was changed Simple Easy to understand Only two parameters to set Scales well No additional data sources needed Users would understand how their data was changed Drawbacks : Only uses distance Must choose parameter settings Sensitive to sparse global sampling Does not include any other relevant environmental information Can only flag outliers outside of a point blob Only uses distance Must choose parameter settings Sensitive to sparse global sampling Does not include any other relevant environmental information Can only flag outliers outside of a point blob Outlier detection and error detection are different. If your goal is to produce a system with no false positives, it will fail. While more complex environmentally-informed outlier detection methods (like reverse jackknifing (Chapman 2005)) might perform better for certain examples or even in genreal, DBSCAN performs adequately on almost everything despite being very simple. Currently I am using DBSCAN to find errors and assess dataset quality. It is a Spark job written in Scala (github). It does not run on species with lots of (>30K) unique latitude-longitude points, since the current implementation relies on an in-memory distance matrix. However, around 99% of species (plants, animals, fungi) on GBIF have fewer than >30K unique lat-long points (2,283 species keys / 222,993 species keys). There are other implementations ( example) that might scale to many more points. There are no immediate plans to include DBSCAN outliers as a data quality flag on GBIF, but it could be done somewhat easily, since this type of method does not rely on any external environmental data sources and already runs on the GBIF cluster.


Sign in / Sign up

Export Citation Format

Share Document