Outlier Detection Method on UCI Repository Dataset by Entropy Based Rough K-means

2016 ◽  
Vol 66 (2) ◽  
pp. 113 ◽  
Author(s):  
Ashok P. ◽  
G.M Kadhar Nawaz

<p>Rough set theory is used to handle uncertainty and incomplete information by applying two sets, lower and upper approximation. In this paper, the clustering process is improved by adapting the preliminary centroid selection method on rough K-means (RKM) algorithm. The entropy based rough K-means (ERKM) method is developed by adapting entropy based preliminary centroids selection on RKM and executed and also validated by cluster validity indexes. An example shows that the ERKM performs effectively by selection of entropy based preliminary centroid. In addition, Outlier detection is an important task in data mining and very much different from the rest of the objects in the cluster. Entropy based rough outlier factor (EROF) method is used to detect outlier effectively for yeast dataset. An example shows that EROF detects outlier effectively on protein localisation sites and ERKM clustering algorithm performed effectively. Further, experimental readings show that the ERKM and EROF method outperformed the other methods.</p><p> </p>

Author(s):  
WASEEM AHMAD ◽  
AJIT NARAYANAN

Outlier detection has important applications in various data mining domains such as fraud detection, intrusion detection, customers' behavior and employees' performance analysis. Outliers are characterized by being significantly or "interestingly" different from the rest of the data. In this paper, a novel cluster-based outlier detection method is proposed using a humoral-mediated clustering algorithm (HAIS) based on concepts of antibody secretion in natural immune systems. The proposed method finds meaningful clusters as well as outliers simultaneously. This is an iterative approach where only clusters above threshold (larger sized clusters) are carried forward to the next cycle of cluster formation while removing small sized clusters. This paper also demonstrates through experimental results that the mere existence of outliers severely affects the clustering outcome, and removing those outliers can result in better clustering solutions. The feasibility of the method is demonstrated through simulated datasets, current datasets from the literature as well as a real-world doctors' performance evaluation dataset where the task is to identify potentially under-performing doctors. The results indicate that HAIS has capabilities of detecting single point as well as cluster-based outliers.


2020 ◽  
Vol 13 (6) ◽  
pp. 120-130
Author(s):  
Neelampalli Jayanthi ◽  
◽  
Burra Babu ◽  
Nandam Rao ◽  
◽  
...  

The outlier detection technique is widely used in the data analysis for the clustering of data. Many techniques have been applied in the outlier detection to increase the efficiency of the data analysis. The Local Projection based Outlier Detection (LPOD) method effectively identifies neighbouring values of data, but this has the drawback of random selection of the cluster centre that affects the overall clustering performance of the system. In this study, the Adaptive Clustering by Fast Search and Find of Density Peak (ACFSFDP) is proposed to select the clustering centre and density peak. This ACFSFDP method is implemented with the min-max algorithm to find the number of categories that measured the local density and distance information. The density and distance are used to select the cluster centre, but density is not calculated on the existing distance based clustering techniques. The ACFSFDP method calculates cluster centre based on the density and distance during the clustering process, whereas the existing techniques randomly select the data centre. The results indicated that the ACFSFDP method is provided effective outlier detection compared with existing Clustering by Fast Search and Find of Density Peak (CFSFDP) methods. The ACFSFDP is tested on two datasets Pen-digits and waveform datasets. The experiment results proved that Area Under Curve (AUC) of the ACFSFDP is 99.08% on the Pen-Digit dataset, while the existing distance classifier method k-Nearest Neighbour has achieved 68.7% of AUC.


2018 ◽  
Vol 3 (1) ◽  
pp. 001
Author(s):  
Zulhendra Zulhendra ◽  
Gunadi Widi Nurcahyo ◽  
Julius Santony

In this study using Data Mining, namely K-Means Clustering. Data Mining can be used in searching for a large enough data analysis that aims to enable Indocomputer to know and classify service data based on customer complaints using Weka Software. In this study using the algorithm K-Means Clustering to predict or classify complaints about hardware damage on Payakumbuh Indocomputer. And can find out the data of Laptop brands most do service on Indocomputer Payakumbuh as one of the recommendations to consumers for the selection of Laptops.


2021 ◽  
Vol 13 (5) ◽  
pp. 956
Author(s):  
Florian Mouret ◽  
Mohanad Albughdadi ◽  
Sylvie Duthoit ◽  
Denis Kouamé ◽  
Guillaume Rieu ◽  
...  

This paper studies the detection of anomalous crop development at the parcel-level based on an unsupervised outlier detection technique. The experimental validation is conducted on rapeseed and wheat parcels located in Beauce (France). The proposed methodology consists of four sequential steps: (1) preprocessing of synthetic aperture radar (SAR) and multispectral images acquired using Sentinel-1 and Sentinel-2 satellites, (2) extraction of SAR and multispectral pixel-level features, (3) computation of parcel-level features using zonal statistics and (4) outlier detection. The different types of anomalies that can affect the studied crops are analyzed and described. The different factors that can influence the outlier detection results are investigated with a particular attention devoted to the synergy between Sentinel-1 and Sentinel-2 data. Overall, the best performance is obtained when using jointly a selection of Sentinel-1 and Sentinel-2 features with the isolation forest algorithm. The selected features are co-polarized (VV) and cross-polarized (VH) backscattering coefficients for Sentinel-1 and five Vegetation Indexes for Sentinel-2 (among us, the Normalized Difference Vegetation Index and two variants of the Normalized Difference Water). When using these features with an outlier ratio of 10%, the percentage of detected true positives (i.e., crop anomalies) is equal to 94.1% for rapeseed parcels and 95.5% for wheat parcels.


Sign in / Sign up

Export Citation Format

Share Document