Outlying Subspace Detection for High-Dimensional Data

Author(s):  
Ji Zhang ◽  
Qigang Gao ◽  
Hai Wang

Knowledge discovery in databases, commonly referred to as data mining, has attracted enormous research efforts from different domains such as databases, statistics, artificial intelligence, data visualization, and so forth in the past decade. Most of the research work in data mining such as clustering, association rules mining, and classification focus on discovering large patterns from databases (Ramaswamy, Rastogi, & Shim, 2000). Yet, it is also important to explore the small patterns in databases that carry valuable information about the interesting abnormalities. Outlier detection is a research problem in small-pattern mining in databases. It aims at finding a specific number of objects that are considerably dissimilar, exceptional, and inconsistent with respect to the majority records in an input database. Numerous research work in outlier detection has been proposed such as the distribution-based methods (Barnett & Lewis, 1994; Hawkins, 1980), the distance-based methods (Angiulli & Pizzuti, 2002; Knorr & Ng, 1998, 1999; Ramaswamy et al.; Wang, Zhang, & Wang, 2005), the density-based methods (Breuning, Kriegel, Ng, & Sander, 2000; Jin, Tung, & Han, 2001; Tang, Chen, Fu, & Cheung, 2002), and the clustering-based methods (Agrawal, Gehrke, Gunopulos, & Raghavan, 1998; Ester, Kriegel, Sander, & Xu, 1996; Hinneburg & Keim, 1998; Ng & Han, 1994; Sheikholeslami, Chatterjee, & Zhang, 1999; J. Zhang, Hsu, & Lee, 2005; T. Zhang, Ramakrishnan, & Livny, 1996).

2021 ◽  
Vol 50 (1) ◽  
pp. 138-152
Author(s):  
Mujeeb Ur Rehman ◽  
Dost Muhammad Khan

Recently, anomaly detection has acquired a realistic response from data mining scientists as a graph of its reputation has increased smoothly in various practical domains like product marketing, fraud detection, medical diagnosis, fault detection and so many other fields. High dimensional data subjected to outlier detection poses exceptional challenges for data mining experts and it is because of natural problems of the curse of dimensionality and resemblance of distant and adjoining points. Traditional algorithms and techniques were experimented on full feature space regarding outlier detection. Customary methodologies concentrate largely on low dimensional data and hence show ineffectiveness while discovering anomalies in a data set comprised of a high number of dimensions. It becomes a very difficult and tiresome job to dig out anomalies present in high dimensional data set when all subsets of projections need to be explored. All data points in high dimensional data behave like similar observations because of its intrinsic feature i.e., the distance between observations approaches to zero as the number of dimensions extends towards infinity. This research work proposes a novel technique that explores deviation among all data points and embeds its findings inside well established density-based techniques. This is a state of art technique as it gives a new breadth of research towards resolving inherent problems of high dimensional data where outliers reside within clusters having different densities. A high dimensional dataset from UCI Machine Learning Repository is chosen to test the proposed technique and then its results are compared with that of density-based techniques to evaluate its efficiency.


2018 ◽  
Vol 7 (3.3) ◽  
pp. 532
Author(s):  
S Sathya ◽  
N Rajendran

Data mining (DM) is used for extracting the useful and non-trivial information from the large amount of data to collect in many and diverse fields. Data mining determines explanation through clustering visualization, association and sequential analysis. Chemical compounds are well-defined structures compressed by a graph representation. Chemical bonding is the association of atoms into molecules, ions, crystals and other stable species which frame the common substances in chemical information. However, large-scale sequential data is a fundamental problem like higher classification time and bonding time in data mining with many applications. In this work, chemical structured index bonding is used for sequential pattern mining. Our research work helps to evaluate the structural patterns of chemical bonding in chemical information data sets.  


2022 ◽  
Vol 13 (1) ◽  
pp. 1-17
Author(s):  
Ankit Kumar ◽  
Abhishek Kumar ◽  
Ali Kashif Bashir ◽  
Mamoon Rashid ◽  
V. D. Ambeth Kumar ◽  
...  

Detection of outliers or anomalies is one of the vital issues in pattern-driven data mining. Outlier detection detects the inconsistent behavior of individual objects. It is an important sector in the data mining field with several different applications such as detecting credit card fraud, hacking discovery and discovering criminal activities. It is necessary to develop tools used to uncover the critical information established in the extensive data. This paper investigated a novel method for detecting cluster outliers in a multidimensional dataset, capable of identifying the clusters and outliers for datasets containing noise. The proposed method can detect the groups and outliers left by the clustering process, like instant irregular sets of clusters (C) and outliers (O), to boost the results. The results obtained after applying the algorithm to the dataset improved in terms of several parameters. For the comparative analysis, the accurate average value and the recall value parameters are computed. The accurate average value is 74.05% of the existing COID algorithm, and our proposed algorithm has 77.21%. The average recall value is 81.19% and 89.51% of the existing and proposed algorithm, which shows that the proposed work efficiency is better than the existing COID algorithm.


Author(s):  
Bharat Gupta ◽  
Durga Toshniwal

In high dimensional data large no of outliers are embedded in low dimensional subspaces known as projected outliers, but most of existing outlier detection techniques are unable to find these projected outliers, because these methods perform detection of abnormal patterns in full data space. So, outlier detection in high dimensional data becomes an important research problem. In this paper we are proposing an approach for outlier detection of high dimensional data. Here we are modifying the existing SPOT approach by adding three new concepts namely Adaption of Sparse Sub-Space Template (SST), Different combination of PCS parameters and set of non outlying cells for testing data set.


Author(s):  
Anusha L. ◽  
Nagaraja G S

Artificial intelligence (AI) is the science that allows computers to replicate human intelligence in areas such as decision-making, text processing, visual perception. Artificial Intelligence is the broader field that contains several subfields such as machine learning, robotics, and computer vision. Machine Learning is a branch of Artificial Intelligence that allows a machine to learn and improve at a task over time. Deep Learning is a subset of machine learning that makes use of deep artificial neural networks for training. The paper proposed on outlier detection for multivariate high dimensional data for Autoencoder unsupervised model.


Author(s):  
Frank Klawonn ◽  
Frank Rehm

For many applications in knowledge discovery in databases finding outliers, rare events, is of importance. Outliers are observations, which deviate significantly from the rest of the data, so that it seems they are generated by another process (Hawkins, 1980). Such outlier objects often contain information about an untypical behavior of the system. However, outliers bias the results of many data mining methods like the mean value, the standard deviation or the positions of the prototypes of k-means clustering (Estivill-Castro, 2004; Keller, 2000). Therefore, before further analysis or processing of data is carried out with more sophisticated data mining techniques, identifying outliers is a crucial step. Usually, data objects are considered as outliers, when they occur in a region of extremely low data density. Many clustering techniques like possibilistic clustering (PCM) (Krishnapuram & Keller, 1993; Krishnapuram & Keller, 1996) or noise clustering (NC) (Dave, 1991; Dave & Krishnapuram, 1997) that deal with noisy data and can identify outliers, need good initializations or suffer from lack of adaptability to different cluster sizes (Rehm, Klawonn & Kruse, 2007). Distance-based approaches (Knorr, 1998; Knorr, Ng & Tucakov, 2000) have a global view on the data set. These algorithms can hardly treat data sets containing regions with different data density (Breuning, Kriegel, Ng & Sander, 2000). In this work we present an approach that combines a fuzzy clustering algorithm (Höppner, Klawonn, Kruse & Runkler, 1999) (or any other prototype-based clustering algorithm) with statistical distribution-based outlier detection.


Author(s):  
Usman Ahmed ◽  
Jerry Chun-Wei Lin ◽  
Gautam Srivastava ◽  
Hsing-Chung Chen

Finding frequent patterns identifies the most important patterns in data sets. Due to the huge and high-dimensional nature of transactional data, classical pattern mining techniques suffer from the limitations of dimensions and data annotations. Recently, data mining while preserving privacy is considered an important research area in recent decades. Information privacy is a tradeoff that must be considered when using data. Through many years, privacy-preserving data mining (PPDM) made use of methods that are mostly based on heuristics. The operation of deletion was used to hide the sensitive information in PPDM. In this study, we used deep active learning to hide sensitive operations and protect private information. This paper combines entropy-based active learning with an attention-based approach to effectively detect sensitive patterns. The constructed models are then validated using high-dimensional transactional data with attention-based and active learning methods in a reinforcement environment. The results show that the proposed model can support and improve the decision boundaries by increasing the number of training instances through the use of a pooling technique and an entropy uncertainty measure. The proposed paradigm can achieve cleanup by hiding sensitive items and avoiding non-sensitive items. The model outperforms greedy, genetic, and particle swarm optimization approaches.


Author(s):  
S Imavathy ◽  
M. Chinnadurai

Now a days the pattern recognition is the major challenge in the field of data mining. The researchers focus on using data mining for wide variety of applications like market basket analysis, advertisement, and medical field etc., Here the transcriptional database is used for all the conventional algorithms, which is based on daily usage of object and/or performance of patients. Here the proposed research work uses sequential pattern mining approach using classification technique of Threshold based Support Vector Machine learning (T-SVM) algorithm. The pattern mining is to give the variable according to the user’s interest by statistical model. Here this proposed research work is used to analysis the gene sequence datasets. Further, the T-SVM technique is used to classify the dataset based on sequential pattern mining approach. Especially, the threshold-based model is used for predicting the upcoming state of interest by sequential patterns. Because this makes deeper understanding about sequential input data and classify the result by providing threshold values. Therefore, the proposed method is efficient than the conventional method by getting the value of achievable classification accuracy, precision, False Positive rate, True Positive rate and it also reduces operating time. This proposed model is performed in MATLAB in the adaptation of 2018a.


2015 ◽  
Vol 3 (2) ◽  
Author(s):  
S. Vijayarani Mohan

A data stream is a real time, continuous, structured sequence of data items. Mining data stream is the process of extracting knowledge from continuous arrival of rapid data records. Data can arrive fast and in continuous manner. It is very difficult to perform mining process. Normally, stream mining algorithms are designed to scan the database only once, and it is a complicated task to extract the knowledge from the database by a single scan. Data streams are a computational challenge to data mining problems because of the additional algorithmic constraints created by the large volume of data. Popular data mining techniques namely clustering, classification, and frequent pattern mining are applied to data streams for extracting the knowledge. This research work mainly concentrates on how to predict the valuable items which are found in a transactional data of a data stream. In the literature, most of the researchers have discussed about how the frequent items are mined from the data streams. This research work helps to predict the valuable items in a transactional data. Frequent item mining is defined as finding the items which occur frequently, i.e. the occurrence of items above the given threshold is considered as frequent items. Valuable item mining is nothing but finding the costliest or most valuable items of a database. Predicting this information helps businesses to know about the sales details about the valuable items which guide to make crucial decisions, such as catalogue drawing, cross promotion, end user shopping, and performance scrutiny. In this research work, a new algorithm namely VIM (Valuable Item Mining) is proposed for finding the valuable items in data streams. The performance of this algorithm is analysed by using the factors, number of valuable items discovered, and execution time.


2019 ◽  
pp. 1-3
Author(s):  
Hiren R. Kavathiya ◽  
G. C. Bhimani

Outlier detection is presented in detail in chapter 1.The finding of outliers for high dimensional datasets is a challenging data mining task. Different perspectives can be used to define the notion of outliers. Hawkins et al., 2002, defines an outlier as “an observation which deviates so much from other observations as to create suspicions that it was generated by a different mechanism”. While 'Barnett and Lewis, 1994' define it as “An outlier is an observation (or subset of observations) which appears to be inconsistent with the remainder of that dataset”.


Sign in / Sign up

Export Citation Format

Share Document