Restraint Method Research for Coupling Random Error Based on High Dimensional Data Set Multiscale Analysis

Recently, anomaly detection has acquired a realistic response from data mining scientists as a graph of its reputation has increased smoothly in various practical domains like product marketing, fraud detection, medical diagnosis, fault detection and so many other fields. High dimensional data subjected to outlier detection poses exceptional challenges for data mining experts and it is because of natural problems of the curse of dimensionality and resemblance of distant and adjoining points. Traditional algorithms and techniques were experimented on full feature space regarding outlier detection. Customary methodologies concentrate largely on low dimensional data and hence show ineffectiveness while discovering anomalies in a data set comprised of a high number of dimensions. It becomes a very difficult and tiresome job to dig out anomalies present in high dimensional data set when all subsets of projections need to be explored. All data points in high dimensional data behave like similar observations because of its intrinsic feature i.e., the distance between observations approaches to zero as the number of dimensions extends towards infinity. This research work proposes a novel technique that explores deviation among all data points and embeds its findings inside well established density-based techniques. This is a state of art technique as it gives a new breadth of research towards resolving inherent problems of high dimensional data where outliers reside within clusters having different densities. A high dimensional dataset from UCI Machine Learning Repository is chosen to test the proposed technique and then its results are compared with that of density-based techniques to evaluate its efficiency.

Download Full-text

Data segmentation based on the local intrinsic dimension

Scientific Reports ◽

10.1038/s41598-020-72222-0 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Michele Allegra ◽

Elena Facco ◽

Francesco Denti ◽

Alessandro Laio ◽

Antonietta Mira

Keyword(s):

High Dimensional Data ◽

Large Data ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Imaging Data ◽

Unsupervised Segmentation ◽

Real World Data ◽

Data Set ◽

Intrinsic Dimension

Abstract One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.

Download Full-text

Outlier Detection Algorithm Basing on Similarity Measurement Relation

Advanced Engineering Forum ◽

10.4028/www.scientific.net/aef.6-7.621 ◽

2012 ◽

Vol 6-7 ◽

pp. 621-624

Author(s):

Hong Bin Fang

Keyword(s):

Outlier Detection ◽

Credit Card ◽

High Dimensional Data ◽

Detection Algorithm ◽

Experimental Result ◽

Similarity Measurement ◽

High Dimensional ◽

Data Set ◽

Network Intrusion ◽

Metric Function

Outlier detection is an important field of data mining, which is widely used in credit card fraud detection, network intrusion detection ,etc. A kind of high dimensional data similarity metric function and the concept of class density are given in the paper, basing on the combination of hierarchical clustering and similarity, as well as outlier detection algorithm about similarity measurement is presented after the redefinition of high dimension density outliers is put. The algorithm has some value for outliers detection of high dimensional data set in view of experimental result.

Download Full-text

How to visualize high-dimensional data: a roadmap

Journal of Data Mining & Digital Humanities ◽

10.46298/jdmdh.5594 ◽

2020 ◽

Vol Special issue on... ◽

Author(s):

Hermann Moisl

Keyword(s):

Cluster Analysis ◽

High Dimensional Data ◽

Latent Structure ◽

High Dimensional ◽

Graphical Methods ◽

Data Set ◽

The Third ◽

Historical Text ◽

International Audience ◽

And Cluster Analysis

International audience Discovery of the chronological or geographical distribution of collections of historical text can be more reliable when based on multivariate rather than on univariate data because multivariate data provide a more complete description. Where the data are high-dimensional, however, their complexity can defy analysis using traditional philological methods. The first step in dealing with such data is to visualize it using graphical methods in order to identify any latent structure. If found, such structure facilitates formulation of hypotheses which can be tested using a range of mathematical and statistical methods. Where, however, the dimensionality is greater than 3, direct graphical investigation is impossible. The present discussion presents a roadmap of how this obstacle can be overcome, and is in three main parts: the first part presents some fundamental data concepts, the second describes an example corpus and a high-dimensional data set derived from it, and the third outlines two approaches to visualization of that data set: dimensionality reduction and cluster analysis.

Download Full-text

A System for Outlier Detection of High Dimensional Data

International Journal of Computer Science and Informatics ◽

10.47893/ijcsi.2012.1037 ◽

2012 ◽

pp. 197-201

Author(s):

Bharat Gupta ◽

Durga Toshniwal

Keyword(s):

Outlier Detection ◽

High Dimensional Data ◽

Research Problem ◽

High Dimensional ◽

Full Data ◽

Data Set ◽

Detection Techniques ◽

New Concepts ◽

Low Dimensional ◽

Important Research Problem

In high dimensional data large no of outliers are embedded in low dimensional subspaces known as projected outliers, but most of existing outlier detection techniques are unable to find these projected outliers, because these methods perform detection of abnormal patterns in full data space. So, outlier detection in high dimensional data becomes an important research problem. In this paper we are proposing an approach for outlier detection of high dimensional data. Here we are modifying the existing SPOT approach by adding three new concepts namely Adaption of Sparse Sub-Space Template (SST), Different combination of PCS parameters and set of non outlying cells for testing data set.

Download Full-text

Soft Subspace Clustering for High-Dimensional Data

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch276 ◽

2011 ◽

pp. 1810-1814

Author(s):

Liping Jing ◽

Michael K. Ng ◽

Joshua Zhexue Huang

Keyword(s):

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Special Treatment ◽

Clustering Methods ◽

Real World Data ◽

Text Data ◽

Data Set ◽

Dna Microarray Data ◽

Text Document

High dimensional data is a phenomenon in real-world data mining applications. Text data is a typical example. In text mining, a text document is viewed as a vector of terms whose dimension is equal to the total number of unique terms in a data set, which is usually in thousands. High dimensional data occurs in business as well. In retails, for example, to effectively manage supplier relationship, suppliers are often categorized according to their business behaviors (Zhang, Huang, Qian, Xu, & Jing, 2006). The supplier’s behavior data is high dimensional, which contains thousands of attributes to describe the supplier’s behaviors, including product items, ordered amounts, order frequencies, product quality and so forth. One more example is DNA microarray data. Clustering high-dimensional data requires special treatment (Swanson, 1990; Jain, Murty, & Flynn, 1999; Cai, He, & Han, 2005; Kontaki, Papadopoulos & Manolopoulos., 2007), although various methods for clustering are available (Jain & Dubes, 1988). One type of clustering methods for high dimensional data is referred to as subspace clustering, aiming at finding clusters from subspaces instead of the entire data space. In a subspace clustering, each cluster is a set of objects identified by a subset of dimensions and different clusters are represented in different subsets of dimensions. Soft subspace clustering considers that different dimensions make different contributions to the identification of objects in a cluster. It represents the importance of a dimension as a weight that can be treated as the degree of the dimension in contribution to the cluster. Soft subspace clustering can find the cluster memberships of objects and identify the subspace of each cluster in the same clustering process.

Download Full-text

Robust estimation of the mean vector for high-dimensional data set using robust clustering

Journal of Applied Statistics ◽

10.1080/02664763.2014.999030 ◽

2015 ◽

Vol 42 (6) ◽

pp. 1183-1205 ◽

Cited By ~ 1

Author(s):

Hamid Shahriari ◽

Orod Ahmadi

Keyword(s):

Robust Estimation ◽

High Dimensional Data ◽

High Dimensional ◽

Data Set ◽

Robust Clustering ◽

The Mean ◽

Mean Vector

Download Full-text

A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data

Computational and Mathematical Methods in Medicine ◽

10.1155/2017/7907163 ◽

2017 ◽

Vol 2017 ◽

pp. 1-18 ◽

Cited By ~ 5

Author(s):

Andrea Bommert ◽

Jörg Rahnenführer ◽

Michel Lang

Keyword(s):

Feature Selection ◽

Predictive Model ◽

Predictive Accuracy ◽

Pearson Correlation ◽

High Dimensional Data ◽

High Dimensional ◽

Sparse Models ◽

Data Set ◽

The Stability ◽

Selection Of

Finding a good predictive model for a high-dimensional data set can be challenging. For genetic data, it is not only important to find a model with high predictive accuracy, but it is also important that this model uses only few features and that the selection of these features is stable. This is because, in bioinformatics, the models are used not only for prediction but also for drawing biological conclusions which makes the interpretability and reliability of the model crucial. We suggest using three target criteria when fitting a predictive model to a high-dimensional data set: the classification accuracy, the stability of the feature selection, and the number of chosen features. As it is unclear which measure is best for evaluating the stability, we first compare a variety of stability measures. We conclude that the Pearson correlation has the best theoretical and empirical properties. Also, we find that for the stability assessment behaviour it is most important that a measure contains a correction for chance or large numbers of chosen features. Then, we analyse Pareto fronts and conclude that it is possible to find models with a stable selection of few features without losing much predictive accuracy.

Download Full-text

An adaptive classifier design for high-dimensional data analysis with a limited training data set

IEEE Transactions on Geoscience and Remote Sensing ◽

10.1109/36.975001 ◽

2001 ◽

Vol 39 (12) ◽

pp. 2664-2679 ◽

Cited By ~ 128

Author(s):

Q. Jackson ◽

D.A. Landgrebe

Keyword(s):

Data Analysis ◽

High Dimensional Data ◽

Training Data ◽

High Dimensional ◽

Data Set ◽

Classifier Design ◽

High Dimensional Data Analysis

Download Full-text

Visualization of Very Large High-Dimensional Data Sets as Minimum Spanning Trees

10.26434/chemrxiv.9698861.v1 ◽

2019 ◽

Author(s):

Daniel Probst ◽

Jean-Louis Reymond

Keyword(s):

Data Visualization ◽

Particle Physics ◽

Cancer Biology ◽

Spanning Trees ◽

Minimum Spanning Tree ◽

High Dimensional Data ◽

Locality Sensitive Hashing ◽

High Dimensional ◽

Data Sets ◽

Data Set

<div>Here, we introduce a new data visualization and exploration method, TMAP (tree-map), which exploits locality sensitive hashing, Kruskal’s minimum-spanning-tree algorithm, and a multilevel multipole-based graph layout algorithm to represent large and high dimensional data sets as a tree structure, which is readily understandable and explorable. Compared to other data visualization methods such as t-SNE or UMAP, TMAP increases the size of data sets that can be visualized due to its significantly lower memory requirements and running time and should find broad applicability in the age of big data. We exemplify TMAP in the area of cheminformatics with interactive maps for 1.16 million drug-like molecules from ChEMBL, 10.1 million small molecule fragments from FDB17, and 131 thousand 3D-structures of biomolecules from the PDB Databank, and to visualize data from literature (GUTENBERG data set), cancer biology (PANSCAN data set) and particle physics (MiniBooNE data set). TMAP is available as a Python package. Installation, usage instructions and application examples can be found at http://tmap.gdb.tools.</div>

Download Full-text