scholarly journals How to visualize high-dimensional data: a roadmap

2020 ◽  
Vol Special issue on... ◽  
Author(s):  
Hermann Moisl

International audience Discovery of the chronological or geographical distribution of collections of historical text can be more reliable when based on multivariate rather than on univariate data because multivariate data provide a more complete description. Where the data are high-dimensional, however, their complexity can defy analysis using traditional philological methods. The first step in dealing with such data is to visualize it using graphical methods in order to identify any latent structure. If found, such structure facilitates formulation of hypotheses which can be tested using a range of mathematical and statistical methods. Where, however, the dimensionality is greater than 3, direct graphical investigation is impossible. The present discussion presents a roadmap of how this obstacle can be overcome, and is in three main parts: the first part presents some fundamental data concepts, the second describes an example corpus and a high-dimensional data set derived from it, and the third outlines two approaches to visualization of that data set: dimensionality reduction and cluster analysis.

2016 ◽  
Vol 18 (5) ◽  
pp. 98-107 ◽  
Author(s):  
Renato R.O. da Silva ◽  
Paulo E. Rauber ◽  
Alexandru C. Telea

2021 ◽  
Vol 50 (1) ◽  
pp. 138-152
Author(s):  
Mujeeb Ur Rehman ◽  
Dost Muhammad Khan

Recently, anomaly detection has acquired a realistic response from data mining scientists as a graph of its reputation has increased smoothly in various practical domains like product marketing, fraud detection, medical diagnosis, fault detection and so many other fields. High dimensional data subjected to outlier detection poses exceptional challenges for data mining experts and it is because of natural problems of the curse of dimensionality and resemblance of distant and adjoining points. Traditional algorithms and techniques were experimented on full feature space regarding outlier detection. Customary methodologies concentrate largely on low dimensional data and hence show ineffectiveness while discovering anomalies in a data set comprised of a high number of dimensions. It becomes a very difficult and tiresome job to dig out anomalies present in high dimensional data set when all subsets of projections need to be explored. All data points in high dimensional data behave like similar observations because of its intrinsic feature i.e., the distance between observations approaches to zero as the number of dimensions extends towards infinity. This research work proposes a novel technique that explores deviation among all data points and embeds its findings inside well established density-based techniques. This is a state of art technique as it gives a new breadth of research towards resolving inherent problems of high dimensional data where outliers reside within clusters having different densities. A high dimensional dataset from UCI Machine Learning Repository is chosen to test the proposed technique and then its results are compared with that of density-based techniques to evaluate its efficiency.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Michele Allegra ◽  
Elena Facco ◽  
Francesco Denti ◽  
Alessandro Laio ◽  
Antonietta Mira

Abstract One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.


2012 ◽  
Vol 6-7 ◽  
pp. 621-624
Author(s):  
Hong Bin Fang

Outlier detection is an important field of data mining, which is widely used in credit card fraud detection, network intrusion detection ,etc. A kind of high dimensional data similarity metric function and the concept of class density are given in the paper, basing on the combination of hierarchical clustering and similarity, as well as outlier detection algorithm about similarity measurement is presented after the redefinition of high dimension density outliers is put. The algorithm has some value for outliers detection of high dimensional data set in view of experimental result.


Geophysics ◽  
1996 ◽  
Vol 61 (4) ◽  
pp. 1209-1227 ◽  
Author(s):  
Don W. Vasco ◽  
John E. Peterson ◽  
Ernest L. Majer

We examine the nonlinear aspects of seismic traveltime tomography. This is accomplished by completing an extensive set of conjugate gradient inversions on a parallel virtual machine, with each initiated by a different starting model. The goal is an exploratory analysis of a set of conjugate gradient solutions to the traveltime tomography problem. We find that distinct local minima are generated when prior constraints are imposed on traveltime tomographic inverse problems. Methods from cluster analysis determine the number and location of the isolated solutions to the traveltime tomography problem. We apply the cluster analysis techniques to a cross‐borehole traveltime data set gathered at the Gypsy Pilot Site in Pawnee County, Oklahoma. We find that the 1075 final models, satisfying the traveltime data and a model norm penalty, form up to 61 separate solutions. All solutions appear to contain a central low velocity zone bounded above and below by higher velocity layers. Such a structure agrees with well‐logs, hydrological well tests, and a previous seismic inversion.


2007 ◽  
Vol 227 (3) ◽  
Author(s):  
Wolf Dieter Heinbach ◽  
Stefanie Schröpfer

SummaryThe introduction of opening clauses in collective wage agreements allowing firms to deviate from their collective bargaining agreements has become widely accepted for the last fifteen years. With respect to the flexibility agreed through collective bargaining, the distinctions between single collective bargaining areas of the same industry have increased. Hence, the economic idea of uniform industry-wide central collective bargaining agreements is no longer tenable. The data set of the IAW used in this article provides differentiated information about opening clauses in collective wage agreements. By means of correspondence and cluster analysis, seven groups of collective bargaining areas are identified, which differ in the type of opening clauses introduced. Over the period from 1991 until 2004, the examination of dynamic aspects of these seven groups exhibits typical paths of development towards an improved flexibility agreed through collective bargaining. Furthermore, the conjunction of the data set with the German Structure of Earnings Survey of the years 1995 and 2001 makes it possible to show the relevance of the different types of single collective bargaining areas for employment and industries of the German manufacturing sector.


Author(s):  
Bharat Gupta ◽  
Durga Toshniwal

In high dimensional data large no of outliers are embedded in low dimensional subspaces known as projected outliers, but most of existing outlier detection techniques are unable to find these projected outliers, because these methods perform detection of abnormal patterns in full data space. So, outlier detection in high dimensional data becomes an important research problem. In this paper we are proposing an approach for outlier detection of high dimensional data. Here we are modifying the existing SPOT approach by adding three new concepts namely Adaption of Sparse Sub-Space Template (SST), Different combination of PCS parameters and set of non outlying cells for testing data set.


Sign in / Sign up

Export Citation Format

Share Document