scholarly journals Cluster Appearance Glyphs: A Methodology for Illustrating High-Dimensional Data Patterns in 2-D Data Layouts

Information ◽  
2021 ◽  
Vol 13 (1) ◽  
pp. 3
Author(s):  
Jenny Hyunjung Lee ◽  
Darius Coelho ◽  
Klaus Mueller

Two-dimensional space embeddings such as Multi-Dimensional Scaling (MDS) are a popular means to gain insight into high-dimensional data relationships. However, in all but the simplest cases these embeddings suffer from significant distortions, which can lead to misinterpretations of the high-dimensional data. These distortions occur both at the global inter-cluster and the local intra-cluster levels. The former leads to misinterpretation of the distances between the various N-D cluster populations, while the latter hampers the appreciation of their individual shapes and composition, which we call cluster appearance. The distortion of cluster appearance incurred in the 2-D embedding is unavoidable since such low-dimensional embeddings always come at the loss of some of the intra-cluster variance. In this paper, we propose techniques to overcome these limitations by conveying the N-D cluster appearance via a framework inspired by illustrative design. Here we make use of Scagnostics which offers a set of intuitive feature descriptors to describe the appearance of 2-D scatterplots. We extend the Scagnostics analysis to N-D and then devise and test via crowd-sourced user studies a set of parameterizable texture patterns that map to the various Scagnostics descriptors. Finally, we embed these N-D Scagnostics-informed texture patterns into shapes derived from N-D statistics to yield what we call Cluster Appearance Glyphs. We demonstrate our framework with a dataset acquired to analyze program execution times in file systems.

2015 ◽  
Vol 2015 ◽  
pp. 1-21
Author(s):  
Wanyi Li ◽  
Jifeng Sun

This paper proposes a novel algorithm called low dimensional space incremental learning (LDSIL) to estimate the human motion in 3D from the silhouettes of human motion multiview images. The proposed algorithm takes the advantage of stochastic extremum memory adaptive searching (SEMAS) and incremental probabilistic dimension reduction model (IPDRM) to collect new high dimensional data samples. The high dimensional data samples can be selected to update the mapping from low dimensional space to high dimensional space, so that incremental learning can be achieved to estimate human motion from small amount of samples. Compared with three traditional algorithms, the proposed algorithm can make human motion estimation achieve a good performance in disambiguating silhouettes, overcoming the transient occlusion, and reducing estimation error.


Author(s):  
Zeyu Sun ◽  
Xiaohui Ji

The process of high-dimensional data is a hot research area in data mining technology. Due to sparsity of the high-dimensional data, there is significant difference between the high-dimensional space and the low-dimensional space, especially in terms of the data process. Many sophisticated algorithms of low-dimensional space cannot achieve the expected effect, even cannot be used in the high-dimensional space. Thus, this paper proposes a High-dimensional Data Aggregation Control Algorithm for Big Data (HDAC). The algorithm uses information to eliminate the dimension not matching with the specified requirements. Then it uses the principal components method to analyze the rest dimension. Thus, the simplest method is used to reduce the calculation of dimensionality reduction as can as it possible. In the process of data aggregation, the self-adaptive data aggregation mechanism is used to reduce the phenomenon of network delay. Finally, the simulation shows that the algorithm can improve the performance of node energy-consumption, rate of the data post-back and the data delay.


2020 ◽  
pp. 286-300
Author(s):  
Zeyu Sun ◽  
Xiaohui Ji

The process of high-dimensional data is a hot research area in data mining technology. Due to sparsity of the high-dimensional data, there is significant difference between the high-dimensional space and the low-dimensional space, especially in terms of the data process. Many sophisticated algorithms of low-dimensional space cannot achieve the expected effect, even cannot be used in the high-dimensional space. Thus, this paper proposes a High-dimensional Data Aggregation Control Algorithm for Big Data (HDAC). The algorithm uses information to eliminate the dimension not matching with the specified requirements. Then it uses the principal components method to analyze the rest dimension. Thus, the simplest method is used to reduce the calculation of dimensionality reduction as can as it possible. In the process of data aggregation, the self-adaptive data aggregation mechanism is used to reduce the phenomenon of network delay. Finally, the simulation shows that the algorithm can improve the performance of node energy-consumption, rate of the data post-back and the data delay.


Information ◽  
2021 ◽  
Vol 12 (6) ◽  
pp. 239
Author(s):  
Zonglin Tian ◽  
Xiaorui Zhai ◽  
Gijs van Steenpaal ◽  
Lingyun Yu ◽  
Evanthia Dimara ◽  
...  

Projections are well-known techniques that help the visual exploration of high-dimensional data by creating depictions thereof in a low-dimensional space. While projections that target the 2D space have been studied in detail both quantitatively and qualitatively, 3D projections are far less well understood, with authors arguing both for and against the added-value of a third visual dimension. We fill this gap by first presenting a quantitative study that compares 2D and 3D projections along a rich selection of datasets, projection techniques, and quality metrics. To refine these insights, we conduct a qualitative study that compares the preference of users in exploring high-dimensional data using 2D vs. 3D projections, both without and with visual explanations. Our quantitative and qualitative findings indicate that, in general, 3D projections bring only limited added-value atop of the one provided by their 2D counterparts. However, certain 3D projection techniques can show more structure than their 2D counterparts, and can stimulate users to further exploration. All our datasets, source code, and measurements are made public for ease of replication and extension.


2020 ◽  
Vol 245 ◽  
pp. 06018
Author(s):  
Ursula Laa

In physics we often encounter high-dimensional data, in the form of multivariate measurements or of models with multiple free parameters. The information encoded is increasingly explored using machine learning, but is not typically explored visually. The barrier tends to be visualising beyond 3D, but systematic approaches for this exist in the statistics literature. I use examples from particle and astrophysics to show how we can use the “grand tour” for such multidimensional visualisations, for example to explore grouping in high dimension and for visual identification of multivariate outliers. I then discuss the idea of projection pursuit, i.e. searching the high-dimensional space for “interesting” low dimensional projections, and illustrate how we can detect complex associations between multiple parameters.


2021 ◽  
Vol 50 (1) ◽  
pp. 138-152
Author(s):  
Mujeeb Ur Rehman ◽  
Dost Muhammad Khan

Recently, anomaly detection has acquired a realistic response from data mining scientists as a graph of its reputation has increased smoothly in various practical domains like product marketing, fraud detection, medical diagnosis, fault detection and so many other fields. High dimensional data subjected to outlier detection poses exceptional challenges for data mining experts and it is because of natural problems of the curse of dimensionality and resemblance of distant and adjoining points. Traditional algorithms and techniques were experimented on full feature space regarding outlier detection. Customary methodologies concentrate largely on low dimensional data and hence show ineffectiveness while discovering anomalies in a data set comprised of a high number of dimensions. It becomes a very difficult and tiresome job to dig out anomalies present in high dimensional data set when all subsets of projections need to be explored. All data points in high dimensional data behave like similar observations because of its intrinsic feature i.e., the distance between observations approaches to zero as the number of dimensions extends towards infinity. This research work proposes a novel technique that explores deviation among all data points and embeds its findings inside well established density-based techniques. This is a state of art technique as it gives a new breadth of research towards resolving inherent problems of high dimensional data where outliers reside within clusters having different densities. A high dimensional dataset from UCI Machine Learning Repository is chosen to test the proposed technique and then its results are compared with that of density-based techniques to evaluate its efficiency.


2020 ◽  
Vol 49 (3) ◽  
pp. 421-437
Author(s):  
Genggeng Liu ◽  
Lin Xie ◽  
Chi-Hua Chen

Dimensionality reduction plays an important role in the data processing of machine learning and data mining, which makes the processing of high-dimensional data more efficient. Dimensionality reduction can extract the low-dimensional feature representation of high-dimensional data, and an effective dimensionality reduction method can not only extract most of the useful information of the original data, but also realize the function of removing useless noise. The dimensionality reduction methods can be applied to all types of data, especially image data. Although the supervised learning method has achieved good results in the application of dimensionality reduction, its performance depends on the number of labeled training samples. With the growing of information from internet, marking the data requires more resources and is more difficult. Therefore, using unsupervised learning to learn the feature of data has extremely important research value. In this paper, an unsupervised multilayered variational auto-encoder model is studied in the text data, so that the high-dimensional feature to the low-dimensional feature becomes efficient and the low-dimensional feature can retain mainly information as much as possible. Low-dimensional feature obtained by different dimensionality reduction methods are used to compare with the dimensionality reduction results of variational auto-encoder (VAE), and the method can be significantly improved over other comparison methods.


Author(s):  
Parul Agarwal ◽  
Shikha Mehta

Subspace clustering approaches cluster high dimensional data in different subspaces. It means grouping the data with different relevant subsets of dimensions. This technique has become very effective as a distance measure becomes ineffective in a high dimensional space. This chapter presents a novel evolutionary approach to a bottom up subspace clustering SUBSPACE_DE which is scalable to high dimensional data. SUBSPACE_DE uses a self-adaptive DBSCAN algorithm to perform clustering in data instances of each attribute and maximal subspaces. Self-adaptive DBSCAN clustering algorithms accept input from differential evolution algorithms. The proposed SUBSPACE_DE algorithm is tested on 14 datasets, both real and synthetic. It is compared with 11 existing subspace clustering algorithms. Evaluation metrics such as F1_Measure and accuracy are used. Performance analysis of the proposed algorithms is considerably better on a success rate ratio ranking in both accuracy and F1_Measure. SUBSPACE_DE also has potential scalability on high dimensional datasets.


Author(s):  
Samuel Melton ◽  
Sharad Ramanathan

Abstract Motivation Recent technological advances produce a wealth of high-dimensional descriptions of biological processes, yet extracting meaningful insight and mechanistic understanding from these data remains challenging. For example, in developmental biology, the dynamics of differentiation can now be mapped quantitatively using single-cell RNA sequencing, yet it is difficult to infer molecular regulators of developmental transitions. Here, we show that discovering informative features in the data is crucial for statistical analysis as well as making experimental predictions. Results We identify features based on their ability to discriminate between clusters of the data points. We define a class of problems in which linear separability of clusters is hidden in a low-dimensional space. We propose an unsupervised method to identify the subset of features that define a low-dimensional subspace in which clustering can be conducted. This is achieved by averaging over discriminators trained on an ensemble of proposed cluster configurations. We then apply our method to single-cell RNA-seq data from mouse gastrulation, and identify 27 key transcription factors (out of 409 total), 18 of which are known to define cell states through their expression levels. In this inferred subspace, we find clear signatures of known cell types that eluded classification prior to discovery of the correct low-dimensional subspace. Availability and implementation https://github.com/smelton/SMD. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document