Quality-based guidance for exploratory dimensionality reduction

High-dimensional data sets containing hundreds of variables are difficult to explore, as traditional visualization methods often are unable to represent such data effectively. This is commonly addressed by employing dimensionality reduction prior to visualization. Numerous dimensionality reduction methods are available. However, few reduction approaches take the importance of several structures into account and few provide an overview of structures existing in the full high-dimensional data set. For exploratory analysis, as well as for many other tasks, several structures may be of interest. Exploration of the full high-dimensional data set without reduction may also be desirable. This paper presents flexible methods for exploratory analysis and interactive dimensionality reduction. Automated methods are employed to analyse the variables, using a range of quality metrics, providing one or more measures of ‘interestingness’ for individual variables. Through ranking, a single value of interestingness is obtained, based on several quality metrics, that is usable as a threshold for the most interesting variables. An interactive environment is presented in which the user is provided with many possibilities to explore and gain understanding of the high-dimensional data set. Guided by this, the analyst can explore the high-dimensional data set and interactively select a subset of the potentially most interesting variables, employing various methods for dimensionality reduction. The system is demonstrated through a use-case analysing data from a DNA sequence-based study of bacterial populations.

Download Full-text

Unsupervised Text Feature Learning via Deep Variational Auto-encoder

Information Technology And Control ◽

10.5755/j01.itc.49.3.25918 ◽

2020 ◽

Vol 49 (3) ◽

pp. 421-437

Author(s):

Genggeng Liu ◽

Lin Xie ◽

Chi-Hua Chen

Keyword(s):

Dimensionality Reduction ◽

High Dimensional Data ◽

Image Data ◽

Original Data ◽

Feature Representation ◽

High Dimensional ◽

Learning To Learn ◽

Text Feature ◽

Reduction Methods ◽

Low Dimensional

Dimensionality reduction plays an important role in the data processing of machine learning and data mining, which makes the processing of high-dimensional data more efficient. Dimensionality reduction can extract the low-dimensional feature representation of high-dimensional data, and an effective dimensionality reduction method can not only extract most of the useful information of the original data, but also realize the function of removing useless noise. The dimensionality reduction methods can be applied to all types of data, especially image data. Although the supervised learning method has achieved good results in the application of dimensionality reduction, its performance depends on the number of labeled training samples. With the growing of information from internet, marking the data requires more resources and is more difficult. Therefore, using unsupervised learning to learn the feature of data has extremely important research value. In this paper, an unsupervised multilayered variational auto-encoder model is studied in the text data, so that the high-dimensional feature to the low-dimensional feature becomes efficient and the low-dimensional feature can retain mainly information as much as possible. Low-dimensional feature obtained by different dimensionality reduction methods are used to compare with the dimensionality reduction results of variational auto-encoder (VAE), and the method can be significantly improved over other comparison methods.

Download Full-text

Data segmentation based on the local intrinsic dimension

Scientific Reports ◽

10.1038/s41598-020-72222-0 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Michele Allegra ◽

Elena Facco ◽

Francesco Denti ◽

Alessandro Laio ◽

Antonietta Mira

Keyword(s):

High Dimensional Data ◽

Large Data ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Imaging Data ◽

Unsupervised Segmentation ◽

Real World Data ◽

Data Set ◽

Intrinsic Dimension

Abstract One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.

Download Full-text

A SURVEY ON THE CURES FOR THE CURSE OF DIMENSIONALITY IN BIG DATA

Asian Journal of Pharmaceutical and Clinical Research ◽

10.22159/ajpcr.2017.v10s1.19755 ◽

2017 ◽

Vol 10 (13) ◽

pp. 355 ◽

Cited By ~ 1

Author(s):

Reshma Remesh ◽

Pattabiraman. V

Keyword(s):

Dimensionality Reduction ◽

Input Data ◽

Principal Component ◽

Kernel Principal Component Analysis ◽

High Dimensional ◽

Data Sets ◽

Learning Approaches ◽

Data Set ◽

Reduction Techniques ◽

Dimensionality Reduction Techniques

Dimensionality reduction techniques are used to reduce the complexity for analysis of high dimensional data sets. The raw input data set may have large dimensions and it might consume time and lead to wrong predictions if unnecessary data attributes are been considered for analysis. So using dimensionality reduction techniques one can reduce the dimensions of input data towards accurate prediction with less cost. In this paper the different machine learning approaches used for dimensionality reductions such as PCA, SVD, LDA, Kernel Principal Component Analysis and Artificial Neural Network have been studied.

Download Full-text

Dimensionality Reduction for Registration of High-Dimensional Data Sets

IEEE Transactions on Image Processing ◽

10.1109/tip.2013.2253480 ◽

2013 ◽

Vol 22 (8) ◽

pp. 3041-3049 ◽

Cited By ~ 8

Author(s):

Min Xu ◽

Hao Chen ◽

P. K. Varshney

Keyword(s):

Dimensionality Reduction ◽

High Dimensional Data ◽

High Dimensional ◽

Data Sets

Download Full-text

Visualization of Very Large High-Dimensional Data Sets as Minimum Spanning Trees

10.26434/chemrxiv.9698861.v1 ◽

2019 ◽

Author(s):

Daniel Probst ◽

Jean-Louis Reymond

Keyword(s):

Data Visualization ◽

Particle Physics ◽

Cancer Biology ◽

Spanning Trees ◽

Minimum Spanning Tree ◽

High Dimensional Data ◽

Locality Sensitive Hashing ◽

High Dimensional ◽

Data Sets ◽

Data Set

<div>Here, we introduce a new data visualization and exploration method, TMAP (tree-map), which exploits locality sensitive hashing, Kruskal’s minimum-spanning-tree algorithm, and a multilevel multipole-based graph layout algorithm to represent large and high dimensional data sets as a tree structure, which is readily understandable and explorable. Compared to other data visualization methods such as t-SNE or UMAP, TMAP increases the size of data sets that can be visualized due to its significantly lower memory requirements and running time and should find broad applicability in the age of big data. We exemplify TMAP in the area of cheminformatics with interactive maps for 1.16 million drug-like molecules from ChEMBL, 10.1 million small molecule fragments from FDB17, and 131 thousand 3D-structures of biomolecules from the PDB Databank, and to visualize data from literature (GUTENBERG data set), cancer biology (PANSCAN data set) and particle physics (MiniBooNE data set). TMAP is available as a Python package. Installation, usage instructions and application examples can be found at http://tmap.gdb.tools.</div>

Download Full-text

A Similar Distribution Discriminant Analysis with Orthogonal and Nearly Statistically Uncorrelated Characteristics

Mathematical Problems in Engineering ◽

10.1155/2019/3145973 ◽

2019 ◽

Vol 2019 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Zhibo Guo ◽

Ying Zhang

Keyword(s):

Discriminant Analysis ◽

Dimensionality Reduction ◽

Linear Discriminant Analysis ◽

High Dimensional Data ◽

Sensor Data ◽

High Dimensional ◽

Similar Distribution ◽

Face Database ◽

Linear Discriminant ◽

Reduction Methods

It is very difficult to process and analyze high-dimensional data directly. Therefore, it is necessary to learn a potential subspace of high-dimensional data through excellent dimensionality reduction algorithms to preserve the intrinsic structure of high-dimensional data and abandon the less useful information. Principal component analysis (PCA) and linear discriminant analysis (LDA) are two popular dimensionality reduction methods for high-dimensional sensor data preprocessing. LDA contains two basic methods, namely, classic linear discriminant analysis and FS linear discriminant analysis. In this paper, a new method, called similar distribution discriminant analysis (SDDA), is proposed based on the similarity of samples’ distribution. Furthermore, the method of solving the optimal discriminant vector is given. These discriminant vectors are orthogonal and nearly statistically uncorrelated. The disadvantages of PCA and LDA are overcome, and the extracted features are more effective by using SDDA. The recognition performance of SDDA exceeds PCA and LDA largely. Some experiments on the Yale face database, FERET face database, and UCI multiple features dataset demonstrate that the proposed method is effective. The results reveal that SDDA obtains better performance than comparison dimensionality reduction methods.

Download Full-text

A Robust Supervised Variable Selection for Noisy High-Dimensional Data

BioMed Research International ◽

10.1155/2015/320385 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 6

Author(s):

Jan Kalina ◽

Anna Schlenker

Keyword(s):

Variable Selection ◽

Dimensionality Reduction ◽

Robust Statistics ◽

High Dimensional Data ◽

Real Data ◽

High Dimensional ◽

Adaptive Weights ◽

Novel Approach ◽

Reduction Methods ◽

Data Adaptive

The Minimum Redundancy Maximum Relevance (MRMR) approach to supervised variable selection represents a successful methodology for dimensionality reduction, which is suitable for high-dimensional data observed in two or more different groups. Various available versions of the MRMR approach have been designed to search for variables with the largest relevance for a classification task while controlling for redundancy of the selected set of variables. However, usual relevance and redundancy criteria have the disadvantages of being too sensitive to the presence of outlying measurements and/or being inefficient. We propose a novel approach called Minimum Regularized Redundancy Maximum Robust Relevance (MRRMRR), suitable for noisy high-dimensional data observed in two groups. It combines principles of regularization and robust statistics. Particularly, redundancy is measured by a new regularized version of the coefficient of multiple correlation and relevance is measured by a highly robust correlation coefficient based on the least weighted squares regression with data-adaptive weights. We compare various dimensionality reduction methods on three real data sets. To investigate the influence of noise or outliers on the data, we perform the computations also for data artificially contaminated by severe noise of various forms. The experimental results confirm the robustness of the method with respect to outliers.

Download Full-text

A new variant of radial visualization for supervised visualization of high dimensional data

Transport and Communication Science Journal ◽

10.25073/tcsj.70.3.24 ◽

2019 ◽

Vol 70 (3) ◽

pp. 162-172

Author(s):

Long Tran Van ◽

Nguyen Dinh Thi

Keyword(s):

High Dimensional Data ◽

Visual Space ◽

High Dimensional ◽

Data Sets ◽

Data Set ◽

New Variant ◽

Radial Visualization ◽

Linear Dimensionality Reduction ◽

High Dimensional Datasets ◽

Dimensionality Reduction Method

Radial Visualization technique is a non linear dimensionality reduction method. Radial Visualization projects multivariate data in the 2-dimensional visual space inside the unit circle. Radial Visualization supports display both the samples and the attributes that provides useful information of data structures. In this article, we introduced a new variant of Radial Visualization for visualizing high dimensional data set that named Arc Radial Visualization. The new proposal that modified Radial Visualization supported more space to display high dimensional datasets. Our method provides an improvement in visualizing cluster structures of high dimensional data sets on the Radial Visualization. We present our proposal method with two quality measurements and proved the effectiveness of our approach for several real datasets.

Download Full-text

Subspace Clustering of High-Dimensional Data: An Evolutionary Approach

Applied Computational Intelligence and Soft Computing ◽

10.1155/2013/863146 ◽

2013 ◽

Vol 2013 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Singh Vijendra ◽

Sahoo Laxman

Keyword(s):

Clustering Algorithm ◽

Dimensional Space ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Data Points

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of high-dimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and real-world data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS.

Download Full-text

A SOM PROJECTION TECHNIQUE WITH THE GROWING STRUCTURE FOR VISUALIZING HIGH-DIMENSIONAL DATA

International Journal of Neural Systems ◽

10.1142/s0129065703001662 ◽

2003 ◽

Vol 13 (05) ◽

pp. 353-365 ◽

Cited By ~ 8

Author(s):

ZHENG WU ◽

GARY G. YEN

Keyword(s):

Projection Method ◽

Cell Structure ◽

High Dimensional Data ◽

Network Size ◽

High Dimensional ◽

Data Sets ◽

Self Organizing Map ◽

Data Set ◽

Growing Cell ◽

Self Organizing

The Self-Organizing Map (SOM) is an efficient tool for visualizing high-dimensional data. In this paper, an intuitive and effective SOM projection method is proposed for mapping high-dimensional data onto the two-dimensional grid structure with a growing self-organizing mechanism. In the learning phase, a growing SOM is trained and the growing cell structure is used as the baseline framework. In the ordination phase, the new projection method is used to map the input vector so that the input data is mapped to the structure of the SOM without having to plot the weight values, resulting in easy visualization of the data. The projection method is demonstrated on four different data sets, including a 118 patent data set and a 399 checical abstract data set related to polymer cements, with promising results and a significantly reduced network size.

Download Full-text