scholarly journals VAE-SNE: a deep generative model for simultaneous dimensionality reduction and clustering

2020 ◽  
Author(s):  
Jacob M. Graving ◽  
Iain D. Couzin

AbstractScientific datasets are growing rapidly in scale and complexity. Consequently, the task of understanding these data to answer scientific questions increasingly requires the use of compression algorithms that reduce dimensionality by combining correlated features and cluster similar observations to summarize large datasets. Here we introduce a method for both dimension reduction and clustering called VAE-SNE (variational autoencoder stochastic neighbor embedding). Our model combines elements from deep learning, probabilistic inference, and manifold learning to produce interpretable compressed representations while also readily scaling to tens-of-millions of observations. Unlike existing methods, VAE-SNE simultaneously compresses high-dimensional data and automatically learns a distribution of clusters within the data — without the need to manually select the number of clusters. This naturally creates a multi-scale representation, which makes it straightforward to generate coarse-grained descriptions for large subsets of related observations and select specific regions of interest for further analysis. VAE-SNE can also quickly and easily embed new samples, detect outliers, and can be optimized with small batches of data, which makes it possible to compress datasets that are otherwise too large to fit into memory. We evaluate VAE-SNE as a general purpose method for dimensionality reduction by applying it to multiple real-world datasets and by comparing its performance with existing methods for dimensionality reduction. We find that VAE-SNE produces high-quality compressed representations with results that are on par with existing nonlinear dimensionality reduction algorithms. As a practical example, we demonstrate how the cluster distribution learned by VAE-SNE can be used for unsupervised action recognition to detect and classify repeated motifs of stereotyped behavior in high-dimensional timeseries data. Finally, we also introduce variants of VAE-SNE for embedding data in polar (spherical) coordinates and for embedding image data from raw pixels. VAE-SNE is a robust, feature-rich, and scalable method with broad applicability to a range of datasets in the life sciences and beyond.

2020 ◽  
Vol 49 (3) ◽  
pp. 421-437
Author(s):  
Genggeng Liu ◽  
Lin Xie ◽  
Chi-Hua Chen

Dimensionality reduction plays an important role in the data processing of machine learning and data mining, which makes the processing of high-dimensional data more efficient. Dimensionality reduction can extract the low-dimensional feature representation of high-dimensional data, and an effective dimensionality reduction method can not only extract most of the useful information of the original data, but also realize the function of removing useless noise. The dimensionality reduction methods can be applied to all types of data, especially image data. Although the supervised learning method has achieved good results in the application of dimensionality reduction, its performance depends on the number of labeled training samples. With the growing of information from internet, marking the data requires more resources and is more difficult. Therefore, using unsupervised learning to learn the feature of data has extremely important research value. In this paper, an unsupervised multilayered variational auto-encoder model is studied in the text data, so that the high-dimensional feature to the low-dimensional feature becomes efficient and the low-dimensional feature can retain mainly information as much as possible. Low-dimensional feature obtained by different dimensionality reduction methods are used to compare with the dimensionality reduction results of variational auto-encoder (VAE), and the method can be significantly improved over other comparison methods.


2019 ◽  
Vol 283 ◽  
pp. 07009
Author(s):  
Xinyao Zhang ◽  
Pengyu Wang ◽  
Ning Wang

Dimensionality reduction is one of the central problems in machine learning and pattern recognition, which aims to develop a compact representation for complex data from high-dimensional observations. Here, we apply a nonlinear manifold learning algorithm, called local tangent space alignment (LTSA) algorithm, to high-dimensional acoustic observations and achieve nonlinear dimensionality reduction for the acoustic field measured by a linear senor array. By dimensionality reduction, the underlying physical degrees of freedom of acoustic field, such as the variations of sound source location and sound speed profiles, can be discovered. Two simulations are presented to verify the validity of the approach.


2003 ◽  
Vol 15 (6) ◽  
pp. 1373-1396 ◽  
Author(s):  
Mikhail Belkin ◽  
Partha Niyogi

One of the central problems in machine learning and pattern recognition is to develop appropriate representations for complex data. We consider the problem of constructing a representation for data lying on a low-dimensional manifold embedded in a high-dimensional space. Drawing on the correspondence between the graph Laplacian, the Laplace Beltrami operator on the manifold, and the connections to the heat equation, we propose a geometrically motivated algorithm for representing the high-dimensional data. The algorithm provides a computationally efficient approach to nonlinear dimensionality reduction that has locality-preserving properties and a natural connection to clustering. Some potential applications and illustrative examples are discussed.


2020 ◽  
Author(s):  
Kevin C. VanHorn ◽  
Murat Can Çobanoğlu

AbstractDimensionality reduction (DR) is often integral when analyzing high-dimensional data across scientific, economic, and social networking applications. For data with a high order of complexity, nonlinear approaches are often needed to identify and represent the most important components. We propose a novel DR approach that can incorporate a known underlying hierarchy. Specifically, we extend the widely used t-Distributed Stochastic Neighbor Embedding technique (t-SNE) to include hierarchical information and demonstrate its use with known or unknown class labels. We term this approach “H-tSNE.” Such a strategy can aid in discovering and understanding underlying patterns of a dataset that is heavily influenced by parent-child relationships. Without integrating information that is known a priori, we suggest that DR cannot function as effectively. In this regard, we argue for a DR approach that enables the user to incorporate known, relevant relationships even if their representation is weakly expressed in the dataset.Availabilitygithub.com/Cobanoglu-Lab/h-tSNE


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Yongbin Liu ◽  
Jingjie Wang ◽  
Wei Bai

Dimensionality reduction of images with high-dimensional nonlinear structure is the key to improving the recognition rate. Although some traditional algorithms have achieved some results in the process of dimensionality reduction, they also expose their respective defects. In order to achieve the ideal effect of high-dimensional nonlinear image recognition, based on the analysis of the traditional dimensionality reduction algorithm and refining its advantages, an image recognition technology based on the nonlinear dimensionality reduction method is proposed. As an effective nonlinear feature extraction method, the nonlinear dimensionality reduction method can find the nonlinear structure of datasets and maintain the intrinsic structure of data. Applying the nonlinear dimensionality reduction method to image recognition is to divide the input image into blocks, take it as a dataset in high-dimensional space, reduce the dimension of its structure, and obtain the low-dimensional expression vector of its eigenstructure so that the problem of image recognition can be carried out in a lower dimension. Thus, the computational complexity can be reduced, the recognition accuracy can be improved, and it is convenient for further processing such as image recognition and search. The defects of traditional algorithms are solved, and the commodity price recognition and simulation experiments are carried out, which verifies the feasibility of image recognition technology based on the nonlinear dimensionality reduction method in commodity price recognition.


2014 ◽  
Vol 1014 ◽  
pp. 375-378 ◽  
Author(s):  
Ri Sheng Huang

To improve effectively the performance on speech emotion recognition, it is needed to perform nonlinear dimensionality reduction for speech feature data lying on a nonlinear manifold embedded in high-dimensional acoustic space. This paper proposes an improved SLLE algorithm, which enhances the discriminating power of low-dimensional embedded data and possesses the optimal generalization ability. The proposed algorithm is used to conduct nonlinear dimensionality reduction for 48-dimensional speech emotional feature data including prosody so as to recognize three emotions including anger, joy and neutral. Experimental results on the natural speech emotional database demonstrate that the proposed algorithm obtains the highest accuracy of 90.97% with only less 9 embedded features, making 11.64% improvement over SLLE algorithm.


2010 ◽  
Vol 7 (1) ◽  
pp. 127-138 ◽  
Author(s):  
Zhao Zhang ◽  
Ye Ning

Dimensionality reduction is an important preprocessing step in high-dimensional data analysis without losing intrinsic information. The problem of semi-supervised nonlinear dimensionality reduction called KNDR is considered for wood defects recognition. In this setting, domain knowledge in forms of pairs constraints are used to specify whether pairs of instances belong to the same class or different classes. KNDR can project the data onto a set of 'useful' features and preserve the structure of labeled and unlabeled data as well as the constraints defined in the embedding space, under which the projections of the original data can be effectively partitioned from each other. We demonstrate the practical usefulness of KNDR for data visualization and wood defects recognition through extensive experiments. Experimental results show it achieves similar or even higher performances than some existing methods.


Author(s):  
Mohammad Sultan Mahmud ◽  
Joshua Zhexue Huang ◽  
Xianghua Fu

Classification problems in which the number of features (dimensions) is unduly higher than the number of samples (observations) is an essential research and application area in a variety of domains, especially in computational biology. It is also known as a high-dimensional small-sample-size (HDSSS) problem. Various dimensionality reduction methods have been developed, but they are not potent with the small-sample-sized high-dimensional datasets and suffer from overfitting and high-variance gradients. To overcome the pitfalls of sample size and dimensionality, this study employed variational autoencoder (VAE), which is a dynamic framework for unsupervised learning in recent years. The objective of this study is to investigate a reliable classification model for high-dimensional and small-sample-sized datasets with minimal error. Moreover, it evaluated the strength of different architectures of VAE on the HDSSS datasets. In the experiment, six genomic microarray datasets from Kent Ridge Biomedical Dataset Repository were selected, and several choices of dimensions (features) were applied for data preprocessing. Also, to evaluate the classification accuracy and to find a stable and suitable classifier, nine state-of-the-art classifiers that have been successful for classification tasks in high-dimensional data settings were selected. The experimental results demonstrate that the VAE can provide superior performance compared to traditional methods such as PCA, fastICA, FA, NMF, and LDA in terms of accuracy and AUROC.


Sign in / Sign up

Export Citation Format

Share Document