scholarly journals Supervised dimensionality reduction for big data

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Joshua T. Vogelstein ◽  
Eric W. Bridgeford ◽  
Minh Tang ◽  
Da Zheng ◽  
Christopher Douville ◽  
...  

AbstractTo solve key biomedical problems, experimentalists now routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Because sample sizes are typically orders of magnitude smaller than the dimensionality of these data, valid inferences require finding a low-dimensional representation that preserves the discriminating information (e.g., whether the individual suffers from a particular disease). There is a lack of interpretable supervised dimensionality reduction methods that scale to millions of dimensions with strong statistical theoretical guarantees. We introduce an approach to extending principal components analysis by incorporating class-conditional moment estimates into the low-dimensional projection. The simplest version, Linear Optimal Low-rank projection, incorporates the class-conditional means. We prove, and substantiate with both synthetic and real data benchmarks, that Linear Optimal Low-Rank Projection and its generalizations lead to improved data representations for subsequent classification, while maintaining computational efficiency and scalability. Using multiple brain imaging datasets consisting of more than 150 million features, and several genomics datasets with more than 500,000 features, Linear Optimal Low-Rank Projection outperforms other scalable linear dimensionality reduction techniques in terms of accuracy, while only requiring a few minutes on a standard desktop computer.

2019 ◽  
Author(s):  
Cody N. Heiser ◽  
Ken S. Lau

SummaryHigh-dimensional data, such as those generated using single-cell RNA sequencing, present challenges in interpretation and visualization. Numerical and computational methods for dimensionality reduction allow for low-dimensional representation of genome-scale expression data for downstream clustering, trajectory reconstruction, and biological interpretation. However, a comprehensive and quantitative evaluation of the performance of these techniques has not been established. We present an unbiased framework that defines metrics of global and local structure preservation in dimensionality reduction transformations. Using discrete and continuous scRNA-seq datasets, we find that input cell distribution and method parameters are largely determinant of global, local, and organizational data structure preservation by eleven published dimensionality reduction methods. Code available atgithub.com/KenLauLab/DR-structure-preservationallows for rapid evaluation of further datasets and methods.


2022 ◽  
pp. 17-25
Author(s):  
Nancy Jan Sliper

Experimenters today frequently quantify millions or even billions of characteristics (measurements) each sample to address critical biological issues, in the hopes that machine learning tools would be able to make correct data-driven judgments. An efficient analysis requires a low-dimensional representation that preserves the differentiating features in data whose size and complexity are orders of magnitude apart (e.g., if a certain ailment is present in the person's body). While there are several systems that can handle millions of variables and yet have strong empirical and conceptual guarantees, there are few that can be clearly understood. This research presents an evaluation of supervised dimensionality reduction for large scale data. We provide a methodology for expanding Principal Component Analysis (PCA) by including category moment estimations in low-dimensional projections. Linear Optimum Low-Rank (LOLR) projection, the cheapest variant, includes the class-conditional means. We show that LOLR projections and its extensions enhance representations of data for future classifications while retaining computing flexibility and reliability using both experimental and simulated data benchmark. When it comes to accuracy, LOLR prediction outperforms other modular linear dimension reduction methods that require much longer computation times on conventional computers. LOLR uses more than 150 million attributes in brain image processing datasets, and many genome sequencing datasets have more than half a million attributes.


2014 ◽  
Vol 26 (4) ◽  
pp. 761-780 ◽  
Author(s):  
Guoqiang Zhong ◽  
Mohamed Cheriet

We present a supervised model for tensor dimensionality reduction, which is called large margin low rank tensor analysis (LMLRTA). In contrast to traditional vector representation-based dimensionality reduction methods, LMLRTA can take any order of tensors as input. And unlike previous tensor dimensionality reduction methods, which can learn only the low-dimensional embeddings with a priori specified dimensionality, LMLRTA can automatically and jointly learn the dimensionality and the low-dimensional representations from data. Moreover, LMLRTA delivers low rank projection matrices, while it encourages data of the same class to be close and of different classes to be separated by a large margin of distance in the low-dimensional tensor space. LMLRTA can be optimized using an iterative fixed-point continuation algorithm, which is guaranteed to converge to a local optimal solution of the optimization problem. We evaluate LMLRTA on an object recognition application, where the data are represented as 2D tensors, and a face recognition application, where the data are represented as 3D tensors. Experimental results show the superiority of LMLRTA over state-of-the-art approaches.


2020 ◽  
Vol 49 (3) ◽  
pp. 421-437
Author(s):  
Genggeng Liu ◽  
Lin Xie ◽  
Chi-Hua Chen

Dimensionality reduction plays an important role in the data processing of machine learning and data mining, which makes the processing of high-dimensional data more efficient. Dimensionality reduction can extract the low-dimensional feature representation of high-dimensional data, and an effective dimensionality reduction method can not only extract most of the useful information of the original data, but also realize the function of removing useless noise. The dimensionality reduction methods can be applied to all types of data, especially image data. Although the supervised learning method has achieved good results in the application of dimensionality reduction, its performance depends on the number of labeled training samples. With the growing of information from internet, marking the data requires more resources and is more difficult. Therefore, using unsupervised learning to learn the feature of data has extremely important research value. In this paper, an unsupervised multilayered variational auto-encoder model is studied in the text data, so that the high-dimensional feature to the low-dimensional feature becomes efficient and the low-dimensional feature can retain mainly information as much as possible. Low-dimensional feature obtained by different dimensionality reduction methods are used to compare with the dimensionality reduction results of variational auto-encoder (VAE), and the method can be significantly improved over other comparison methods.


Author(s):  
Akira Imakura ◽  
Momo Matsuda ◽  
Xiucai Ye ◽  
Tetsuya Sakurai

Dimensionality reduction methods that project highdimensional data to a low-dimensional space by matrix trace optimization are widely used for clustering and classification. The matrix trace optimization problem leads to an eigenvalue problem for a low-dimensional subspace construction, preserving certain properties of the original data. However, most of the existing methods use only a few eigenvectors to construct the low-dimensional space, which may lead to a loss of useful information for achieving successful classification. Herein, to overcome the deficiency of the information loss, we propose a novel complex moment-based supervised eigenmap including multiple eigenvectors for dimensionality reduction. Furthermore, the proposed method provides a general formulation for matrix trace optimization methods to incorporate with ridge regression, which models the linear dependency between covariate variables and univariate labels. To reduce the computational complexity, we also propose an efficient and parallel implementation of the proposed method. Numerical experiments indicate that the proposed method is competitive compared with the existing dimensionality reduction methods for the recognition performance. Additionally, the proposed method exhibits high parallel efficiency.


2020 ◽  
Author(s):  
Elnaz Lashgari ◽  
Uri Maoz

AbstractElectromyography (EMG) is a simple, non-invasive, and cost-effective technology for sensing muscle activity. However, EMG is also noisy, complex, and high-dimensional. It has nevertheless been widely used in a host of human-machine-interface applications (electrical wheelchairs, virtual computer mice, prosthesis, robotic fingers, etc.) and in particular to measure reaching and grasping motions of the human hand. Here, we developd a more automated pipeline to predict object weight in a reach-and-grasp task from an open dataset relying only on EMG data. In that we shifted the focus from manual feature-engineering to automated feature-extraction by using raw (filtered) EMG signals and thus letting the algorithms select the features. We further compared intrinsic EMG features, derived from several dimensionality-reduction methods, and then ran some classification algorithms on these low-dimensional representations. We found that the Laplacian Eigenmap algorithm generally outperformed other dimensionality-reduction methods. What is more, optimal classification accuracy was achieved using a combination of Laplacian Eigenmaps (simple-minded) and k-Nearest Neighbors (88% for 3-way classification). Our results, using EMG alone, are comparable to others in the literature that used EMG and EEG together. They also demonstrate the usefulness of dimensionality reduction when classifying movement based on EMG signals and more generally the usefulness of EMG for movement classification.


Author(s):  
Xiaojie Guo ◽  
Zhouchen Lin

In practice, even very high-dimensional data are typically sampled from low-dimensional subspaces but with intrusion of outliers and/or noises. Recovering the underlying structure and the pollution from the observations is key to understanding and processing such data. Besides properly modeling the low-rank structure of subspace, how to handle the pollution, is core regarding the performance of recovery. Often, the observed data is posed as a superimposition of the clean data and residual, while the residual can be roughly divided into two groups, including small dense noises and gross sparse outliers. Compared with small noises, outliers more likely ruin the recovery, as they can be arbitrarily large. By considering the above, this paper designs a method for recovering the low rank matrix with robust outlier estimation, termed as ROUTE, in a unified manner. Theoretical analysis on convergence and optimality, and experimental results on both synthetic and real data are provided to demonstrate the efficacy of our proposed method and show its superiority over other state-of-the-arts.


2020 ◽  
Vol 7 (2) ◽  
pp. 190714 ◽  
Author(s):  
Omar Shetta ◽  
Mahesan Niranjan

The application of machine learning to inference problems in biology is dominated by supervised learning problems of regression and classification, and unsupervised learning problems of clustering and variants of low-dimensional projections for visualization. A class of problems that have not gained much attention is detecting outliers in datasets, arising from reasons such as gross experimental, reporting or labelling errors. These could also be small parts of a dataset that are functionally distinct from the majority of a population. Outlier data are often identified by considering the probability density of normal data and comparing data likelihoods against some threshold. This classical approach suffers from the curse of dimensionality, which is a serious problem with omics data which are often found in very high dimensions. We develop an outlier detection method based on structured low-rank approximation methods. The objective function includes a regularizer based on neighbourhood information captured in the graph Laplacian. Results on publicly available genomic data show that our method robustly detects outliers whereas a density-based method fails even at moderate dimensions. Moreover, we show that our method has better clustering and visualization performance on the recovered low-dimensional projection when compared with popular dimensionality reduction techniques.


2012 ◽  
Vol 60 (3) ◽  
pp. 389-405 ◽  
Author(s):  
G. Zhou ◽  
A. Cichocki

Abstract A multiway blind source separation (MBSS) method is developed to decompose large-scale tensor (multiway array) data. Benefitting from all kinds of well-established constrained low-rank matrix factorization methods, MBSS is quite flexible and able to extract unique and interpretable components with physical meaning. The multilinear structure of Tucker and the essential uniqueness of BSS methods allow MBSS to estimate each component matrix separately from an unfolding matrix in each mode. Consequently, alternating least squares (ALS) iterations, which are considered as the workhorse for tensor decompositions, can be avoided and various robust and efficient dimensionality reduction methods can be easily incorporated to pre-process the data, which makes MBSS extremely fast, especially for large-scale problems. Identification and uniqueness conditions are also discussed. Two practical issues dimensionality reduction and estimation of number of components are also addressed based on sparse and random fibers sampling. Extensive simulations confirmed the validity, flexibility, and high efficiency of the proposed method. We also demonstrated by simulations that the MBSS approach can successfully extract desired components while most existing algorithms may fail for ill-conditioned and large-scale problems.


2019 ◽  
Vol 8 (S3) ◽  
pp. 66-71
Author(s):  
T. Sudha ◽  
P. Nagendra Kumar

Data mining is one of the major areas of research. Clustering is one of the main functionalities of datamining. High dimensionality is one of the main issues of clustering and Dimensionality reduction can be used as a solution to this problem. The present work makes a comparative study of dimensionality reduction techniques such as t-distributed stochastic neighbour embedding and probabilistic principal component analysis in the context of clustering. High dimensional data have been reduced to low dimensional data using dimensionality reduction techniques such as t-distributed stochastic neighbour embedding and probabilistic principal component analysis. Cluster analysis has been performed on the high dimensional data as well as the low dimensional data sets obtained through t-distributed stochastic neighbour embedding and Probabilistic principal component analysis with varying number of clusters. Mean squared error; time and space have been considered as parameters for comparison. The results obtained show that time taken to convert the high dimensional data into low dimensional data using probabilistic principal component analysis is higher than the time taken to convert the high dimensional data into low dimensional data using t-distributed stochastic neighbour embedding.The space required by the data set reduced through Probabilistic principal component analysis is less than the storage space required by the data set reduced through t-distributed stochastic neighbour embedding.


Sign in / Sign up

Export Citation Format

Share Document