scholarly journals Robust PCA via Regularized Reaper with a Matrix-Free Proximal Algorithm

Author(s):  
Robert Beinert ◽  
Gabriele Steidl

AbstractPrincipal component analysis (PCA) is known to be sensitive to outliers, so that various robust PCA variants were proposed in the literature. A recent model, called reaper, aims to find the principal components by solving a convex optimization problem. Usually the number of principal components must be determined in advance and the minimization is performed over symmetric positive semi-definite matrices having the size of the data, although the number of principal components is substantially smaller. This prohibits its use if the dimension of the data is large which is often the case in image processing. In this paper, we propose a regularized version of reaper which enforces the sparsity of the number of principal components by penalizing the nuclear norm of the corresponding orthogonal projector. If only an upper bound on the number of principal components is available, our approach can be combined with the L-curve method to reconstruct the appropriate subspace. Our second contribution is a matrix-free algorithm to find a minimizer of the regularized reaper which is also suited for high-dimensional data. The algorithm couples a primal-dual minimization approach with a thick-restarted Lanczos process. This appears to be the first efficient convex variational method for robust PCA that can handle high-dimensional data. As a side result, we discuss the topic of the bias in robust PCA. Numerical examples demonstrate the performance of our algorithm.

2013 ◽  
Vol 303-306 ◽  
pp. 1101-1104 ◽  
Author(s):  
Yong De Hu ◽  
Jing Chang Pan ◽  
Xin Tan

Kernel entropy component analysis (KECA) reveals the original data’s structure by kernel matrix. This structure is related to the Renyi entropy of the data. KECA maintains the invariance of the original data’s structure by keeping the data’s Renyi entropy unchanged. This paper described the original data by several components on the purpose of dimension reduction. Then the KECA was applied in celestial spectra reduction and was compared with Principal Component Analysis (PCA) and Kernel Principal Component Analysis (KPCA) by experiments. Experimental results show that the KECA is a good method in high-dimensional data reduction.


2011 ◽  
Vol 54 (1) ◽  
pp. 94-107 ◽  
Author(s):  
Guo-Chun Ding ◽  
Kornelia Smalla ◽  
Holger Heuer ◽  
Siegfried Kropf

2010 ◽  
Vol 10 (03) ◽  
pp. 343-363
Author(s):  
ULRIK SÖDERSTRÖM ◽  
HAIBO LI

In this paper, we examine how much information is needed to represent the facial mimic, based on Paul Ekman's assumption that the facial mimic can be represented with a few basic emotions. Principal component analysis is used to compact the important facial expressions. Theoretical bounds for facial mimic representation are presented both for using a certain number of principal components and a certain number of bits. When 10 principal components are used to reconstruct color image video at a resolution of 240 × 176 pixels the representation bound is on average 36.8 dB, measured in peak signal-to-noise ratio. Practical confirmation of the theoretical bounds is demonstrated. Quantization of projection coefficients affects the representation, but a quantization with approximately 7-8 bits is found to match an exact representation, measured in mean square error.


2019 ◽  
Author(s):  
Shamus M. Cooley ◽  
Timothy Hamilton ◽  
J. Christian J. Ray ◽  
Eric J. Deeds

AbstractHigh-dimensional data are becoming increasingly common in nearly all areas of science. Developing approaches to analyze these data and understand their meaning is a pressing issue. This is particularly true for the rapidly growing field of single-cell RNA-Seq (scRNA-Seq), a technique that simultaneously measures the expression of tens of thousands of genes in thousands to millions of single cells. The emerging consensus for analysis workflows reduces the dimensionality of the dataset before performing downstream analysis, such as assignment of cell types. One problem with this approach is that dimensionality reduction can introduce substantial distortion into the data; consider the familiar example of trying to represent the three-dimensional earth as a two-dimensional map. It is currently unclear if such distortion affects analysis of scRNA-Seq data sets. Here, we introduce a straightforward approach to quantifying this distortion by comparing the local neighborhoods of points before and after dimensionality reduction. We found that popular techniques like t-SNE and UMAP introduce substantial distortion even for relatively simple geometries such as simulated hyperspheres. For scRNA-Seq data, we found the distortion in local neighborhoods was often greater than 95% in the representations typically used for downstream analysis. This high level of distortion can readily introduce important errors into cell type identification, pseudotime ordering, and other analyses that rely on local relationships. We found that principal component analysis can generate accurate embeddings of the data, but only when using dimensionalities that are much higher than typically used in scRNA-Seq analysis. We suggest approaches to take these findings into account and call for a new generation of dimensional reduction algorithms that can accurately embed high dimensional data in its true latent dimension.


Author(s):  
Avani Ahuja

In the current era of ‘big data’, scientists are able to quickly amass enormous amount of data in a limited number of experiments. The investigators then try to hypothesize about the root cause based on the observed trends for the predictors and the response variable. This involves identifying the discriminatory predictors that are most responsible for explaining variation in the response variable. In the current work, we investigated three related multivariate techniques: Principal Component Regression (PCR), Partial Least Squares or Projections to Latent Structures (PLS), and Orthogonal Partial Least Squares (OPLS). To perform a comparative analysis, we used a publicly available dataset for Parkinson’ disease patien ts. We first performed the analysis using a cross-validated number of principal components for the aforementioned techniques. Our results demonstrated that PLS and OPLS were better suited than PCR for identifying the discriminatory predictors. Since the X data did not exhibit a strong correlation, we also performed Multiple Linear Regression (MLR) on the dataset. A comparison of the top five discriminatory predictors identified by the four techniques showed a substantial overlap between the results obtained by PLS, OPLS, and MLR, and the three techniques exhibited a significant divergence from the variables identified by PCR. A further investigation of the data revealed that PCR could be used to identify the discriminatory variables successfully if the number of principal components in the regression model were increased. In summary, we recommend using PLS or OPLS for hypothesis generation and systemizing the selection process for principal components when using PCR.rewordexplain later why MLR can be used on a dataset with no correlation


2011 ◽  
Vol 20 (4) ◽  
pp. 852-873 ◽  
Author(s):  
Vadim Zipunnikov ◽  
Brian Caffo ◽  
David M. Yousem ◽  
Christos Davatzikos ◽  
Brian S. Schwartz ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document