Avoiding Optimal Mean ℓ2,1-Norm Maximization-Based Robust PCA for Reconstruction

2017 ◽  
Vol 29 (4) ◽  
pp. 1124-1150 ◽  
Author(s):  
Minnan Luo ◽  
Feiping Nie ◽  
Xiaojun Chang ◽  
Yi Yang ◽  
Alexander G. Hauptmann ◽  
...  

Robust principal component analysis (PCA) is one of the most important dimension-reduction techniques for handling high-dimensional data with outliers. However, most of the existing robust PCA presupposes that the mean of the data is zero and incorrectly utilizes the average of data as the optimal mean of robust PCA. In fact, this assumption holds only for the squared [Formula: see text]-norm-based traditional PCA. In this letter, we equivalently reformulate the objective of conventional PCA and learn the optimal projection directions by maximizing the sum of projected difference between each pair of instances based on [Formula: see text]-norm. The proposed method is robust to outliers and also invariant to rotation. More important, the reformulated objective not only automatically avoids the calculation of optimal mean and makes the assumption of centered data unnecessary, but also theoretically connects to the minimization of reconstruction error. To solve the proposed nonsmooth problem, we exploit an efficient optimization algorithm to soften the contributions from outliers by reweighting each data point iteratively. We theoretically analyze the convergence and computational complexity of the proposed algorithm. Extensive experimental results on several benchmark data sets illustrate the effectiveness and superiority of the proposed method.

2018 ◽  
Vol 64 ◽  
pp. 08006 ◽  
Author(s):  
Kummerow André ◽  
Nicolai Steffen ◽  
Bretschneider Peter

The scope of this survey is the uncovering of potential critical events from mixed PMU data sets. An unsupervised procedure is introduced with the use of different outlier detection methods. For that, different techniques for signal analysis are used to generate features in time and frequency domain as well as linear and non-linear dimension reduction techniques. That approach enables the exploration of critical grid dynamics in power systems without prior knowledge about existing failure patterns. Furthermore new failure patterns can be extracted for the creation of training data sets used for online detection algorithms.


2017 ◽  
Vol 10 (13) ◽  
pp. 355 ◽  
Author(s):  
Reshma Remesh ◽  
Pattabiraman. V

Dimensionality reduction techniques are used to reduce the complexity for analysis of high dimensional data sets. The raw input data set may have large dimensions and it might consume time and lead to wrong predictions if unnecessary data attributes are been considered for analysis. So using dimensionality reduction techniques one can reduce the dimensions of input data towards accurate prediction with less cost. In this paper the different machine learning approaches used for dimensionality reductions such as PCA, SVD, LDA, Kernel Principal Component Analysis and Artificial Neural Network  have been studied.


2020 ◽  
Vol 13 (6) ◽  
pp. 2995-3022
Author(s):  
Sini Isokääntä ◽  
Eetu Kari ◽  
Angela Buchholz ◽  
Liqing Hao ◽  
Siegfried Schobesberger ◽  
...  

Abstract. Online analysis with mass spectrometers produces complex data sets, consisting of mass spectra with a large number of chemical compounds (ions). Statistical dimension reduction techniques (SDRTs) are able to condense complex data sets into a more compact form while preserving the information included in the original observations. The general principle of these techniques is to investigate the underlying dependencies of the measured variables by combining variables with similar characteristics into distinct groups, called factors or components. Currently, positive matrix factorization (PMF) is the most commonly exploited SDRT across a range of atmospheric studies, in particular for source apportionment. In this study, we used five different SDRTs in analysing mass spectral data from complex gas- and particle-phase measurements during a laboratory experiment investigating the interactions of gasoline car exhaust and α-pinene. Specifically, we used four factor analysis techniques, namely principal component analysis (PCA), PMF, exploratory factor analysis (EFA) and non-negative matrix factorization (NMF), as well as one clustering technique, partitioning around medoids (PAM). All SDRTs were able to resolve four to five factors from the gas-phase measurements, including an α-pinene precursor factor, two to three oxidation product factors, and a background or car exhaust precursor factor. NMF and PMF provided an additional oxidation product factor, which was not found by other SDRTs. The results from EFA and PCA were similar after applying oblique rotations. For the particle-phase measurements, four factors were discovered with NMF: one primary factor, a mixed-LVOOA factor and two α-pinene secondary-organic-aerosol-derived (SOA-derived) factors. PMF was able to separate two factors: semi-volatile oxygenated organic aerosol (SVOOA) and low-volatility oxygenated organic aerosol (LVOOA). PAM was not able to resolve interpretable clusters due to general limitations of clustering methods, as the high degree of fragmentation taking place in the aerosol mass spectrometer (AMS) causes different compounds formed at different stages in the experiment to be detected at the same variable. However, when preliminary analysis is needed, or isomers and mixed sources are not expected, cluster analysis may be a useful tool, as the results are simpler and thus easier to interpret. In the factor analysis techniques, any single ion generally contributes to multiple factors, although EFA and PCA try to minimize this spread. Our analysis shows that different SDRTs put emphasis on different parts of the data, and with only one technique, some interesting data properties may still stay undiscovered. Thus, validation of the acquired results, either by comparing between different SDRTs or applying one technique multiple times (e.g. by resampling the data or giving different starting values for iterative algorithms), is important, as it may protect the user from dismissing unexpected results as “unphysical”.


2022 ◽  
pp. 612-628
Author(s):  
João Paulo Teixeira ◽  
Nuno Alves ◽  
Paula Odete Fernandes

Vocal acoustic analysis is becoming a useful tool for the classification and recognition of laryngological pathologies. This technique enables a non-invasive and low-cost assessment of voice disorders, allowing a more efficient, fast, and objective diagnosis. In this work, ANN and SVM were experimented on to classify between dysphonic/control and vocal cord paralysis/control. A vector was made up of 4 jitter parameters, 4 shimmer parameters, and a harmonic to noise ratio (HNR), determined from 3 different vowels at 3 different tones, with a total of 81 features. Variable selection and dimension reduction techniques such as hierarchical clustering, multilinear regression analysis and principal component analysis (PCA) was applied. The classification between dysphonic and control was made with an accuracy of 100% for female and male groups with ANN and SVM. For the classification between vocal cords paralysis and control an accuracy of 78,9% was achieved for female group with SVM, and 81,8% for the male group with ANN.


2017 ◽  
Vol 15 (04) ◽  
pp. 1750017 ◽  
Author(s):  
Wentian Li ◽  
Jane E. Cerise ◽  
Yaning Yang ◽  
Henry Han

The t-distributed stochastic neighbor embedding t-SNE is a new dimension reduction and visualization technique for high-dimensional data. t-SNE is rarely applied to human genetic data, even though it is commonly used in other data-intensive biological fields, such as single-cell genomics. We explore the applicability of t-SNE to human genetic data and make these observations: (i) similar to previously used dimension reduction techniques such as principal component analysis (PCA), t-SNE is able to separate samples from different continents; (ii) unlike PCA, t-SNE is more robust with respect to the presence of outliers; (iii) t-SNE is able to display both continental and sub-continental patterns in a single plot. We conclude that the ability for t-SNE to reveal population stratification at different scales could be useful for human genetic association studies.


2019 ◽  
Vol 11 (16) ◽  
pp. 1866 ◽  
Author(s):  
Nicholas Westing ◽  
Brett Borghetti ◽  
Kevin C. Gross

The increasing spatial and spectral resolution of hyperspectral imagers yields detailed spectroscopy measurements from both space-based and airborne platforms. These detailed measurements allow for material classification, with many recent advancements from the fields of machine learning and deep learning. In many scenarios, the hyperspectral image must first be corrected or compensated for atmospheric effects. RT computations can provide LUT to support these corrections. This research investigates a dimension-reduction approach using machine learning methods to create an effective sensor-specific lwir RT model. The utility of this approach is investigated emulating the Mako lwir hyperspectral sensor ( Δ λ ≃ 0 . 044 m , Δ ν ˜ ≃ 3 . 9 c m - 1 ). This study employs physics-based metrics and loss functions to identify promising dimension-reduction techniques and reduce at-sensor radiance reconstruction error. The derived RT model shows an overall RMSE of less than 1 K across reflective to emissive grey-body emissivity profiles.


Author(s):  
Cornelia Fuetterer ◽  
Thomas Augustin ◽  
Christiane Fuchs

AbstractThe analysis of single-cell RNA sequencing data is of great importance in health research. It challenges data scientists, but has enormous potential in the context of personalized medicine. The clustering of single cells aims to detect different subgroups of cell populations within a patient in a data-driven manner. Some comparison studies denote single-cell consensus clustering (SC3), proposed by Kiselev et al. (Nat Methods 14(5):483–486, 2017), as the best method for classifying single-cell RNA sequencing data. SC3 includes Laplacian eigenmaps and a principal component analysis (PCA). Our proposal of unsupervised adapted single-cell consensus clustering (adaSC3) suggests to replace the linear PCA by diffusion maps, a non-linear method that takes the transition of single cells into account. We investigate the performance of adaSC3 in terms of accuracy on the data sets of the original source of SC3 as well as in a simulation study. A comparison of adaSC3 with SC3 as well as with related algorithms based on further alternative dimension reduction techniques shows a quite convincing behavior of adaSC3.


2017 ◽  
Vol 33 (1) ◽  
pp. 15-41 ◽  
Author(s):  
Aida Calviño

Abstract In this article we propose a simple and versatile method for limiting disclosure in continuous microdata based on Principal Component Analysis (PCA). Instead of perturbing the original variables, we propose to alter the principal components, as they contain the same information but are uncorrelated, which permits working on each component separately, reducing processing times. The number and weight of the perturbed components determine the level of protection and distortion of the masked data. The method provides preservation of the mean vector and the variance-covariance matrix. Furthermore, depending on the technique chosen to perturb the principal components, the proposed method can provide masked, hybrid or fully synthetic data sets. Some examples of application and comparison with other methods previously proposed in the literature (in terms of disclosure risk and data utility) are also included.


2019 ◽  
Author(s):  
Sini Isokääntä ◽  
Eetu Kari ◽  
Angela Buchholz ◽  
Liqing Hao ◽  
Siegfried Schobesberger ◽  
...  

Abstract. Online analysis with mass spectrometers produces complex data sets, consisting of mass spectra with a large number of chemical compounds (ions). Statistical dimension reduction techniques (SDRTs) are able to condense complex data sets into a more compact form while preserving the information included in the original observations. The general principle of these techniques is to investigate the underlying dependencies of the measured variables, by combining variables with similar characteristics to distinct groups, called factors or components. Currently, positive matrix factorization (PMF) is the most commonly exploited SDRT across a range of atmospheric studies, in particular for source apportionment. In this study, we used 5 different SDRTs in analysing mass spectral data from complex gas- and particle phase measurements during laboratory experiment investigating the interactions of gasoline car exhaust and α-pinene. Specifically, we used four factor analysis techniques: principal component analysis (PCA), positive matrix factorization (PMF), exploratory factor analysis (EFA), and non-negative matrix factorization (NMF), as well as one clustering technique, partitioning around medoids (PAM). All SDRTs were able to resolve 4–5 factors from the gas phase measurements, including an α-pinene precursor factor, 2–3 oxidation product factors and a background/car exhaust precursor factor. NMF and PMF provided an additional oxidation product factor, which was not found by other SDRTs. The results from EFA and PCA were similar after applying oblique rotations. For the particle phase measurements, four factors were discovered with NMF and PMF: one primary factor, a mixed LVOOA factor, and two α-pinene SOA derived factors. PAM was not able to resolve interpretable clusters due to general limitations of clustering methods, as the high degree of fragmentation taking place in the AMS causes different compounds formed at different stages in the experiment to be detected at the same variable. However, when preliminary analysis is needed, or isomers and mixed sources are not expected, cluster analysis may be a useful tool as the results are simpler and thus easier to interpret. In the factor analysis techniques, any single ion generally contributes to multiple factors, although EFA and PCA try to minimize this spread. Our analysis shows that different SDRTs put emphasis on different parts of the data, and with only one technique some interesting data properties may still stay undiscovered. Thus, validation of the acquired results either by comparing between different SDRTs or applying one technique multiple times (e.g. by resampling the data or giving different starting values for iterative algorithms) is important as it may protect the user from dismissing unexpected results as unphysical.


Author(s):  
Qianqian Wang ◽  
Quanxue Gao ◽  
Xinbo Gao ◽  
Feiping Nie

Recently, many ℓ1-norm based PCA methods have been developed for dimensionality reduction, but they do not explicitly consider the reconstruction error. Moreover, they do not take into account the relationship between reconstruction error and variance of projected data. This reduces the robustness of algorithms. To handle this problem, a novel formulation for PCA, namely angle PCA, is proposed. Angle PCA employs ℓ2-norm to measure reconstruction error and variance of projected da-ta and maximizes the summation of ratio between variance and reconstruction error of each data. Angle PCA not only is robust to outliers but also retains PCA’s desirable property such as rotational invariance. To solve Angle PCA, we propose an iterative algorithm, which has closed-form solution in each iteration. Extensive experiments on several face image databases illustrate that our method is overall superior to the other robust PCA algorithms, such as PCA, PCA-L1 greedy, PCA-L1 nongreedy and HQ-PCA.


Sign in / Sign up

Export Citation Format

Share Document