Modified “Rule N” Procedure for Principal Component (EOF) Truncation

Principal component analysis (PCA) is an efficient model for the optimization problem of finding d' axes of a subspace Rd' ⊆ Rd so that the mean squared distances from a given set R of points to the axes are minimal. Despite being steadily employed since 1901 in different scenarios, e.g., mechanics, PCA has become an important link in machine learning chained tasks, such as feature learning and AutoML designs. A frequent yet open issue that arises from supervised-based problems is how many PCA axes are required for the performance of machine learning constructs to be tuned. Accordingly, we investigate the behavior of six independent and uncoupled criteria for estimating the number of PCA axes, namely Scree-Plot %, Scree Plot Gap, Kaiser-Guttman, Broken-Stick, p-Score, and 2D. In total, we evaluate the performance of those approaches in 20 high dimensional datasets by using (i) four different classifiers, and (ii) a hypothesis test upon the reported F-Measures. Results indicate Broken-Stick and Scree-Plot % criteria consistently outperformed the competitors regarding supervised-based tasks, whereas estimators Kaiser-Guttman and Scree-Plot Gap delivered poor performances in the same scenarios.

Download Full-text

Generalized penalty for circular coordinate representation

Foundations of Data Science ◽

10.3934/fods.2021024 ◽

2021 ◽

Vol 0 (0) ◽

pp. 0

Author(s):

Hengrui Luo ◽

Alice Patania ◽

Jisu Kim ◽

Mikael Vejdemo-Johansson

Keyword(s):

Dimension Reduction ◽

Real Data ◽

Topological Data Analysis ◽

High Dimensional ◽

Topological Structures ◽

Sampling Schemes ◽

Coordinate Representation ◽

Geometrical Shapes ◽

Persistent Cohomology ◽

High Dimensional Datasets

<p style='text-indent:20px;'>Topological Data Analysis (TDA) provides novel approaches that allow us to analyze the geometrical shapes and topological structures of a dataset. As one important application, TDA can be used for data visualization and dimension reduction. We follow the framework of circular coordinate representation, which allows us to perform dimension reduction and visualization for high-dimensional datasets on a torus using persistent cohomology. In this paper, we propose a method to adapt the circular coordinate framework to take into account the roughness of circular coordinates in change-point and high-dimensional applications. To do that, we use a generalized penalty function instead of an <inline-formula><tex-math id="M1">\begin{document}$ L_{2} $\end{document}</tex-math></inline-formula> penalty in the traditional circular coordinate algorithm. We provide simulation experiments and real data analyses to support our claim that circular coordinates with generalized penalty will detect the change in high-dimensional datasets under different sampling schemes while preserving the topological structures.</p>

Download Full-text

Sparse common component analysis for multiple high-dimensional datasets via noncentered principal component analysis

Statistical Papers ◽

10.1007/s00362-018-1045-6 ◽

2018 ◽

Vol 61 (6) ◽

pp. 2283-2311

Author(s):

Heewon Park ◽

Sadanori Konishi

Keyword(s):

Principal Component Analysis ◽

Principal Component ◽

Component Analysis ◽

High Dimensional ◽

Common Component ◽

High Dimensional Datasets

Download Full-text

Efficient PCA-Exploration of High-Dimensional Datasets

10.26434/chemrxiv.13340711 ◽

2020 ◽

Author(s):

Oxana Ye. Rodionova ◽

Sergey Kucheryavskiy ◽

Alexey L. Pomerantsev

Keyword(s):

Degrees Of Freedom ◽

Principal Component ◽

Recent Decade ◽

Model Complexity ◽

High Dimensional ◽

Exploratory Data ◽

Classification Tasks ◽

First Time ◽

High Dimensional Datasets

<div><div><div><p>Basic tools for exploration and interpretation of Principal Component Analysis (PCA) results are well- known and thoroughly described in many comprehensive tutorials. However, in the recent decade, several new tools have been developed. Some of them were originally created for solving authentication and classification tasks. In this paper we demonstrate that they can also be useful for the exploratory data analysis.</p><p><br></p><p>We discuss several important aspects of the PCA exploration of high dimensional datasets, such as estimation of a proper complexity of PCA model, dependence on the data structure, presence of outliers, etc. We introduce new tools for the assessment of the PCA model complexity such as the plots of the degrees of freedom developed for the orthogonal and score distances, as well as the Extreme and Distance plots, which present a new look at the features of the training and test (new) data. These tools are simple and fast in computation. In some cases, they are more efficient than the conventional PCA tools. A simulated example provides an intuitive illustration of their application. Three real-world examples originated from various fields are employed to demonstrate capabilities of the new tools and ways they can be used. The first example considers the reproducibility of a handheld spectrometer using a dataset that is presented for the first time. The other two datasets, which describe the authentication of olives in brine and classification of wines by their geographical origin, are already known and are often used for the illustrative purposes.</p><p><br></p><p>The paper does not touch upon the well-known things, such as the algorithms for the PCA decomposition, or interpretation of scores and loadings. Instead, we pay attention primarily to more advanced topics, such as exploration of data homogeneity, understanding and evaluation of an optimal model complexity. The examples are accompanied by links to free software that implements the tools.</p></div></div></div>

Download Full-text

Efficient PCA-Exploration of High-Dimensional Datasets

10.26434/chemrxiv.13340711.v1 ◽

2020 ◽

Author(s):

Oxana Ye. Rodionova ◽

Sergey Kucheryavskiy ◽

Alexey L. Pomerantsev

Keyword(s):

Degrees Of Freedom ◽

Principal Component ◽

Recent Decade ◽

Model Complexity ◽

High Dimensional ◽

Exploratory Data ◽

Classification Tasks ◽

First Time ◽

High Dimensional Datasets

<div><div><div><p>Basic tools for exploration and interpretation of Principal Component Analysis (PCA) results are well- known and thoroughly described in many comprehensive tutorials. However, in the recent decade, several new tools have been developed. Some of them were originally created for solving authentication and classification tasks. In this paper we demonstrate that they can also be useful for the exploratory data analysis.</p><p><br></p><p>We discuss several important aspects of the PCA exploration of high dimensional datasets, such as estimation of a proper complexity of PCA model, dependence on the data structure, presence of outliers, etc. We introduce new tools for the assessment of the PCA model complexity such as the plots of the degrees of freedom developed for the orthogonal and score distances, as well as the Extreme and Distance plots, which present a new look at the features of the training and test (new) data. These tools are simple and fast in computation. In some cases, they are more efficient than the conventional PCA tools. A simulated example provides an intuitive illustration of their application. Three real-world examples originated from various fields are employed to demonstrate capabilities of the new tools and ways they can be used. The first example considers the reproducibility of a handheld spectrometer using a dataset that is presented for the first time. The other two datasets, which describe the authentication of olives in brine and classification of wines by their geographical origin, are already known and are often used for the illustrative purposes.</p><p><br></p><p>The paper does not touch upon the well-known things, such as the algorithms for the PCA decomposition, or interpretation of scores and loadings. Instead, we pay attention primarily to more advanced topics, such as exploration of data homogeneity, understanding and evaluation of an optimal model complexity. The examples are accompanied by links to free software that implements the tools.</p></div></div></div>

Download Full-text

Single-Pass PCA of Large High-Dimensional Data

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/468 ◽

2017 ◽

Cited By ~ 5

Author(s):

Wenjian Yu ◽

Yu Gu ◽

Jian Li ◽

Shenghua Liu ◽

Yaohang Li

Keyword(s):

Machine Learning ◽

High Dimensional Data ◽

Randomized Algorithm ◽

Principal Component ◽

Real Data ◽

Data Matrix ◽

High Dimensional ◽

Singular Vectors ◽

Single Pass ◽

Fundamental Dimension

Principal component analysis (PCA) is a fundamental dimension reduction tool in statistics and machine learning. For large and high-dimensional data, computing the PCA (i.e., the top singular vectors of the data matrix) becomes a challenging task. In this work, a single-pass randomized algorithm is proposed to compute PCA with only one pass over the data. It is suitable for processing extremely large and high-dimensional data stored in slow memory (hard disk) or the data generated in a streaming fashion. Experiments with synthetic and real data validate the algorithm's accuracy, which has orders of magnitude smaller error than an existing single-pass algorithm. For a set of high-dimensional data stored as a 150 GB file, the algorithm is able to compute the first 50 principal components in just 24 minutes on a typical 24-core computer, with less than 1 GB memory cost.

Download Full-text

Beta Hebbian Learning as a New Method for Exploratory Projection Pursuit

International Journal of Neural Systems ◽

10.1142/s0129065717500241 ◽

2017 ◽

Vol 27 (06) ◽

pp. 1750024 ◽

Cited By ~ 20

Author(s):

Héctor Quintián ◽

Emilio Corchado

Keyword(s):

Hebbian Learning ◽

Projection Pursuit ◽

Principal Component ◽

Component Analysis ◽

High Dimensional ◽

Locally Linear Embedding ◽

Learning Rules ◽

Low Dimensional ◽

Exploratory Projection Pursuit ◽

High Dimensional Datasets

In this research, a novel family of learning rules called Beta Hebbian Learning (BHL) is thoroughly investigated to extract information from high-dimensional datasets by projecting the data onto low-dimensional (typically two dimensional) subspaces, improving the existing exploratory methods by providing a clear representation of data’s internal structure. BHL applies a family of learning rules derived from the Probability Density Function (PDF) of the residual based on the beta distribution. This family of rules may be called Hebbian in that all use a simple multiplication of the output of the neural network with some function of the residuals after feedback. The derived learning rules can be linked to an adaptive form of Exploratory Projection Pursuit and with artificial distributions, the networks perform as the theory suggests they should: the use of different learning rules derived from different PDFs allows the identification of “interesting” dimensions (as far from the Gaussian distribution as possible) in high-dimensional datasets. This novel algorithm, BHL, has been tested over seven artificial datasets to study the behavior of BHL parameters, and was later applied successfully over four real datasets, comparing its results, in terms of performance, with other well-known Exploratory and projection models such as Maximum Likelihood Hebbian Learning (MLHL), Locally-Linear Embedding (LLE), Curvilinear Component Analysis (CCA), Isomap and Neural Principal Component Analysis (Neural PCA).

Download Full-text

A Blockchain-Integrated Divided-Block Sparse Matrix Transformation Differential Privacy Data Publishing Model

Security and Communication Networks ◽

10.1155/2021/2418539 ◽

2021 ◽

Vol 2021 ◽

pp. 1-15

Author(s):

Yiyang Hong ◽

Xingwen Zhao ◽

Hui Zhu ◽

Hui Li

Keyword(s):

Principal Component Analysis ◽

Privacy Protection ◽

Differential Privacy ◽

Sparse Matrix ◽

Principal Component ◽

Matrix Transformation ◽

Data Publishing ◽

High Dimensional ◽

High Dimensional Datasets ◽

Different Levels

With the rapid development of information technology, people benefit more and more from big data. At the same time, it becomes a great concern that how to obtain optimal outputs from big data publishing and sharing management while protecting privacy. Many researchers seek to realize differential privacy protection in massive high-dimensional datasets using the method of principal component analysis. However, these algorithms are inefficient in processing and do not take into account the different privacy protection needs of each attribute in high-dimensional datasets. To address the above problem, we design a Divided-block Sparse Matrix Transformation Differential Privacy Data Publishing Algorithm (DSMT-DP). In this algorithm, different levels of privacy budget parameters are assigned to different attributes according to the required privacy protection level of each attribute, taking into account the privacy protection needs of different levels of attributes. Meanwhile, the use of the divided-block scheme and the sparse matrix transformation scheme can improve the computational efficiency of the principal component analysis method for handling large amounts of high-dimensional sensitive data, and we demonstrate that the proposed algorithm satisfies differential privacy. Our experimental results show that the mean square error of the proposed algorithm is smaller than the traditional differential privacy algorithm with the same privacy parameters, and the computational efficiency can be improved. Further, we combine this algorithm with blockchain and propose an Efficient Privacy Data Publishing and Sharing Model based on the blockchain. Publishing and sharing private data on this model not only resist strong background knowledge attacks from adversaries outside the system but also prevent stealing and tampering of data by not-completely-honest participants inside the system.

Download Full-text

Dynamic Principal Component CAW Models for High-Dimensional Realized Covariance Matrices

SSRN Electronic Journal ◽

10.2139/ssrn.2888520 ◽

2016 ◽

Author(s):

Bastian Gribisch ◽

Michael Stollenwerk

Keyword(s):

Principal Component ◽

Covariance Matrices ◽

High Dimensional ◽

Realized Covariance

Download Full-text

Separation of Chromatographic Co-Eluted Compounds by Clustering and by Functional Data Analysis

Metabolites ◽

10.3390/metabo11040214 ◽

2021 ◽

Vol 11 (4) ◽

pp. 214

Author(s):

Aneta Sawikowska ◽

Anna Piasecka ◽

Piotr Kachlicki ◽

Paweł Krajewski

Keyword(s):

Simulated Data ◽

Principal Component ◽

Real Data ◽

Functional Principal Component Analysis ◽

Additional Advantage ◽

Time Alignment ◽

Peak Separation ◽

Biological Mixtures ◽

Overlapping Peaks ◽

Retention Time Alignment

Peak overlapping is a common problem in chromatography, mainly in the case of complex biological mixtures, i.e., metabolites. Due to the existence of the phenomenon of co-elution of different compounds with similar chromatographic properties, peak separation becomes challenging. In this paper, two computational methods of separating peaks, applied, for the first time, to large chromatographic datasets, are described, compared, and experimentally validated. The methods lead from raw observations to data that can form inputs for statistical analysis. First, in both methods, data are normalized by the mass of sample, the baseline is removed, retention time alignment is conducted, and detection of peaks is performed. Then, in the first method, clustering is used to separate overlapping peaks, whereas in the second method, functional principal component analysis (FPCA) is applied for the same purpose. Simulated data and experimental results are used as examples to present both methods and to compare them. Real data were obtained in a study of metabolomic changes in barley (Hordeum vulgare) leaves under drought stress. The results suggest that both methods are suitable for separation of overlapping peaks, but the additional advantage of the FPCA is the possibility to assess the variability of individual compounds present within the same peaks of different chromatograms.

Download Full-text