When and Why are Principal Component Scores a Good Tool for Visualizing High-dimensional Data?

Kernel entropy component analysis (KECA) reveals the original data’s structure by kernel matrix. This structure is related to the Renyi entropy of the data. KECA maintains the invariance of the original data’s structure by keeping the data’s Renyi entropy unchanged. This paper described the original data by several components on the purpose of dimension reduction. Then the KECA was applied in celestial spectra reduction and was compared with Principal Component Analysis (PCA) and Kernel Principal Component Analysis (KPCA) by experiments. Experimental results show that the KECA is a good method in high-dimensional data reduction.

Download Full-text

Principal Component Analysis (PCA) for high-dimensional data. PCA is dead. Long live PCA

Perspectives on Big Data Analysis - Contemporary Mathematics ◽

10.1090/conm/622/12430 ◽

2014 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Fan Yang ◽

Kjell Doksum ◽

Kam-Wah Tsui

Keyword(s):

Principal Component Analysis ◽

High Dimensional Data ◽

Principal Component ◽

Component Analysis ◽

High Dimensional

Download Full-text

Convergence and prediction of principal component scores in high-dimensional settings

The Annals of Statistics ◽

10.1214/10-aos821 ◽

2010 ◽

Vol 38 (6) ◽

pp. 3605-3629 ◽

Cited By ~ 43

Author(s):

Seunggeun Lee ◽

Fei Zou ◽

Fred A. Wright

Keyword(s):

Principal Component ◽

High Dimensional ◽

Component Scores

Download Full-text

A new proposal for a principal component-based test for high-dimensional data applied to the analysis of PhyloChip data

Biometrical Journal ◽

10.1002/bimj.201000164 ◽

2011 ◽

Vol 54 (1) ◽

pp. 94-107 ◽

Cited By ~ 5

Author(s):

Guo-Chun Ding ◽

Kornelia Smalla ◽

Holger Heuer ◽

Siegfried Kropf

Keyword(s):

High Dimensional Data ◽

Principal Component ◽

High Dimensional

Download Full-text

A novel metric reveals previously unrecognized distortion in dimensionality reduction of scRNA-Seq data

10.1101/689851 ◽

2019 ◽

Cited By ~ 9

Author(s):

Shamus M. Cooley ◽

Timothy Hamilton ◽

J. Christian J. Ray ◽

Eric J. Deeds

Keyword(s):

Dimensionality Reduction ◽

Single Cells ◽

High Dimensional Data ◽

Three Dimensional ◽

Principal Component ◽

Cell Types ◽

High Dimensional ◽

Before And After ◽

Downstream Analysis ◽

High Level

AbstractHigh-dimensional data are becoming increasingly common in nearly all areas of science. Developing approaches to analyze these data and understand their meaning is a pressing issue. This is particularly true for the rapidly growing field of single-cell RNA-Seq (scRNA-Seq), a technique that simultaneously measures the expression of tens of thousands of genes in thousands to millions of single cells. The emerging consensus for analysis workflows reduces the dimensionality of the dataset before performing downstream analysis, such as assignment of cell types. One problem with this approach is that dimensionality reduction can introduce substantial distortion into the data; consider the familiar example of trying to represent the three-dimensional earth as a two-dimensional map. It is currently unclear if such distortion affects analysis of scRNA-Seq data sets. Here, we introduce a straightforward approach to quantifying this distortion by comparing the local neighborhoods of points before and after dimensionality reduction. We found that popular techniques like t-SNE and UMAP introduce substantial distortion even for relatively simple geometries such as simulated hyperspheres. For scRNA-Seq data, we found the distortion in local neighborhoods was often greater than 95% in the representations typically used for downstream analysis. This high level of distortion can readily introduce important errors into cell type identification, pseudotime ordering, and other analyses that rely on local relationships. We found that principal component analysis can generate accurate embeddings of the data, but only when using dimensionalities that are much higher than typically used in scRNA-Seq analysis. We suggest approaches to take these findings into account and call for a new generation of dimensional reduction algorithms that can accurately embed high dimensional data in its true latent dimension.

Download Full-text

Multilevel Functional Principal Component Analysis for High-Dimensional Data

Journal of Computational and Graphical Statistics ◽

10.1198/jcgs.2011.10122 ◽

2011 ◽

Vol 20 (4) ◽

pp. 852-873 ◽

Cited By ~ 33

Author(s):

Vadim Zipunnikov ◽

Brian Caffo ◽

David M. Yousem ◽

Christos Davatzikos ◽

Brian S. Schwartz ◽

...

Keyword(s):

Principal Component Analysis ◽

High Dimensional Data ◽

Principal Component ◽

Component Analysis ◽

Functional Principal Component Analysis ◽

High Dimensional ◽

Functional Principal Component

Download Full-text

Impact of Bone Marrow Radiation Dose on Acute Hematologic Toxicity in Cervical Cancer: Principal Component Analysis on High Dimensional Data

International Journal of Radiation Oncology*Biology*Physics ◽

10.1016/j.ijrobp.2009.11.062 ◽

2010 ◽

Vol 78 (3) ◽

pp. 912-919 ◽

Cited By ~ 28

Author(s):

Yun Liang ◽

Karen Messer ◽

Brent S. Rose ◽

John H. Lewis ◽

Steve B. Jiang ◽

...

Keyword(s):

Cervical Cancer ◽

Principal Component Analysis ◽

Bone Marrow ◽

Radiation Dose ◽

High Dimensional Data ◽

Principal Component ◽

Component Analysis ◽

High Dimensional ◽

Hematologic Toxicity

Download Full-text

Geometric consistency of principal component scores for high‐dimensional mixture models and its application

Scandinavian Journal of Statistics ◽

10.1111/sjos.12432 ◽

2019 ◽

Vol 47 (3) ◽

pp. 899-921

Author(s):

Kazuyoshi Yata ◽

Makoto Aoshima

Keyword(s):

Mixture Models ◽

Principal Component ◽

High Dimensional ◽

Geometric Consistency ◽

Component Scores

Download Full-text

Visualization and Interpretation of Multivariate Associations with Disease Risk Markers and Disease Risk—The Triplot

Metabolites ◽

10.3390/metabo9070133 ◽

2019 ◽

Vol 9 (7) ◽

pp. 133 ◽

Cited By ~ 2

Author(s):

Tessa Schillemans ◽

Lin Shi ◽

Xin Liu ◽

Agneta Åkesson ◽

Rikard Landberg ◽

...

Keyword(s):

Latent Variable ◽

Disease Risk ◽

High Dimensional Data ◽

Principal Component ◽

High Dimensional ◽

Intermediate Risk ◽

Risk Modeling ◽

Least Squares Regression ◽

Metabolomics Data ◽

Incident Type

Metabolomics has emerged as a promising technique to understand relationships between environmental factors and health status. Through comprehensive profiling of small molecules in biological samples, metabolomics generates high-dimensional data objectively, reflecting exposures, endogenous responses, and health effects, thereby providing further insights into exposure-disease associations. However, the multivariate nature of metabolomics data contributes to high complexity in analysis and interpretation. Efficient visualization techniques of multivariate data that allow direct interpretation of combined exposures, metabolome, and disease risk, are currently lacking. We have therefore developed the ‘triplot’ tool, a novel algorithm that simultaneously integrates and displays metabolites through latent variable modeling (e.g., principal component analysis, partial least squares regression, or factor analysis), their correlations with exposures, and their associations with disease risk estimates or intermediate risk factors. This paper illustrates the framework of the ‘triplot’ using two synthetic datasets that explore associations between dietary intake, plasma metabolome, and incident type 2 diabetes or BMI, an intermediate risk factor for lifestyle-related diseases. Our results demonstrate advantages of triplot over conventional visualization methods in facilitating interpretation in multivariate risk modeling with high-dimensional data. Algorithms, synthetic data, and tutorials are open source and available in the R package ‘triplot’.

Download Full-text