Structure-Aware Principal Component Analysis for Single-Cell RNA-seq Data

AbstractSingle-cell RNA-Seq measurements are commonly affected by high levels of technical noise, posing challenges for data analysis and visualization. A diverse array of methods has been proposed to computationally remove noise by sharing information across similar cells or genes, however their respective accuracies have been difficult to establish. Here, we propose a simple denoising strategy based on principal component analysis (PCA). We show that while PCA performed on raw data is biased towards highly expressed genes, this bias can be mitigated with a cell aggregation step, allowing the recovery of denoised expression values for both highly and lowly expressed genes. We benchmark our resulting ENHANCE algorithm and three previously described methods on simulated data that closely mimic real datasets, showing that ENHANCE provides the best overall denoising accuracy, recovering modules of co-expressed genes and cell subpopulations. Implementations of our algorithm are available at https://github.com/yanailab/enhance.

Download Full-text

Benchmarking principal component analysis for large-scale single-cell RNA-sequencing

10.1101/642595 ◽

2019 ◽

Cited By ~ 1

Author(s):

Koki Tsuyuzaki ◽

Hiroyuki Sato ◽

Kenta Sato ◽

Itoshi Nikaido

Keyword(s):

Principal Component Analysis ◽

Single Cell ◽

Large Scale ◽

Principal Component ◽

Component Analysis ◽

Rna Seq ◽

Large Memory ◽

Synthetic Datasets ◽

Selection Of ◽

Memory Efficient

AbstractPrincipal component analysis (PCA) is an essential method for analyzing single-cell RNA-seq (scRNA-seq) datasets, but large-scale scRNA-seq datasets require long computational times and a large memory capacity.In this work, we review 21 fast and memory-efficient PCA implementations (10 algorithms) and evaluate their application using 4 real and 18 synthetic datasets. Our benchmarking showed that some PCA algorithms are faster, more memory efficient, and more accurate than others. In consideration of the differences in the computational environments of users and developers, we have also developed guidelines to assist with selection of appropriate PCA implementations.

Download Full-text

Visualizing Single-Cell RNA-seq Data with Semisupervised Principal Component Analysis

International Journal of Molecular Sciences ◽

10.3390/ijms21165797 ◽

2020 ◽

Vol 21 (16) ◽

pp. 5797

Author(s):

Zhenqiu Liu

Keyword(s):

Principal Component Analysis ◽

Dimension Reduction ◽

Single Cell ◽

Optimal Solution ◽

Principal Component ◽

Component Analysis ◽

Biological Information ◽

Rna Seq ◽

Computationally Efficient ◽

Leibler Divergence

Single-cell RNA-seq (scRNA-seq) is a powerful tool for analyzing heterogeneous and functionally diverse cell population. Visualizing scRNA-seq data can help us effectively extract meaningful biological information and identify novel cell subtypes. Currently, the most popular methods for scRNA-seq visualization are principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE). While PCA is an unsupervised dimension reduction technique, t-SNE incorporates cluster information into pairwise probability, and then maximizes the Kullback–Leibler divergence. Uniform Manifold Approximation and Projection (UMAP) is another recently developed visualization method similar to t-SNE. However, one limitation with UMAP and t-SNE is that they can only capture the local structure of the data, the global structure of the data is not faithfully preserved. In this manuscript, we propose a semisupervised principal component analysis (ssPCA) approach for scRNA-seq visualization. The proposed approach incorporates cluster-labels into dimension reduction and discovers principal components that maximize both data variance and cluster dependence. ssPCA must have cluster-labels as its input. Therefore, it is most useful for visualizing clusters from a scRNA-seq clustering software. Our experiments with simulation and real scRNA-seq data demonstrate that ssPCA is able to preserve both local and global structures of the data, and uncover the transition and progressions in the data, if they exist. In addition, ssPCA is convex and has a global optimal solution. It is also robust and computationally efficient, making it viable for scRNA-seq cluster visualization.

Download Full-text

Q-Mer Analysis: A Generalized Method for Analyzing RNA-Seq Data

10.21203/rs.3.rs-914457/v1 ◽

2021 ◽

Author(s):

Tatsuma Shoji ◽

Yoshiharu Sato

Keyword(s):

Principal Component Analysis ◽

Homo Sapiens ◽

Principal Component ◽

Component Analysis ◽

Rna Seq ◽

Disease Mechanisms ◽

Number Of Genes ◽

Q 14

Abstract Background: RNA-Seq data are usually summarized by counting the number of transcript reads aligned to each gene. However, count-based methods do not take alignment information, where and how each read was mapped in the gene, into account. This information is essential to characterize samples accurately. In this study, we developed a method to summarize RNA-Seq data without losing alignment information. Results: To include alignment information, we introduce “q-mer analysis,” which summarizes RNA-Seq data with 4q kinds of q-length oligomers. Using publicly available RNA-Seq datasets, we demonstrate that at least q ≧ 9 is required for capturing alignment information in Homo sapiens. It should be noted that 49 = 262,144 is approximately 10 times larger than the number of genes in H. sapiens (20,022 genes). Furthermore, principal component analysis showed that q-mer analysis with q = 14 linearly distinguished samples from controls, while a count-based method failed. These results indicate that alignment information is essential to characterize transcriptomics samples. Conclusions: In conclusion, we introduce q-mer analysis to include alignment information in RNA-Seq analysis and demonstrate the superiority of q-mer analysis over count-based methods in that q-mer analysis can distinguish case samples from controls. Combining RNA-Seq research with q-mer analysis could be useful for identifying distinguishing transcriptomic features that could provide hypotheses for disease mechanisms.

Download Full-text

Robust principal component analysis for accurate outlier sample detection in RNA-Seq data

BMC Bioinformatics ◽

10.1186/s12859-020-03608-0 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 1

Author(s):

Xiaoying Chen ◽

Bo Zhang ◽

Ting Wang ◽

Azad Bonni ◽

Guoyan Zhao

Keyword(s):

Principal Component Analysis ◽

Principal Component ◽

Component Analysis ◽

Rna Seq ◽

Robust Principal Component Analysis

Download Full-text

Truncated Robust Principal Component Analysis and Noise Reduction for Single Cell RNA Sequencing Data

Journal of Computational Biology ◽

10.1089/cmb.2018.0255 ◽

2019 ◽

Vol 26 (8) ◽

pp. 782-793 ◽

Cited By ~ 1

Author(s):

Krzysztof Gogolewski ◽

Maciej Sykulski ◽

Neo Christopher Chung ◽

Anna Gambin

Keyword(s):

Principal Component Analysis ◽

Noise Reduction ◽

Single Cell ◽

Rna Sequencing ◽

Principal Component ◽

Component Analysis ◽

Sequencing Data ◽

Robust Principal Component Analysis ◽

Single Cell Rna Sequencing

Download Full-text

Auto-classification for confocal back-scattering micro-spectrum at single-cell scale using principal component analysis

Optik ◽

10.1016/j.ijleo.2015.10.066 ◽

2016 ◽

Vol 127 (3) ◽

pp. 1007-1010 ◽

Cited By ~ 4

Author(s):

Cheng Wang ◽

Miao Wen ◽

Lihong Bai ◽

Tong Zhang

Keyword(s):

Principal Component Analysis ◽

Single Cell ◽

Principal Component ◽

Component Analysis ◽

Back Scattering

Download Full-text

PCAGO: An interactive web service to analyze RNA-Seq data with principal component analysis

10.1101/433078 ◽

2018 ◽

Cited By ~ 1

Author(s):

Ruman Gerst ◽

Martin Hölzer

Keyword(s):

Principal Component Analysis ◽

Web Service ◽

Principal Components ◽

Clustering Algorithm ◽

Gene Annotation ◽

Principal Component ◽

Component Analysis ◽

Rna Seq ◽

Gene Sets ◽

Relationship Of

ABSTRACTThe initial characterization and clustering of biological samples is a critical step in the analysis of any transcriptomic study. In many studies, principal component analysis (PCA) is the clustering algorithm of choice to predict the relationship of samples or cells based solely on differential gene expression. In addition to the pure quality evaluation of the data, a PCA can also provide initial insights into the biological background of an experiment and help researchers to interpret the data and design the subsequent computational steps accordingly. However, to avoid misleading clusterings and interpretations, an appropriate selection of the underlying gene sets to build the PCA and the choice of the most fitting principal components for the visualization are crucial parts. Here, we present PCAGO, an easy-to-use and interactive web service to analyze gene quantification data derived from RNA sequencing (RNA-Seq) experiments with PCA. The tool includes features such as read-count normalization, filtering of read counts by gene annotation, and various visualization options. Additionally, PCAGO helps to select appropriate parameters such as the number of genes and principal components to create meaningful visualizations.Availability and implementationThe web service is implemented in R and freely available at [email protected]

Download Full-text