scholarly journals Accurate denoising of single-cell RNA-Seq data using unbiased principal component analysis

2019 ◽  
Author(s):  
Florian Wagner ◽  
Dalia Barkley ◽  
Itai Yanai

AbstractSingle-cell RNA-Seq measurements are commonly affected by high levels of technical noise, posing challenges for data analysis and visualization. A diverse array of methods has been proposed to computationally remove noise by sharing information across similar cells or genes, however their respective accuracies have been difficult to establish. Here, we propose a simple denoising strategy based on principal component analysis (PCA). We show that while PCA performed on raw data is biased towards highly expressed genes, this bias can be mitigated with a cell aggregation step, allowing the recovery of denoised expression values for both highly and lowly expressed genes. We benchmark our resulting ENHANCE algorithm and three previously described methods on simulated data that closely mimic real datasets, showing that ENHANCE provides the best overall denoising accuracy, recovering modules of co-expressed genes and cell subpopulations. Implementations of our algorithm are available at https://github.com/yanailab/enhance.

2018 ◽  
Vol 25 (12) ◽  
pp. 1365-1373 ◽  
Author(s):  
Snehalika Lall ◽  
Debajyoti Sinha ◽  
Sanghamitra Bandyopadhyay ◽  
Debarka Sengupta

2019 ◽  
Author(s):  
Koki Tsuyuzaki ◽  
Hiroyuki Sato ◽  
Kenta Sato ◽  
Itoshi Nikaido

AbstractPrincipal component analysis (PCA) is an essential method for analyzing single-cell RNA-seq (scRNA-seq) datasets, but large-scale scRNA-seq datasets require long computational times and a large memory capacity.In this work, we review 21 fast and memory-efficient PCA implementations (10 algorithms) and evaluate their application using 4 real and 18 synthetic datasets. Our benchmarking showed that some PCA algorithms are faster, more memory efficient, and more accurate than others. In consideration of the differences in the computational environments of users and developers, we have also developed guidelines to assist with selection of appropriate PCA implementations.


2020 ◽  
Vol 21 (16) ◽  
pp. 5797
Author(s):  
Zhenqiu Liu

Single-cell RNA-seq (scRNA-seq) is a powerful tool for analyzing heterogeneous and functionally diverse cell population. Visualizing scRNA-seq data can help us effectively extract meaningful biological information and identify novel cell subtypes. Currently, the most popular methods for scRNA-seq visualization are principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE). While PCA is an unsupervised dimension reduction technique, t-SNE incorporates cluster information into pairwise probability, and then maximizes the Kullback–Leibler divergence. Uniform Manifold Approximation and Projection (UMAP) is another recently developed visualization method similar to t-SNE. However, one limitation with UMAP and t-SNE is that they can only capture the local structure of the data, the global structure of the data is not faithfully preserved. In this manuscript, we propose a semisupervised principal component analysis (ssPCA) approach for scRNA-seq visualization. The proposed approach incorporates cluster-labels into dimension reduction and discovers principal components that maximize both data variance and cluster dependence. ssPCA must have cluster-labels as its input. Therefore, it is most useful for visualizing clusters from a scRNA-seq clustering software. Our experiments with simulation and real scRNA-seq data demonstrate that ssPCA is able to preserve both local and global structures of the data, and uncover the transition and progressions in the data, if they exist. In addition, ssPCA is convex and has a global optimal solution. It is also robust and computationally efficient, making it viable for scRNA-seq cluster visualization.


Entropy ◽  
2019 ◽  
Vol 21 (6) ◽  
pp. 548 ◽  
Author(s):  
Yuqing Sun ◽  
Jun Niu

Hydrological regionalization is a useful step in hydrological modeling and prediction. The regionalization is not always straightforward, however, due to the lack of long-term hydrological data and the complex multi-scale variability features embedded in the data. This study examines the multiscale soil moisture variability for the simulated data on a grid cell base obtained from a large-scale hydrological model, and clusters the grid-cell based soil moisture data using wavelet-based multiscale entropy and principal component analysis, over the Xijiang River basin in South China, for the period of 2002–2010. The effective regionalization, for 169 grid cells with the special resolution of 0.5° × 0.5°, produced homogeneous groups based on the pattern of wavelet-based entropy information. Four distinct modes explain 80.14% of the total embedded variability of the transformed wavelet power across different timescales. Moreover, the possible implications of the regionalization results for local hydrological applications, such as parameter estimation for an ungagged catchment and designing a uniform prediction strategy for a sub-area in a large-scale basin, are discussed.


2014 ◽  
Vol 556-562 ◽  
pp. 4317-4320
Author(s):  
Qiang Zhang ◽  
Li Ping Liu ◽  
Chao Liu

As a zero-emission mode of transportation, an increasing number of Electric Vehicles (EV) have come into use in our daily lives. The EV charging station is an important component of the Smart Grid which is now facing the challenges of big data. This paper presents a data compression and reconstruction method based on the technique of Principal Component Analysis (PCA). The data reconstruction error Normalized Absolute Percent Error (NAPE) is taken into consideration to balance the compression ratio and data reconstruction quality. By using the simulated data, the effectiveness of data compression and reconstruction for EV charging stations are verified.


2021 ◽  
Author(s):  
Tatsuma Shoji ◽  
Yoshiharu Sato

Abstract Background: RNA-Seq data are usually summarized by counting the number of transcript reads aligned to each gene. However, count-based methods do not take alignment information, where and how each read was mapped in the gene, into account. This information is essential to characterize samples accurately. In this study, we developed a method to summarize RNA-Seq data without losing alignment information. Results: To include alignment information, we introduce “q-mer analysis,” which summarizes RNA-Seq data with 4q kinds of q-length oligomers. Using publicly available RNA-Seq datasets, we demonstrate that at least q ≧ 9 is required for capturing alignment information in Homo sapiens. It should be noted that 49 = 262,144 is approximately 10 times larger than the number of genes in H. sapiens (20,022 genes). Furthermore, principal component analysis showed that q-mer analysis with q = 14 linearly distinguished samples from controls, while a count-based method failed. These results indicate that alignment information is essential to characterize transcriptomics samples. Conclusions: In conclusion, we introduce q-mer analysis to include alignment information in RNA-Seq analysis and demonstrate the superiority of q-mer analysis over count-based methods in that q-mer analysis can distinguish case samples from controls. Combining RNA-Seq research with q-mer analysis could be useful for identifying distinguishing transcriptomic features that could provide hypotheses for disease mechanisms.


Mathematics ◽  
2018 ◽  
Vol 6 (11) ◽  
pp. 269 ◽  
Author(s):  
Sergio Camiz ◽  
Valério Pillar

The identification of a reduced dimensional representation of the data is among the main issues of exploratory multidimensional data analysis and several solutions had been proposed in the literature according to the method. Principal Component Analysis (PCA) is the method that has received the largest attention thus far and several identification methods—the so-called stopping rules—have been proposed, giving very different results in practice, and some comparative study has been carried out. Some inconsistencies in the previous studies led us to try to fix the distinction between signal from noise in PCA—and its limits—and propose a new testing method. This consists in the production of simulated data according to a predefined eigenvalues structure, including zero-eigenvalues. From random populations built according to several such structures, reduced-size samples were extracted and to them different levels of random normal noise were added. This controlled introduction of noise allows a clear distinction between expected signal and noise, the latter relegated to the non-zero eigenvalues in the samples corresponding to zero ones in the population. With this new method, we tested the performance of ten different stopping rules. Of every method, for every structure and every noise, both power (the ability to correctly identify the expected dimension) and type-I error (the detection of a dimension composed only by noise) have been measured, by counting the relative frequencies in which the smallest non-zero eigenvalue in the population was recognized as signal in the samples and that in which the largest zero-eigenvalue was recognized as noise, respectively. This way, the behaviour of the examined methods is clear and their comparison/evaluation is possible. The reported results show that both the generalization of the Bartlett’s test by Rencher and the Bootstrap method by Pillar result much better than all others: both are accounted for reasonable power, decreasing with noise, and very good type-I error. Thus, more than the others, these methods deserve being adopted.


2010 ◽  
Vol 08 (06) ◽  
pp. 995-1011 ◽  
Author(s):  
HAO ZHENG ◽  
HONGWEI WU

Metagenomics is an emerging field in which the power of genomic analysis is applied to an entire microbial community, bypassing the need to isolate and culture individual microbial species. Assembling of metagenomic DNA fragments is very much like the overlap-layout-consensus procedure for assembling isolated genomes, but is augmented by an additional binning step to differentiate scaffolds, contigs and unassembled reads into various taxonomic groups. In this paper, we employed n-mer oligonucleotide frequencies as the features and developed a hierarchical classifier (PCAHIER) for binning short (≤ 1,000 bps) metagenomic fragments. The principal component analysis was used to reduce the high dimensionality of the feature space. The hierarchical classifier consists of four layers of local classifiers that are implemented based on the linear discriminant analysis. These local classifiers are responsible for binning prokaryotic DNA fragments into superkingdoms, of the same superkingdom into phyla, of the same phylum into genera, and of the same genus into species, respectively. We evaluated the performance of the PCAHIER by using our own simulated data sets as well as the widely used simHC synthetic metagenome data set from the IMG/M system. The effectiveness of the PCAHIER was demonstrated through comparisons against a non-hierarchical classifier, and two existing binning algorithms (TETRA and Phylopythia).


Sign in / Sign up

Export Citation Format

Share Document