scholarly journals findPC: An R package to automatically select number of principal components in single-cell analysis

2021 ◽  
Author(s):  
Haotian Zhuang ◽  
Zhicheng Ji

Principal component analysis (PCA) is widely used in analyzing single-cell genomic data. Selecting the optimal number of PCs is a crucial step for downstream analyses. The elbow method is most commonly used for this task, but it requires one to visually inspect the elbow plot and manually choose the elbow point. To address this limitation, we developed six methods to automatically select the optimal number of PCs based on the elbow method. We evaluated the performance of these methods on real single-cell RNA-seq data from multiple human and mouse tissues. The perpendicular line method with 20 PCs has the best overall performance, and its results are highly consistent with the numbers of PCs identified manually. We implemented the six methods in an R package, findPC, that objectively selects the number of PCs and can be easily incorporated into any automatic analysis pipeline.

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Walter Muskovic ◽  
Joseph E. Powell

Abstract Background Advances in droplet-based single-cell RNA-sequencing (scRNA-seq) have dramatically increased throughput, allowing tens of thousands of cells to be routinely sequenced in a single experiment. In addition to cells, droplets capture cell-free “ambient” RNA predominantly caused by lysis of cells during sample preparation. Samples with high ambient RNA concentration can create challenges in accurately distinguishing cell-containing droplets and droplets containing ambient RNA. Current methods to separate these groups often retain a significant number of droplets that do not contain cells or empty droplets. Additionally, there are currently no methods available to detect droplets containing damaged cells, which comprise partially lysed cells, the original source of the ambient RNA. Results Here, we describe DropletQC, a new method that is able to detect empty droplets, damaged, and intact cells, and accurately distinguish them from one another. This approach is based on a novel quality control metric, the nuclear fraction, which quantifies for each droplet the fraction of RNA originating from unspliced, nuclear pre-mRNA. We demonstrate how DropletQC provides a powerful extension to existing computational methods for identifying empty droplets such as EmptyDrops. Conclusions We implement DropletQC as an R package, which can be easily integrated into existing single-cell analysis workflows.


2017 ◽  
Author(s):  
Bo Wang ◽  
Daniele Ramazzotti ◽  
Luca De Sano ◽  
Junjie Zhu ◽  
Emma Pierson ◽  
...  

AbstractMotivationWe here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a cell-to-cell similarity measure from single-cell RNA-seq data. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of cells. SIMLR was benchmarked against state-of-the-art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization.Availability and ImplementationSIMLR is available on GitHub in both R and MATLAB implementations. Furthermore, it is also available as an R package on [email protected] or [email protected] InformationSupplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Zachary J. DeBruine ◽  
Karsten Melcher ◽  
Timothy J. Triche

AbstractNon-negative matrix factorization (NMF) is an intuitively appealing method to extract additive combinations of measurements from noisy or complex data. NMF is applied broadly to text and image processing, time-series analysis, and genomics, where recent technological advances permit sequencing experiments to measure the representation of tens of thousands of features in millions of single cells. In these experiments, a count of zero for a given feature in a given cell may indicate either the absence of that feature or an insufficient read coverage to detect that feature (“dropout”). In contrast to spectral decompositions such as the Singular Value Decomposition (SVD), the strictly positive imputation of signal by NMF is an ideal fit for single-cell data with ambiguous zeroes. Nevertheless, most single-cell analysis pipelines apply SVD or Principal Component Analysis (PCA) on transformed counts because these implementations are fast while current NMF implementations are slow. To address this need, we present an accessible NMF implementation that is much faster than PCA and rivals the runtimes of state-of-the-art SVD. NMF models learned with our implementation from raw count matrices yield intuitive summaries of complex biological processes, capturing coordinated gene activity and enrichment of sample metadata. Our NMF implementation, available in the RcppML (Rcpp Machine Learning library) R package, improves upon current NMF implementations by introducing a scaling diagonal to enable convex L1 regularization for feature engineering, reproducible factor scalings, and symmetric factorizations. RcppML NMF easily handles sparse datasets with millions of samples, making NMF an attractive replacement for PCA in the analysis of single-cell experiments.


2020 ◽  
Vol 36 (15) ◽  
pp. 4291-4295
Author(s):  
Philipp Angerer ◽  
David S Fischer ◽  
Fabian J Theis ◽  
Antonio Scialdone ◽  
Carsten Marr

Abstract Motivation Dimensionality reduction is a key step in the analysis of single-cell RNA-sequencing data. It produces a low-dimensional embedding for visualization and as a calculation base for downstream analysis. Nonlinear techniques are most suitable to handle the intrinsic complexity of large, heterogeneous single-cell data. However, with no linear relation between gene and embedding coordinate, there is no way to extract the identity of genes driving any cell’s position in the low-dimensional embedding, making it difficult to characterize the underlying biological processes. Results In this article, we introduce the concepts of local and global gene relevance to compute an equivalent of principal component analysis loadings for non-linear low-dimensional embeddings. Global gene relevance identifies drivers of the overall embedding, while local gene relevance identifies those of a defined sub-region. We apply our method to single-cell RNA-seq datasets from different experimental protocols and to different low-dimensional embedding techniques. This shows our method’s versatility to identify key genes for a variety of biological processes. Availability and implementation To ensure reproducibility and ease of use, our method is released as part of destiny 3.0, a popular R package for building diffusion maps from single-cell transcriptomic data. It is readily available through Bioconductor. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Lei Xiong ◽  
Kang Tian ◽  
Yuzhe Li ◽  
Qiangfeng Zhang

Abstract Single-cell RNA-seq and ATAC-seq analyses have been widely applied to decipher cell-type and regulation complexities. However, experimental conditions often confound biological variations when comparing data from different samples. For integrative single-cell data analysis, we have developed SCALEX, a deep generative framework that maps cells into a generalized, batch-invariant cell-embedding space. We demonstrate that SCALEX accurately and efficiently integrates heterogenous single-cell data using multiple benchmarks. It outperforms competing methods, especially for datasets with partial overlaps, accurately aligning similar cell populations while r,etaining true biological differences. We demonstrate the advantages of SCALEX by constructing continuously expandable single-cell atlases for human, mouse, and COVID-19, which were assembled from multiple data sources and can keep growing through the inclusion of new incoming data. Analyses based on these atlases revealed the complex cellular landscapes of human and mouse tissues and identified multiple peripheral immune subtypes associated with COVID-19 disease severity.


2021 ◽  
Author(s):  
Konrad Thorner ◽  
Aaron M. Zorn ◽  
Praneet Chaturvedi

AbstractAnnotation of single cells has become an important step in the single cell analysis framework. With advances in sequencing technology thousands to millions of cells can be processed to understand the intricacies of the biological system in question. Annotation through manual curation of markers based on a priori knowledge is cumbersome given this exponential growth. There are currently ~200 computational tools available to help researchers automatically annotate single cells using supervised/unsupervised machine learning, cell type markers, or tissue-based markers from bulk RNA-seq. But with the expansion of publicly available data there is also a need for a tool which can help integrate multiple references into a unified atlas and understand how annotations between datasets compare. Here we present ELeFHAnt: Ensemble learning for harmonization and annotation of single cells. ELeFHAnt is an easy-to-use R package that employs support vector machine and random forest algorithms together to perform three main functions: 1) CelltypeAnnotation 2) LabelHarmonization 3) DeduceRelationship. CelltypeAnnotation is a function to annotate cells in a query Seurat object using a reference Seurat object with annotated cell types. LabelHarmonization can be utilized to integrate multiple cell atlases (references) into a unified cellular atlas with harmonized cell types. Finally, DeduceRelationship is a function that compares cell types between two scRNA-seq datasets. ELeFHAnt can be accessed from GitHub at https://github.com/praneet1988/ELeFHAnt.


2021 ◽  
Author(s):  
Lei Xiong ◽  
Kang Tian ◽  
Yuzhe Li ◽  
Qiangfeng Cliff Zhang

Single-cell RNA-seq and ATAC-seq analyses have been widely applied to decipher cell-type and regulation complexities. However, experimental conditions often confound biological variations when comparing data from different samples. For integrative single-cell data analysis, we have developed SCALEX, a deep generative framework that maps cells into a generalized, batch-invariant cell-embedding space. We demonstrate that SCALEX accurately and efficiently integrates heterogenous single-cell data using multiple benchmarks. It outperforms competing methods, especially for datasets with partial overlaps, accurately aligning similar cell populations while retaining true biological differences. We demonstrate the advantages of SCALEX by constructing continuously expandable single-cell atlases for human, mouse, and COVID-19, which were assembled from multiple data sources and can keep growing through the inclusion of new incoming data. Analyses based on these atlases revealed the complex cellular landscapes of human and mouse tissues and identified multiple peripheral immune subtypes associated with COVID-19 disease severity.


2021 ◽  
Author(s):  
Walter Muskovic ◽  
Joseph Powell

Advances in droplet-based single cell RNA-sequencing (scRNA-seq) have dramatically increased throughput, allowing tens of thousands of cells to be routinely sequenced in a single experiment. In addition to cells, droplets capture cell-free 'ambient' RNA predominately caused by lysis of cells during sample preparation. Samples with high ambient RNA concentration can create challenges in accurately distinguishing cell containing droplets and droplets containing ambient RNA. Current methods to separate these groups often retain a significant number of droplets that do not contain cells, so called empty droplets. Additional to the challenge of identifying empty drops, there are currently no methods available to detect droplets containing damaged cells, which comprise of partially lysed cells, the original source of the ambient RNA. Here we describe DropletQC, a new method that is able to detect empty droplets, damaged, and intact cells, and accurately distinguish from one another. This approach is based on a novel quality control metric, the nuclear fraction, which quantifies for each droplet the fraction of RNA originating from unspliced, nuclear pre-mRNA. We demonstrate how DropletQC provides a powerful extension to existing computational methods for identifying empty droplets such as EmptyDrops. We have implemented DropletQC as an R package, which can be easily integrated into existing single cell analysis workflows.


2019 ◽  
Author(s):  
Debajyoti Sinha ◽  
Pradyumn Sinha ◽  
Ritwik Saha ◽  
Sanghamitra Bandyopadhyay ◽  
Debarka Sengupta

ABSTRACTDropClust leverages Locality Sensitive Hashing (LSH) to speed up clustering of large scale single cell expression data. It makes ingenious use of structure persevering sampling and modality based principal component selection to rescue minor cell types. Existing implementation of dropClust involves interfacing with multiple programming languagesviz. R, python and C, hindering seamless installation and portability. Here we present dropClust2, a complete R package that’s not only fast but also minimally resource intensive. DropClust2 features a novel batch effect removal algorithm that allows integrative analysis of single cell RNA-seq (scRNA-seq) datasets.Availability and implementationdropClust2 is freely available athttps://debsinha.shinyapps.io/dropClust/as an online web service and athttps://github.com/debsin/dropClustas an R package.


Author(s):  
Irzam Sarfraz ◽  
Muhammad Asif ◽  
Joshua D Campbell

Abstract Motivation R Experiment objects such as the SummarizedExperiment or SingleCellExperiment are data containers for storing one or more matrix-like assays along with associated row and column data. These objects have been used to facilitate the storage and analysis of high-throughput genomic data generated from technologies such as single-cell RNA sequencing. One common computational task in many genomics analysis workflows is to perform subsetting of the data matrix before applying down-stream analytical methods. For example, one may need to subset the columns of the assay matrix to exclude poor-quality samples or subset the rows of the matrix to select the most variable features. Traditionally, a second object is created that contains the desired subset of assay from the original object. However, this approach is inefficient as it requires the creation of an additional object containing a copy of the original assay and leads to challenges with data provenance. Results To overcome these challenges, we developed an R package called ExperimentSubset, which is a data container that implements classes for efficient storage and streamlined retrieval of assays that have been subsetted by rows and/or columns. These classes are able to inherently provide data provenance by maintaining the relationship between the subsetted and parent assays. We demonstrate the utility of this package on a single-cell RNA-seq dataset by storing and retrieving subsets at different stages of the analysis while maintaining a lower memory footprint. Overall, the ExperimentSubset is a flexible container for the efficient management of subsets. Availability and implementation ExperimentSubset package is available at Bioconductor: https://bioconductor.org/packages/ExperimentSubset/ and Github: https://github.com/campbio/ExperimentSubset. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document