scholarly journals DropletQC: improved identification of empty droplets and damaged cells in single-cell RNA-seq data

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Walter Muskovic ◽  
Joseph E. Powell

Abstract Background Advances in droplet-based single-cell RNA-sequencing (scRNA-seq) have dramatically increased throughput, allowing tens of thousands of cells to be routinely sequenced in a single experiment. In addition to cells, droplets capture cell-free “ambient” RNA predominantly caused by lysis of cells during sample preparation. Samples with high ambient RNA concentration can create challenges in accurately distinguishing cell-containing droplets and droplets containing ambient RNA. Current methods to separate these groups often retain a significant number of droplets that do not contain cells or empty droplets. Additionally, there are currently no methods available to detect droplets containing damaged cells, which comprise partially lysed cells, the original source of the ambient RNA. Results Here, we describe DropletQC, a new method that is able to detect empty droplets, damaged, and intact cells, and accurately distinguish them from one another. This approach is based on a novel quality control metric, the nuclear fraction, which quantifies for each droplet the fraction of RNA originating from unspliced, nuclear pre-mRNA. We demonstrate how DropletQC provides a powerful extension to existing computational methods for identifying empty droplets such as EmptyDrops. Conclusions We implement DropletQC as an R package, which can be easily integrated into existing single-cell analysis workflows.

2021 ◽  
Author(s):  
Walter Muskovic ◽  
Joseph Powell

Advances in droplet-based single cell RNA-sequencing (scRNA-seq) have dramatically increased throughput, allowing tens of thousands of cells to be routinely sequenced in a single experiment. In addition to cells, droplets capture cell-free 'ambient' RNA predominately caused by lysis of cells during sample preparation. Samples with high ambient RNA concentration can create challenges in accurately distinguishing cell containing droplets and droplets containing ambient RNA. Current methods to separate these groups often retain a significant number of droplets that do not contain cells, so called empty droplets. Additional to the challenge of identifying empty drops, there are currently no methods available to detect droplets containing damaged cells, which comprise of partially lysed cells, the original source of the ambient RNA. Here we describe DropletQC, a new method that is able to detect empty droplets, damaged, and intact cells, and accurately distinguish from one another. This approach is based on a novel quality control metric, the nuclear fraction, which quantifies for each droplet the fraction of RNA originating from unspliced, nuclear pre-mRNA. We demonstrate how DropletQC provides a powerful extension to existing computational methods for identifying empty droplets such as EmptyDrops. We have implemented DropletQC as an R package, which can be easily integrated into existing single cell analysis workflows.


2021 ◽  
Author(s):  
Daniel Osorio ◽  
Marieke Lydia Kuijjer ◽  
James J. Cai

Motivation: Characterizing cells with rare molecular phenotypes is one of the promises of high throughput single-cell RNA sequencing (scRNA-seq) techniques. However, collecting enough cells with the desired molecular phenotype in a single experiment is challenging, requiring several samples preprocessing steps to filter and collect the desired cells experimentally before sequencing. Data integration of multiple public single-cell experiments stands as a solution for this problem, allowing the collection of enough cells exhibiting the desired molecular signatures. By increasing the sample size of the desired cell type, this approach enables a robust cell type transcriptome characterization. Results: Here, we introduce rPanglaoDB, an R package to download and merge the uniformly processed and annotated scRNA-seq data provided by the PanglaoDB database. To show the potential of rPanglaoDB for collecting rare cell types by integrating multiple public datasets, we present a biological application collecting and characterizing a set of 157 fibrocytes. Fibrocytes are a rare monocyte-derived cell type, that exhibits both the inflammatory features of macrophages and the tissue remodeling properties of fibroblasts. This constitutes the first fibrocytes' unbiased transcriptome profile report. We compared the transcriptomic profile of the fibrocytes against the fibroblasts collected from the same tissue samples and confirm their associated relationship with healing processes in tissue damage and infection through the activation of the prostaglandin biosynthesis and regulation pathway. Availability and Implementation: rPanglaoDB is implemented as an R package available through the CRAN repositories https://CRAN.R-project.org/package=rPanglaoDB.


2017 ◽  
Author(s):  
Bo Wang ◽  
Daniele Ramazzotti ◽  
Luca De Sano ◽  
Junjie Zhu ◽  
Emma Pierson ◽  
...  

AbstractMotivationWe here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a cell-to-cell similarity measure from single-cell RNA-seq data. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of cells. SIMLR was benchmarked against state-of-the-art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization.Availability and ImplementationSIMLR is available on GitHub in both R and MATLAB implementations. Furthermore, it is also available as an R package on [email protected] or [email protected] InformationSupplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Haotian Zhuang ◽  
Zhicheng Ji

Principal component analysis (PCA) is widely used in analyzing single-cell genomic data. Selecting the optimal number of PCs is a crucial step for downstream analyses. The elbow method is most commonly used for this task, but it requires one to visually inspect the elbow plot and manually choose the elbow point. To address this limitation, we developed six methods to automatically select the optimal number of PCs based on the elbow method. We evaluated the performance of these methods on real single-cell RNA-seq data from multiple human and mouse tissues. The perpendicular line method with 20 PCs has the best overall performance, and its results are highly consistent with the numbers of PCs identified manually. We implemented the six methods in an R package, findPC, that objectively selects the number of PCs and can be easily incorporated into any automatic analysis pipeline.


2021 ◽  
Author(s):  
Konrad Thorner ◽  
Aaron M. Zorn ◽  
Praneet Chaturvedi

AbstractAnnotation of single cells has become an important step in the single cell analysis framework. With advances in sequencing technology thousands to millions of cells can be processed to understand the intricacies of the biological system in question. Annotation through manual curation of markers based on a priori knowledge is cumbersome given this exponential growth. There are currently ~200 computational tools available to help researchers automatically annotate single cells using supervised/unsupervised machine learning, cell type markers, or tissue-based markers from bulk RNA-seq. But with the expansion of publicly available data there is also a need for a tool which can help integrate multiple references into a unified atlas and understand how annotations between datasets compare. Here we present ELeFHAnt: Ensemble learning for harmonization and annotation of single cells. ELeFHAnt is an easy-to-use R package that employs support vector machine and random forest algorithms together to perform three main functions: 1) CelltypeAnnotation 2) LabelHarmonization 3) DeduceRelationship. CelltypeAnnotation is a function to annotate cells in a query Seurat object using a reference Seurat object with annotated cell types. LabelHarmonization can be utilized to integrate multiple cell atlases (references) into a unified cellular atlas with harmonized cell types. Finally, DeduceRelationship is a function that compares cell types between two scRNA-seq datasets. ELeFHAnt can be accessed from GitHub at https://github.com/praneet1988/ELeFHAnt.


2021 ◽  
Author(s):  
April R Kriebel ◽  
Joshua D Welch

Single-cell genomic technologies provide an unprecedented opportunity to define molecular cell types in a data-driven fashion, but present unique data integration challenges. Integration analyses often involve datasets with partially overlapping features, including both shared features that occur in all datasets and features exclusive to a single experiment. Previous computational integration approaches require that the input matrices share the same number of either genes or cells, and thus can use only shared features. To address this limitation, we derive a novel nonnegative matrix factorization algorithm for integrating single-cell datasets containing both shared and unshared features. The key advance is incorporating an additional metagene matrix that allows unshared features to inform the factorization. We demonstrate that incorporating unshared features significantly improves integration of single-cell RNA-seq, spatial transcriptomic, SHARE-seq, and cross-species datasets. We have incorporated the UINMF algorithm into the open-source LIGER R package (https://github.com/welch-lab/liger).


Author(s):  
Irzam Sarfraz ◽  
Muhammad Asif ◽  
Joshua D Campbell

Abstract Motivation R Experiment objects such as the SummarizedExperiment or SingleCellExperiment are data containers for storing one or more matrix-like assays along with associated row and column data. These objects have been used to facilitate the storage and analysis of high-throughput genomic data generated from technologies such as single-cell RNA sequencing. One common computational task in many genomics analysis workflows is to perform subsetting of the data matrix before applying down-stream analytical methods. For example, one may need to subset the columns of the assay matrix to exclude poor-quality samples or subset the rows of the matrix to select the most variable features. Traditionally, a second object is created that contains the desired subset of assay from the original object. However, this approach is inefficient as it requires the creation of an additional object containing a copy of the original assay and leads to challenges with data provenance. Results To overcome these challenges, we developed an R package called ExperimentSubset, which is a data container that implements classes for efficient storage and streamlined retrieval of assays that have been subsetted by rows and/or columns. These classes are able to inherently provide data provenance by maintaining the relationship between the subsetted and parent assays. We demonstrate the utility of this package on a single-cell RNA-seq dataset by storing and retrieving subsets at different stages of the analysis while maintaining a lower memory footprint. Overall, the ExperimentSubset is a flexible container for the efficient management of subsets. Availability and implementation ExperimentSubset package is available at Bioconductor: https://bioconductor.org/packages/ExperimentSubset/ and Github: https://github.com/campbio/ExperimentSubset. Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Zhun Miao ◽  
Ke Deng ◽  
Xiaowo Wang ◽  
Xuegong Zhang

AbstractSummaryThe excessive amount of zeros in single-cell RNA-seq data include “real” zeros due to the on-off nature of gene transcription in single cells and “dropout” zeros due to technical reasons. Existing differential expression (DE) analysis methods cannot distinguish these two types of zeros. We developed an R package DEsingle which employed Zero-Inflated Negative Binomial model to estimate the proportion of real and dropout zeros and to define and detect 3 types of DE genes in single-cell RNA-seq data with higher accuracy.Availability and ImplementationThe R package DEsingle is freely available at https://github.com/miaozhun/DEsingle and is under Bioconductor’s consideration [email protected] informationSupplementary data are available at bioRxiv online.


2020 ◽  
Author(s):  
Naim Al Mahi ◽  
Erik Y. Zhang ◽  
Susan Sherman ◽  
Jane J. Yu ◽  
Mario Medvedovic

ABSTRACTLymphangioleiomyomatosis (LAM) is a rare pulmonary disease affecting women of childbearing age that is characterized by the aberrant proliferation of smooth-muscle (SM)-like cells and emphysema-like lung remodeling. In LAM, mutations in TSC1 or TSC2 genes results in the activation of the mechanistic target of rapamycin complex 1 (mTORC1) and thus sirolimus, an mTORC1 inhibitor, has been approved by FDA to treat LAM patients. Sirolimus stabilizes lung function and improves symptoms. However, the disease recurs with discontinuation of the drug, potentially because of the sirolimus-induced refractoriness of the LAM cells. Therefore, there is a critical need to identify remission inducing cytocidal treatments for LAM. Recently released Library of Integrated Network-based Cellular Signatures (LINCS) L1000 transcriptional signatures of chemical perturbations has opened new avenues to study cellular responses to existing drugs and new bioactive compounds. Connecting transcriptional signature of a disease to these chemical perturbation signatures to identify bioactive chemicals that can “revert” the disease signatures can lead to novel drug discovery. We developed methods for constructing disease transcriptional signatures and performing connectivity analysis using single cell RNA-seq data. The methods were applied in the analysis of scRNA-seq data of naïve and sirolimus-treated LAM cells. The single cell connectivity analyses implicated mTORC1 inhibitors as capable of reverting the LAM transcriptional signatures while the corresponding standard bulk analysis did not. This indicates the importance of using single cell analysis in constructing disease signatures. The analysis also implicated other classes of drugs, CDK, MEK/MAPK and EGFR/JAK inhibitors, as potential therapeutic agents for LAM.


2021 ◽  
pp. ASN.2020121742 ◽  
Author(s):  
Michael S. Balzer ◽  
Ziyuan Ma ◽  
Jianfu Zhou ◽  
Amin Abedini ◽  
Katalin Susztak

Over the last 5 years, single cell methods have enabled the monitoring of gene and protein expression, genetic, and epigenetic changes in thousands of individual cells in a single experiment. With the improved measurement and the decreasing cost of the reactions and sequencing, the size of these datasets is increasing rapidly. The critical bottleneck remains the analysis of the wealth of information generated by single cell experiments. In this review, we give a simplified overview of the analysis pipelines, as they are typically used in the field today. We aim to enable researchers starting out in single cell analysis to gain an overview of challenges and the most commonly used analytical tools. In addition, we hope to empower others to gain an understanding of how typical readouts from single cell datasets are presented in the published literature.


Sign in / Sign up

Export Citation Format

Share Document