Sample demultiplexing, multiplet detection, experiment planning and novel cell type verification in single cell sequencing

AbstractIdentifying and removing multiplets from downstream analysis is essential to improve the scalability and reliability of single cell RNA sequencing (scRNA-seq). High multiplet rates create artificial cell types in the dataset. Sample barcoding, including the cell hashing technology and the MULTI-seq technology, enables analytical identification of a fraction of multiplets in a scRNA-seq dataset.We propose a Gaussian-mixture-model-based multiplet identification method, GMM-Demux. GMM-Demux accurately identifies and removes the sample-barcoding-detectable multiplets and estimates the percentage of sample-barcoding-undetectable multiplets in the remaining dataset. GMM-Demux describes the droplet formation process with an augmented binomial probabilistic model, and uses the model to authenticate cell types discovered from a scRNA-seq dataset.We conducted two cell-hashing experiments, collected a public cell-hashing dataset, and generated a simulated cellhashing dataset. We compared the classification result of GMM-Demux against a state-of-the-art heuristic-based classifier. We show that GMM-Demux is more accurate, more stable, reduces the error rate by up to 69×, and is capable of reliably recognizing 9 multiplet-induced fake cell types and 8 real cell types in a PBMC scRNA-seq dataset.

Download Full-text

SMNN: Batch Effect Correction for Single-cell RNA-seq data via Supervised Mutual Nearest Neighbor Detection

10.1101/672261 ◽

2019 ◽

Cited By ~ 1

Author(s):

Yuchen Yang ◽

Gang Li ◽

Huijun Qian ◽

Kirk C. Wilhelmsen ◽

Yin Shen ◽

...

Keyword(s):

Single Cell ◽

Nearest Neighbor ◽

State Of The Art ◽

Nearest Neighbors ◽

Cell Types ◽

Batch Effect ◽

Batch Effects ◽

Cell Type ◽

Label Information ◽

Cell Type Specific

AbstractBatch effect correction has been recognized to be indispensable when integrating single-cell RNA sequencing (scRNA-seq) data from multiple batches. State-of-the-art methods ignore single-cell cluster label information, but such information can improve effectiveness of batch effect correction, particularly under realistic scenarios where biological differences are not orthogonal to batch effects. To address this issue, we propose SMNN for batch effect correction of scRNA-seq data via supervised mutual nearest neighbor detection. Our extensive evaluations in simulated and real datasets show that SMNN provides improved merging within the corresponding cell types across batches, leading to reduced differentiation across batches over MNN, Seurat v3, and LIGER. Furthermore, SMNN retains more cell type-specific features, partially manifested by differentially expressed genes identified between cell types after SMNN correction being biologically more relevant, with precision improving by up to 841%.Key PointsBatch effect correction has been recognized to be critical when integrating scRNA-seq data from multiple batches due to systematic differences in time points, generating laboratory and/or handling technician(s), experimental protocol, and/or sequencing platform.Existing batch effect correction methods that leverages information from mutual nearest neighbors across batches (for example, implemented in SC3 or Seurat) ignore cell type information and suffer from potentially mismatching single cells from different cell types across batches, which would lead to undesired correction results, especially under the scenario where variation from batch effects is non-negligible compared with biological effects.To address this critical issue, here we present SMNN, a supervised machine learning method that first takes cluster/cell-type label information from users or inferred from scRNA-seq clustering, and then searches mutual nearest neighbors within each cell type instead of global searching.Our SMNN method shows clear advantages over three state-of-the-art batch effect correction methods and can better mix cells of the same cell type across batches and more effectively recover cell-type specific features, in both simulations and real datasets.

Download Full-text

Multi-modal single-cell sequencing identifies cellular immunophenotypes associated with juvenile dermatomyositis disease activity

10.1101/2021.09.18.21263581 ◽

2021 ◽

Author(s):

Jessica Neely ◽

George Hartoularos ◽

Daniel Bunis ◽

Yang Sun ◽

David Lee ◽

...

Keyword(s):

Disease Activity ◽

Single Cell ◽

Juvenile Dermatomyositis ◽

Cell Types ◽

Gene Signature ◽

Inactive Disease ◽

Cell Type ◽

Gene Score ◽

Single Cell Sequencing ◽

Inflammatory Monocytes

Juvenile dermatomyositis (JDM) is a rare autoimmune condition with insufficient biomarkers and treatments, in part, due to incomplete knowledge of the cell types mediating disease. We investigated immunophenotypes and cell-specific genes associated with disease activity using multiplexed RNA and protein single-cell sequencing applied to PBMCs from 4 treatment-naive JDM (TN-JDM) subjects at baseline, 2, 4, and 6 months and 4 subjects with inactive disease. Analysis of 55,564 cells revealed separate clustering of TN-JDM cells within monocyte, NK, CD8+ effector T and naive B populations. The proportion of CD16+ monocytes was reduced in TN-JDM, and naive B cells were expanded. Cell-type differential gene expression analysis and hierarchical clustering identified a pan-cell-type IFN gene signature over-expressed in TN-JDM in all cell types and correlated with disease activity. TN-JDM monocytes displayed an inflammatory state: CD16+ monocytes expressed the highest IFN gene score and differential protein expression of adhesion molecules, CD49d and CD56, compared to CD14+ inflammatory monocytes. A transitional B cell population expressing higher CD24 and CD5 proteins and an IFN-hi naive B population were associated with TN-JDM and exhibited less CD39, an immunoregulatory protein. This data provides new insights into JDM immune dysregulation at cellular resolution and novel resource for myositis investigators.

Download Full-text

MarkerCount: A stable, count-based cell type identifier for single cell RNA-Seq experiments

10.21203/rs.3.rs-418249/v1 ◽

2021 ◽

Author(s):

Hanbyeol Kim ◽

Joongho Lee ◽

Keunsoo Kang ◽

Seokhyun Yoon

Keyword(s):

Gene Expression ◽

Single Cell ◽

Cell Types ◽

Batch Effect ◽

Expression Level ◽

Rna Seq ◽

Cell Type ◽

Stable Performance ◽

Downstream Analysis

Abstract Cell type identification is a key step to downstream analysis of single cell RNA-seq experiments. Indispensible information for this is gene expression, which is used to cluster cells, train the model and set rejection thresholds. Problem is they are subject to batch effect arising from different platforms and preprocessing. We present MarkerCount, which uses the number of markers expressed regardless of their expression level to initially identify cell types and, then, reassign cell type in cluster-basis. MarkerCount works both in reference and marker-based mode, where the latter utilizes only the existing lists of markers, while the former required pre-annotated dataset to train the model. The performance was evaluated and compared with the existing identifiers, both marker and reference-based, that can be customized with publicly available datasets and marker DB. The results show that MarkerCount provides a stable performance when comparing with other reference-based and marker-based cell type identifiers.

Download Full-text

Superscan: Supervised Single-Cell Annotation

10.1101/2021.05.20.445014 ◽

2021 ◽

Author(s):

Carolyn Shasha ◽

Yuan Tian ◽

Florian Mair ◽

Helen E Rodgers Miller ◽

Raphael Gottardo

Keyword(s):

Single Cell ◽

State Of The Art ◽

Marker Gene ◽

Surface Protein ◽

Cell Types ◽

Training Data ◽

Supervised Machine Learning ◽

Cell Type ◽

Surface Protein Expression ◽

Meta Analyses

Automated cell type annotation of single-cell RNA-seq data has the potential to significantly improve and streamline single cell data analysis, facilitating comparisons and meta-analyses. However, many of the current state-of-the-art techniques suffer from limitations, such as reliance on a single reference dataset or marker gene set, or excessive run times for large datasets. Acquiring high-quality labeled data to use as a reference can be challenging. With CITE-seq, surface protein expression of cells can be directly measured in addition to the RNA expression, facilitating cell type annotation. Here, we compiled and annotated a collection of 16 publicly available CITE-seq datasets. This data was then used as training data to develop Superscan, a supervised machine learning-based prediction model. Using our 16 reference datasets, we benchmarked Superscan and showed that it performs better in terms of both accuracy and speed when compared to other state-of-the-art cell annotation methods. Superscan is pre-trained on a collection of primarily PBMC immune datasets; however, additional data and cell types can be easily added to the training data for further improvement. Finally, we used Superscan to reanalyze a previously published dataset, demonstrating its applicability even when the dataset includes cell types that are missing from the training set.

Download Full-text

Normalization and De-noising of Single-cell Hi-C Data with BandNorm and 3DVI

10.1101/2021.03.10.434870 ◽

2021 ◽

Author(s):

Ye Zheng ◽

Siqi Shen ◽

Sündüz Keleş

Keyword(s):

Single Cell ◽

Long Range ◽

High Throughput ◽

State Of The Art ◽

Cell Types ◽

Chromatin Conformation ◽

Modeling Framework ◽

Technical Noise ◽

Generative Modeling ◽

Downstream Analysis

AbstractSingle-cell high-throughput chromatin conformation capture methodologies (scHi-C) enable profiling long-range genomic interactions at the single-cell resolution; however, data from these technologies are prone to technical noise and bias that, when unaccounted for, hinder downstream analysis. Here we developed a fast band normalization approach, BandNorm, and a deep generative modeling framework, 3DVI, to explicitly account for scHi-C specific technical biases. We present robust performances of BandNorm and 3DVI compared to existing state-of-the-art methods. BandNorm is effective in separating cell types, identification of interaction features, and recovery of cell-cell relationship, whereas de-noising by 3DVI successfully enables 3D compartments and domains recovery, especially for rare cell types.

Download Full-text

Artificial-cell-type aware cell-type classification in CITE-seq

Bioinformatics ◽

10.1093/bioinformatics/btaa467 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i542-i550 ◽

Cited By ~ 1

Author(s):

Qiuyu Lian ◽

Hongyi Xin ◽

Jianzhu Ma ◽

Liza Konnikova ◽

Wei Chen ◽

...

Keyword(s):

Cell Surface ◽

Single Cell ◽

Domain Knowledge ◽

Cell Types ◽

Surface Marker ◽

Supplementary Information ◽

Clustering Methods ◽

Cell Type ◽

Artificial Cell ◽

Marker Proteins

Abstract Motivation Cellular Indexing of Transcriptomes and Epitopes by sequencing (CITE-seq), couples the measurement of surface marker proteins with simultaneous sequencing of mRNA at single cell level, which brings accurate cell surface phenotyping to single-cell transcriptomics. Unfortunately, multiplets in CITE-seq datasets create artificial cell types (ACT) and complicate the automation of cell surface phenotyping. Results We propose CITE-sort, an artificial-cell-type aware surface marker clustering method for CITE-seq. CITE-sort is aware of and is robust to multiplet-induced ACT. We benchmarked CITE-sort with real and simulated CITE-seq datasets and compared CITE-sort against canonical clustering methods. We show that CITE-sort produces the best clustering performance across the board. CITE-sort not only accurately identifies real biological cell types (BCT) but also consistently and reliably separates multiplet-induced artificial-cell-type droplet clusters from real BCT droplet clusters. In addition, CITE-sort organizes its clustering process with a binary tree, which facilitates easy interpretation and verification of its clustering result and simplifies cell-type annotation with domain knowledge in CITE-seq. Availability and implementation http://github.com/QiuyuLian/CITE-sort. Supplementary information Supplementary data is available at Bioinformatics online.

Download Full-text

GMM-Demux: sample demultiplexing, multiplet detection, experiment planning, and novel cell-type verification in single cell sequencing

Genome Biology ◽

10.1186/s13059-020-02084-2 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 2

Author(s):

Hongyi Xin ◽

Qiuyu Lian ◽

Yale Jiang ◽

Jiadi Luo ◽

Xinjun Wang ◽

...

Keyword(s):

Single Cell ◽

Experiment Planning ◽

Cell Type ◽

Detection Experiment ◽

Single Cell Sequencing

Download Full-text

Artificial-Cell-Type Aware Cell Type Classification in CITE-seq

10.1101/2020.01.31.928010 ◽

2020 ◽

Author(s):

Qiuyu Lian ◽

Hongyi Xin ◽

Jianzhu Ma ◽

Liza Konnikova ◽

Wei Chen ◽

...

Keyword(s):

Cell Surface ◽

Single Cell ◽

Domain Knowledge ◽

Cell Types ◽

Surface Marker ◽

Biological Cell ◽

Clustering Methods ◽

Cell Type ◽

Artificial Cell ◽

Marker Proteins

AbstractCellular Indexing of Transcriptomes and Epitopes by sequencing (CITE-seq), couples the measurement of surface marker proteins with simultaneous sequencing of mRNA at single cell level, which brings accurate cell surface phenotyping to single cell transcriptomics. Unfortunately, multiplets in CITE-seq datasets create artificial cell types and complicates the automation of cell surface phenotyping. We propose CITE-sort, an artificial-cell-type aware surface marker clustering method for CITE-seq. CITE-sort is aware of and is robust to multiplet-induced artificial cell types. We benchmarked CITE-sort with real and simulated CITE-seq datasets and compared CITE-sort against canonical clustering methods. We show that CITE-sort produces the best clustering performance across the board. CITE-sort not only accurately identifies real biological cell types but also consistently and reliably separates multiplet-induced artificial-cell-type droplet clusters from real biological-cell-type droplet clusters. In addition, CITE-sort organizes its clustering process with a binary tree, which facilitates easy interpretation and verification of its clustering result and simplifies cell type annotation with domain knowledge in CITE-seq.

Download Full-text

Network-Based Single-Cell RNA-Seq Data Imputation Enhances Cell Type Identification

Genes ◽

10.3390/genes11040377 ◽

2020 ◽

Vol 11 (4) ◽

pp. 377 ◽

Cited By ~ 2

Author(s):

Maryam Zand ◽

Jianhua Ruan

Keyword(s):

Single Cell ◽

Network Performance ◽

Expression Profiles ◽

Simulated Data ◽

Cell Types ◽

Ppi Network ◽

Cell Type ◽

Protein Protein Interaction ◽

Gene Level ◽

Downstream Analysis

Single-cell RNA sequencing is a powerful technology for obtaining transcriptomes at single-cell resolutions. However, it suffers from dropout events (i.e., excess zero counts) since only a small fraction of transcripts get sequenced in each cell during the sequencing process. This inherent sparsity of expression profiles hinders further characterizations at cell/gene-level such as cell type identification and downstream analysis. To alleviate this dropout issue we introduce a network-based method, netImpute, by leveraging the hidden information in gene co-expression networks to recover real signals. netImpute employs Random Walk with Restart (RWR) to adjust the gene expression level in a given cell by borrowing information from its neighbors in a gene co-expression network. Performance evaluation and comparison with existing tools on simulated data and seven real datasets show that netImpute substantially enhances clustering accuracy and data visualization clarity, thanks to its effective treatment of dropouts. While the idea of netImpute is general and can be applied with other types of networks such as cell co-expression network or protein–protein interaction (PPI) network, evaluation results show that gene co-expression network is consistently more beneficial, presumably because PPI network usually lacks cell type context, while cell co-expression network can cause information loss for rare cell types. Evaluation results on several biological datasets show that netImpute can more effectively recover missing transcripts in scRNA-seq data and enhance the identification and visualization of heterogeneous cell types than existing methods.

Download Full-text

SSBER: removing batch effect for single-cell RNA sequencing data

BMC Bioinformatics ◽

10.1186/s12859-021-04165-w ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yin Zhang ◽

Fei Wang

Keyword(s):

Single Cell ◽

Cell Types ◽

Batch Effect ◽

Batch Effects ◽

Cell Type ◽

Sequencing Data ◽

Cell Type Composition ◽

Type Composition ◽

Downstream Analysis ◽

Sequencing Platforms

Abstract Background With the continuous maturity of sequencing technology, different laboratories or different sequencing platforms have generated a large amount of single-cell transcriptome sequencing data for the same or different tissues. Due to batch effects and high dimensions of scRNA data, downstream analysis often faces challenges. Although a number of algorithms and tools have been proposed for removing batch effects, the current mainstream algorithms have faced the problem of data overcorrection when the cell type composition varies greatly between batches. Results In this paper, we propose a novel method named SSBER by utilizing biological prior knowledge to guide the correction, aiming to solve the problem of poor batch-effect correction when the cell type composition differs greatly between batches. Conclusions SSBER effectively solves the above problems and outperforms other algorithms when the cell type structure among batches or distribution of cell population varies considerably, or some similar cell types exist across batches.

Download Full-text