Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling

Abstract Background The research landscape of single-cell and single-nuclei RNA-sequencing is evolving rapidly. In particular, the area for the detection of rare cells was highly facilitated by this technology. However, an automated, unbiased, and accurate annotation of rare subpopulations is challenging. Once rare cells are identified in one dataset, it is usually necessary to generate further specific datasets to enrich the analysis (e.g., with samples from other tissues). From a machine learning perspective, the challenge arises from the fact that rare-cell subpopulations constitute an imbalanced classification problem. We here introduce a Machine Learning (ML)-based oversampling method that uses gene expression counts of already identified rare cells as an input to generate synthetic cells to then identify similar (rare) cells in other publicly available experiments. We utilize single-cell synthetic oversampling (sc-SynO), which is based on the Localized Random Affine Shadowsampling (LoRAS) algorithm. The algorithm corrects for the overall imbalance ratio of the minority and majority class. Results We demonstrate the effectiveness of our method for three independent use cases, each consisting of already published datasets. The first use case identifies cardiac glial cells in snRNA-Seq data (17 nuclei out of 8635). This use case was designed to take a larger imbalance ratio (~1 to 500) into account and only uses single-nuclei data. The second use case was designed to jointly use snRNA-Seq data and scRNA-Seq on a lower imbalance ratio (~1 to 26) for the training step to likewise investigate the potential of the algorithm to consider both single-cell capture procedures and the impact of “less” rare-cell types. The third dataset refers to the murine data of the Allen Brain Atlas, including more than 1 million cells. For validation purposes only, all datasets have also been analyzed traditionally using common data analysis approaches, such as the Seurat workflow. Conclusions In comparison to baseline testing without oversampling, our approach identifies rare-cells with a robust precision-recall balance, including a high accuracy and low false positive detection rate. A practical benefit of our algorithm is that it can be readily implemented in other and existing workflows. The code basis in R and Python is publicly available at FairdomHub, as well as GitHub, and can easily be transferred to identify other rare-cell types.

Download Full-text

Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling

10.1101/2021.01.20.427486 ◽

2021 ◽

Author(s):

Saptarshi Bej ◽

Anne-Marie Galow ◽

Robert David ◽

Markus Wolfien ◽

Olaf Wolkenhauer

Keyword(s):

Machine Learning ◽

Single Cell ◽

Rna Sequencing ◽

Cell Types ◽

Classification Problem ◽

Use Case ◽

Cell Capture ◽

Sequencing Data ◽

Rare Cells ◽

The Impact

AbstractThe research landscape of single-cell and single-nuclei RNA sequencing is evolving rapidly, and one area that is enabled by this technology, is the detection of rare cells. An automated, unbiased and accurate annotation of rare subpopulations is challenging. Once rare cells are identified in one dataset, it will usually be necessary to generate other datasets to enrich the analysis (e.g., with samples from other tissues). From a machine learning perspective, the challenge arises from the fact that rare cell subpopulations constitute an imbalanced classification problem.We here introduce a Machine Learning (ML)-based oversampling method that uses gene expression counts of already identified rare cells as an input to generate synthetic cells to then identify similar (rare) cells in other publicly available experiments. We utilize single-cell synthetic oversampling (sc-SynO), which is based on the Localized Random Affine Shadowsampling (LoRAS) algorithm. The algorithm corrects for the overall imbalance ratio of the minority and majority class.We demonstrate the effectiveness of the method for two independent use cases, each consisting of two published datasets. The first use case identifies cardiac glial cells in snRNA-Seq data (17 nuclei out of 8,635). This use case was designed to take a larger imbalance ratio (∼1 to 500) into account and only uses single-nuclei data. The second use case was designed to jointly use snRNA-Seq data and scRNA-Seq on a lower imbalance ratio (∼1 to 26) for the training step to likewise investigate the potential of the algorithm to consider both single cell capture procedures and the impact of “less” rare-cell types. For validation purposes, all datasets have also been analyzed in a traditional manner using common data analysis approaches, such as the Seurat3 workflow.Our algorithm identifies rare-cell populations with a high accuracy and low false positive detection rate. A striking benefit of our algorithm is that it can be readily implemented in other and existing workflows. The code basis is publicly available at FairdomHub (https://fairdomhub.org/assays/1368) and can easily be transferred to train other customized approaches.

Download Full-text

Single-cell data clustering based on sparse optimization and low-rank matrix factorization

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab098 ◽

2021 ◽

Author(s):

Yinlei Hu ◽

Bin Li ◽

Falai Chen ◽

Kun Qu

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Matrix Factorization ◽

Data Clustering ◽

Cell Types ◽

Low Rank ◽

Sequencing Data ◽

Rank Matrix ◽

Single Cell Rna Sequencing ◽

Low Rank Matrix

Abstract Unsupervised clustering is a fundamental step of single-cell RNA sequencing data analysis. This issue has inspired several clustering methods to classify cells in single-cell RNA sequencing data. However, accurate prediction of the cell clusters remains a substantial challenge. In this study, we propose a new algorithm for single-cell RNA sequencing data clustering based on Sparse Optimization and low-rank matrix factorization (scSO). We applied our scSO algorithm to analyze multiple benchmark datasets and showed that the cluster number predicted by scSO was close to the number of reference cell types and that most cells were correctly classified. Our scSO algorithm is available at https://github.com/QuKunLab/scSO. Overall, this study demonstrates a potent cell clustering approach that can help researchers distinguish cell types in single-cell RNA sequencing data.

Download Full-text

Quality assessment of single-cell RNA sequencing data by coverage skewness analysis

10.1101/2019.12.31.890269 ◽

2019 ◽

Author(s):

Imad Abugessaisa ◽

Shuhei Noguchi ◽

Melissa Cardon ◽

Akira Hasegawa ◽

Kazuhide Watanabe ◽

...

Keyword(s):

Quality Assessment ◽

Single Cell ◽

Rna Sequencing ◽

Single Cells ◽

Assessment Method ◽

Poor Quality ◽

Sequencing Data ◽

Single Cell Rna Sequencing ◽

Gene Coverage ◽

The Impact

AbstractAnalysis and interpretation of single-cell RNA-sequencing (scRNA-seq) experiments are compromised by the presence of poor quality cells. For meaningful analyses, such poor quality cells should be excluded to avoid biases and large variation. However, no clear guidelines exist. We introduce SkewC, a novel quality-assessment method to identify poor quality single-cells in scRNA-seq experiments. The method is based on the assessment of gene coverage for each single cell and its skewness as a quality measure. To validate the method, we investigated the impact of poor quality cells on downstream analyses and compared biological differences between typical and poor quality cells. Moreover, we measured the ratio of intergenic expression, suggesting genomic contamination, and foreign organism contamination of single-cell samples. SkewC is tested in 37,993 single-cells generated by 15 scRNA-seq protocols. We envision SkewC as an indispensable QC method to be incorporated into scRNA-seq experiment to preclude the possibility of scRNA-seq data misinterpretation.

Download Full-text

Transfer learning efficiently maps bone marrow cell types from mouse to human using single-cell RNA sequencing

Communications Biology ◽

10.1038/s42003-020-01463-6 ◽

2020 ◽

Vol 3 (1) ◽

Author(s):

Patrick S. Stumpf ◽

Xin Du ◽

Haruka Imanishi ◽

Yuya Kunisaki ◽

Yuichiro Semba ◽

...

Keyword(s):

Machine Learning ◽

Bone Marrow ◽

Single Cell ◽

Rna Sequencing ◽

Transfer Learning ◽

Biomedical Research ◽

Human Cell ◽

Cell Types ◽

Single Cell Rna Sequencing ◽

Using Data

AbstractBiomedical research often involves conducting experiments on model organisms in the anticipation that the biology learnt will transfer to humans. Previous comparative studies of mouse and human tissues were limited by the use of bulk-cell material. Here we show that transfer learning—the branch of machine learning that concerns passing information from one domain to another—can be used to efficiently map bone marrow biology between species, using data obtained from single-cell RNA sequencing. We first trained a multiclass logistic regression model to recognize different cell types in mouse bone marrow achieving equivalent performance to more complex artificial neural networks. Furthermore, it was able to identify individual human bone marrow cells with 83% overall accuracy. However, some human cell types were not easily identified, indicating important differences in biology. When re-training the mouse classifier using data from human, less than 10 human cells of a given type were needed to accurately learn its representation. In some cases, human cell identities could be inferred directly from the mouse classifier via zero-shot learning. These results show how simple machine learning models can be used to reconstruct complex biology from limited data, with broad implications for biomedical research.

Download Full-text

Evaluation of single-cell classifiers for single-cell RNA sequencing data sets

Briefings in Bioinformatics ◽

10.1093/bib/bbz096 ◽

2019 ◽

Vol 21 (5) ◽

pp. 1581-1595 ◽

Cited By ~ 6

Author(s):

Xinlei Zhao ◽

Shuang Wu ◽

Nan Fang ◽

Xiao Sun ◽

Jue Fan

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Reference Data ◽

Predictive Accuracy ◽

Cell Types ◽

Superior Performance ◽

Marker Genes ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Abstract Single-cell RNA sequencing (scRNA-seq) has been rapidly developing and widely applied in biological and medical research. Identification of cell types in scRNA-seq data sets is an essential step before in-depth investigations of their functional and pathological roles. However, the conventional workflow based on clustering and marker genes is not scalable for an increasingly large number of scRNA-seq data sets due to complicated procedures and manual annotation. Therefore, a number of tools have been developed recently to predict cell types in new data sets using reference data sets. These methods have not been generally adapted due to a lack of tool benchmarking and user guidance. In this article, we performed a comprehensive and impartial evaluation of nine classification software tools specifically designed for scRNA-seq data sets. Results showed that Seurat based on random forest, SingleR based on correlation analysis and CaSTLe based on XGBoost performed better than others. A simple ensemble voting of all tools can improve the predictive accuracy. Under nonideal situations, such as small-sized and class-imbalanced reference data sets, tools based on cluster-level similarities have superior performance. However, even with the function of assigning ‘unassigned’ labels, it is still challenging to catch novel cell types by solely using any of the single-cell classifiers. This article provides a guideline for researchers to select and apply suitable classification tools in their analysis workflows and sheds some lights on potential direction of future improvement on classification tools.

Download Full-text

Single-Cell RNA Sequencing With Combined Use of Bulk RNA Sequencing to Reveal Cell Heterogeneity and Molecular Changes at Acute Stage of Ischemic Stroke in Mouse Cortex Penumbra Area

Frontiers in Cell and Developmental Biology ◽

10.3389/fcell.2021.624711 ◽

2021 ◽

Vol 9 ◽

Author(s):

Kang Guo ◽

Jianing Luo ◽

Dayun Feng ◽

Lin Wu ◽

Xin Wang ◽

...

Keyword(s):

Ischemic Stroke ◽

Single Cell ◽

Rna Sequencing ◽

Therapeutic Targets ◽

Cell Types ◽

Acute Stage ◽

Gene Markers ◽

Molecular Changes ◽

Early Stages ◽

The Impact

Stroke has been the leading cause of adult morbidity and mortality over the past several years. After an ischemic stroke attack, many dormant or reversibly injured brain cells exist in the penumbra area. However, the pathological processes and unique cell information in the penumbra area of an acute ischemic stroke remain elusive. We applied unbiased single cell sequencing in combination with bulk RNA-seq analysis to investigate the heterogeneity of each cell type in the early stages of ischemic stroke and to detect early possible therapeutic targets to help cell survival. We used these analyses to study the mouse brain penumbra during this phase. Our results reveal the impact of ischemic stroke on specific genes and pathways of different cell types and the alterations of cell differentiation trajectories, suggesting potential pathological mechanisms and therapeutic targets. In addition to classical gene markers, single-cell genomics demonstrates unique information on subclusters of several cell types and metabolism changes in an ischemic stroke. These findings suggest that Gadd45b in microglia, Cyr61 in astrocytes, and Sgk3 in oligodendrocytes may play a subcluster-specific role in cell death or survival in the early stages of ischemic stroke. Moreover, RNA-scope multiplex in situ hybridization and immunofluorescence staining were applied to selected target gene markers to validate and confirm the existence of these cell subtypes and molecular changes during acute stage of ischemic stroke.

Download Full-text

Splatter: simulation of single-cell RNA sequencing data

10.1101/133173 ◽

2017 ◽

Cited By ~ 8

Author(s):

Luke Zappia ◽

Belinda Phipson ◽

Alicia Oshlack

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Real Data ◽

Cell Types ◽

Rna Seq ◽

Sequencing Data ◽

Sequencing Technologies ◽

Simulation Based ◽

Single Cell Rna Sequencing ◽

Multiple Cell

AbstractAs single-cell RNA sequencing technologies have rapidly developed, so have analysis methods. Many methods have been tested, developed and validated using simulated datasets. Unfortunately, current simulations are often poorly documented, their similarity to real data is not demonstrated, or reproducible code is not available.Here we present the Splatter Bioconductor package for simple, reproducible and well-documented simulation of single-cell RNA-seq data. Splatter provides an interface to multiple simulation methods including Splat, our own simulation, based on a gamma-Poisson distribution. Splat can simulate single populations of cells, populations with multiple cell types or differentiation paths.

Download Full-text

DoubletFinder: Doublet detection in single-cell RNA sequencing data using artificial nearest neighbors

10.1101/352484 ◽

2018 ◽

Cited By ~ 17

Author(s):

Christopher S. McGinnis ◽

Lyndsay M. Murrow ◽

Zev J. Gartner

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

False Negative ◽

Droplet Microfluidics ◽

Cell Capture ◽

Sequencing Data ◽

Putative Gene ◽

Detection Tool ◽

Single Cell Rna Sequencing

SUMMARYSingle-cell RNA sequencing (scRNA-seq) using droplet microfluidics occasionally produces transcriptome data representing more than one cell. These technical artifacts are caused by cell doublets formed during cell capture and occur at a frequency proportional to the total number of sequenced cells. The presence of doublets can lead to spurious biological conclusions, which justifies the practice of sequencing fewer cells to limit doublet formation rates. Here, we present a computational doublet detection tool – DoubletFinder – that identifies doublets based solely on gene expression features. DoubletFinder infers the putative gene expression profile of real doublets by generating artificial doublets from existing scRNA-seq data. Neighborhood detection in gene expression space then identifies sequenced cells with increased probability of being doublets based on their proximity to artificial doublets. DoubletFinder robustly identifies doublets across scRNA-seq datasets with variable numbers of cells and sequencing depth, and predicts false-negative and false-positive doublets defined using conventional barcoding approaches. We anticipate that DoubletFinder will aid in scRNA-seq data analysis and will increase the throughput and accuracy of scRNA-seq experiments.

Download Full-text

The single-cell eQTLGen consortium

eLife ◽

10.7554/elife.52155 ◽

2020 ◽

Vol 9 ◽

Cited By ~ 18

Author(s):

MGP van der Wijst ◽

DH de Vries ◽

HE Groot ◽

G Trynka ◽

CC Hon ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Large Scale ◽

Cell Types ◽

Eqtl Analysis ◽

Sequencing Data ◽

Scale Population ◽

Trait Locus ◽

Different Cell Types ◽

Affect Gene Expression

In recent years, functional genomics approaches combining genetic information with bulk RNA-sequencing data have identified the downstream expression effects of disease-associated genetic risk factors through so-called expression quantitative trait locus (eQTL) analysis. Single-cell RNA-sequencing creates enormous opportunities for mapping eQTLs across different cell types and in dynamic processes, many of which are obscured when using bulk methods. Rapid increase in throughput and reduction in cost per cell now allow this technology to be applied to large-scale population genetics studies. To fully leverage these emerging data resources, we have founded the single-cell eQTLGen consortium (sc-eQTLGen), aimed at pinpointing the cellular contexts in which disease-causing genetic variants affect gene expression. Here, we outline the goals, approach and potential utility of the sc-eQTLGen consortium. We also provide a set of study design considerations for future single-cell eQTL studies.

Download Full-text

Distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data

10.1101/234872 ◽

2018 ◽

Cited By ~ 7

Author(s):

Aaron T. L. Lun ◽

Samantha Riesenfeld ◽

Tallulah Andrews ◽

Tomas Gomes ◽

John C. Marioni ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Real Data ◽

Cell Types ◽

Sequencing Data ◽

Minimum Threshold ◽

False Discovery ◽

Distinct Cell ◽

Single Cell Rna Sequencing ◽

Unique Molecular Identifier

AbstractDroplet-based single-cell RNA sequencing protocols have dramatically increased the throughput and efficiency of single-cell transcriptomics studies. A key computational challenge when processing these data is to distinguish libraries for real cells from empty droplets. Existing methods for cell calling set a minimum threshold on the total unique molecular identifier (UMI) count for each library, which indiscriminately discards cell libraries with low UMI counts. Here, we describe a new statistical method for calling cells from droplet-based data, based on detecting significant deviations from the expression profile of the ambient solution. Using simulations, we demonstrate that our method has greater power than existing approaches for detecting cell libraries with low UMI counts, while controlling the false discovery rate among detected cells. We also apply our method to real data, where we show that the use of our method results in the retention of distinct cell types that would otherwise have been discarded.

Download Full-text