Machine learning and statistical methods for clustering single-cell RNA-sequencing data

Raphael Petegrosso; Zhuliu Li; Rui Kuang

doi:10.1093/bib/bbz063

Machine learning and statistical methods for clustering single-cell RNA-sequencing data

Briefings in Bioinformatics ◽

10.1093/bib/bbz063 ◽

2019 ◽

Vol 21 (4) ◽

pp. 1209-1223 ◽

Cited By ~ 13

Author(s):

Raphael Petegrosso ◽

Zhuliu Li ◽

Rui Kuang

Keyword(s):

Machine Learning ◽

Single Cell ◽

Statistical Methods ◽

Large Scale ◽

Time Series Data ◽

Single Cells ◽

Transcriptome Profiling ◽

Cell Types ◽

Series Data ◽

Sequencing Data

Abstract Single-cell RNAsequencing (scRNA-seq) technologies have enabled the large-scale whole-transcriptome profiling of each individual single cell in a cell population. A core analysis of the scRNA-seq transcriptome profiles is to cluster the single cells to reveal cell subtypes and infer cell lineages based on the relations among the cells. This article reviews the machine learning and statistical methods for clustering scRNA-seq transcriptomes developed in the past few years. The review focuses on how conventional clustering techniques such as hierarchical clustering, graph-based clustering, mixture models, $k$-means, ensemble learning, neural networks and density-based clustering are modified or customized to tackle the unique challenges in scRNA-seq data analysis, such as the dropout of low-expression genes, low and uneven read coverage of transcripts, highly variable total mRNAs from single cells and ambiguous cell markers in the presence of technical biases and irrelevant confounding biological variations. We review how cell-specific normalization, the imputation of dropouts and dimension reduction methods can be applied with new statistical or optimization strategies to improve the clustering of single cells. We will also introduce those more advanced approaches to cluster scRNA-seq transcriptomes in time series data and multiple cell populations and to detect rare cell types. Several software packages developed to support the cluster analysis of scRNA-seq data are also reviewed and experimentally compared to evaluate their performance and efficiency. Finally, we conclude with useful observations and possible future directions in scRNA-seq data analytics. Availability All the source code and data are available at https://github.com/kuanglab/single-cell-review.

Download Full-text

Scirpy: A Scanpy extension for analyzing single-cell T-cell receptor sequencing data

10.1101/2020.04.10.035865 ◽

2020 ◽

Author(s):

Gregor Sturm ◽

Tamas Szabo ◽

Georgios Fotakis ◽

Marlene Haider ◽

Dietmar Rieder ◽

...

Keyword(s):

T Cell ◽

Single Cell ◽

Large Scale ◽

Single Cells ◽

Cell Receptor ◽

Sequencing Data ◽

Seamless Integration ◽

T Cell Phenotypes ◽

Cell Phenotypes

AbstractSummaryAdvances in single-cell technologies have enabled the investigation of T cell phenotypes and repertoires at unprecedented resolution and scale. Bioinformatic methods for the efficient analysis of these large-scale datasets are instrumental for advancing our understanding of adaptive immune responses in cancer, but also in infectious diseases like COVID-19. However, while well-established solutions are accessible for the processing of single-cell transcriptomes, no streamlined pipelines are available for the comprehensive characterization of T cell receptors. Here we propose Scirpy, a scalable Python toolkit that provides simplified access to the analysis and visualization of immune repertoires from single cells and seamless integration with transcriptomic data.Availability and implementationScirpy source code and documentation are available at https://github.com/icbi-lab/scirpy.

Download Full-text

Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling

10.1101/2021.01.20.427486 ◽

2021 ◽

Author(s):

Saptarshi Bej ◽

Anne-Marie Galow ◽

Robert David ◽

Markus Wolfien ◽

Olaf Wolkenhauer

Keyword(s):

Machine Learning ◽

Single Cell ◽

Rna Sequencing ◽

Cell Types ◽

Classification Problem ◽

Use Case ◽

Cell Capture ◽

Sequencing Data ◽

Rare Cells ◽

The Impact

AbstractThe research landscape of single-cell and single-nuclei RNA sequencing is evolving rapidly, and one area that is enabled by this technology, is the detection of rare cells. An automated, unbiased and accurate annotation of rare subpopulations is challenging. Once rare cells are identified in one dataset, it will usually be necessary to generate other datasets to enrich the analysis (e.g., with samples from other tissues). From a machine learning perspective, the challenge arises from the fact that rare cell subpopulations constitute an imbalanced classification problem.We here introduce a Machine Learning (ML)-based oversampling method that uses gene expression counts of already identified rare cells as an input to generate synthetic cells to then identify similar (rare) cells in other publicly available experiments. We utilize single-cell synthetic oversampling (sc-SynO), which is based on the Localized Random Affine Shadowsampling (LoRAS) algorithm. The algorithm corrects for the overall imbalance ratio of the minority and majority class.We demonstrate the effectiveness of the method for two independent use cases, each consisting of two published datasets. The first use case identifies cardiac glial cells in snRNA-Seq data (17 nuclei out of 8,635). This use case was designed to take a larger imbalance ratio (∼1 to 500) into account and only uses single-nuclei data. The second use case was designed to jointly use snRNA-Seq data and scRNA-Seq on a lower imbalance ratio (∼1 to 26) for the training step to likewise investigate the potential of the algorithm to consider both single cell capture procedures and the impact of “less” rare-cell types. For validation purposes, all datasets have also been analyzed in a traditional manner using common data analysis approaches, such as the Seurat3 workflow.Our algorithm identifies rare-cell populations with a high accuracy and low false positive detection rate. A striking benefit of our algorithm is that it can be readily implemented in other and existing workflows. The code basis is publicly available at FairdomHub (https://fairdomhub.org/assays/1368) and can easily be transferred to train other customized approaches.

Download Full-text

Phenotypic convergence in the brain: distinct transcription factors regulate common terminal neuronal characters

10.1101/243113 ◽

2018 ◽

Cited By ~ 2

Author(s):

Nikos Konstantinides ◽

Katarina Kapuralin ◽

Chaimaa Fadil ◽

Luendreo Barboza ◽

Rahul Satija ◽

...

Keyword(s):

Transcription Factors ◽

Single Cell ◽

Large Scale ◽

Single Cells ◽

Deep Understanding ◽

Cell Types ◽

Marker Genes ◽

Cell Type ◽

Functional Specification ◽

Phenotypic Convergence

SummaryTranscription factors regulate the molecular, morphological, and physiological characters of neurons and generate their impressive cell type diversity. To gain insight into general principles that govern how transcription factors regulate cell type diversity, we used large-scale single-cell mRNA sequencing to characterize the extensive cellular diversity in the Drosophila optic lobes. We sequenced 55,000 single optic lobe neurons and glia and assigned them to 52 clusters of transcriptionally distinct single cells. We validated the clustering and annotated many of the clusters using RNA sequencing of characterized FACS-sorted single cell types, as well as marker genes specific to given clusters. To identify transcription factors responsible for inducing specific terminal differentiation features, we used machine-learning to generate a ‘random forest’ model. The predictive power of the model was confirmed by showing that two transcription factors expressed specifically in cholinergic (apterous) and glutamatergic (traffic-jam) neurons are necessary for the expression of ChAT and VGlut in many, but not all, cholinergic or glutamatergic neurons, respectively. We used a transcriptome-wide approach to show that the same terminal characters, including but not restricted to neurotransmitter identity, can be regulated by different transcription factors in different cell types, arguing for extensive phenotypic convergence. Our data provide a deep understanding of the developmental and functional specification of a complex brain structure.

Download Full-text

The single-cell eQTLGen consortium

eLife ◽

10.7554/elife.52155 ◽

2020 ◽

Vol 9 ◽

Cited By ~ 18

Author(s):

MGP van der Wijst ◽

DH de Vries ◽

HE Groot ◽

G Trynka ◽

CC Hon ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Large Scale ◽

Cell Types ◽

Eqtl Analysis ◽

Sequencing Data ◽

Scale Population ◽

Trait Locus ◽

Different Cell Types ◽

Affect Gene Expression

In recent years, functional genomics approaches combining genetic information with bulk RNA-sequencing data have identified the downstream expression effects of disease-associated genetic risk factors through so-called expression quantitative trait locus (eQTL) analysis. Single-cell RNA-sequencing creates enormous opportunities for mapping eQTLs across different cell types and in dynamic processes, many of which are obscured when using bulk methods. Rapid increase in throughput and reduction in cost per cell now allow this technology to be applied to large-scale population genetics studies. To fully leverage these emerging data resources, we have founded the single-cell eQTLGen consortium (sc-eQTLGen), aimed at pinpointing the cellular contexts in which disease-causing genetic variants affect gene expression. Here, we outline the goals, approach and potential utility of the sc-eQTLGen consortium. We also provide a set of study design considerations for future single-cell eQTL studies.

Download Full-text

Massively parallel single cell lineage tracing using CRISPR/Cas9 induced genetic scars

10.1101/205971 ◽

2017 ◽

Cited By ~ 6

Author(s):

Bastiaan Spanjaard ◽

Bo Hu ◽

Nina Mitic ◽

Jan Philipp Junker

Keyword(s):

Single Cell ◽

Computational Analysis ◽

Systematic Approach ◽

Single Cells ◽

Cell Lineage ◽

Transcriptome Profiling ◽

Cell Types ◽

Lineage Tracing ◽

Lineage Trees ◽

Different Cell Types

A key goal of developmental biology is to understand how a single cell transforms into a full-grown organism consisting of many different cell types. Single-cell RNA-sequencing (scRNA-seq) has become a widely-used method due to its ability to identify all cell types in a tissue or organ in a systematic manner 1–3. However, a major challenge is to organize the resulting taxonomy of cell types into lineage trees revealing the developmental origin of cells. Here, we present a strategy for simultaneous lineage tracing and transcriptome profiling in thousands of single cells. By combining scRNA-seq with computational analysis of lineage barcodes generated by genome editing of transgenic reporter genes, we reconstruct developmental lineage trees in zebrafish larvae and adult fish. In future analyses, LINNAEUS (LINeage tracing by Nuclease-Activated Editing of Ubiquitous Sequences) can be used as a systematic approach for identifying the lineage origin of novel cell types, or of known cell types under different conditions.

Download Full-text

Scirpy: a Scanpy extension for analyzing single-cell T-cell receptor-sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa611 ◽

2020 ◽

Vol 36 (18) ◽

pp. 4817-4818 ◽

Cited By ~ 2

Author(s):

Gregor Sturm ◽

Tamas Szabo ◽

Georgios Fotakis ◽

Marlene Haider ◽

Dietmar Rieder ◽

...

Keyword(s):

T Cell ◽

Single Cell ◽

Large Scale ◽

Single Cells ◽

Cell Receptor ◽

Supplementary Information ◽

Sequencing Data ◽

Seamless Integration ◽

T Cell Phenotypes ◽

Cell Phenotypes

Abstract Summary Advances in single-cell technologies have enabled the investigation of T-cell phenotypes and repertoires at unprecedented resolution and scale. Bioinformatic methods for the efficient analysis of these large-scale datasets are instrumental for advancing our understanding of adaptive immune responses. However, while well-established solutions are accessible for the processing of single-cell transcriptomes, no streamlined pipelines are available for the comprehensive characterization of T-cell receptors. Here, we propose single-cell immune repertoires in Python (Scirpy), a scalable Python toolkit that provides simplified access to the analysis and visualization of immune repertoires from single cells and seamless integration with transcriptomic data. Availability and implementation Scirpy source code and documentation are available at https://github.com/icbi-lab/scirpy. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Saturating Single-Cell atlas Datasets

10.1101/218370 ◽

2017 ◽

Cited By ~ 2

Author(s):

Aparna Bhaduri ◽

Tomasz J. Nowakowski ◽

Alex A. Pollen ◽

Arnold R. Kriegstein

Keyword(s):

Population Structure ◽

Single Cell ◽

Mouse Brain ◽

Large Scale ◽

Single Cells ◽

Cost Effective ◽

Cell Types ◽

Cell Number ◽

Cell Type ◽

The Relationship

AbstractHigh throughput methods for profiling the transcriptomes of single cells have recently emerged as transformative approaches for large-scale population surveys of cellular diversity in heterogeneous primary tissues. Efficient generation of such an atlas will depend on sufficient sampling of the diverse cell types while remaining cost-effective to enable a comprehensive examination of organs, developmental stages, and individuals. To examine the relationship between cell number and transcriptional heterogeneity in the context of unbiased cell type classification, we explicitly explored the population structure of a publically available 1.3 million cell dataset from the E18.5 mouse brain. We propose a computational framework for inferring the saturation point of cluster discovery in a single cell mRNA-seq experiment, centered around cluster preservation in downsampled datasets. In addition, we introduce a “complexity index”, which characterizes the heterogeneity of cells in a given dataset. Using Cajal-Retzius cells as an example of a limited complexity dataset, we explored whether biological distinctions relate to technical clustering. Surprisingly, we found that clustering distinctions carrying biologically interpretable meaning are achieved with far fewer cells (20,000). Together, these findings suggest that most of the biologically interpretable insights from the 1.3 million cells can be recapitulated by analyzing 50,000 randomly selected cells, indicating that instead of profiling few individuals at high “cellular coverage”, the much anticipated cell atlasing studies may instead benefit from profiling more individuals, or many time points at lower cellular coverage.Recent efforts seek to create a comprehensive cell atlas of the human body1,2 Current technology, however, makes it precipitously expensive to perform analysis of every cell. Therefore, designing effective sampling strategies be critical to generate a working atlas in an efficient, cost-effective, and streamlined manner. The advent of single cell and single nucleus mRNA sequencing (RNAseq) in droplet format3,4 now enables large scale sampling of cells from any tissue, and a recently released publicly available dataset of 1.3 million single cells from the E18.5 mouse brain generated with the 10X Chromium5 provides an opportunity to explore the relationship between population structure and the number of sampled cells necessary to reveal the underlying diversity of cell types. Here, we present a framework for how researchers can evaluate whether a dataset has reached saturation, and we estimate how many cells would be required to generate an atlas of the sample analyzed here. This framework can be applied to any organ or cell type specific atlas for any organism.

Download Full-text

Comparison of computational methods for imputing single-cell RNA-sequencing data

10.1101/241190 ◽

2017 ◽

Cited By ~ 10

Author(s):

Lihua Zhang ◽

Shihua Zhang

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Large Scale ◽

Real Data ◽

Cell Types ◽

Biological Functions ◽

Sequencing Data ◽

Imputation Methods ◽

Future Studies ◽

Single Cell Rna Sequencing

AbstractSingle-cell RNA-sequencing (scRNA-seq) is a recent breakthrough technology, which paves the way for measuring RNA levels at single cell resolution to study precise biological functions. One of the main challenges when analyzing scRNA-seq data is the presence of zeros or dropout events, which may mislead downstream analyses. To compensate the dropout effect, several methods have been developed to impute gene expression since the first Bayesian-based method being proposed in 2016. However, these methods have shown very diverse characteristics in terms of model hypothesis and imputation performance. Thus, large-scale comparison and evaluation of these methods is urgently needed now. To this end, we compared eight imputation methods, evaluated their power in recovering original real data, and performed broad analyses to explore their effects on clustering cell types, detecting differentially expressed genes, and reconstructing lineage trajectories in the context of both simulated and real data. Simulated datasets and case studies highlight that there are no one method performs the best in all the situations. Some defects of these methods such as scalability, robustness and unavailability in some situations need to be addressed in future studies.

Download Full-text

A Scalable Strand-Specific Protocol Enabling Full-Length Total RNA Sequencing From Single Cells

Frontiers in Genetics ◽

10.3389/fgene.2021.665888 ◽

2021 ◽

Vol 12 ◽

Author(s):

Simon Haile ◽

Richard D. Corbett ◽

Veronique G. LeBlanc ◽

Lisa Wei ◽

Stephen Pleasance ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

High Throughput Sequencing ◽

Single Cells ◽

Cell Types ◽

Full Length ◽

Sequencing Data ◽

Total Rna ◽

Specific Protocol

RNA sequencing (RNAseq) has been widely used to generate bulk gene expression measurements collected from pools of cells. Only relatively recently have single-cell RNAseq (scRNAseq) methods provided opportunities for gene expression analyses at the single-cell level, allowing researchers to study heterogeneous mixtures of cells at unprecedented resolution. Tumors tend to be composed of heterogeneous cellular mixtures and are frequently the subjects of such analyses. Extensive method developments have led to several protocols for scRNAseq but, owing to the small amounts of RNA in single cells, technical constraints have required compromises. For example, the majority of scRNAseq methods are limited to sequencing only the 3′ or 5′ termini of transcripts. Other protocols that facilitate full-length transcript profiling tend to capture only polyadenylated mRNAs and are generally limited to processing only 96 cells at a time. Here, we address these limitations and present a novel protocol that allows for the high-throughput sequencing of full-length, total RNA at single-cell resolution. We demonstrate that our method produced strand-specific sequencing data for both polyadenylated and non-polyadenylated transcripts, enabled the profiling of transcript regions beyond only transcript termini, and yielded data rich enough to allow identification of cell types from heterogeneous biological samples.

Download Full-text