A Systematic Evaluation of Methods for Cell Phenotype Classification Using Single-Cell RNA Sequencing Data

Abstract Background: Single-cell RNA sequencing (scRNA-seq) yields valuable insights about gene expression and gives critical information about complex tissue cellular composition. In the analysis of single-cell RNA sequencing, the annotations of cell subtypes are often done manually, which is time-consuming and irreproducible. Garnett is a cell-type annotation software based the on elastic net method. Beside cell-type annotation, supervised machine learning methods can also be applied to predict other cell phenotypes from genomic data. Despite the popularity of such applications, there is no existing study to systematically investigate the performance of those supervised algorithms in various sizes of scRNA-seq data sets. Methods and Results: This study evaluates 13 popular supervised machine learning algorithms to classify cell phenotypes, using published real and simulated data sets with diverse cell sizes. The benchmark contained two parts. In the first part, we used real data sets to assess the popular supervised algorithms’ computing speed and cell phenotype classification performance. The classification performances were evaluated using AUC statistics, F1-score, precision, recall, and false-positive rate. In the second part, we evaluated gene selection performance using published simulated data sets with a known list of real genes. Conclusion: The study outcomes showed that ElasticNet with interactions performed best in small and medium data sets. NB was another appropriate method for medium data sets. In large data sets, XGB works excellent. Ensemble algorithms were not significantly superior to individual machine learning methods. Adding interactions to ElasticNet can help, and the improvement was significant in small data sets.

Download Full-text

Red panda: a novel method for detecting variants in single-cell RNA sequencing

BMC Genomics ◽

10.1186/s12864-020-07224-3 ◽

2020 ◽

Vol 21 (S11) ◽

Author(s):

Adam Cornish ◽

Shrabasti Roychoudhury ◽

Krishna Sarma ◽

Suravi Pramanik ◽

Kishor Bhakat ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Articular Chondrocytes ◽

Genetic Diseases ◽

Simulated Data ◽

Single Nucleotide ◽

Single Cell Rna Sequencing ◽

Red Panda ◽

Novel Method ◽

Rare Cells

Abstract Background Single-cell sequencing enables us to better understand genetic diseases, such as cancer or autoimmune disorders, which are often affected by changes in rare cells. Currently, no existing software is aimed at identifying single nucleotide variations or micro (1-50 bp) insertions and deletions in single-cell RNA sequencing (scRNA-seq) data. Generating high-quality variant data is vital to the study of the aforementioned diseases, among others. Results In this study, we report the design and implementation of Red Panda, a novel method to accurately identify variants in scRNA-seq data. Variants were called on scRNA-seq data from human articular chondrocytes, mouse embryonic fibroblasts (MEFs), and simulated data stemming from the MEF alignments. Red Panda had the highest Positive Predictive Value at 45.0%, while other tools—FreeBayes, GATK HaplotypeCaller, GATK UnifiedGenotyper, Monovar, and Platypus—ranged from 5.8–41.53%. From the simulated data, Red Panda had the highest sensitivity at 72.44%. Conclusions We show that our method provides a novel and improved mechanism to identify variants in scRNA-seq as compared to currently existing software. However, methods for identification of genomic variants using scRNA-seq data can be still improved.

Download Full-text

Software Benchmark—Classification Tree Algorithms for Cell Atlases Annotation Using Single-Cell RNA-Sequencing Data

Microbiology Research ◽

10.3390/microbiolres12020022 ◽

2021 ◽

Vol 12 (2) ◽

pp. 317-334

Author(s):

Omar Alaqeeli ◽

Li Xing ◽

Xuekui Zhang

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Classification Tree ◽

Area Under The Curve ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Rna Sequencing ◽

Tree Algorithms ◽

R Packages

Classification tree is a widely used machine learning method. It has multiple implementations as R packages; rpart, ctree, evtree, tree and C5.0. The details of these implementations are not the same, and hence their performances differ from one application to another. We are interested in their performance in the classification of cells using the single-cell RNA-Sequencing data. In this paper, we conducted a benchmark study using 22 Single-Cell RNA-sequencing data sets. Using cross-validation, we compare packages’ prediction performances based on their Precision, Recall, F1-score, Area Under the Curve (AUC). We also compared the Complexity and Run-time of these R packages. Our study shows that rpart and evtree have the best Precision; evtree is the best in Recall, F1-score and AUC; C5.0 prefers more complex trees; tree is consistently much faster than others, although its complexity is often higher than others.

Download Full-text

Modeling cellular crosstalk and organotypic vasculature development with human iPSC-derived endothelial cells and cardiomyocytes

10.1101/2020.05.04.075846 ◽

2020 ◽

Author(s):

Emmi Helle ◽

Minna Ampuja ◽

Alexandra Dainis ◽

Laura Antola ◽

Elina Temmes ◽

...

Keyword(s):

Endothelial Cells ◽

Shear Stress ◽

Endothelial Cell ◽

Single Cell ◽

Rna Sequencing ◽

Tissue Growth ◽

Cell Phenotype ◽

Cell Populations ◽

Pump System ◽

Single Cell Rna Sequencing

AbstractRationaleCell-cell interactions are crucial for the development and function of the organs. Endothelial cells act as essential regulators of tissue growth and regeneration. In the heart, endothelial cells engage in delicate bidirectional communication with cardiomyocytes. The mechanisms and mediators of this crosstalk are still poorly known. Furthermore, endothelial cells in vivo are exposed to blood flow and their phenotype is greatly affected by shear stress.ObjectiveWe aimed to elucidate how cardiomyocytes regulate the development of organotypic phenotype in endothelial cells. In addition, the effects of flow-induced shear stress on endothelial cell phenotype were studied.Methods and resultsHuman induced pluripotent stem cell (hiPSC) -derived cardiomyocytes and endothelial cells were grown either as a monoculture or as a coculture. hiPS-endothelial cells were exposed to flow using the Ibidi-pump system. Single-cell RNA sequencing was performed to define cell populations and to uncover the effects on their transcriptomic phenotypes. The hiPS-cardiomyocyte differentiation resulted in two distinct populations; atrial and ventricular. Coculture had a more pronounced effect on hiPS-endothelial cells compared to hiPS-cardiomyocytes. Coculture increased hiPS-endothelial cell expression of transcripts related to vascular development and maturation, cardiac development, and the expression of cardiac endothelial cell -specific genes. Exposure to flow significantly reprogrammed the hiPS-endothelial cell transcriptome, and surprisingly, promoted the appearance of both venous and arterial clusters.ConclusionsSingle-cell RNA sequencing revealed distinct atrial and ventricular cell populations in hiPS-cardiomyocytes, and arterial and venous-like cell populations in flow exposed hiPS-endothelial cells. hiPS-endothelial cells acquired cardiac endothelial cell identity in coculture. Our study demonstrated that hiPS-cardiomoycytes and hiPS-endothelial cells readily adapt to coculture and flow in a consistent and relevant manner, indicating that the methods used represent improved physiological cell culturing conditions that potentially are more relevant in disease modelling. In addition, novel cardiomyocyte-endothelial cell crosstalk mediators were revealed.

Download Full-text

Transfer learning efficiently maps bone marrow cell types from mouse to human using single-cell RNA sequencing

Communications Biology ◽

10.1038/s42003-020-01463-6 ◽

2020 ◽

Vol 3 (1) ◽

Author(s):

Patrick S. Stumpf ◽

Xin Du ◽

Haruka Imanishi ◽

Yuya Kunisaki ◽

Yuichiro Semba ◽

...

Keyword(s):

Machine Learning ◽

Bone Marrow ◽

Single Cell ◽

Rna Sequencing ◽

Transfer Learning ◽

Biomedical Research ◽

Human Cell ◽

Cell Types ◽

Single Cell Rna Sequencing ◽

Using Data

AbstractBiomedical research often involves conducting experiments on model organisms in the anticipation that the biology learnt will transfer to humans. Previous comparative studies of mouse and human tissues were limited by the use of bulk-cell material. Here we show that transfer learning—the branch of machine learning that concerns passing information from one domain to another—can be used to efficiently map bone marrow biology between species, using data obtained from single-cell RNA sequencing. We first trained a multiclass logistic regression model to recognize different cell types in mouse bone marrow achieving equivalent performance to more complex artificial neural networks. Furthermore, it was able to identify individual human bone marrow cells with 83% overall accuracy. However, some human cell types were not easily identified, indicating important differences in biology. When re-training the mouse classifier using data from human, less than 10 human cells of a given type were needed to accurately learn its representation. In some cases, human cell identities could be inferred directly from the mouse classifier via zero-shot learning. These results show how simple machine learning models can be used to reconstruct complex biology from limited data, with broad implications for biomedical research.

Download Full-text

Evaluation of single-cell classifiers for single-cell RNA sequencing data sets

Briefings in Bioinformatics ◽

10.1093/bib/bbz096 ◽

2019 ◽

Vol 21 (5) ◽

pp. 1581-1595 ◽

Cited By ~ 6

Author(s):

Xinlei Zhao ◽

Shuang Wu ◽

Nan Fang ◽

Xiao Sun ◽

Jue Fan

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Reference Data ◽

Predictive Accuracy ◽

Cell Types ◽

Superior Performance ◽

Marker Genes ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Abstract Single-cell RNA sequencing (scRNA-seq) has been rapidly developing and widely applied in biological and medical research. Identification of cell types in scRNA-seq data sets is an essential step before in-depth investigations of their functional and pathological roles. However, the conventional workflow based on clustering and marker genes is not scalable for an increasingly large number of scRNA-seq data sets due to complicated procedures and manual annotation. Therefore, a number of tools have been developed recently to predict cell types in new data sets using reference data sets. These methods have not been generally adapted due to a lack of tool benchmarking and user guidance. In this article, we performed a comprehensive and impartial evaluation of nine classification software tools specifically designed for scRNA-seq data sets. Results showed that Seurat based on random forest, SingleR based on correlation analysis and CaSTLe based on XGBoost performed better than others. A simple ensemble voting of all tools can improve the predictive accuracy. Under nonideal situations, such as small-sized and class-imbalanced reference data sets, tools based on cluster-level similarities have superior performance. However, even with the function of assigning ‘unassigned’ labels, it is still challenging to catch novel cell types by solely using any of the single-cell classifiers. This article provides a guideline for researchers to select and apply suitable classification tools in their analysis workflows and sheds some lights on potential direction of future improvement on classification tools.

Download Full-text

Stem Cell Pluripotency Genes Klf4 and Oct4 Regulate Complex SMC Phenotypic Changes Critical in Late-Stage Atherosclerotic Lesion Pathogenesis

Circulation ◽

10.1161/circulationaha.120.046672 ◽

2020 ◽

Vol 142 (21) ◽

pp. 2045-2059 ◽

Cited By ~ 8

Author(s):

Gabriel F. Alencar ◽

Katherine M. Owsiany ◽

Santosh Karnewar ◽

Katyayani Sukhavasi ◽

Giuseppe Mocci ◽

...

Keyword(s):

Stem Cell ◽

Single Cell ◽

Rna Sequencing ◽

Late Stage ◽

Lineage Tracing ◽

Stem Cell Marker ◽

Cell Phenotype ◽

Phenotypic Changes ◽

Atherosclerotic Lesions ◽

Single Cell Rna Sequencing

Background: Rupture and erosion of advanced atherosclerotic lesions with a resultant myocardial infarction or stroke are the leading worldwide cause of death. However, we have a limited understanding of the identity, origin, and function of many cells that make up late-stage atherosclerotic lesions, as well as the mechanisms by which they control plaque stability. Methods: We conducted a comprehensive single-cell RNA sequencing of advanced human carotid endarterectomy samples and compared these with single-cell RNA sequencing from murine microdissected advanced atherosclerotic lesions with smooth muscle cell (SMC) and endothelial lineage tracing to survey all plaque cell types and rigorously determine their origin. We further used chromatin immunoprecipitation sequencing (ChIP-seq), bulk RNA sequencing, and an innovative dual lineage tracing mouse to understand the mechanism by which SMC phenotypic transitions affect lesion pathogenesis. Results: We provide evidence that SMC-specific Klf4- versus Oct4-knockout showed virtually opposite genomic signatures, and their putative target genes play an important role regulating SMC phenotypic changes. Single-cell RNA sequencing revealed remarkable similarity of transcriptomic clusters between mouse and human lesions and extensive plasticity of SMC- and endothelial cell-derived cells including 7 distinct clusters, most negative for traditional markers. In particular, SMC contributed to a Myh11 - , Lgals3 + population with a chondrocyte-like gene signature that was markedly reduced with SMC- Klf4 knockout. We observed that SMCs that activate Lgals3 compose up to two thirds of all SMC in lesions. However, initial activation of Lgals3 in these cells does not represent conversion to a terminally differentiated state, but rather represents transition of these cells to a unique stem cell marker gene–positive, extracellular matrix-remodeling, “pioneer” cell phenotype that is the first to invest within lesions and subsequently gives rise to at least 3 other SMC phenotypes within advanced lesions, including Klf4-dependent osteogenic phenotypes likely to contribute to plaque calcification and plaque destabilization. Conclusions: Taken together, these results provide evidence that SMC-derived cells within advanced mouse and human atherosclerotic lesions exhibit far greater phenotypic plasticity than generally believed, with Klf4 regulating transition to multiple phenotypes including Lgals3 + osteogenic cells likely to be detrimental for late-stage atherosclerosis plaque pathogenesis.

Download Full-text

DrivAER: Identification of driving transcriptional programs in single-cell RNA sequencing data

10.1101/864165 ◽

2019 ◽

Author(s):

Lukas M. Simon ◽

Fangfang Yan ◽

Zhongming Zhao

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Disease Status ◽

Data Sets ◽

Sequencing Data ◽

Functional Interpretation ◽

Recent Success ◽

Gene Sets ◽

Single Cell Rna Sequencing ◽

Cellular Maps

AbstractSingle cell RNA sequencing (scRNA-seq) unfolds complex transcriptomic data sets into detailed cellular maps. Despite recent success, there is a pressing need for specialized methods tailored towards the functional interpretation of these cellular maps. Here, we present DrivAER, a machine learning approach that scores annotated gene sets based on their relevance to user-specified outcomes such as pseudotemporal ordering or disease status. We demonstrate that DrivAER extracts the key driving pathways and transcription factors that regulate complex biological processes from scRNA-seq data.

Download Full-text

Disease-relevant single cell photonic signatures identify S100β stem cells and their myogenic progeny in vascular lesions

10.1101/2020.05.13.093518 ◽

2020 ◽

Author(s):

Claire Molony ◽

Damien King ◽

Mariana Di Luca ◽

Abidemi Olayinka ◽

Roya Hakimjavadi ◽

...

Keyword(s):

Machine Learning ◽

Stem Cells ◽

Single Cell ◽

Ex Vivo ◽

Lineage Tracing ◽

Supervised Machine Learning ◽

Vascular Lesions ◽

Cell Phenotype ◽

Genetic Lineage ◽

Collagen Iii

AbstractA hallmark of subclinical atherosclerosis is the accumulation of vascular smooth muscle cell (SMC)-like cells leading to intimal thickening and lesion formation. While medial SMCs contribute to vascular lesions, the involvement of resident vascular stem cells (vSCs) remains unclear. We evaluated single cell photonics as a discriminator of cell phenotype in vitro before the presence of vSC within vascular lesions was assessed ex vivo using supervised machine learning and further validated using lineage tracing analysis. Using a novel lab-on-a-Disk (Load) platform, label-free single cell photonic emissions from normal and injured vessels ex vivo were interrogated and compared to freshly isolated aortic SMCs, cultured Movas SMCs, macrophages, B-cells, S100β+ mVSc, bone marrow derived mesenchymal stem cells (MSC) and their respective myogenic progeny across five broadband light wavelengths (λ465 - λ670 ± 20 nm). We found that profiles were of sufficient coverage, specificity, and quality to clearly distinguish medial SMCs from different vascular beds (carotid vs aorta), discriminate normal carotid medial SMCs from lesional SMC-like cells ex vivo following flow restriction, and identify SMC differentiation of a series of multipotent stem cells following treatment with transforming growth factor beta 1 (TGF-β1), the Notch ligand Jagged1, and Sonic Hedgehog using multivariate analysis, in part, due to photonic emissions from enhanced collagen III and elastin expression. Supervised machine learning supported genetic lineage tracing analysis of S100β+ vSCs and identified the presence of S100β+ vSC-derived myogenic progeny within vascular lesions. We conclude disease-relevant photonic signatures may have predictive value for vascular disease.

Download Full-text

Controlling for confounding effects in single cell RNA sequencing studies using both control and target genes

10.1101/045070 ◽

2016 ◽

Author(s):

Mengjie Chen ◽

Xiang Zhou

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Target Genes ◽

Expectation Maximization Algorithm ◽

Data Sets ◽

Single Cell Rna Sequencing ◽

Sequencing Studies ◽

Order Of Magnitude ◽

The Rich ◽

Downstream Analysis

Single cell RNA sequencing (scRNAseq) technique is becoming increasingly popular for unbiased and high-resolutional transcriptome analysis of heterogeneous cell populations. Despite its many advantages, scRNAseq, like any other genomic sequencing technique, is susceptible to the influence of confounding effects. Controlling for confounding effects in scRNAseq data is thus a crucial step for proper data normalization and accurate downstream analysis. Several recent methodological studies have demonstrated the use of control genes for controlling for confounding effects in scRNAseq studies; the control genes are used to infer the confounding effects, which are then used to normalize target genes of primary interest. However, these methods can be suboptimal as they ignore the rich information contained in the target genes. Here, we develop an alternative statistical method, which we refer to as scPLS, for more accurate inference of confounding effects. Our method is based on partial least squares and models control and target genes jointly to better infer and control for confounding effects. To accompany our method, we develop a novel expectation maximization algorithm for scalable inference. Our algorithm is an order of magnitude faster than standard ones, making scPLS applicable to hundreds of cells and hundreds of thousands of genes. With extensive simulations and comparisons with other methods, we demonstrate the effectiveness of scPLS. Finally, we apply scPLS to analyze two scRNAseq data sets to illustrate its benefits in removing technical confounding effects as well as for removing cell cycle effects.

Download Full-text