Clustering and visualization of single-cell RNA-seq data using path metrics

Mapping Intimacies ◽

10.1101/2021.12.14.472627 ◽

2021 ◽

Author(s):

Andriana Manousidaki ◽

Anna Little ◽

Yuying Xie

Keyword(s):

Dimension Reduction ◽

Single Cell ◽

Cluster Structure ◽

Data Sets ◽

Sequencing Data ◽

Local Cluster ◽

Wide Range ◽

Clustering Quality ◽

Low Dimensional

Recent advances in single-cell technologies have enabled high-resolution characterization of tissue and cancer compositions. Although numerous tools for dimension reduction and clustering are available for single-cell data analyses, these methods often fail to simultaneously preserve local cluster structure and global data geometry. This article explores the application of power-weighted path metrics for the analysis of single cell RNA data. Extensive experiments on single cell RNA sequencing data sets confirm the usefulness of path metrics for dimension reduction and clustering. Distances between cells are measured in a data- driven way which is both density sensitive (decreasing distances across high density regions) and respects the underlying data geometry. By combining path metrics with multidimensional scaling, a low dimensional embedding of the data is obtained which respects both the global geometry of the data and preserves cluster structure. We evaluate the method both for clustering quality and geometric fidelity, and it outperforms other algorithms on a wide range of bench marking data sets.

Download Full-text

Dimension reduction and denoising of single-cell RNA sequencing data in the presence of observed confounding variables

10.1101/2020.08.03.234765 ◽

2020 ◽

Author(s):

Mo Huang ◽

Zhaojun Zhang ◽

Nancy R. Zhang

Keyword(s):

Gene Expression ◽

Dimension Reduction ◽

Single Cell ◽

Rna Sequencing ◽

Cell Alignment ◽

Sequencing Data ◽

Library Size ◽

Batch Correction ◽

Single Cell Rna Sequencing ◽

Low Dimensional

AbstractConfounding variation, such as batch effects, are a pervasive issue in single-cell RNA sequencing experiments. While methods exist for aligning cells across batches, it is yet unclear how to correct for other types of confounding variation which may be observed at the subject level, such as age and sex, and at the cell level, such as library size and other measures of cell quality. On the specific problem of batch alignment, many questions still persist despite recent advances: Existing methods can effectively align batches in low-dimensional representations of cells, yet their effectiveness in aligning the original gene expression matrices is unclear. Nor is it clear how batch correction can be performed alongside data denoising, the former treating technical biases due to experimental stratification while the latter treating technical variation due inherently to the random sampling that occurs during library construction and sequencing. Here, we propose SAVERCAT, a method for dimension reduction and denoising of single-cell gene expression data that can flexibly adjust for arbitrary observed covariates. We benchmark SAVERCAT against existing single-cell batch correction methods and show that while it matches the best of the field in low-dimensional cell alignment, it significantly improves upon existing methods on the task of batch correction in the high-dimensional expression matrix. We also demonstrate the ability of SAVERCAT to effectively integrate batch correction and denoising through a data down-sampling experiment. Finally, we apply SAVERCAT to a single cell study of Alzheimer’s disease where batch is confounded with the contrast of interest, and demonstrate how adjusting for covariates other than batch allows for more interpretable analysis.

Download Full-text

Software Benchmark—Classification Tree Algorithms for Cell Atlases Annotation Using Single-Cell RNA-Sequencing Data

Microbiology Research ◽

10.3390/microbiolres12020022 ◽

2021 ◽

Vol 12 (2) ◽

pp. 317-334

Author(s):

Omar Alaqeeli ◽

Li Xing ◽

Xuekui Zhang

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Classification Tree ◽

Area Under The Curve ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Rna Sequencing ◽

Tree Algorithms ◽

R Packages

Classification tree is a widely used machine learning method. It has multiple implementations as R packages; rpart, ctree, evtree, tree and C5.0. The details of these implementations are not the same, and hence their performances differ from one application to another. We are interested in their performance in the classification of cells using the single-cell RNA-Sequencing data. In this paper, we conducted a benchmark study using 22 Single-Cell RNA-sequencing data sets. Using cross-validation, we compare packages’ prediction performances based on their Precision, Recall, F1-score, Area Under the Curve (AUC). We also compared the Complexity and Run-time of these R packages. Our study shows that rpart and evtree have the best Precision; evtree is the best in Recall, F1-score and AUC; C5.0 prefers more complex trees; tree is consistently much faster than others, although its complexity is often higher than others.

Download Full-text

Bayesian Classification of Microbial Communities Based on 16S rRNA Metagenomic Data

10.1101/340653 ◽

2018 ◽

Cited By ~ 1

Author(s):

Arghavan Bahadorinejad ◽

Ivan Ivanov ◽

Johanna W Lampe ◽

Meredith AJ Hullar ◽

Robert S Chapkin ◽

...

Keyword(s):

16S Rrna ◽

Sample Size ◽

Microbial Communities ◽

State Of The Art ◽

Metagenomic Data ◽

Data Sets ◽

Sequencing Data ◽

Sample Data

AbstractWe propose a Bayesian method for the classification of 16S rRNA metagenomic profiles of bacterial abundance, by introducing a Poisson-Dirichlet-Multinomial hierarchical model for the sequencing data, constructing a prior distribution from sample data, calculating the posterior distribution in closed form; and deriving an Optimal Bayesian Classifier (OBC). The proposed algorithm is compared to state-of-the-art classification methods for 16S rRNA metagenomic data, including Random Forests and the phylogeny-based Metaphyl algorithm, for varying sample size, classification difficulty, and dimensionality (number of OTUs), using both synthetic and real metagenomic data sets. The results demonstrate that the proposed OBC method, with either noninformative or constructed priors, is competitive or superior to the other methods. In particular, in the case where the ratio of sample size to dimensionality is small, it was observed that the proposed method can vastly outperform the others.Author summaryRecent studies have highlighted the interplay between host genetics, gut microbes, and colorectal tumor initiation/progression. The characterization of microbial communities using metagenomic profiling has therefore received renewed interest. In this paper, we propose a method for classification, i.e., prediction of different outcomes, based on 16S rRNA metagenomic data. The proposed method employs a Bayesian approach, which is suitable for data sets with small ration of number of available instances to the dimensionality. Results using both synthetic and real metagenomic data show that the proposed method can outperform other state-of-the-art metagenomic classification algorithms.

Download Full-text

CAMISIM: Simulating metagenomes and microbial communities

10.1101/300970 ◽

2018 ◽

Cited By ~ 4

Author(s):

Adrian Fritz ◽

Peter Hofmann ◽

Stephan Majda ◽

Eik Dahms ◽

Johannes Dröge ◽

...

Keyword(s):

Microbial Communities ◽

De Novo ◽

Real Data ◽

Small Data ◽

Data Sets ◽

Sequencing Data ◽

Taxonomic Profiling ◽

Benchmark Data ◽

Sequencing Technologies ◽

Wide Range

Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required. Here, we describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series and differential abundance studies, includes real and simulated strain-level diversity, and generates second and third generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT and metaSPAdes, on several thousand small data sets generated with CAMISIM. CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with truth standards for method evaluation. All data sets and the software are freely available at: https://github.com/CAMI-challenge/CAMISIM

Download Full-text

Panoramic stitching of heterogeneous single-cell transcriptomic data

10.1101/371179 ◽

2018 ◽

Cited By ~ 17

Author(s):

Brian Hie ◽

Bryan Bryson ◽

Bonnie Berger

Keyword(s):

Single Cell ◽

Cell Types ◽

Data Sets ◽

Cell Type ◽

Data Set ◽

Wide Range ◽

Data Set Integration ◽

Biological Patterns ◽

Insight Into ◽

Comprehensive Reference

AbstractResearchers are generating single-cell RNA sequencing (scRNA-seq) profiles of diverse biological systems1–4 and every cell type in the human body.5 Leveraging this data to gain unprecedented insight into biology and disease will require assembling heterogeneous cell populations across multiple experiments, laboratories, and technologies. Although methods for scRNA-seq data integration exist6,7, they often naively merge data sets together even when the data sets have no cell types in common, leading to results that do not correspond to real biological patterns. Here we present Scanorama, inspired by algorithms for panorama stitching, that overcomes the limitations of existing methods to enable accurate, heterogeneous scRNA-seq data set integration. Our strategy identifies and merges the shared cell types among all pairs of data sets and is orders of magnitude faster than existing techniques. We use Scanorama to combine 105,476 cells from 26 diverse scRNA-seq experiments across 9 different technologies into a single comprehensive reference, demonstrating how Scanorama can be used to obtain a more complete picture of cellular function across a wide range of scRNA-seq experiments.

Download Full-text

Graph representation learning: a survey

APSIPA Transactions on Signal and Information Processing ◽

10.1017/atsip.2020.13 ◽

2020 ◽

Vol 9 ◽

Author(s):

Fenxiao Chen ◽

Yun-Cheng Wang ◽

Bin Wang ◽

C.-C. Jay Kuo

Keyword(s):

Graph Embedding ◽

Large Data ◽

Representation Learning ◽

Graph Representation ◽

Data Sets ◽

Graph Data ◽

Graph Properties ◽

Wide Range ◽

Regular Lattices ◽

Low Dimensional

Abstract Research on graph representation learning has received great attention in recent years since most data in real-world applications come in the form of graphs. High-dimensional graph data are often in irregular forms. They are more difficult to analyze than image/video/audio data defined on regular lattices. Various graph embedding techniques have been developed to convert the raw graph data into a low-dimensional vector representation while preserving the intrinsic graph properties. In this review, we first explain the graph embedding task and its challenges. Next, we review a wide range of graph embedding techniques with insights. Then, we evaluate several stat-of-the-art methods against small and large data sets and compare their performance. Finally, potential applications and future directions are presented.

Download Full-text

Evaluation of single-cell classifiers for single-cell RNA sequencing data sets

Briefings in Bioinformatics ◽

10.1093/bib/bbz096 ◽

2019 ◽

Vol 21 (5) ◽

pp. 1581-1595 ◽

Cited By ~ 6

Author(s):

Xinlei Zhao ◽

Shuang Wu ◽

Nan Fang ◽

Xiao Sun ◽

Jue Fan

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Reference Data ◽

Predictive Accuracy ◽

Cell Types ◽

Superior Performance ◽

Marker Genes ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Abstract Single-cell RNA sequencing (scRNA-seq) has been rapidly developing and widely applied in biological and medical research. Identification of cell types in scRNA-seq data sets is an essential step before in-depth investigations of their functional and pathological roles. However, the conventional workflow based on clustering and marker genes is not scalable for an increasingly large number of scRNA-seq data sets due to complicated procedures and manual annotation. Therefore, a number of tools have been developed recently to predict cell types in new data sets using reference data sets. These methods have not been generally adapted due to a lack of tool benchmarking and user guidance. In this article, we performed a comprehensive and impartial evaluation of nine classification software tools specifically designed for scRNA-seq data sets. Results showed that Seurat based on random forest, SingleR based on correlation analysis and CaSTLe based on XGBoost performed better than others. A simple ensemble voting of all tools can improve the predictive accuracy. Under nonideal situations, such as small-sized and class-imbalanced reference data sets, tools based on cluster-level similarities have superior performance. However, even with the function of assigning ‘unassigned’ labels, it is still challenging to catch novel cell types by solely using any of the single-cell classifiers. This article provides a guideline for researchers to select and apply suitable classification tools in their analysis workflows and sheds some lights on potential direction of future improvement on classification tools.

Download Full-text

High-throughput single-cell DNA sequencing of AML tumors with droplet microfluidics

10.1101/203158 ◽

2017 ◽

Cited By ~ 2

Author(s):

Maurizio Pellegrino ◽

Adam Sciambi ◽

Sebastian Treusch ◽

Robert Durruthy-Durruthy ◽

Kaustubh Gokhale ◽

...

Keyword(s):

Single Cell ◽

Clonal Evolution ◽

Droplet Microfluidics ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Single Cell Sequencing ◽

Selection For ◽

Genetic Profiles ◽

Generation Sequencing

ABSTRACTTo enable the characterization of genetic heterogeneity in tumor cell populations, we developed a novel microfluidic approach that barcodes amplified genomic DNA from thousands of individual cancer cells confined to droplets. The barcodes are then used to reassemble the genetic profiles of cells from next generation sequencing data. Using this approach, we sequenced longitudinally collected AML tumor populations from two patients and genotyped up to 62 disease relevant loci across more than 16,000 individual cells. Targeted single-cell sequencing was able to sensitively identify tumor cells during complete remission and uncovered complex clonal evolution within AML tumors that was not observable with bulk sequencing. We anticipate that this approach will make feasible the routine analysis of heterogeneity in AML leading to improved stratification and therapy selection for the disease.

Download Full-text

Scirpy: A Scanpy extension for analyzing single-cell T-cell receptor sequencing data

10.1101/2020.04.10.035865 ◽

2020 ◽

Author(s):

Gregor Sturm ◽

Tamas Szabo ◽

Georgios Fotakis ◽

Marlene Haider ◽

Dietmar Rieder ◽

...

Keyword(s):

T Cell ◽

Single Cell ◽

Large Scale ◽

Single Cells ◽

Cell Receptor ◽

Sequencing Data ◽

Seamless Integration ◽

T Cell Phenotypes ◽

Cell Phenotypes

AbstractSummaryAdvances in single-cell technologies have enabled the investigation of T cell phenotypes and repertoires at unprecedented resolution and scale. Bioinformatic methods for the efficient analysis of these large-scale datasets are instrumental for advancing our understanding of adaptive immune responses in cancer, but also in infectious diseases like COVID-19. However, while well-established solutions are accessible for the processing of single-cell transcriptomes, no streamlined pipelines are available for the comprehensive characterization of T cell receptors. Here we propose Scirpy, a scalable Python toolkit that provides simplified access to the analysis and visualization of immune repertoires from single cells and seamless integration with transcriptomic data.Availability and implementationScirpy source code and documentation are available at https://github.com/icbi-lab/scirpy.

Download Full-text

DrivAER: Identification of driving transcriptional programs in single-cell RNA sequencing data

10.1101/864165 ◽

2019 ◽

Author(s):

Lukas M. Simon ◽

Fangfang Yan ◽

Zhongming Zhao

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Disease Status ◽

Data Sets ◽

Sequencing Data ◽

Functional Interpretation ◽

Recent Success ◽

Gene Sets ◽

Single Cell Rna Sequencing ◽

Cellular Maps

AbstractSingle cell RNA sequencing (scRNA-seq) unfolds complex transcriptomic data sets into detailed cellular maps. Despite recent success, there is a pressing need for specialized methods tailored towards the functional interpretation of these cellular maps. Here, we present DrivAER, a machine learning approach that scores annotated gene sets based on their relevance to user-specified outcomes such as pseudotemporal ordering or disease status. We demonstrate that DrivAER extracts the key driving pathways and transcription factors that regulate complex biological processes from scRNA-seq data.

Download Full-text