Differential abundance testing on single-cell data using k-nearest neighbor graphs

In high-dimensional single cell data, comparing changes in functional markers between conditions is typically done across manual or algorithm-derived partitions based on population-defining markers. Visualizations of these partitions is commonly done on low-dimensional embeddings (eg. t-SNE), colored by per-partition changes. Here, we provide an analysis and visualization tool that performs these comparisons across overlapping k-nearest neighbor (KNN) groupings. This allows one to color low-dimensional embeddings by marker changes without hard boundaries imposed by partitioning. We devised an objective optimization of k based on minimizing functional marker KNN imputation error. Proof-of-concept work visualized the exact location of an IL-7 responsive subset in a B cell developmental trajectory on a t-SNE map independent of clustering. Per-condition cell frequency analysis revealed that KNN is sensitive to detecting artifacts due to marker shift, and therefore can also be valuable in a quality control pipeline. Overall, we found that KNN groupings lead to useful multiple condition visualizations and efficiently extract a large amount of information from mass cytometry data. Our software is publicly available through the Bioconductor package Sconify.

Download Full-text

SPRING: a kinetic interface for visualizing high dimensional single-cell expression data

10.1101/090332 ◽

2016 ◽

Cited By ~ 10

Author(s):

Caleb Weinreb ◽

Samuel Wolock ◽

Allon Klein

Keyword(s):

Gene Expression ◽

Single Cell ◽

Nearest Neighbor ◽

High Dimensional ◽

K Nearest Neighbor ◽

Link Type ◽

Cell Gene Expression ◽

Graph Layouts ◽

Cell Expression ◽

Cell Data

MotivationSingle-cell gene expression profiling technologies can map the cell states in a tissue or organism. As these technologies become more common, there is a need for computational tools to explore the data they produce. In particular, existing data visualization approaches are imperfect for studying continuous gene expression topologies.ResultsForce-directed layouts of k-nearest-neighbor graphs can visualize continuous gene expression topologies in a manner that preserves high-dimensional relationships and allows manually exploration of different stable two-dimensional representations of the same data. We implemented an interactive web-tool to visualize single-cell data using force-directed graph layouts, called SPRING. SPRING reveals more detailed biological relationships than existing approaches when applied to branching gene expression trajectories from hematopoietic progenitor cells. Visualizations from SPRING are also more reproducible than those of stochastic visualization methods such as tSNE, a state-of-the-art tool.Availabilityhttps://kleintools.hms.harvard.edu/tools/spring.html,https://github.com/AllonKleinLab/SPRING/[email protected], [email protected]

Download Full-text

Milo: differential abundance testing on single-cell data using k-NN graphs

10.1101/2020.11.23.393769 ◽

2020 ◽

Author(s):

Emma Dann ◽

Neil C. Henderson ◽

Sarah A. Teichmann ◽

Michael D. Morgan ◽

John C. Marioni

Keyword(s):

Single Cell ◽

Linear Models ◽

R Software ◽

Experimental Conditions ◽

Differential Abundance ◽

Biological Condition ◽

Precursor State ◽

Statistical Framework ◽

Wide Range ◽

Cell Data

AbstractSingle-cell omic protocols applied to disease, development or mechanistic studies can reveal the emergence of aberrant cell states or changes in differentiation. These perturbations can manifest as a shift in the abundance of cells associated with a biological condition. Current computational workflows for comparative analyses typically use discrete clusters as input when testing for differential abundance between experimental conditions. However, clusters are not always an optimal representation of the biological manifold on which cells lie, especially in the context of continuous differentiation trajectories. To overcome these barriers to discovery, we present Milo, a flexible and scalable statistical framework that performs differential abundance testing by assigning cells to partially overlapping neighbourhoods on a k-nearest neighbour graph. Our method samples and refines neighbourhoods across the graph and leverages the flexibility of generalized linear models, making it applicable to a wide range of experimental settings. Using simulations, we show that Milo is both robust and sensitive, and can reveal subtle but important cell state perturbations that are obscured by discretizing cells into clusters. We illustrate the power of Milo by identifying the perturbed differentiation during ageing of a lineage-biased thymic epithelial precursor state and by uncovering extensive perturbation to multiple lineages in human cirrhotic liver. Milo is provided as an open-source R software package with documentation and tutorials at https://github.com/MarioniLab/miloR.

Download Full-text

Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008569 ◽

2021 ◽

Vol 17 (1) ◽

pp. e1008569

Author(s):

Andreas Tjärnberg ◽

Omar Mahmood ◽

Christopher A. Jackson ◽

Giuseppe-Antonio Saldi ◽

Kyunghyun Cho ◽

...

Keyword(s):

Objective Function ◽

Single Cell ◽

Missing Values ◽

Nearest Neighbor ◽

Projection Methods ◽

Specific Gene ◽

K Nearest Neighbor ◽

Single Cell Genomics ◽

Nearest Neighbor Graph ◽

And Diffusion

The analysis of single-cell genomics data presents several statistical challenges, and extensive efforts have been made to produce methods for the analysis of this data that impute missing values, address sampling issues and quantify and correct for noise. In spite of such efforts, no consensus on best practices has been established and all current approaches vary substantially based on the available data and empirical tests. The k-Nearest Neighbor Graph (kNN-G) is often used to infer the identities of, and relationships between, cells and is the basis of many widely used dimensionality-reduction and projection methods. The kNN-G has also been the basis for imputation methods using, e.g., neighbor averaging and graph diffusion. However, due to the lack of an agreed-upon optimal objective function for choosing hyperparameters, these methods tend to oversmooth data, thereby resulting in a loss of information with regard to cell identity and the specific gene-to-gene patterns underlying regulatory mechanisms. In this paper, we investigate the tuning of kNN- and diffusion-based denoising methods with a novel non-stochastic method for optimally preserving biologically relevant informative variance in single-cell data. The framework, Denoising Expression data with a Weighted Affinity Kernel and Self-Supervision (DEWÄKSS), uses a self-supervised technique to tune its parameters. We demonstrate that denoising with optimal parameters selected by our objective function (i) is robust to preprocessing methods using data from established benchmarks, (ii) disentangles cellular identity and maintains robust clusters over dimension-reduction methods, (iii) maintains variance along several expression dimensions, unlike previous heuristic-based methods that tend to oversmooth data variance, and (iv) rarely involves diffusion but rather uses a fixed weighted kNN graph for denoising. Together, these findings provide a new understanding of kNN- and diffusion-based denoising methods. Code and example data for DEWÄKSS is available at https://gitlab.com/Xparx/dewakss/-/tree/Tjarnberg2020branch.

Download Full-text

K-nearest neighbor smoothing for high-throughput single-cell RNA-Seq data

10.1101/217737 ◽

2017 ◽

Cited By ~ 39

Author(s):

Florian Wagner ◽

Yun Yan ◽

Itai Yanai

Keyword(s):

Single Cell ◽

High Throughput ◽

Nearest Neighbor ◽

Expression Profiles ◽

Simulated Data ◽

T Cell Subsets ◽

Cell Populations ◽

Rna Seq ◽

K Nearest Neighbor ◽

Heterogeneous Tissues

High-throughput single-cell RNA-Seq (scRNA-Seq) is a powerful approach for studying heterogeneous tissues and dynamic cellular processes. However, compared to bulk RNA-Seq, single-cell expression profiles are extremely noisy, as they only capture a fraction of the transcripts present in the cell. Here, we propose the k-nearest neighbor smoothing (kNN-smoothing) algorithm, designed to reduce noise by aggregating information from similar cells (neighbors) in a computationally efficient and statistically tractable manner. The algorithm is based on the observation that across protocols, the technical noise exhibited by UMI-filtered scRNA-Seq data closely follows Poisson statistics. Smoothing is performed by first identifying the nearest neighbors of each cell in a step-wise fashion, based on partially smoothed and variance-stabilized expression profiles, and then aggregating their transcript counts. We show that kNN-smoothing greatly improves the detection of clusters of cells and co-expressed genes, and clearly outperforms other smoothing methods on simulated data. To accurately perform smoothing for datasets containing highly similar cell populations, we propose the kNN-smoothing 2 algorithm, in which neighbors are determined after projecting the partially smoothed data onto the first few principal components. We show that unlike its predecessor, kNN-smoothing 2 can accurately distinguish between cells from different T cell subsets, and enables their identification in peripheral blood using unsupervised methods. Our work facilitates the analysis of scRNA-Seq data across a broad range of applications, including the identification of cell populations in heterogeneous tissues and the characterization of dynamic processes such as cellular differentiation. Reference implementations of our algorithms can be found at https://github.com/yanailab/knn-smoothing.

Download Full-text

A Hybrid Clustering Algorithm for Identifying Cell Types from Single-Cell RNA-Seq Data

Genes ◽

10.3390/genes10020098 ◽

2019 ◽

Vol 10 (2) ◽

pp. 98 ◽

Cited By ~ 11

Author(s):

Xiaoshu Zhu ◽

Hong-Dong Li ◽

Yunpei Xu ◽

Lilu Guo ◽

Fang-Xiang Wu ◽

...

Keyword(s):

Single Cell ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Nonnegative Matrix ◽

Cell Types ◽

Clustering Methods ◽

K Nearest Neighbor ◽

Functional Variation ◽

Minimization Principle ◽

Structural Entropy

Single-cell RNA sequencing (scRNA-seq) has recently brought new insight into cell differentiation processes and functional variation in cell subtypes from homogeneous cell populations. A lack of prior knowledge makes unsupervised machine learning methods, such as clustering, suitable for analyzing scRNA-seq . However, there are several limitations to overcome, including high dimensionality, clustering result instability, and parameter adjustment complexity. In this study, we propose a method by combining structure entropy and k nearest neighbor to identify cell subpopulations in scRNA-seq data. In contrast to existing clustering methods for identifying cell subtypes, minimized structure entropy results in natural communities without specifying the number of clusters. To investigate the performance of our model, we applied it to eight scRNA-seq datasets and compared our method with three existing methods (nonnegative matrix factorization, single-cell interpretation via multikernel learning, and structural entropy minimization principle). The experimental results showed that our approach achieves, on average, better performance in these datasets compared to the benchmark methods.

Download Full-text

Evaluating single-cell cluster stability using the Jaccard similarity index

Bioinformatics ◽

10.1093/bioinformatics/btaa956 ◽

2020 ◽

Author(s):

Ming Tang ◽

Yasin Kaymaz ◽

Brandon L Logeman ◽

Stephen Eichhorn ◽

Zhengzheng S Liang ◽

...

Keyword(s):

Single Cell ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Similarity Index ◽

Cell Types ◽

R Package ◽

Clustering Methods ◽

K Nearest Neighbor ◽

Jaccard Similarity ◽

Cluster Stability

Abstract Motivation One major goal of single-cell RNA sequencing (scRNAseq) experiments is to identify novel cell types. With increasingly large scRNAseq datasets, unsupervised clustering methods can now produce detailed catalogues of transcriptionally distinct groups of cells in a sample. However, the interpretation of these clusters is challenging for both technical and biological reasons. Popular clustering algorithms are sensitive to parameter choices, and can produce different clustering solutions with even small changes in the number of principal components used, the k nearest neighbor and the resolution parameters, among others. Results Here, we present a set of tools to evaluate cluster stability by subsampling, which can guide parameter choice and aid in biological interpretation. The R package scclusteval and the accompanying Snakemake workflow implement all steps of the pipeline: subsampling the cells, repeating the clustering with Seurat and estimation of cluster stability using the Jaccard similarity index and providing rich visualizations. Availabilityand implementation R package scclusteval: https://github.com/crazyhottommy/scclusteval Snakemake workflow: https://github.com/crazyhottommy/pyflow_seuratv3_parameter Tutorial: https://crazyhottommy.github.io/EvaluateSingleCellClustering/.

Download Full-text

Triku: a feature selection method based on nearest neighbors for single-cell data

10.1101/2021.02.12.430764 ◽

2021 ◽

Author(s):

Alex M. Ascensión ◽

Olga Ibañez-Solé ◽

Inaki Inza ◽

Ander Izeta ◽

Marcos J. Araúzo-Bravo

Keyword(s):

Feature Selection ◽

Single Cell ◽

Nearest Neighbor ◽

Feature Selection Method ◽

Selection Method ◽

Cell Populations ◽

Neighbor Graph ◽

Gene Sets ◽

Nearest Neighbor Graph ◽

Cell Data

AbstractFeature selection is a relevant step in the analysis of single-cell RNA sequencing datasets. Triku is a feature selection method that favours genes defining the main cell populations. It does so by selecting genes expressed by groups of cells that are close in the nearest neighbor graph. Triku efficiently recovers cell populations present in artificial and biological benchmarking datasets, based on mutual information and silhouette coefficient measurements. Additionally, gene sets selected by triku are more likely to be related to relevant Gene Ontology terms, and contain fewer ribosomal and mitochondrial genes. Triku is available at https://gitlab.com/alexmascension/triku.

Download Full-text

Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data

10.1101/2020.02.28.970202 ◽

2020 ◽

Author(s):

Andreas Tjärnberg ◽

Omar Mahmood ◽

Christopher A Jackson ◽

Giuseppe-Antonio Saldi ◽

Kyunghyun Cho ◽

...

Keyword(s):

Objective Function ◽

Single Cell ◽

Missing Values ◽

Nearest Neighbor ◽

Projection Methods ◽

Future Research ◽

Specific Gene ◽

K Nearest Neighbor ◽

Single Cell Genomics ◽

And Diffusion

AbstractThe analysis of single-cell genomics data presents several statistical challenges, and extensive efforts have been made to produce methods for the analysis of this data that impute missing values, address sampling issues and quantify and correct for noise. In spite of such efforts, no consensus on best practices has been established and all current approaches vary substantially based on the available data and empirical tests. The k-Nearest Neighbor Graph (kNN-G) is often used to infer the identities of, and relationships between, cells and is the basis of many widely used dimensionality-reduction and projection methods. The kNN-G has also been the basis for imputation methods using, e.g., neighbor averaging and graph diffusion. However, due to the lack of an agreed-upon optimal objective function for choosing hyperparameters, these methods tend to oversmooth data, thereby resulting in a loss of information with regard to cell identity and the specific gene-to-gene patterns underlying regulatory mechanisms. In this paper, we investigate the tuning of kNN- and diffusion-based denoising methods with a novel non-stochastic method for optimally preserving biologically relevant informative variance in single-cell data. The framework, Denoising Expression data with a Weighted Affinity Kernel and Self-Supervision (DEWÄKSS), uses a self-supervised technique to tune its parameters. We demonstrate that denoising with optimal parameters selected by our objective function (i) is robust to preprocessing methods using data from established benchmarks, (ii) disentangles cellular identity and maintains robust clusters over dimension-reduction methods, (iii) maintains variance along several expression dimensions, unlike previous heuristic-based methods that tend to oversmooth data variance, and (iv) rarely involves diffusion but rather uses a fixed weighted kNN graph for denoising. Together, these findings provide a new understanding of kNN- and diffusion-based denoising methods and serve as a foundation for future research. Code and example data for DEWÄKSS is available at https://gitlab.com/Xparx/dewakss/-/tree/Tjarnberg2020branch.

Download Full-text

HGC: fast hierarchical clustering for large-scale single-cell data

10.1101/2021.02.07.430106 ◽

2021 ◽

Author(s):

Ziheng Zou ◽

Kui Hua ◽

Xuegong Zhang

Keyword(s):

Single Cell ◽

Hierarchical Clustering ◽

Large Scale ◽

Nearest Neighbor ◽

Linear Time ◽

Fixed Number ◽

Large Datasets ◽

Clustering Methods ◽

Shared Nearest Neighbor ◽

Cell Data

AbstractClustering is a key step in revealing heterogeneities in single-cell data. Cell heterogeneity can be explored at different resolutions and the resulted varying cell states are inherently nested. However, most existing single-cell clustering methods output a fixed number of clusters without the hierarchical information. Classical hierarchical clustering provides dendrogram of cells, but cannot scale to large datasets due to the high computational complexity. We present HGC, a fast Hierarchical Graph-based Clustering method to address both problems. It combines the advantages of graph-based clustering and hierarchical clustering. On the shared nearest neighbor graph of cells, HGC constructs the hierarchical tree with linear time complexity. Experiments showed that HGC enables multiresolution exploration of the biological hierarchy underlying the data, achieves state-of-the-art accuracy on benchmark data, and can scale to large datasets. HGC is freely available for academic use at https://www.github.com/XuegongLab/[email protected], [email protected]

Download Full-text