Graph Drawing-based Dimensionality Reduction to Identify Hidden Communities in Single-Cell Sequencing Spatial Representation

Mapping Intimacies ◽

10.1101/2020.05.05.078550 ◽

2020 ◽

Author(s):

Alireza Khodadadi-Jamayran ◽

Aristotelis Tsirigos

Keyword(s):

Dimensionality Reduction ◽

Single Cell ◽

Large Scale ◽

Graph Drawing ◽

Dimensional Space ◽

K Nearest Neighbor ◽

Network Graph ◽

Gene Expressions ◽

Single Cell Sequencing ◽

Spring Force

SUMMARYWith the rapid growth of single cell sequencing technologies, finding cell communities with high accuracy has become crucial for large scale projects. Employing the current commonly used dimensionality reduction techniques such as tSNE and UMAP, it is often difficult to clearly distinguish cell communities in high dimensional space. Usually cell communities with similar origin and trajectories cluster so closely to each that their subtle but important differences do not become readily apparent. This creates a problem for clustering, as clustering is also performed on dimensionality reduction results. In order to identify such communities, scientists either perform broad clustering and then extract each cluster and perform re-clustering to identify sub-populations or they over-cluster the data and then merging the clusters with similar gene expressions. This is an incredibly cumbersome and time-consuming process. To solve this problem, we propose K-nearest-neighbor-based Network graph drawing Layout (KNetL, pronounced like ‘nettle’) for dimensionality reduction. In our method, we use force-directed graph drawing, whereby the attractive force (analogous to a spring force) and the repulsive force (analogous to an electrical force in atomic particles) between the cells are evaluated, and the cell communities are organized in a structural visualization. The coordinates of the force-compacted nodes are then extracted, and we employ dimensionality reduction methods, such as tSNE and UMAP to unpack the nodes. The final plot, a KNetL map, shows a visually-appealing and distinctive separation between cell communities. Our results show that KNetL maps bring significant resolution to visualizing and identifying otherwise hidden cell communities. All the algorithms are implemented in the iCellR package and available through the CRAN repository. Single (i) Cell R package (iCellR) provides great flexibility at every step of the analysis pipeline, including normalization, clustering, dimensionality reduction, interactive 2D and 3D visualizations, batch alignment or data integration, imputation, and interactive cell gating tools, which allow users to manually gate around the cells.

Download Full-text

A statistical test on single-cell data reveals widespread recurrent mutations in tumor evolution

10.1101/094722 ◽

2016 ◽

Cited By ~ 3

Author(s):

Jack Kuipers ◽

Katharina Jahn ◽

Benjamin J. Raphael ◽

Niko Beerenwinkel

Keyword(s):

Single Cell ◽

Large Scale ◽

Tumor Evolution ◽

Sequencing Data ◽

General Validity ◽

Genomic Deletions ◽

Single Cell Sequencing ◽

Statistical Framework ◽

Recurrent Mutations ◽

Complex Models

The infinite sites assumption, which states that every genomic position mutates at most once over the lifetime of a tumor, is central to current approaches for reconstructing mutation histories of tumors, but has never been tested explicitly. We developed a rigorous statistical framework to test the assumption with single-cell sequencing data. The framework accounts for the high noise and contamination present in such data. We found strong evidence for recurrent mutations at the same site in 8 out of 9 single-cell sequencing datasets from human tumors. Six cases involved the loss of earlier mutations, five of which occurred at sites unaffected by large scale genomic deletions. Two cases exhibited parallel mutation, including the dataset with the strongest evidence of recurrence. Our results refute the general validity of the infinite sites assumption and indicate that more complex models are needed to adequately quantify intra-tumor heterogeneity.

Download Full-text

Integrating single-cell datasets with ambiguous batch information by incorporating molecular network features

Briefings in Bioinformatics ◽

10.1093/bib/bbab366 ◽

2021 ◽

Author(s):

Ji Dong ◽

Peijie Zhou ◽

Yichong Wu ◽

Yidong Chen ◽

Haoling Xie ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Developmental Stages ◽

Rapid Development ◽

Molecular Network ◽

Rna Seq ◽

Single Cell Sequencing ◽

The World ◽

Information Score ◽

Simple Network

Abstract With the rapid development of single-cell sequencing techniques, several large-scale cell atlas projects have been launched across the world. However, it is still challenging to integrate single-cell RNA-seq (scRNA-seq) datasets with diverse tissue sources, developmental stages and/or few overlaps, due to the ambiguity in determining the batch information, which is particularly important for current batch-effect correction methods. Here, we present SCORE, a simple network-based integration methodology, which incorporates curated molecular network features to infer cellular states and generate a unified workflow for integrating scRNA-seq datasets. Validating on real single-cell datasets, we showed that regardless of batch information, SCORE outperforms existing methods in accuracy, robustness, scalability and data integration.

Download Full-text

Deep soft K-means clustering with self-training for single-cell RNA sequence data

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa039 ◽

2020 ◽

Vol 2 (2) ◽

Cited By ~ 2

Author(s):

Liang Chen ◽

Weinan Wang ◽

Yuyao Zhai ◽

Minghua Deng

Keyword(s):

Deep Learning ◽

Single Cell ◽

Large Scale ◽

Sequence Data ◽

Dimensional Space ◽

Expression Profiles ◽

Single Cells ◽

Clustering Algorithms ◽

Training Procedure ◽

Latent Space

Abstract Single-cell RNA sequencing (scRNA-seq) allows researchers to study cell heterogeneity at the cellular level. A crucial step in analyzing scRNA-seq data is to cluster cells into subpopulations to facilitate subsequent downstream analysis. However, frequent dropout events and increasing size of scRNA-seq data make clustering such high-dimensional, sparse and massive transcriptional expression profiles challenging. Although some existing deep learning-based clustering algorithms for single cells combine dimensionality reduction with clustering, they either ignore the distance and affinity constraints between similar cells or make some additional latent space assumptions like mixture Gaussian distribution, failing to learn cluster-friendly low-dimensional space. Therefore, in this paper, we combine the deep learning technique with the use of a denoising autoencoder to characterize scRNA-seq data while propose a soft self-training K-means algorithm to cluster the cell population in the learned latent space. The self-training procedure can effectively aggregate the similar cells and pursue more cluster-friendly latent space. Our method, called ‘scziDesk’, alternately performs data compression, data reconstruction and soft clustering iteratively, and the results exhibit excellent compatibility and robustness in both simulated and real data. Moreover, our proposed method has perfect scalability in line with cell size on large-scale datasets.

Download Full-text

Large-scale simultaneous measurement of epitopes and transcriptomes in single cells

10.1101/113068 ◽

2017 ◽

Cited By ~ 10

Author(s):

Marlon Stoeckius ◽

Christoph Hafemeister ◽

William Stephenson ◽

Brian Houck-Loomis ◽

Pratip K. Chattopadhyay ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Single Cells ◽

Cellular Proteins ◽

Surface Markers ◽

Cell Surface Markers ◽

Complex Cell ◽

Single Cell Sequencing ◽

Protein Levels ◽

Phenotypic Information

Recent high-throughput single-cell sequencing approaches have been transformative for understanding complex cell populations, but are unable to provide additional phenotypic information, such as protein levels of cell-surface markers. Using oligonucleotide-labeled antibodies, we integrate measurements of cellular proteins and transcriptomes into an efficient, sequencing-based readout of single cells. This method is compatible with existing single-cell sequencing approaches and will readily scale as the throughput of these methods increase.

Download Full-text

scGAE: topology-preserving dimensionality reduction for single-cell RNA-seq data using graph autoencoder

10.1101/2021.02.16.431357 ◽

2021 ◽

Author(s):

Zixiang Luo ◽

Chenyu Xu ◽

Zhen Zhang ◽

Wenfei Jin

Keyword(s):

Dimensionality Reduction ◽

Single Cell ◽

Topological Structure ◽

Dimensional Space ◽

Simulated Data ◽

Oriented Graph ◽

Developmental Trajectory ◽

Structure Information ◽

Low Dimensional ◽

Cell Graph

ABSTRACTDimensionality reduction is crucial for the visualization and interpretation of the high-dimensional single-cell RNA sequencing (scRNA-seq) data. However, preserving topological structure among cells to low dimensional space remains a challenge. Here, we present the single-cell graph autoencoder (scGAE), a dimensionality reduction method that preserves topological structure in scRNA-seq data. scGAE builds a cell graph and uses a multitask-oriented graph autoencoder to preserve topological structure information and feature information in scRNA-seq data simultaneously. We further extended scGAE for scRNA-seq data visualization, clustering, and trajectory inference. Analyses of simulated data showed that scGAE accurately reconstructs developmental trajectory and separates discrete cell clusters under different scenarios, outperforming recently developed deep learning methods. Furthermore, implementation of scGAE on empirical data showed scGAE provided novel insights into cell developmental lineages and preserved inter-cluster distances.

Download Full-text

RobustClone: A robust PCA method of tumor clone and evolution inference from single-cell sequencing data

10.1101/666271 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ziwei Chen ◽

Fuzhou Gong ◽

Liang Ma ◽

Lin Wan

Keyword(s):

Single Cell ◽

Large Scale ◽

Principal Component ◽

Low Rank ◽

Breast Cancer Dataset ◽

Sequencing Data ◽

Cancer Dataset ◽

Large Reservoir ◽

Single Cell Sequencing ◽

Model Free

AbstractSingle-cell sequencing (SCS) data provide unprecedented insights into intratumoral heterogeneity. With SCS, we can better characterize clonal genotypes and build phylogenetic relationships of tumor cells/clones. However, high technical errors bring much noise into the genetic data, thus limiting the application of evolutionary tools in the large reservoir. To recover the low-dimensional subspace of tumor subpopulations from error-prone SCS data in the presence of corrupted and/or missing elements, we developed an efficient computational framework, termed RobustClone, to recover the true genotypes of subclones based on the low-rank matrix factorization method of extended robust principal component analysis (RPCA) and reconstruct the subclonal evolutionary tree. RobustClone is a model-free method, fast and scalable to large-scale datasets. We conducted a set of systematic evaluations on simulated datasets and demonstrated that RobustClone outperforms state-of-the-art methods, both in accuracy and efficiency. We further validated RobustClone on 2 single-cell SNV and 2 single-cell CNV datasets and demonstrated that RobustClone could recover genotype matrix and infer the subclonal evolution tree accurately under various scenarios. In particular, RobustClone revealed the spatial progression patterns of subclonal evolution on the large-scale 10X Genomics scCNV breast cancer dataset. RobustClone software is available at https://github.com/ucasdp/RobustClone.

Download Full-text

D-EE: Distributed software for visualizing intrinsic structure of large-scale single-cell data

GigaScience ◽

10.1093/gigascience/giaa126 ◽

2020 ◽

Vol 9 (11) ◽

Cited By ~ 1

Author(s):

Shaokun An ◽

Jizu Huang ◽

Lin Wan

Keyword(s):

Time Series ◽

Dimensionality Reduction ◽

Single Cell ◽

High Performance ◽

Large Scale ◽

Distributed Storage ◽

Distributed Computation ◽

Low Dimensional ◽

Cell Data ◽

Performance Computing

Abstract Background Dimensionality reduction and visualization play vital roles in single-cell RNA sequencing (scRNA-seq) data analysis. While they have been extensively studied, state-of-the-art dimensionality reduction algorithms are often unable to preserve the global structures underlying data. Elastic embedding (EE), a nonlinear dimensionality reduction method, has shown promise in revealing low-dimensional intrinsic local and global data structure. However, the current implementation of the EE algorithm lacks scalability to large-scale scRNA-seq data. Results We present a distributed optimization implementation of the EE algorithm, termed distributed elastic embedding (D-EE). D-EE reveals the low-dimensional intrinsic structures of data with accuracy equal to that of elastic embedding, and it is scalable to large-scale scRNA-seq data. It leverages distributed storage and distributed computation, achieving memory efficiency and high-performance computing simultaneously. In addition, an extended version of D-EE, termed distributed optimization implementation of time-series elastic embedding (D-TSEE), enables the user to visualize large-scale time-series scRNA-seq data by incorporating experimentally temporal information. Results with large-scale scRNA-seq data indicate that D-TSEE can uncover oscillatory gene expression patterns by using experimentally temporal information. Conclusions D-EE is a distributed dimensionality reduction and visualization tool. Its distributed storage and distributed computation technique allow us to efficiently analyze large-scale single-cell data at the cost of constant time speedup. The source code for D-EE algorithm based on C and MPI tailored to a high-performance computing cluster is available at https://github.com/ShaokunAn/D-EE.

Download Full-text

RobustClone: a robust PCA method for tumor clone and evolution inference from single-cell sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa172 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3299-3306

Author(s):

Ziwei Chen ◽

Fuzhou Gong ◽

Lin Wan ◽

Liang Ma

Keyword(s):

Single Cell ◽

Large Scale ◽

Clonal Evolution ◽

Low Rank ◽

Supplementary Information ◽

Breast Cancer Dataset ◽

Sequencing Data ◽

Cancer Dataset ◽

Single Cell Sequencing ◽

Model Free

Abstract Motivation Single-cell sequencing (SCS) data provide unprecedented insights into intratumoral heterogeneity. With SCS, we can better characterize clonal genotypes and reconstruct phylogenetic relationships of tumor cells/clones. However, SCS data are often error-prone, making their computational analysis challenging. Results To infer the clonal evolution in tumor from the error-prone SCS data, we developed an efficient computational framework, termed RobustClone. It recovers the true genotypes of subclones based on the extended robust principal component analysis, a low-rank matrix decomposition method, and reconstructs the subclonal evolutionary tree. RobustClone is a model-free method, which can be applied to both single-cell single nucleotide variation (scSNV) and single-cell copy-number variation (scCNV) data. It is efficient and scalable to large-scale datasets. We conducted a set of systematic evaluations on simulated datasets and demonstrated that RobustClone outperforms state-of-the-art methods in large-scale data both in accuracy and efficiency. We further validated RobustClone on two scSNV and two scCNV datasets and demonstrated that RobustClone could recover genotype matrix and infer the subclonal evolution tree accurately under various scenarios. In particular, RobustClone revealed the spatial progression patterns of subclonal evolution on the large-scale 10X Genomics scCNV breast cancer dataset. Availability and implementation RobustClone software is available at https://github.com/ucasdp/RobustClone. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Immunology Driven by Large-Scale Single-Cell Sequencing

Trends in Immunology ◽

10.1016/j.it.2019.09.004 ◽

2019 ◽

Vol 40 (11) ◽

pp. 1011-1021 ◽

Cited By ~ 27

Author(s):

Tomás Gomes ◽

Sarah A. Teichmann ◽

Carlos Talavera-López

Keyword(s):

Single Cell ◽

Large Scale ◽

Single Cell Sequencing

Download Full-text

MultiMAP: Dimensionality Reduction and Integration of Multimodal Data

10.1101/2021.02.16.431421 ◽

2021 ◽

Author(s):

Mika Sarkin Jain ◽

Krzysztof Polanski ◽

Cecilia Dominguez Conde ◽

Xi Chen ◽

Jongeun Park ◽

...

Keyword(s):

Transcription Factor ◽

Dimensionality Reduction ◽

Single Cell ◽

Spatial Data ◽

Cell Biology ◽

Dimensional Space ◽

Linear Mapping ◽

Structure Alignment ◽

Dimensional Structure ◽

Multimodal Data

AbstractMultimodal data is rapidly growing in many fields of science and engineering, including single-cell biology. We introduce MultiMAP, an approach for dimensionality reduction and integration of multiple datasets. MultiMAP recovers a single manifold on which all of the data resides and then projects the data into a single low-dimensional space so as to preserve the structure of the manifold. It is based on a framework of Riemannian geometry and algebraic topology, and generalizes the popular UMAP algorithm1 to the multimodal setting. MultiMAP can be used for visualization of multimodal data, and as an integration approach that enables joint analyses. MultiMAP has several advantages over existing integration strategies for single-cell data, including that MultiMAP can integrate any number of datasets, leverages features that are not present in all datasets (i.e. datasets can be of different dimensionalities), is not restricted to a linear mapping, can control the influence of each dataset on the embedding, and is extremely scalable to large datasets. We apply MultiMAP to the integration of a variety of single-cell transcriptomics, chromatin accessibility, methylation, and spatial data, and show that it outperforms current approaches in preservation of high-dimensional structure, alignment of datasets, visual separation of clusters, transfer learning, and runtime. On a newly generated single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) and single-cell RNA-seq (scRNA-seq) dataset of the human thymus, we use MultiMAP to integrate cells along a temporal trajectory. This enables the quantitative comparison of transcription factor expression and binding site accessibility over the course of T cell differentiation, revealing patterns of transcription factor kinetics.

Download Full-text