scGAE: topology-preserving dimensionality reduction for single-cell RNA-seq data using graph autoencoder

ABSTRACTDimensionality reduction is crucial for the visualization and interpretation of the high-dimensional single-cell RNA sequencing (scRNA-seq) data. However, preserving topological structure among cells to low dimensional space remains a challenge. Here, we present the single-cell graph autoencoder (scGAE), a dimensionality reduction method that preserves topological structure in scRNA-seq data. scGAE builds a cell graph and uses a multitask-oriented graph autoencoder to preserve topological structure information and feature information in scRNA-seq data simultaneously. We further extended scGAE for scRNA-seq data visualization, clustering, and trajectory inference. Analyses of simulated data showed that scGAE accurately reconstructs developmental trajectory and separates discrete cell clusters under different scenarios, outperforming recently developed deep learning methods. Furthermore, implementation of scGAE on empirical data showed scGAE provided novel insights into cell developmental lineages and preserved inter-cluster distances.

Download Full-text

A topology-preserving dimensionality reduction method for single-cell RNA-seq data using graph autoencoder

Scientific Reports ◽

10.1038/s41598-021-99003-7 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Zixiang Luo ◽

Chenyu Xu ◽

Zhen Zhang ◽

Wenfei Jin

Keyword(s):

Dimensionality Reduction ◽

Single Cell ◽

Topological Structure ◽

Reduction Method ◽

Dimensional Space ◽

Oriented Graph ◽

Developmental Trajectory ◽

Structure Information ◽

Dimensionality Reduction Method ◽

Cell Graph

AbstractDimensionality reduction is crucial for the visualization and interpretation of the high-dimensional single-cell RNA sequencing (scRNA-seq) data. However, preserving topological structure among cells to low dimensional space remains a challenge. Here, we present the single-cell graph autoencoder (scGAE), a dimensionality reduction method that preserves topological structure in scRNA-seq data. scGAE builds a cell graph and uses a multitask-oriented graph autoencoder to preserve topological structure information and feature information in scRNA-seq data simultaneously. We further extended scGAE for scRNA-seq data visualization, clustering, and trajectory inference. Analyses of simulated data showed that scGAE accurately reconstructs developmental trajectory and separates discrete cell clusters under different scenarios, outperforming recently developed deep learning methods. Furthermore, implementation of scGAE on empirical data showed scGAE provided novel insights into cell developmental lineages and preserved inter-cluster distances.

Download Full-text

A copula based topology preserving graph convolution network for clustering of single-cell RNA-seq data

10.1101/2021.11.15.468695 ◽

2021 ◽

Author(s):

Snehalika Lall ◽

Sumanta Ray ◽

Sanghamitra Bandyopadhyay

Keyword(s):

Single Cell ◽

Dimensional Space ◽

Homogeneous Grouping ◽

Cell Clustering ◽

A Cell ◽

Clustering Approach ◽

Low Dimensional ◽

Cell To Cell Variability ◽

Cell Cell ◽

Cell Graph

Annotation of cells in single-cell clustering requires a homogeneous grouping of cell populations. There are various issues in single cell sequencing that effect homogeneous grouping (clustering) of cells, such as small amount of starting RNA, limited per-cell sequenced reads, cell-to-cell variability due to cell-cycle, cellular morphology, and variable reagent concentrations. Moreover, single cell data is susceptible to technical noise, which affects the quality of genes (or features) selected/extracted prior to clustering. Here we introduce sc-CGconv (copula based graph convolution network for single cell clustering), a stepwise robust unsupervised feature extraction and clustering approach that formulates and aggregates cell–cell relationships using copula correlation (Ccor), followed by a graph convolution network based clustering approach. sc-CGconv formulates a cell-cell graph using Ccor that is learned by a graph-based artificial intelligence model, graph convolution network. The learned representation (low dimensional embedding) is utilized for cell clustering. sc-CGconv features the following advantages. a. sc-CGconv works with substantially smaller sample sizes to identify homogeneous clusters. b. sc-CGconv can model the expression co-variability of a large number of genes, thereby outperforming state-of-the-art gene selection/extraction methods for clustering. c. sc-CGconv preserves the cell-to-cell variability within the selected gene set by constructing a cell-cell graph through copula correlation measure. d. sc-CGconv provides a topology-preserving embedding of cells in low dimensional space. The source code and usage information are available at https://github.com/Snehalikalall/CopulaGCN .

Download Full-text

Ensemble dimensionality reduction and feature gene extraction for single-cell RNA-seq data

Nature Communications ◽

10.1038/s41467-020-19465-7 ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Xiaoxiao Sun ◽

Yiwen Liu ◽

Lingling An

Keyword(s):

Dimensionality Reduction ◽

Single Cell ◽

Dimensional Space ◽

Essential Feature ◽

Empirical Studies ◽

Expression Patterns ◽

Cell Types ◽

Stochastic Gradient Descent ◽

Reduction Techniques ◽

Low Dimensional

AbstractSingle-cell RNA sequencing (scRNA-seq) technologies allow researchers to uncover the biological states of a single cell at high resolution. For computational efficiency and easy visualization, dimensionality reduction is necessary to capture gene expression patterns in low-dimensional space. Here we propose an ensemble method for simultaneous dimensionality reduction and feature gene extraction (EDGE) of scRNA-seq data. Different from existing dimensionality reduction techniques, the proposed method implements an ensemble learning scheme that utilizes massive weak learners for an accurate similarity search. Based on the similarity matrix constructed by those weak learners, the low-dimensional embedding of the data is estimated and optimized through spectral embedding and stochastic gradient descent. Comprehensive simulation and empirical studies show that EDGE is well suited for searching for meaningful organization of cells, detecting rare cell types, and identifying essential feature genes associated with certain cell types.

Download Full-text

A Statistical Approach to Dimensionality Reduction Reveals Scale and Structure in scRNA-seq Data

10.1101/2020.11.18.389031 ◽

2020 ◽

Author(s):

Eric Johnson ◽

William Kath ◽

Madhav Mani

Keyword(s):

Dimensionality Reduction ◽

Statistical Approach ◽

Dimensional Space ◽

Developmental Trajectory ◽

Data Sets ◽

Cell Type ◽

Biologically Relevant ◽

Trajectory Reconstruction ◽

Statistical Framework ◽

Low Dimensional

AbstractSingle-cell RNA sequencing (scRNA-seq) experiments often measure thousands of genes, making them high-dimensional data sets. As a result, dimensionality reduction (DR) algorithms such as t-SNE and UMAP are necessary for data visualization. However, the use of DR methods in other tasks, such as for cell-type detection or developmental trajectory reconstruction, is stymied by unquantified non-linear and stochastic deformations in the mapping from the high- to low-dimensional space. In this work, we present a statistical framework for the quantification of embedding quality so that DR algorithms can be used with confidence in unsupervised applications. Specifically, this framework generates a local assessment of embedding quality by statistically integrating information across embeddings. Furthermore, the approach separates biological signal from noise via the construction of an empirical null hypothesis. Using this approach on scRNA-seq data reveals biologically relevant structure and suggests a novel “spectral” decomposition of data. We apply the framework to several data sets and DR methods, illustrating its robustness and flexibility as well as its widespread utility in several quantitative applications.

Download Full-text

Optimizing the structure and movement of a robotic bat with biological kinematic synergies

The International Journal of Robotics Research ◽

10.1177/0278364918804654 ◽

2018 ◽

Vol 37 (10) ◽

pp. 1233-1252 ◽

Cited By ~ 2

Author(s):

Jonathan Hoff ◽

Alireza Ramezani ◽

Soon-Jo Chung ◽

Seth Hutchinson

Keyword(s):

Principal Component Analysis ◽

Topological Structure ◽

Degrees Of Freedom ◽

Dimensional Space ◽

Principal Component ◽

Biologically Inspired ◽

Wing Kinematics ◽

Wing Motion ◽

Kinematic Synergies ◽

Low Dimensional

In this article, we present methods to optimize the design and flight characteristics of a biologically inspired bat-like robot. In previous, work we have designed the topological structure for the wing kinematics of this robot; here we present methods to optimize the geometry of this structure, and to compute actuator trajectories such that its wingbeat pattern closely matches biological counterparts. Our approach is motivated by recent studies on biological bat flight that have shown that the salient aspects of wing motion can be accurately represented in a low-dimensional space. Although bats have over 40 degrees of freedom (DoFs), our robot possesses several biologically meaningful morphing specializations. We use principal component analysis (PCA) to characterize the two most dominant modes of biological bat flight kinematics, and we optimize our robot’s parametric kinematics to mimic these. The method yields a robot that is reduced from five degrees of actuation (DoAs) to just three, and that actively folds its wings within a wingbeat period. As a result of mimicking synergies, the robot produces an average net lift improvesment of 89% over the same robot when its wings cannot fold.

Download Full-text

Network Embedding via a Bi-Mode and Deep Neural Network Model

10.20944/preprints201712.0156.v1 ◽

2017 ◽

Author(s):

Yang Fang ◽

Xiang Zhao ◽

Zhen Tan

Keyword(s):

Neural Network ◽

Deep Neural Network ◽

Semantic Information ◽

Dimensional Space ◽

Relation Extraction ◽

Network Embedding ◽

Structure Information ◽

Second Mode ◽

Real World Datasets ◽

Low Dimensional

Network Embedding (NE) is an important method to learn the representations of network via a low-dimensional space. Conventional NE models focus on capturing the structure information and semantic information of vertices while neglecting such information for edges. In this work, we propose a novel NE model named BimoNet to capture both the structure and semantic information of edges. BimoNet is composed of two parts, i.e., the bi-mode embedding part and the deep neural network part. For bi-mode embedding part, the first mode named add-mode is used to express the entity-shared features of edges and the second mode named subtract-mode is employed to represent the entity-specific features of edges. These features actually reflect the semantic information. For deep neural network part, we firstly regard the edges in a network as nodes, and the vertices as links, which will not change the overall structure of the whole network. Then we take the nodes' adjacent matrix as the input of the deep neural network as it can obtain similar representations for nodes with similar structure. Afterwards, by jointly optimizing the objective function of these two parts, BimoNet could preserve both the semantic and structure information of edges. In experiments, we evaluate BimoNet on three real-world datasets and task of relation extraction, and BimoNet is demonstrated to outperform state-of-the-art baseline models consistently and significantly.

Download Full-text

Discovering a sparse set of pairwise discriminating features in high-dimensional data

Bioinformatics ◽

10.1093/bioinformatics/btaa690 ◽

2020 ◽

Author(s):

Samuel Melton ◽

Sharad Ramanathan

Keyword(s):

Single Cell ◽

Dimensional Space ◽

Cell Types ◽

Dimensional Subspace ◽

Supplementary Information ◽

High Dimensional ◽

Technological Advances ◽

Data Points ◽

Low Dimensional ◽

Sparse Set

Abstract Motivation Recent technological advances produce a wealth of high-dimensional descriptions of biological processes, yet extracting meaningful insight and mechanistic understanding from these data remains challenging. For example, in developmental biology, the dynamics of differentiation can now be mapped quantitatively using single-cell RNA sequencing, yet it is difficult to infer molecular regulators of developmental transitions. Here, we show that discovering informative features in the data is crucial for statistical analysis as well as making experimental predictions. Results We identify features based on their ability to discriminate between clusters of the data points. We define a class of problems in which linear separability of clusters is hidden in a low-dimensional space. We propose an unsupervised method to identify the subset of features that define a low-dimensional subspace in which clustering can be conducted. This is achieved by averaging over discriminators trained on an ensemble of proposed cluster configurations. We then apply our method to single-cell RNA-seq data from mouse gastrulation, and identify 27 key transcription factors (out of 409 total), 18 of which are known to define cell states through their expression levels. In this inferred subspace, we find clear signatures of known cell types that eluded classification prior to discovery of the correct low-dimensional subspace. Availability and implementation https://github.com/smelton/SMD. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A Bayesian nonparametric semi-supervised model for integration of multiple single-cell experiments

10.1101/2020.01.14.906313 ◽

2020 ◽

Author(s):

Archit Verma ◽

Barbara Engelhardt

Keyword(s):

Single Cell ◽

Latent Variable ◽

Environmental Variability ◽

Simulated Data ◽

Joint Analysis ◽

Variable Model ◽

Manifold Alignment ◽

Multiple Data Sets ◽

Sequencing Platforms ◽

Low Dimensional

Joint analysis of multiple single cell RNA-sequencing (scRNA-seq) data is confounded by technical batch effects across experiments, biological or environmental variability across cells, and different capture processes across sequencing platforms. Manifold alignment is a principled, effective tool for integrating multiple data sets and controlling for confounding factors. We demonstrate that the semi-supervised t-distributed Gaussian process latent variable model (sstGPLVM), which projects the data onto a mixture of fixed and latent dimensions, can learn a unified low-dimensional embedding for multiple single cell experiments with minimal assumptions. We show the efficacy of the model as compared with state-of-the-art methods for single cell data integration on simulated data, pancreas cells from four sequencing technologies, induced pluripotent stem cells from male and female donors, and mouse brain cells from both spatial seqFISH+ and traditional scRNA-seq.Code and data is available at https://github.com/architverma1/sc-manifold-alignment

Download Full-text

Complex Moment-Based Supervised Eigenmap for Dimensionality Reduction

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013910 ◽

2019 ◽

Vol 33 ◽

pp. 3910-3918 ◽

Cited By ~ 1

Author(s):

Akira Imakura ◽

Momo Matsuda ◽

Xiucai Ye ◽

Tetsuya Sakurai

Keyword(s):

Dimensionality Reduction ◽

Parallel Implementation ◽

Dimensional Space ◽

Recognition Performance ◽

Optimization Methods ◽

Original Data ◽

Dimensional Subspace ◽

Reduction Methods ◽

Low Dimensional ◽

Matrix Trace

Dimensionality reduction methods that project highdimensional data to a low-dimensional space by matrix trace optimization are widely used for clustering and classification. The matrix trace optimization problem leads to an eigenvalue problem for a low-dimensional subspace construction, preserving certain properties of the original data. However, most of the existing methods use only a few eigenvectors to construct the low-dimensional space, which may lead to a loss of useful information for achieving successful classification. Herein, to overcome the deficiency of the information loss, we propose a novel complex moment-based supervised eigenmap including multiple eigenvectors for dimensionality reduction. Furthermore, the proposed method provides a general formulation for matrix trace optimization methods to incorporate with ridge regression, which models the linear dependency between covariate variables and univariate labels. To reduce the computational complexity, we also propose an efficient and parallel implementation of the proposed method. Numerical experiments indicate that the proposed method is competitive compared with the existing dimensionality reduction methods for the recognition performance. Additionally, the proposed method exhibits high parallel efficiency.

Download Full-text

Continuous visualization of differences between biological conditions in single-cell data

10.1101/337485 ◽

2018 ◽

Cited By ~ 1

Author(s):

Tyler J. Burns ◽

Garry P. Nolan ◽

Nikolay Samusik

Keyword(s):

Single Cell ◽

Nearest Neighbor ◽

Developmental Trajectory ◽

Functional Markers ◽

Mass Cytometry ◽

K Nearest Neighbor ◽

Cell Frequency ◽

Low Dimensional ◽

Marker Shift ◽

Cell Data

In high-dimensional single cell data, comparing changes in functional markers between conditions is typically done across manual or algorithm-derived partitions based on population-defining markers. Visualizations of these partitions is commonly done on low-dimensional embeddings (eg. t-SNE), colored by per-partition changes. Here, we provide an analysis and visualization tool that performs these comparisons across overlapping k-nearest neighbor (KNN) groupings. This allows one to color low-dimensional embeddings by marker changes without hard boundaries imposed by partitioning. We devised an objective optimization of k based on minimizing functional marker KNN imputation error. Proof-of-concept work visualized the exact location of an IL-7 responsive subset in a B cell developmental trajectory on a t-SNE map independent of clustering. Per-condition cell frequency analysis revealed that KNN is sensitive to detecting artifacts due to marker shift, and therefore can also be valuable in a quality control pipeline. Overall, we found that KNN groupings lead to useful multiple condition visualizations and efficiently extract a large amount of information from mass cytometry data. Our software is publicly available through the Bioconductor package Sconify.

Download Full-text