Fast and precise single-cell data analysis using a hierarchical autoencoder

AbstractA primary challenge in single-cell RNA sequencing (scRNA-seq) studies comes from the massive amount of data and the excess noise level. To address this challenge, we introduce an analysis framework, named single-cell Decomposition using Hierarchical Autoencoder (scDHA), that reliably extracts representative information of each cell. The scDHA pipeline consists of two core modules. The first module is a non-negative kernel autoencoder able to remove genes or components that have insignificant contributions to the part-based representation of the data. The second module is a stacked Bayesian autoencoder that projects the data onto a low-dimensional space (compressed). To diminish the tendency to overfit of neural networks, we repeatedly perturb the compressed space to learn a more generalized representation of the data. In an extensive analysis, we demonstrate that scDHA outperforms state-of-the-art techniques in many research sub-fields of scRNA-seq analysis, including cell segregation through unsupervised learning, visualization of transcriptome landscape, cell classification, and pseudo-time inference.

Download Full-text

Fast and precise single-cell data analysis using hierarchical autoencoder

10.1101/799817 ◽

2019 ◽

Cited By ~ 2

Author(s):

Duc Tran ◽

Hung Nguyen ◽

Bang Tran ◽

Carlo La Vecchia ◽

Hung N. Luu ◽

...

Keyword(s):

Data Analysis ◽

Single Cell ◽

Rna Sequencing ◽

Noise Level ◽

State Of The Art ◽

Excess Noise ◽

Extensive Analysis ◽

Single Cell Rna Sequencing ◽

Cell Segregation ◽

Cell Data

AbstractA primary challenge in single-cell RNA sequencing (scRNA-seq) studies comes from the massive amount of data and the excess noise level. To address this challenge, we introduce a hierarchical autoencoder that reliably extracts representative information of each cell. In an extensive analysis, we demonstrate that the approach vastly outperforms state-of-the-art techniques in many research sub-fields of scRNA-seq analysis, including cell segregation through unsupervised learning, visualization of transcriptome landscape, cell classification, and pseudo-time inference.

Download Full-text

Metric Multidimensional Scaling for Large Single-Cell Data Sets using Neural Networks

10.1101/2021.06.24.449725 ◽

2021 ◽

Author(s):

Stefan Canzar ◽

Van Hoan Do ◽

Slobodan Jelic ◽

Soeren Laue ◽

Domagoj Matijevic ◽

...

Keyword(s):

Multidimensional Scaling ◽

Single Cell ◽

State Of The Art ◽

Dimensional Space ◽

Linear Mapping ◽

Alternative Methods ◽

Dimensional Euclidean Space ◽

Data Sets ◽

Metric Multidimensional Scaling ◽

Low Dimensional

Metric multidimensional scaling is one of the classical methods for embedding data into low-dimensional Euclidean space. It creates the low-dimensional embedding by approximately preserving the pairwise distances between the input points. However, current state-of-the-art approaches only scale to a few thousand data points. For larger data sets such as those occurring in single-cell RNA sequencing experiments, the running time becomes prohibitively large and thus alternative methods such as PCA are widely used instead. Here, we propose a neural network based approach for solving the metric multidimensional scaling problem that is orders of magnitude faster than previous state-of-the-art approaches, and hence scales to data sets with up to a few million cells. At the same time, it provides a non-linear mapping between high- and low-dimensional space that can place previously unseen cells in the same embedding.

Download Full-text

Season- and Trend-aware Symbolic Approximation for Accurate and Efficient Time Series Matching

Datenbank-Spektrum ◽

10.1007/s13222-021-00389-5 ◽

2021 ◽

Author(s):

Lars Kegel ◽

Claudio Hartmann ◽

Maik Thiele ◽

Wolfgang Lehner

Keyword(s):

Time Series ◽

State Of The Art ◽

Dimensional Space ◽

Symbolic Aggregate Approximation ◽

Current State ◽

Optimal Representation ◽

Symbolic Approximation ◽

Low Dimensional ◽

Deterministic Behavior ◽

Support Time

AbstractProcessing and analyzing time series datasets have become a central issue in many domains requiring data management systems to support time series as a native data type. A core access primitive of time series is matching, which requires efficient algorithms on-top of appropriate representations like the symbolic aggregate approximation (SAX) representing the current state of the art. This technique reduces a time series to a low-dimensional space by segmenting it and discretizing each segment into a small symbolic alphabet. Unfortunately, SAX ignores the deterministic behavior of time series such as cyclical repeating patterns or a trend component affecting all segments, which may lead to a sub-optimal representation accuracy. We therefore introduce a novel season- and a trend-aware symbolic approximation and demonstrate an improved representation accuracy without increasing the memory footprint. Most importantly, our techniques also enable a more efficient time series matching by providing a match up to three orders of magnitude faster than SAX.

Download Full-text

Discovering a sparse set of pairwise discriminating features in high-dimensional data

Bioinformatics ◽

10.1093/bioinformatics/btaa690 ◽

2020 ◽

Author(s):

Samuel Melton ◽

Sharad Ramanathan

Keyword(s):

Single Cell ◽

Dimensional Space ◽

Cell Types ◽

Dimensional Subspace ◽

Supplementary Information ◽

High Dimensional ◽

Technological Advances ◽

Data Points ◽

Low Dimensional ◽

Sparse Set

Abstract Motivation Recent technological advances produce a wealth of high-dimensional descriptions of biological processes, yet extracting meaningful insight and mechanistic understanding from these data remains challenging. For example, in developmental biology, the dynamics of differentiation can now be mapped quantitatively using single-cell RNA sequencing, yet it is difficult to infer molecular regulators of developmental transitions. Here, we show that discovering informative features in the data is crucial for statistical analysis as well as making experimental predictions. Results We identify features based on their ability to discriminate between clusters of the data points. We define a class of problems in which linear separability of clusters is hidden in a low-dimensional space. We propose an unsupervised method to identify the subset of features that define a low-dimensional subspace in which clustering can be conducted. This is achieved by averaging over discriminators trained on an ensemble of proposed cluster configurations. We then apply our method to single-cell RNA-seq data from mouse gastrulation, and identify 27 key transcription factors (out of 409 total), 18 of which are known to define cell states through their expression levels. In this inferred subspace, we find clear signatures of known cell types that eluded classification prior to discovery of the correct low-dimensional subspace. Availability and implementation https://github.com/smelton/SMD. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Continuous visualization of differences between biological conditions in single-cell data

10.1101/337485 ◽

2018 ◽

Cited By ~ 1

Author(s):

Tyler J. Burns ◽

Garry P. Nolan ◽

Nikolay Samusik

Keyword(s):

Single Cell ◽

Nearest Neighbor ◽

Developmental Trajectory ◽

Functional Markers ◽

Mass Cytometry ◽

K Nearest Neighbor ◽

Cell Frequency ◽

Low Dimensional ◽

Marker Shift ◽

Cell Data

In high-dimensional single cell data, comparing changes in functional markers between conditions is typically done across manual or algorithm-derived partitions based on population-defining markers. Visualizations of these partitions is commonly done on low-dimensional embeddings (eg. t-SNE), colored by per-partition changes. Here, we provide an analysis and visualization tool that performs these comparisons across overlapping k-nearest neighbor (KNN) groupings. This allows one to color low-dimensional embeddings by marker changes without hard boundaries imposed by partitioning. We devised an objective optimization of k based on minimizing functional marker KNN imputation error. Proof-of-concept work visualized the exact location of an IL-7 responsive subset in a B cell developmental trajectory on a t-SNE map independent of clustering. Per-condition cell frequency analysis revealed that KNN is sensitive to detecting artifacts due to marker shift, and therefore can also be valuable in a quality control pipeline. Overall, we found that KNN groupings lead to useful multiple condition visualizations and efficiently extract a large amount of information from mass cytometry data. Our software is publicly available through the Bioconductor package Sconify.

Download Full-text

Unsupervised Deep Video Hashing with Balanced Rotation

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/429 ◽

2017 ◽

Cited By ~ 10

Author(s):

Gengshen Wu ◽

Li Liu ◽

Yuchen Guo ◽

Guiguang Ding ◽

Jungong Han ◽

...

Keyword(s):

State Of The Art ◽

Dimensional Space ◽

Image Hashing ◽

Neighborhood Structure ◽

Function Learning ◽

Video Hashing ◽

Real World Datasets ◽

Low Dimensional ◽

Balanced Code ◽

Hash Codes

Recently, hashing video contents for fast retrieval has received increasing attention due to the enormous growth of online videos. As the extension of image hashing techniques, traditional video hashing methods mainly focus on seeking the appropriate video features but pay little attention to how the video-specific features can be leveraged to achieve optimal binarization. In this paper, an end-to-end hashing framework, namely Unsupervised Deep Video Hashing (UDVH), is proposed, where feature extraction, balanced code learning and hash function learning are integrated and optimized in a self-taught manner. Particularly, distinguished from previous work, our framework enjoys two novelties: 1) an unsupervised hashing method that integrates the feature clustering and feature binarization, enabling the neighborhood structure to be preserved in the binary space; 2) a smart rotation applied to the video-specific features that are widely spread in the low-dimensional space such that the variance of dimensions can be balanced, thus generating more effective hash codes. Extensive experiments have been performed on two real-world datasets and the results demonstrate its superiority, compared to the state-of-the-art video hashing methods. To bootstrap further developments, the source code will be made publically available.

Download Full-text

SCDRHA: A scRNA-Seq Data Dimensionality Reduction Algorithm Based on Hierarchical Autoencoder

Frontiers in Genetics ◽

10.3389/fgene.2021.733906 ◽

2021 ◽

Vol 12 ◽

Author(s):

Jianping Zhao ◽

Na Wang ◽

Haiyun Wang ◽

Chunhou Zheng ◽

Yansen Su

Keyword(s):

Dimensionality Reduction ◽

Data Visualization ◽

State Of The Art ◽

Dimensional Space ◽

High Dimensional ◽

Reduction Algorithm ◽

Cell Clustering ◽

Data Dimensionality Reduction ◽

Single Cell Rna Sequencing ◽

Low Dimensional

Dimensionality reduction of high-dimensional data is crucial for single-cell RNA sequencing (scRNA-seq) visualization and clustering. One prominent challenge in scRNA-seq studies comes from the dropout events, which lead to zero-inflated data. To address this issue, in this paper, we propose a scRNA-seq data dimensionality reduction algorithm based on a hierarchical autoencoder, termed SCDRHA. The proposed SCDRHA consists of two core modules, where the first module is a deep count autoencoder (DCA) that is used to denoise data, and the second module is a graph autoencoder that projects the data into a low-dimensional space. Experimental results demonstrate that SCDRHA has better performance than existing state-of-the-art algorithms on dimension reduction and noise reduction in five real scRNA-seq datasets. Besides, SCDRHA can also dramatically improve the performance of data visualization and cell clustering.

Download Full-text

Unsupervised cell functional annotation for single-cell RNA-Seq

10.1101/2021.11.20.469410 ◽

2021 ◽

Author(s):

Dongshunyi Li ◽

Jun Ding ◽

Ziv Bar-Joseph

Keyword(s):

Single Cell ◽

Dimensional Space ◽

Cell Types ◽

Specific Gene ◽

Rna Seq ◽

Cell Type ◽

Sequencing Data ◽

Gene Sets ◽

Supervised Methods ◽

Low Dimensional

One of the first steps in the analysis of single cell RNA-Sequencing data (scRNA-Seq) is the assignment of cell types. While a number of supervised methods have been developed for this, in most cases such assignment is performed by first clustering cells in low-dimensional space and then assigning cell types to different clusters. To overcome noise and to improve cell type assignments we developed UNIFAN, a neural network method that simultaneously clusters and annotates cells using known gene sets. UNIFAN combines both, low dimension representation for all genes and cell specific gene set activity scores to determine the clustering. We applied UNIFAN to human and mouse scRNA-Seq datasets from several different organs. As we show, by using knowledge on gene sets, UNIFAN greatly outperforms prior methods developed for clustering scRNA-Seq data. The gene sets assigned by UNIFAN to different clusters provide strong evidence for the cell type that is represented by this cluster making annotations easier.

Download Full-text

Prediction of Disease-related microRNAs through Integrating Attributes of microRNA Nodes and Multiple Kinds of Connecting Edges

Molecules ◽

10.3390/molecules24173099 ◽

2019 ◽

Vol 24 (17) ◽

pp. 3099 ◽

Cited By ~ 3

Author(s):

Xuan ◽

Li ◽

Zhang ◽

Song

Keyword(s):

State Of The Art ◽

Dimensional Space ◽

Nonnegative Matrix ◽

Superior Performance ◽

Pancreatic Cancers ◽

Node Attribute ◽

Disease Associations ◽

Node Attributes ◽

Novel Method ◽

Low Dimensional

Identifying disease-associated microRNAs (disease miRNAs) contributes to the understanding of disease pathogenesis. Most previous computational biology studies focused on multiple kinds of connecting edges of miRNAs and diseases, including miRNA–miRNA similarities, disease–disease similarities, and miRNA–disease associations. Few methods exploited the node attribute information related to miRNA family and cluster. The previous methods do not completely consider the sparsity of node attributes. Additionally, it is challenging to deeply integrate the node attributes of miRNAs and the similarities and associations related to miRNAs and diseases. In the present study, we propose a novel method, known as MDAPred, based on nonnegative matrix factorization to predict candidate disease miRNAs. MDAPred integrates the node attributes of miRNAs and the related similarities and associations of miRNAs and diseases. Since a miRNA is typically subordinate to a family or a cluster, the node attributes of miRNAs are sparse. Similarly, the data for miRNA and disease similarities are sparse. Projecting the miRNA and disease similarities and miRNA node attributes into a common low-dimensional space contributes to estimating miRNA-disease associations. Simultaneously, the possibility that a miRNA is associated with a disease depends on the miRNA’s neighbour information. Therefore, MDAPred deeply integrates projections of multiple kinds of connecting edges, projections of miRNAs node attributes, and neighbour information of miRNAs. The cross-validation results showed that MDAPred achieved superior performance compared to other state-of-the-art methods for predicting disease-miRNA associations. MDAPred can also retrieve more actual miRNA-disease associations at the top of prediction results, which is very important for biologists. Additionally, case studies of breast, lung, and pancreatic cancers further confirmed the ability of MDAPred to discover potential miRNA–disease associations.

Download Full-text

MOFA+: a probabilistic framework for comprehensive integration of structured single-cell data

10.1101/837104 ◽

2019 ◽

Cited By ~ 8

Author(s):

Ricard Argelaguet ◽

Damien Arnol ◽

Danila Bredikhin ◽

Yonatan Deloro ◽

Britta Velten ◽

...

Keyword(s):

Factor Analysis ◽

Single Cell ◽

Cell Fate ◽

Joint Analysis ◽

Experimental Conditions ◽

Joint Modelling ◽

Stochastic Variational Inference ◽

Technological Advances ◽

Low Dimensional ◽

Cell Data

AbstractTechnological advances have enabled the joint analysis of multiple molecular layers at single cell resolution. At the same time, increased experimental throughput has facilitated the study of larger numbers of experimental conditions. While methods for analysing single-cell data that model the resulting structure of either of these dimensions are beginning to emerge, current methods do not account for complex experimental designs that include both multiple views (modalities or assays) and groups (conditions or experiments). Here we present Multi-Omics Factor Analysis v2 (MOFA+), a statistical framework for the comprehensive and scalable integration of structured single cell multi-modal data. MOFA+ builds upon a Bayesian Factor Analysis framework combined with fast GPU-accelerated stochastic variational inference. Similar to existing factor models, MOFA+ allows for interpreting variation in single-cell datasets by pooling information across cells and features to reconstruct a low-dimensional representation of the data. Uniquely, the model supports flexible group-level sparsity constraints that allow joint modelling of variation across multiple groups and views.To illustrate MOFA+, we applied it to single-cell data sets of different scales and designs, demonstrating practical advantages when analyzing datasets with complex group and/or view structure. In a multi-omics analysis of mouse gastrulation this joint modelling reveals coordinated changes between gene expression and epigenetic variation associated with cell fate commitment.

Download Full-text