SSCC: a novel computational framework for rapid and accurate clustering large single cell RNA-seq data

ABSTRACTClustering is a prevalent analytical means to analyze single cell RNA sequencing data but the rapidly expanding data volume can make this process computational challenging. New methods for both accurate and efficient clustering are of pressing needs. Here we proposed a new clustering framework based on random projection and feature construction for large scale single-cell RNA sequencing data, which greatly improves clustering accuracy, robustness and computational efficacy for various state-of-the-art algorithms benchmarked on multiple real datasets. On a dataset with 68,578 human blood cells, our method reached 20% improvements for clustering accuracy and 50-fold acceleration but only consumed 66% memory usage compared to the widely-used software package SC3. Compared to k-means, the accuracy improvement can reach 3-fold depending on the concrete dataset. An R implementation of the framework is available from https://github.com/Japrin/sscClust.

Download Full-text

SHARP: Single-cell RNA-seq Hyper-fast and Accurate Processing via Ensemble Random Projection

10.1101/461640 ◽

2018 ◽

Cited By ~ 2

Author(s):

Shibiao Wan ◽

Junil Kim ◽

Kyoung Jae Won

Keyword(s):

Dimension Reduction ◽

Single Cell ◽

Rna Sequencing ◽

Large Scale ◽

Random Projection ◽

Rna Seq ◽

Running Speed ◽

Large Size ◽

Single Cell Rna Sequencing ◽

Speed And Accuracy

ABSTRACTTo process large-scale single-cell RNA-sequencing (scRNA-seq) data effectively without excessive distortion during dimension reduction, we present SHARP, an ensemble random projection-based algorithm which is scalable to clustering 10 million cells. Comprehensive benchmarking tests on 17 public scRNA-seq datasets demonstrate that SHARP outperforms existing methods in terms of speed and accuracy. Particularly, for large-size datasets (>40,000 cells), SHARP’s running speed far excels other competitors while maintaining high clustering accuracy and robustness. To the best of our knowledge, SHARP is the only R-based tool that is scalable to clustering scRNA-seq data with 10 million cells.

Download Full-text

A map of tumor–host interactions in glioma at single-cell resolution

GigaScience ◽

10.1093/gigascience/giaa109 ◽

2020 ◽

Vol 9 (10) ◽

Cited By ~ 3

Author(s):

Francesca Pia Caruso ◽

Luciano Garofano ◽

Fulvio D'Angelo ◽

Kai Yu ◽

Fuchou Tang ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Cross Talk ◽

Large Scale ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Sequencing Data ◽

Host Interaction ◽

Receptor Interactions ◽

Single Cell Rna Sequencing

ABSTRACT Background Single-cell RNA sequencing is the reference technique for characterizing the heterogeneity of the tumor microenvironment. The composition of the various cell types making up the microenvironment can significantly affect the way in which the immune system activates cancer rejection mechanisms. Understanding the cross-talk signals between immune cells and cancer cells is of fundamental importance for the identification of immuno-oncology therapeutic targets. Results We present a novel method, single-cell Tumor–Host Interaction tool (scTHI), to identify significantly activated ligand–receptor interactions across clusters of cells from single-cell RNA sequencing data. We apply our approach to uncover the ligand–receptor interactions in glioma using 6 publicly available human glioma datasets encompassing 57,060 gene expression profiles from 71 patients. By leveraging this large-scale collection we show that unexpected cross-talk partners are highly conserved across different datasets in the majority of the tumor samples. This suggests that shared cross-talk mechanisms exist in glioma. Conclusions Our results provide a complete map of the active tumor–host interaction pairs in glioma that can be therapeutically exploited to reduce the immunosuppressive action of the microenvironment in brain tumor.

Download Full-text

A bivariate zero-inflated negative binomial model for identifying underlying dependence with application to single cell RNA sequencing data

10.1101/2020.03.06.977728 ◽

2020 ◽

Author(s):

Hunyong Cho ◽

Chuwen Liu ◽

John S. Preisser ◽

Di Wu

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Latent Variable ◽

Large Scale ◽

Negative Binomial ◽

Model Fitting ◽

Sequencing Data ◽

Excess Zeros ◽

Binomial Distributions ◽

Single Cell Rna Sequencing

SummaryMeasuring gene-gene dependence in single cell RNA sequencing (scRNA-seq) count data is often of interest and remains challenging, because an unidentified portion of the zero counts represent non-detected RNA due to technical reasons. Conventional statistical methods that fail to account for technical zeros incorrectly measure the dependence among genes. To address this problem, we propose a bivariate zero-inflated negative binomial (BZINB) model constructed using a bivariate Poisson-gamma mixture with dropout indicators for the technical (excess) zeros. Parameters are estimated based on the EM algorithm and are used to measure the underlying dependence by decomposing the two sources of zeros. Compared to existing models, the proposed BZINB model is specifically designed for estimating dependence and is more flexible, while preserving the marginal zero-inflated negative binomial distributions. Additionally, it has a simple latent variable framework, allowing parameters to have clear and intuitive interpretations, and its computation is feasible with large scale data. Using a recent scRNA-seq dataset, we illustrate model fitting and how the model-based measures can be different from naive measures. The inferential ability of the proposed model is evaluated in a simulation study. An R package ‘bzinb’ is available on CRAN.

Download Full-text

Splatter: simulation of single-cell RNA sequencing data

10.1101/133173 ◽

2017 ◽

Cited By ~ 8

Author(s):

Luke Zappia ◽

Belinda Phipson ◽

Alicia Oshlack

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Real Data ◽

Cell Types ◽

Rna Seq ◽

Sequencing Data ◽

Sequencing Technologies ◽

Simulation Based ◽

Single Cell Rna Sequencing ◽

Multiple Cell

AbstractAs single-cell RNA sequencing technologies have rapidly developed, so have analysis methods. Many methods have been tested, developed and validated using simulated datasets. Unfortunately, current simulations are often poorly documented, their similarity to real data is not demonstrated, or reproducible code is not available.Here we present the Splatter Bioconductor package for simple, reproducible and well-documented simulation of single-cell RNA-seq data. Splatter provides an interface to multiple simulation methods including Splat, our own simulation, based on a gamma-Poisson distribution. Splat can simulate single populations of cells, populations with multiple cell types or differentiation paths.

Download Full-text

SPsimSeq: semi-parametric simulation of bulk and single cell RNA sequencing data

10.1101/677740 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alemu Takele Assefa ◽

Jo Vandesompele ◽

Olivier Thas

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Empirical Distribution ◽

Supplementary Information ◽

Rna Seq ◽

Sequencing Data ◽

Actual Distribution ◽

Wide Range ◽

Single Cell Rna Sequencing

SummarySPsimSeq is a semi-parametric simulation method for bulk and single cell RNA sequencing data. It simulates data from a good estimate of the actual distribution of a given real RNA-seq dataset. In contrast to existing approaches that assume a particular data distribution, our method constructs an empirical distribution of gene expression data from a given source RNA-seq experiment to faithfully capture the data characteristics of real data. Importantly, our method can be used to simulate a wide range of scenarios, such as single or multiple biological groups, systematic variations (e.g. confounding batch effects), and different sample sizes. It can also be used to simulate different gene expression units resulting from different library preparation protocols, such as read counts or UMI counts.Availability and implementationThe R package and associated documentation is available from https://github.com/CenterForStatistics-UGent/SPsimSeq.Supplementary informationSupplementary data are available at bioRχiv online.

Download Full-text

Comparison of high-throughput single-cell RNA sequencing data processing pipelines

Briefings in Bioinformatics ◽

10.1093/bib/bbaa116 ◽

2020 ◽

Author(s):

Mingxuan Gao ◽

Mingyi Ling ◽

Xinwei Tang ◽

Shun Wang ◽

Xu Xiao ◽

...

Keyword(s):

Data Processing ◽

Single Cell ◽

Rna Sequencing ◽

High Throughput ◽

Large Scale ◽

Evaluation Framework ◽

Integrated Analysis ◽

Sequencing Data ◽

Single Experiment ◽

Single Cell Rna Sequencing

Abstract With the development of single-cell RNA sequencing (scRNA-seq) technology, it has become possible to perform large-scale transcript profiling for tens of thousands of cells in a single experiment. Many analysis pipelines have been developed for data generated from different high-throughput scRNA-seq platforms, bringing a new challenge to users to choose a proper workflow that is efficient, robust and reliable for a specific sequencing platform. Moreover, as the amount of public scRNA-seq data has increased rapidly, integrated analysis of scRNA-seq data from different sources has become increasingly popular. However, it remains unclear whether such integrated analysis would be biassed if the data were processed by different upstream pipelines. In this study, we encapsulated seven existing high-throughput scRNA-seq data processing pipelines with Nextflow, a general integrative workflow management framework, and evaluated their performance in terms of running time, computational resource consumption and data analysis consistency using eight public datasets generated from five different high-throughput scRNA-seq platforms. Our work provides a useful guideline for the selection of scRNA-seq data processing pipelines based on their performance on different real datasets. In addition, these guidelines can serve as a performance evaluation framework for future developments in high-throughput scRNA-seq data processing.

Download Full-text

SMaSH: A scalable, general marker gene identification framework for single-cell RNA sequencing and Spatial Transcriptomics

10.1101/2021.04.08.438978 ◽

2021 ◽

Author(s):

Michael E Nelson ◽

Simone G Riva ◽

Ann Cvejic

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Marker Gene ◽

Marker Genes ◽

Sequencing Data ◽

Computational Framework ◽

Data Set ◽

Spatially Resolved ◽

Single Cell Rna Sequencing ◽

The Given

Spatial transcriptomics is revolutionising the study of single-cell RNA and tissue-wide cell heterogeneity, but few robust methods connecting spatially resolved cells to so-called marker genes from single-cell RNA sequencing, which generate significant insight gleaned from spatial methods, exist. Here we present SMaSH, a general computational framework for extracting key marker genes from single-cell RNA sequencing data for spatial transcriptomics approaches. SMaSH extracts robust and biologically well-motivated marker genes, which characterise the given data-set better than existing and limited computational approaches for global marker gene calculation.

Download Full-text

Comparison of computational methods for imputing single-cell RNA-sequencing data

10.1101/241190 ◽

2017 ◽

Cited By ~ 10

Author(s):

Lihua Zhang ◽

Shihua Zhang

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Large Scale ◽

Real Data ◽

Cell Types ◽

Biological Functions ◽

Sequencing Data ◽

Imputation Methods ◽

Future Studies ◽

Single Cell Rna Sequencing

AbstractSingle-cell RNA-sequencing (scRNA-seq) is a recent breakthrough technology, which paves the way for measuring RNA levels at single cell resolution to study precise biological functions. One of the main challenges when analyzing scRNA-seq data is the presence of zeros or dropout events, which may mislead downstream analyses. To compensate the dropout effect, several methods have been developed to impute gene expression since the first Bayesian-based method being proposed in 2016. However, these methods have shown very diverse characteristics in terms of model hypothesis and imputation performance. Thus, large-scale comparison and evaluation of these methods is urgently needed now. To this end, we compared eight imputation methods, evaluated their power in recovering original real data, and performed broad analyses to explore their effects on clustering cell types, detecting differentially expressed genes, and reconstructing lineage trajectories in the context of both simulated and real data. Simulated datasets and case studies highlight that there are no one method performs the best in all the situations. Some defects of these methods such as scalability, robustness and unavailability in some situations need to be addressed in future studies.

Download Full-text

Comparison of High-Throughput Single-Cell RNA Sequencing Data Processing Pipelines

10.1101/2020.02.09.940221 ◽

2020 ◽

Cited By ~ 2

Author(s):

Mingxuan Gao ◽

Mingyi Ling ◽

Xinwei Tang ◽

Shun Wang ◽

Xu Xiao ◽

...

Keyword(s):

Data Processing ◽

Single Cell ◽

Rna Sequencing ◽

High Throughput ◽

Large Scale ◽

Evaluation Framework ◽

Integrated Analysis ◽

Sequencing Data ◽

Single Experiment ◽

Single Cell Rna Sequencing

AbstractWith the development of single-cell RNA sequencing (scRNA-seq) technology, it has become possible to perform large-scale transcript profiling for tens of thousands of cells in a single experiment. Many analysis pipelines have been developed for data generated from different high-throughput scRNA-seq platforms, bringing a new challenge to users to choose a proper workflow that is efficient, robust and reliable for a specific sequencing platform. Moreover, as the amount of public scRNA-seq data has increased rapidly, integrated analysis of scRNA-seq data from different sources has become increasingly popular. How-ever, it remains unclear whether such integrated analysis would be biased if the data were processed by different upstream pipelines. In this study, we encapsulated seven existing high-throughput scRNA-seq data processing pipelines with Nextflow, a general integrative workflow management framework, and evaluated their performances in terms of running time, computational resource consumption, and data processing consistency using nine public datasets generated from five different high-throughput scRNA-seq platforms. Our work provides a useful guideline for the selection of scRNA-seq data processing pipelines based on their performances on different real datasets. In addition, these guidelines can serve as a performance evaluation framework for future developments in high-throughput scRNA-seq data processing.

Download Full-text

Deep Transfer Learning of Drug Sensitivity by Integrating Bulk and Single-cell RNA-seq data

10.1101/2021.08.01.454654 ◽

2021 ◽

Author(s):

Junyi Chen ◽

Ren Qi ◽

Zhenyu Wu ◽

Anjun Ma ◽

Lang Li ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Transfer Learning ◽

Large Scale ◽

Drug Response ◽

Single Cells ◽

Cancer Drug ◽

Sequencing Data ◽

Drug Responses ◽

Single Cell Rna Sequencing

Massively bulk RNA sequencing databases incorporating drug screening have opened up an avenue to inform the optimal clinical application of cancer drugs. Meanwhile, the growing single-cell RNA sequencing data contributes to improving therapeutic effectiveness by studying the heterogeneity of drug responses for cancer cell subpopulations. Yet, the drug response information for single-cell data is scarcely obtained. Thus, there is an urgent need to develop computational pipelines to infer and interpret cancer drug responses in single cells. Here, we developed scDEAL, a deep transfer learning framework integrating large-scale bulk and single-cell RNA sequencing drug response datasets. We benchmarked scDEAL on six single-cell RNA sequencing datasets and indicate its model interpretability by several case studies. scDEAL not only achieves accurate and robust performance in single-cell drug response predictions, but also can infer signature genes to reveal potential drug resistance mechanisms based on integrated gradient feature interpretation. This work may help study cell reprogramming, drug selection, and repurposing for improving therapeutic efficacy.

Download Full-text