Integrating single-cell datasets with ambiguous batch information by incorporating molecular network features

Briefings in Bioinformatics ◽

10.1093/bib/bbab366 ◽

2021 ◽

Author(s):

Ji Dong ◽

Peijie Zhou ◽

Yichong Wu ◽

Yidong Chen ◽

Haoling Xie ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Developmental Stages ◽

Rapid Development ◽

Molecular Network ◽

Rna Seq ◽

Single Cell Sequencing ◽

The World ◽

Information Score ◽

Simple Network

Abstract With the rapid development of single-cell sequencing techniques, several large-scale cell atlas projects have been launched across the world. However, it is still challenging to integrate single-cell RNA-seq (scRNA-seq) datasets with diverse tissue sources, developmental stages and/or few overlaps, due to the ambiguity in determining the batch information, which is particularly important for current batch-effect correction methods. Here, we present SCORE, a simple network-based integration methodology, which incorporates curated molecular network features to infer cellular states and generate a unified workflow for integrating scRNA-seq datasets. Validating on real single-cell datasets, we showed that regardless of batch information, SCORE outperforms existing methods in accuracy, robustness, scalability and data integration.

scDIOR: single cell RNA-seq data IO software

BMC Bioinformatics ◽

10.1186/s12859-021-04528-3 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Huijian Feng ◽

Lihui Lin ◽

Jiekai Chen

Keyword(s):

Single Cell ◽

Programming Languages ◽

Large Scale ◽

Developmental Trajectories ◽

Rapid Development ◽

Data Transformation ◽

Rna Seq ◽

Data Types ◽

User Friendly ◽

Cell Data

Abstract Background Single-cell RNA sequencing is becoming a powerful tool to identify cell states, reconstruct developmental trajectories, and deconvolute spatial expression. The rapid development of computational methods promotes the insight of heterogeneous single-cell data. An increasing number of tools have been provided for biological analysts, of which two programming languages- R and Python are widely used among researchers. R and Python are complementary, as many methods are implemented specifically in R or Python. However, the different platforms immediately caused the data sharing and transformation problem, especially for Scanpy, Seurat, and SingleCellExperiemnt. Currently, there is no efficient and user-friendly software to perform data transformation of single-cell omics between platforms, which makes users spend unbearable time on data Input and Output (IO), significantly reducing the efficiency of data analysis. Results We developed scDIOR for single-cell data transformation between platforms of R and Python based on Hierarchical Data Format Version 5 (HDF5). We have created a data IO ecosystem between three R packages (Seurat, SingleCellExperiment, Monocle) and a Python package (Scanpy). Importantly, scDIOR accommodates a variety of data types across programming languages and platforms in an ultrafast way, including single-cell RNA-seq and spatial resolved transcriptomics data, using only a few codes in IDE or command line interface. For large scale datasets, users can partially load the needed information, e.g., cell annotation without the gene expression matrices. scDIOR connects the analytical tasks of different platforms, which makes it easy to compare the performance of algorithms between them. Conclusions scDIOR contains two modules, dior in R and diopy in Python. scDIOR is a versatile and user-friendly tool that implements single-cell data transformation between R and Python rapidly and stably. The software is freely accessible at https://github.com/JiekaiLab/scDIOR.

Comparative analysis of antibody- and lipid-based multiplexing methods for single-cell RNA-seq

10.1101/2020.11.16.384222 ◽

2020 ◽

Author(s):

Viacheslav Mylka ◽

Jeroen Aerts ◽

Irina Matetovici ◽

Suresh Poovathingal ◽

Niels Vandamme ◽

...

Keyword(s):

Genetic Variation ◽

Comparative Analysis ◽

Single Cell ◽

Cell Lines ◽

Clinical Studies ◽

Clinical Samples ◽

Rna Seq ◽

Batch Effects ◽

Single Cell Sequencing ◽

Single Nucleus

ABSTRACTMultiplexing of samples in single-cell RNA-seq studies allows significant reduction of experimental costs, straightforward identification of doublets, increased cell throughput, and reduction of sample-specific batch effects. Recently published multiplexing techniques using oligo-conjugated antibodies or - lipids allow barcoding sample-specific cells, a process called ‘hashing’. Here, we compare the hashing performance of TotalSeq-A and -C antibodies, custom synthesized lipids and MULTI-seq lipid hashes in four cell lines, both for single-cell RNA-seq and single-nucleus RNA-seq. Hashing efficiency was evaluated using the intrinsic genetic variation of the cell lines. Benchmarking of different hashing strategies and computational pipelines indicates that correct demultiplexing can be achieved with both lipid- and antibody-hashed human cells and nuclei, with MULTISeqDemux as the preferred demultiplexing function and antibody-based hashing as the most efficient protocol on cells. Antibody hashing was further evaluated on clinical samples using PBMCs from healthy and SARS-CoV-2 infected patients, where we demonstrate a more affordable approach for large single-cell sequencing clinical studies, while simultaneously reducing batch effects.

Predicting heterogeneity in clone-specific therapeutic vulnerabilities using single-cell transcriptomic signatures

Genome Medicine ◽

10.1186/s13073-021-01000-y ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Chayaporn Suphavilai ◽

Shumei Chia ◽

Ankur Sharma ◽

Lorna Tu ◽

Rafael Peres Da Silva ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Drug Response ◽

Drug Repurposing ◽

High Accuracy ◽

Molecular Heterogeneity ◽

Precision Oncology ◽

Scale Analysis ◽

Rna Seq ◽

Large Scale Analysis

AbstractWhile understanding molecular heterogeneity across patients underpins precision oncology, there is increasing appreciation for taking intra-tumor heterogeneity into account. Based on large-scale analysis of cancer omics datasets, we highlight the importance of intra-tumor transcriptomic heterogeneity (ITTH) for predicting clinical outcomes. Leveraging single-cell RNA-seq (scRNA-seq) with a recommender system (CaDRReS-Sc), we show that heterogeneous gene-expression signatures can predict drug response with high accuracy (80%). Using patient-proximal cell lines, we established the validity of CaDRReS-Sc’s monotherapy (Pearson r>0.6) and combinatorial predictions targeting clone-specific vulnerabilities (>10% improvement). Applying CaDRReS-Sc to rapidly expanding scRNA-seq compendiums can serve as in silico screen to accelerate drug-repurposing studies. Availability: https://github.com/CSB5/CaDRReS-Sc.

Leveraging high-powered RNA-Seq datasets to improve inference of regulatory activity in single-cell RNA-Seq data

10.1101/553040 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ning Wang ◽

Andrew E. Teschendorff

Keyword(s):

Transcription Factors ◽

Single Cell ◽

Cell Fate ◽

Regulatory Networks ◽

Large Scale ◽

Single Cells ◽

Differential Expression Analysis ◽

Dropout Rate ◽

Rna Seq ◽

Regulatory Activity

AbstractInferring the activity of transcription factors in single cells is a key task to improve our understanding of development and complex genetic diseases. This task is, however, challenging due to the relatively large dropout rate and noisy nature of single-cell RNA-Seq data. Here we present a novel statistical inference framework called SCIRA (Single Cell Inference of Regulatory Activity), which leverages the power of large-scale bulk RNA-Seq datasets to infer high-quality tissue-specific regulatory networks, from which regulatory activity estimates in single cells can be subsequently obtained. We show that SCIRA can correctly infer regulatory activity of transcription factors affected by high technical dropouts. In particular, SCIRA can improve sensitivity by as much as 70% compared to differential expression analysis and current state-of-the-art methods. Importantly, SCIRA can reveal novel regulators of cell-fate in tissue-development, even for cell-types that only make up 5% of the tissue, and can identify key novel tumor suppressor genes in cancer at single cell resolution. In summary, SCIRA will be an invaluable tool for single-cell studies aiming to accurately map activity patterns of key transcription factors during development, and how these are altered in disease.

EpiScanpy: integrated single-cell epigenomic analysis

10.1101/648097 ◽

2019 ◽

Cited By ~ 4

Author(s):

Anna Danese ◽

Maria L. Richter ◽

David S. Fischer ◽

Fabian J. Theis ◽

Maria Colomé-Tatché

Keyword(s):

Dna Methylation ◽

Single Cell ◽

Large Scale ◽

Feature Space ◽

Rna Seq ◽

Computational Framework ◽

Learning Techniques ◽

Multiple Feature ◽

The Many ◽

Cell Data

ABSTRACTEpigenetic single-cell measurements reveal a layer of regulatory information not accessible to single-cell transcriptomics, however single-cell-omics analysis tools mainly focus on gene expression data. To address this issue, we present epiScanpy, a computational framework for the analysis of single-cell DNA methylation and single-cell ATAC-seq data. EpiScanpy makes the many existing RNA-seq workflows from scanpy available to large-scale single-cell data from other -omics modalities. We introduce and compare multiple feature space constructions for epigenetic data and show the feasibility of common clustering, dimension reduction and trajectory learning techniques. We benchmark epiScanpy by interrogating different single-cell brain mouse atlases of DNA methylation, ATAC-seq and transcriptomics. We find that differentially methylated and differentially open markers between cell clusters enrich transcriptome-based cell type labels by orthogonal epigenetic information.

K-mer counting with low memory consumption enables fast clustering of single-cell sequencing data without read alignment

10.1101/723833 ◽

2019 ◽

Author(s):

Christina Huan Shi ◽

Kevin Y. Yip

Keyword(s):

Single Cell ◽

State Of The Art ◽

Rna Seq ◽

Sequencing Data ◽

Memory Consumption ◽

Analysis Pipeline ◽

Cell Clusters ◽

Single Cell Sequencing ◽

Sequencing Errors ◽

Full Analysis

AbstractK-mer counting has many applications in sequencing data processing and analysis. However, sequencing errors can produce many false k-mers that substantially increase the memory requirement during counting. We propose a fast k-mer counting method, CQF-deNoise, which has a novel component for dynamically identifying and removing false k-mers while preserving counting accuracy. Compared with four state-of-the-art k-mer counting methods, CQF-deNoise consumed 49-76% less memory than the second best method, but still ran competitively fast. The k-mer counts from CQF-deNoise produced cell clusters from single-cell RNA-seq data highly consistent with CellRanger but required only 5% of the running time at the same memory consumption, suggesting that CQF-deNoise can be used for a preview of cell clusters for an early detection of potential data problems, before running a much more time-consuming full analysis pipeline.

SCOTCluster: Deep Clustering with Optimal Transport for Large-scale Single-cell RNA-seq Data

10.1109/bibm52615.2021.9669800 ◽

2021 ◽

Author(s):

Faning Long ◽

Xiaojun Ding ◽

Xiaoqing Peng ◽

Jianxin Wang ◽

Xiaoshu Zhu

Keyword(s):

Single Cell ◽

Large Scale ◽

Optimal Transport ◽

Rna Seq

A single-cell Raman-based platform to identify developmental stages of human pluripotent stem cell-derived neurons

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2001906117 ◽

2020 ◽

Vol 117 (31) ◽

pp. 18412-18423 ◽

Cited By ~ 2

Author(s):

Chia-Chen Hsu ◽

Jiabao Xu ◽

Bas Brinkhof ◽

Hui Wang ◽

Zhanfeng Cui ◽

...

Keyword(s):

Stem Cells ◽

Stem Cell ◽

Single Cell ◽

Raman Spectra ◽

Large Scale ◽

Developmental Stages ◽

Single Cells ◽

Classification Model ◽

Label Free ◽

Machine Learning Classification

Stem cells with the capability to self-renew and differentiate into multiple cell derivatives provide platforms for drug screening and promising treatment options for a wide variety of neural diseases. Nevertheless, clinical applications of stem cells have been hindered partly owing to a lack of standardized techniques to characterize cell molecular profiles noninvasively and comprehensively. Here, we demonstrate that a label-free and noninvasive single-cell Raman microspectroscopy (SCRM) platform was able to identify neural cell lineages derived from clinically relevant human induced pluripotent stem cells (hiPSCs). By analyzing the intrinsic biochemical profiles of single cells at a large scale (8,774 Raman spectra in total), iPSCs and iPSC-derived neural cells can be distinguished by their intrinsic phenotypic Raman spectra. We identified a Raman biomarker from glycogen to distinguish iPSCs from their neural derivatives, and the result was verified by the conventional glycogen detection assays. Further analysis with a machine learning classification model, utilizing t-distributed stochastic neighbor embedding (t-SNE)-enhanced ensemble stacking, clearly categorized hiPSCs in different developmental stages with 97.5% accuracy. The present study demonstrates the capability of the SCRM-based platform to monitor cell development using high content screening with a noninvasive and label-free approach. This platform as well as our identified biomarker could be extensible to other cell types and can potentially have a high impact on neural stem cell therapy.

scAIDE: clustering of large-scale single-cell RNA-seq data reveals putative and rare cell types

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa082 ◽

2020 ◽

Vol 2 (4) ◽

Author(s):

Kaikun Xie ◽

Yu Huang ◽

Feng Zeng ◽

Zehua Liu ◽

Ting Chen

Keyword(s):

Single Cell ◽

Large Scale ◽

Developmental Trajectories ◽

Cell Types ◽

Random Projection ◽

Good Representation ◽

Rna Seq ◽

Unsupervised Deep Learning ◽

High Level ◽

Computational Resources

Abstract Recent advancements in both single-cell RNA-sequencing technology and computational resources facilitate the study of cell types on global populations. Up to millions of cells can now be sequenced in one experiment; thus, accurate and efficient computational methods are needed to provide clustering and post-analysis of assigning putative and rare cell types. Here, we present a novel unsupervised deep learning clustering framework that is robust and highly scalable. To overcome the high level of noise, scAIDE first incorporates an autoencoder-imputation network with a distance-preserved embedding network (AIDE) to learn a good representation of data, and then applies a random projection hashing based k-means algorithm to accommodate the detection of rare cell types. We analyzed a 1.3 million neural cell dataset within 30 min, obtaining 64 clusters which were mapped to 19 putative cell types. In particular, we further identified three different neural stem cell developmental trajectories in these clusters. We also classified two subpopulations of malignant cells in a small glioblastoma dataset using scAIDE. We anticipate that scAIDE would provide a more in-depth understanding of cell development and diseases.

SSCC: A Novel Computational Framework for Rapid and Accurate Clustering Large-scale Single Cell RNA-seq Data

Genomics Proteomics & Bioinformatics ◽

10.1016/j.gpb.2018.10.003 ◽

2019 ◽

Vol 17 (2) ◽

pp. 201-210 ◽

Cited By ~ 5

Author(s):

Xianwen Ren ◽

Liangtao Zheng ◽

Zemin Zhang

Keyword(s):

Single Cell ◽

Large Scale ◽

Rna Seq ◽

Computational Framework