Cumulus: a cloud-based data analysis framework for large-scale single-cell and single-nucleus RNA-seq

AbstractMassively parallel single-cell and single-nucleus RNA-seq (sc/snRNA-seq) have opened the way to systematic tissue atlases in health and disease, but as the scale of data generation is growing, so does the need for computational pipelines for scaled analysis. Here, we developed Cumulus, a cloud-based framework for analyzing large scale sc/snRNA-seq datasets. Cumulus combines the power of cloud computing with improvements in algorithm implementations to achieve high scalability, low cost, user-friendliness, and integrated support for a comprehensive set of features. We benchmark Cumulus on the Human Cell Atlas Census of Immune Cells dataset of bone marrow cells and show that it substantially improves efficiency over conventional frameworks, while maintaining or improving the quality of results, enabling large-scale studies.

Download Full-text

sNucDrop-Seq: Dissecting cell-type composition and neuronal activity state in mammalian brains by massively parallel single-nucleus RNA-Seq

10.1101/154476 ◽

2017 ◽

Author(s):

Peng Hu ◽

Emily Fabyanic ◽

Zhaolan Zhou ◽

Hao Wu

Keyword(s):

Single Cell ◽

Neuronal Activity ◽

Low Cost ◽

Single Cells ◽

High Sensitivity ◽

Droplet Microfluidics ◽

Massively Parallel ◽

Rna Seq ◽

Single Nucleus

Massively parallel single-cell RNA sequencing can precisely resolve cellular diversity in a high-throughput manner at low cost, but unbiased isolation of intact single cells from complex tissues, such as adult mammalian brains, is challenging. Here, we integrate sucrose-gradient assisted nuclear purification with droplet microfluidics to develop a highly scalable single-nucleus RNA-Seq approach (sNucDrop-Seq), which is free of enzymatic dissociation and nucleus sorting. By profiling ~11,000 nuclei isolated from adult mouse cerebral cortex, we demonstrate that sNucDrop-Seq not only accurately reveals neuronal and non-neuronal subtype composition with high sensitivity, but also enables analysis of long non-coding RNAs and transient states such as neuronal activity-dependent transcription at single-cell resolution in vivo.

Download Full-text

Cumulus provides cloud-based data analysis for large-scale single-cell and single-nucleus RNA-seq

Nature Methods ◽

10.1038/s41592-020-0905-x ◽

2020 ◽

Vol 17 (8) ◽

pp. 793-798 ◽

Cited By ~ 4

Author(s):

Bo Li ◽

Joshua Gould ◽

Yiming Yang ◽

Siranush Sarkizova ◽

Marcin Tabaka ◽

...

Keyword(s):

Data Analysis ◽

Single Cell ◽

Large Scale ◽

Rna Seq ◽

Single Nucleus

Download Full-text

Scarf: A toolkit for memory efficient analysis of large-scale single-cell genomics data

10.1101/2021.05.02.441899 ◽

2021 ◽

Author(s):

Parashar Dhapola ◽

Johan Rodhe ◽

Rasmus Olofzon ◽

Thomas Bonald ◽

Eva Erlandsson ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Clustering Algorithm ◽

Single Cell Analysis ◽

Low Cost ◽

Rna Seq ◽

Representative Sampling ◽

Lineage Differentiation ◽

Memory Efficiency ◽

Memory Efficient

The increasing capacity to perform large-scale single-cell genomic experiments continues to outpace the ability to efficiently handle growing datasets. Herein we present Scarf, a modularly designed Python package that seamlessly interoperates with other single-cell toolkits and allows for memory efficient single-cell analysis of millions of cells on a laptop or low-cost devices like single board computers. We demonstrate Scarf's memory and compute-time efficiency by applying it to the largest existing single-cell RNA-Seq and ATAC-Seq datasets. Scarf wraps memory efficient implementations of a graph-based t-stochastic neighbour embedding and hierarchical clustering algorithm. Moreover, Scarf performs accurate reference-anchored mapping of datasets while maintaining memory efficiency. By implementing a novel data downsampling algorithm, Scarf additionally has the capacity to generate representative sampling of cells from a given dataset wherein rare cell populations and lineage differentiation trajectories are conserved. Together, Scarf provides a framework wherein any researcher can perform advanced processing, downsampling, reanalysis and integration of atlas-scale datasets on standard laptop computers.

Download Full-text

Falco: A quick and flexible single-cell RNA-seq processing framework on the cloud

10.1101/064006 ◽

2016 ◽

Cited By ~ 1

Author(s):

Andrian Yang ◽

Michael Troup ◽

Peijie Lin ◽

Joshua W. K. Ho

Keyword(s):

Single Cell ◽

Large Scale ◽

Low Cost ◽

Supplementary Information ◽

Data Sets ◽

Rna Seq ◽

Single Node ◽

Big Data Technologies ◽

Amazon Web Services ◽

Supplementary Material

AbstractSummarySingle-cell RNA-seq (scRNA-seq) is increasingly used in a range of biomedical studies. Nonetheless, current RNA-seq analysis tools are not specifically designed to efficiently process scRNA-seq data due to their limited scalability. Here we introduce Falco, a cloud-based framework to enable paralellisation of existing RNA-seq processing pipelines using big data technologies of Apache Hadoop and Apache Spark for performing massively parallel analysis of large scale transcriptomic data. Using two public scRNA-seq data sets and two popular RNA-seq alignment/feature quantification pipelines, we show that the same processing pipeline runs 2.6 – 145.4 times faster using Falco than running on a highly optimised single node analysis. Falco also allows user to the utilise low-cost spot instances of Amazon Web Services (AWS), providing a 65% reduction in cost of analysis.AvailabilityFalco is available via a GNU General Public License at https://github.com/VCCRI/Falco/[email protected] informationSupplementary data are available at BioRXiv online.

Download Full-text

FlsnRNA-seq: protoplasting-free full-length single-nucleus RNA profiling in plants

Genome Biology ◽

10.1186/s13059-021-02288-0 ◽

2021 ◽

Vol 22 (1) ◽

Cited By ~ 2

Author(s):

Yanping Long ◽

Zhijian Liu ◽

Jinbu Jia ◽

Weipeng Mo ◽

Liang Fang ◽

...

Keyword(s):

Single Cell ◽

Cell Walls ◽

Large Scale ◽

Full Length ◽

Cell Level ◽

Root Cells ◽

Rna Profiling ◽

Different Types ◽

Long Read ◽

Single Nucleus

AbstractThe broad application of single-cell RNA profiling in plants has been hindered by the prerequisite of protoplasting that requires digesting the cell walls from different types of plant tissues. Here, we present a protoplasting-free approach, flsnRNA-seq, for large-scale full-length RNA profiling at a single-nucleus level in plants using isolated nuclei. Combined with 10x Genomics and Nanopore long-read sequencing, we validate the robustness of this approach in Arabidopsis root cells and the developing endosperm. Sequencing results demonstrate that it allows for uncovering alternative splicing and polyadenylation-related RNA isoform information at the single-cell level, which facilitates characterizing cell identities.

Download Full-text

Farmer Preference to High Elevation Rice Technological Packages for Accelerating Technological Dissemination (A case Study in Humbang Hasundutan Regency)

Agro Ekonomi ◽

10.22146/ae.61367 ◽

2021 ◽

Vol 32 (2) ◽

Author(s):

Setia Sari Girsang ◽

Agung B Santosa ◽

Tommy Purba ◽

Deddy R Siagian ◽

Khadijah E Ramija

Keyword(s):

Low Cost ◽

High Elevation ◽

Primary Data ◽

Survey Method ◽

User Friendliness ◽

Level Of Satisfaction ◽

High Level ◽

Level Of Importance

Accelerating the introduction of a new technological package is needed to increase the productivity of high elevation puddled rice in Humbang Hasundutan. The objectives of the study are to find out the perception of the existence of technological packages and farmers' preference for a new technological package. The study used a survey method with primary data gathered using questionnaires. The criteria of locations and respondents were used to obtain relevant respondents and data concerning their knowledge of high elevation puddled rice cultivation. The collected data were processed by using Importance Performance Analysis in order to find out the level of Importance and Satisfaction of the indicators and the valued aspects in the technological package components. The results of the study showed that the socio-economic aspects had to be heeded in organizing the technological package. Indicators having a high level of importance and a low level of satisfaction consisted of production cost, quality of seeds, farmer groups empowerment, technology information institution, capital cost, agricultural tools and machines, pest control, sales price, irrigation canals, and farm roads. On the other hand, introducing new superior seeds, productivity attribute and planting age were important indicators for local farmers as to improve the quality of existing seeds. Farmers group expected that the technological package had a high level of productivity, better access to input, low cost, and good user-friendliness in its application.

Download Full-text

Comparative analysis of antibody- and lipid-based multiplexing methods for single-cell RNA-seq

10.1101/2020.11.16.384222 ◽

2020 ◽

Author(s):

Viacheslav Mylka ◽

Jeroen Aerts ◽

Irina Matetovici ◽

Suresh Poovathingal ◽

Niels Vandamme ◽

...

Keyword(s):

Genetic Variation ◽

Comparative Analysis ◽

Single Cell ◽

Cell Lines ◽

Clinical Studies ◽

Clinical Samples ◽

Rna Seq ◽

Batch Effects ◽

Single Cell Sequencing ◽

Single Nucleus

ABSTRACTMultiplexing of samples in single-cell RNA-seq studies allows significant reduction of experimental costs, straightforward identification of doublets, increased cell throughput, and reduction of sample-specific batch effects. Recently published multiplexing techniques using oligo-conjugated antibodies or - lipids allow barcoding sample-specific cells, a process called ‘hashing’. Here, we compare the hashing performance of TotalSeq-A and -C antibodies, custom synthesized lipids and MULTI-seq lipid hashes in four cell lines, both for single-cell RNA-seq and single-nucleus RNA-seq. Hashing efficiency was evaluated using the intrinsic genetic variation of the cell lines. Benchmarking of different hashing strategies and computational pipelines indicates that correct demultiplexing can be achieved with both lipid- and antibody-hashed human cells and nuclei, with MULTISeqDemux as the preferred demultiplexing function and antibody-based hashing as the most efficient protocol on cells. Antibody hashing was further evaluated on clinical samples using PBMCs from healthy and SARS-CoV-2 infected patients, where we demonstrate a more affordable approach for large single-cell sequencing clinical studies, while simultaneously reducing batch effects.

Download Full-text

Predicting heterogeneity in clone-specific therapeutic vulnerabilities using single-cell transcriptomic signatures

Genome Medicine ◽

10.1186/s13073-021-01000-y ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Chayaporn Suphavilai ◽

Shumei Chia ◽

Ankur Sharma ◽

Lorna Tu ◽

Rafael Peres Da Silva ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Drug Response ◽

Drug Repurposing ◽

High Accuracy ◽

Molecular Heterogeneity ◽

Precision Oncology ◽

Scale Analysis ◽

Rna Seq ◽

Large Scale Analysis

AbstractWhile understanding molecular heterogeneity across patients underpins precision oncology, there is increasing appreciation for taking intra-tumor heterogeneity into account. Based on large-scale analysis of cancer omics datasets, we highlight the importance of intra-tumor transcriptomic heterogeneity (ITTH) for predicting clinical outcomes. Leveraging single-cell RNA-seq (scRNA-seq) with a recommender system (CaDRReS-Sc), we show that heterogeneous gene-expression signatures can predict drug response with high accuracy (80%). Using patient-proximal cell lines, we established the validity of CaDRReS-Sc’s monotherapy (Pearson r>0.6) and combinatorial predictions targeting clone-specific vulnerabilities (>10% improvement). Applying CaDRReS-Sc to rapidly expanding scRNA-seq compendiums can serve as in silico screen to accelerate drug-repurposing studies. Availability: https://github.com/CSB5/CaDRReS-Sc.

Download Full-text

Leveraging high-powered RNA-Seq datasets to improve inference of regulatory activity in single-cell RNA-Seq data

10.1101/553040 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ning Wang ◽

Andrew E. Teschendorff

Keyword(s):

Transcription Factors ◽

Single Cell ◽

Cell Fate ◽

Regulatory Networks ◽

Large Scale ◽

Single Cells ◽

Differential Expression Analysis ◽

Dropout Rate ◽

Rna Seq ◽

Regulatory Activity

AbstractInferring the activity of transcription factors in single cells is a key task to improve our understanding of development and complex genetic diseases. This task is, however, challenging due to the relatively large dropout rate and noisy nature of single-cell RNA-Seq data. Here we present a novel statistical inference framework called SCIRA (Single Cell Inference of Regulatory Activity), which leverages the power of large-scale bulk RNA-Seq datasets to infer high-quality tissue-specific regulatory networks, from which regulatory activity estimates in single cells can be subsequently obtained. We show that SCIRA can correctly infer regulatory activity of transcription factors affected by high technical dropouts. In particular, SCIRA can improve sensitivity by as much as 70% compared to differential expression analysis and current state-of-the-art methods. Importantly, SCIRA can reveal novel regulators of cell-fate in tissue-development, even for cell-types that only make up 5% of the tissue, and can identify key novel tumor suppressor genes in cancer at single cell resolution. In summary, SCIRA will be an invaluable tool for single-cell studies aiming to accurately map activity patterns of key transcription factors during development, and how these are altered in disease.

Download Full-text

Enhancing droplet-based single-nucleus RNA-seq resolution using the semi-supervised machine learning classifier DIEM

10.1101/786285 ◽

2019 ◽

Cited By ~ 4

Author(s):

Marcus Alvarez ◽

Elior Rahmani ◽

Brandon Jew ◽

Kristina M. Garske ◽

Zong Miao ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Cell Types ◽

Supervised Machine Learning ◽

Data Sets ◽

Rna Seq ◽

Novel Approach ◽

Single Nucleus ◽

Downstream Analysis

AbstractSingle-nucleus RNA sequencing (snRNA-seq) measures gene expression in individual nuclei instead of cells, allowing for unbiased cell type characterization in solid tissues. Contrary to single-cell RNA seq (scRNA-seq), we observe that snRNA-seq is commonly subject to contamination by high amounts of extranuclear background RNA, which can lead to identification of spurious cell types in downstream clustering analyses if overlooked. We present a novel approach to remove debris-contaminated droplets in snRNA-seq experiments, called Debris Identification using Expectation Maximization (DIEM). Our likelihood-based approach models the gene expression distribution of debris and cell types, which are estimated using EM. We evaluated DIEM using three snRNA-seq data sets: 1) human differentiating preadipocytes in vitro, 2) fresh mouse brain tissue, and 3) human frozen adipose tissue (AT) from six individuals. All three data sets showed various degrees of extranuclear RNA contamination. We observed that existing methods fail to account for contaminated droplets and led to spurious cell types. When compared to filtering using these state of the art methods, DIEM better removed droplets containing high levels of extranuclear RNA and led to higher quality clusters. Although DIEM was designed for snRNA-seq data, we also successfully applied DIEM to single-cell data. To conclude, our novel method DIEM removes debris-contaminated droplets from single-cell-based data fast and effectively, leading to cleaner downstream analysis. Our code is freely available for use at https://github.com/marcalva/diem.

Download Full-text