scholarly journals Scarf: A toolkit for memory efficient analysis of large-scale single-cell genomics data

2021 ◽  
Author(s):  
Parashar Dhapola ◽  
Johan Rodhe ◽  
Rasmus Olofzon ◽  
Thomas Bonald ◽  
Eva Erlandsson ◽  
...  

The increasing capacity to perform large-scale single-cell genomic experiments continues to outpace the ability to efficiently handle growing datasets. Herein we present Scarf, a modularly designed Python package that seamlessly interoperates with other single-cell toolkits and allows for memory efficient single-cell analysis of millions of cells on a laptop or low-cost devices like single board computers. We demonstrate Scarf's memory and compute-time efficiency by applying it to the largest existing single-cell RNA-Seq and ATAC-Seq datasets. Scarf wraps memory efficient implementations of a graph-based t-stochastic neighbour embedding and hierarchical clustering algorithm. Moreover, Scarf performs accurate reference-anchored mapping of datasets while maintaining memory efficiency. By implementing a novel data downsampling algorithm, Scarf additionally has the capacity to generate representative sampling of cells from a given dataset wherein rare cell populations and lineage differentiation trajectories are conserved. Together, Scarf provides a framework wherein any researcher can perform advanced processing, downsampling, reanalysis and integration of atlas-scale datasets on standard laptop computers.

2017 ◽  
Author(s):  
Bo Wang ◽  
Daniele Ramazzotti ◽  
Luca De Sano ◽  
Junjie Zhu ◽  
Emma Pierson ◽  
...  

AbstractMotivationWe here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a cell-to-cell similarity measure from single-cell RNA-seq data. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of cells. SIMLR was benchmarked against state-of-the-art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization.Availability and ImplementationSIMLR is available on GitHub in both R and MATLAB implementations. Furthermore, it is also available as an R package on [email protected] or [email protected] InformationSupplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Bo Li ◽  
Joshua Gould ◽  
Yiming Yang ◽  
Siranush Sarkizova ◽  
Marcin Tabaka ◽  
...  

AbstractMassively parallel single-cell and single-nucleus RNA-seq (sc/snRNA-seq) have opened the way to systematic tissue atlases in health and disease, but as the scale of data generation is growing, so does the need for computational pipelines for scaled analysis. Here, we developed Cumulus, a cloud-based framework for analyzing large scale sc/snRNA-seq datasets. Cumulus combines the power of cloud computing with improvements in algorithm implementations to achieve high scalability, low cost, user-friendliness, and integrated support for a comprehensive set of features. We benchmark Cumulus on the Human Cell Atlas Census of Immune Cells dataset of bone marrow cells and show that it substantially improves efficiency over conventional frameworks, while maintaining or improving the quality of results, enabling large-scale studies.


2016 ◽  
Author(s):  
Jinzhou Yuan ◽  
Peter A. Sims

Recent developments have enabled rapid, inexpensive RNA sequencing of thousands of individual cells from a single specimen, raising the possibility of unbiased and comprehensive expression profiling from complex tissues. Microwell arrays are a particularly attractive microfluidic platform for single cell analysis due to their scalability, cell capture efficiency, and compatibility with imaging. We report an automated microwell array platform for single cell RNA-Seq with significantly improved performance over previous implementations. We demonstrate cell capture efficiencies of >50%, compatibility with commercially available barcoded mRNA capture beads, and parallel expression profiling from thousands of individual cells. We evaluate the level of cross-contamination in our platform by both tracking fluorescent cell lysate in sealed microwells and with a human-mouse mixed species RNA-Seq experiment. Finally, we apply our system to comprehensively assess heterogeneity in gene expression of patient-derived glioma neurospheres and uncover subpopulations similar to those observed in human glioma tissue.


2019 ◽  
Author(s):  
Koki Tsuyuzaki ◽  
Hiroyuki Sato ◽  
Kenta Sato ◽  
Itoshi Nikaido

AbstractPrincipal component analysis (PCA) is an essential method for analyzing single-cell RNA-seq (scRNA-seq) datasets, but large-scale scRNA-seq datasets require long computational times and a large memory capacity.In this work, we review 21 fast and memory-efficient PCA implementations (10 algorithms) and evaluate their application using 4 real and 18 synthetic datasets. Our benchmarking showed that some PCA algorithms are faster, more memory efficient, and more accurate than others. In consideration of the differences in the computational environments of users and developers, we have also developed guidelines to assist with selection of appropriate PCA implementations.


2016 ◽  
Author(s):  
Andrian Yang ◽  
Michael Troup ◽  
Peijie Lin ◽  
Joshua W. K. Ho

AbstractSummarySingle-cell RNA-seq (scRNA-seq) is increasingly used in a range of biomedical studies. Nonetheless, current RNA-seq analysis tools are not specifically designed to efficiently process scRNA-seq data due to their limited scalability. Here we introduce Falco, a cloud-based framework to enable paralellisation of existing RNA-seq processing pipelines using big data technologies of Apache Hadoop and Apache Spark for performing massively parallel analysis of large scale transcriptomic data. Using two public scRNA-seq data sets and two popular RNA-seq alignment/feature quantification pipelines, we show that the same processing pipeline runs 2.6 – 145.4 times faster using Falco than running on a highly optimised single node analysis. Falco also allows user to the utilise low-cost spot instances of Amazon Web Services (AWS), providing a 65% reduction in cost of analysis.AvailabilityFalco is available via a GNU General Public License at https://github.com/VCCRI/Falco/[email protected] informationSupplementary data are available at BioRXiv online.


2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Chayaporn Suphavilai ◽  
Shumei Chia ◽  
Ankur Sharma ◽  
Lorna Tu ◽  
Rafael Peres Da Silva ◽  
...  

AbstractWhile understanding molecular heterogeneity across patients underpins precision oncology, there is increasing appreciation for taking intra-tumor heterogeneity into account. Based on large-scale analysis of cancer omics datasets, we highlight the importance of intra-tumor transcriptomic heterogeneity (ITTH) for predicting clinical outcomes. Leveraging single-cell RNA-seq (scRNA-seq) with a recommender system (CaDRReS-Sc), we show that heterogeneous gene-expression signatures can predict drug response with high accuracy (80%). Using patient-proximal cell lines, we established the validity of CaDRReS-Sc’s monotherapy (Pearson r>0.6) and combinatorial predictions targeting clone-specific vulnerabilities (>10% improvement). Applying CaDRReS-Sc to rapidly expanding scRNA-seq compendiums can serve as in silico screen to accelerate drug-repurposing studies. Availability: https://github.com/CSB5/CaDRReS-Sc.


2019 ◽  
Author(s):  
Ning Wang ◽  
Andrew E. Teschendorff

AbstractInferring the activity of transcription factors in single cells is a key task to improve our understanding of development and complex genetic diseases. This task is, however, challenging due to the relatively large dropout rate and noisy nature of single-cell RNA-Seq data. Here we present a novel statistical inference framework called SCIRA (Single Cell Inference of Regulatory Activity), which leverages the power of large-scale bulk RNA-Seq datasets to infer high-quality tissue-specific regulatory networks, from which regulatory activity estimates in single cells can be subsequently obtained. We show that SCIRA can correctly infer regulatory activity of transcription factors affected by high technical dropouts. In particular, SCIRA can improve sensitivity by as much as 70% compared to differential expression analysis and current state-of-the-art methods. Importantly, SCIRA can reveal novel regulators of cell-fate in tissue-development, even for cell-types that only make up 5% of the tissue, and can identify key novel tumor suppressor genes in cancer at single cell resolution. In summary, SCIRA will be an invaluable tool for single-cell studies aiming to accurately map activity patterns of key transcription factors during development, and how these are altered in disease.


2019 ◽  
Author(s):  
Anna Danese ◽  
Maria L. Richter ◽  
David S. Fischer ◽  
Fabian J. Theis ◽  
Maria Colomé-Tatché

ABSTRACTEpigenetic single-cell measurements reveal a layer of regulatory information not accessible to single-cell transcriptomics, however single-cell-omics analysis tools mainly focus on gene expression data. To address this issue, we present epiScanpy, a computational framework for the analysis of single-cell DNA methylation and single-cell ATAC-seq data. EpiScanpy makes the many existing RNA-seq workflows from scanpy available to large-scale single-cell data from other -omics modalities. We introduce and compare multiple feature space constructions for epigenetic data and show the feasibility of common clustering, dimension reduction and trajectory learning techniques. We benchmark epiScanpy by interrogating different single-cell brain mouse atlases of DNA methylation, ATAC-seq and transcriptomics. We find that differentially methylated and differentially open markers between cell clusters enrich transcriptome-based cell type labels by orthogonal epigenetic information.


2018 ◽  
Vol 9 (1) ◽  
Author(s):  
William Stephenson ◽  
Laura T. Donlin ◽  
Andrew Butler ◽  
Cristina Rozo ◽  
Bernadette Bracken ◽  
...  

2020 ◽  
Author(s):  
Naim Al Mahi ◽  
Erik Y. Zhang ◽  
Susan Sherman ◽  
Jane J. Yu ◽  
Mario Medvedovic

ABSTRACTLymphangioleiomyomatosis (LAM) is a rare pulmonary disease affecting women of childbearing age that is characterized by the aberrant proliferation of smooth-muscle (SM)-like cells and emphysema-like lung remodeling. In LAM, mutations in TSC1 or TSC2 genes results in the activation of the mechanistic target of rapamycin complex 1 (mTORC1) and thus sirolimus, an mTORC1 inhibitor, has been approved by FDA to treat LAM patients. Sirolimus stabilizes lung function and improves symptoms. However, the disease recurs with discontinuation of the drug, potentially because of the sirolimus-induced refractoriness of the LAM cells. Therefore, there is a critical need to identify remission inducing cytocidal treatments for LAM. Recently released Library of Integrated Network-based Cellular Signatures (LINCS) L1000 transcriptional signatures of chemical perturbations has opened new avenues to study cellular responses to existing drugs and new bioactive compounds. Connecting transcriptional signature of a disease to these chemical perturbation signatures to identify bioactive chemicals that can “revert” the disease signatures can lead to novel drug discovery. We developed methods for constructing disease transcriptional signatures and performing connectivity analysis using single cell RNA-seq data. The methods were applied in the analysis of scRNA-seq data of naïve and sirolimus-treated LAM cells. The single cell connectivity analyses implicated mTORC1 inhibitors as capable of reverting the LAM transcriptional signatures while the corresponding standard bulk analysis did not. This indicates the importance of using single cell analysis in constructing disease signatures. The analysis also implicated other classes of drugs, CDK, MEK/MAPK and EGFR/JAK inhibitors, as potential therapeutic agents for LAM.


Sign in / Sign up

Export Citation Format

Share Document