Falco: A quick and flexible single-cell RNA-seq processing framework on the cloud

AbstractSummarySingle-cell RNA-seq (scRNA-seq) is increasingly used in a range of biomedical studies. Nonetheless, current RNA-seq analysis tools are not specifically designed to efficiently process scRNA-seq data due to their limited scalability. Here we introduce Falco, a cloud-based framework to enable paralellisation of existing RNA-seq processing pipelines using big data technologies of Apache Hadoop and Apache Spark for performing massively parallel analysis of large scale transcriptomic data. Using two public scRNA-seq data sets and two popular RNA-seq alignment/feature quantification pipelines, we show that the same processing pipeline runs 2.6 – 145.4 times faster using Falco than running on a highly optimised single node analysis. Falco also allows user to the utilise low-cost spot instances of Amazon Web Services (AWS), providing a 65% reduction in cost of analysis.AvailabilityFalco is available via a GNU General Public License at https://github.com/VCCRI/Falco/[email protected] informationSupplementary data are available at BioRXiv online.

Download Full-text

SIMLR: a tool for large-scale single-cell analysis by multi-kernel learning

10.1101/118901 ◽

2017 ◽

Cited By ~ 9

Author(s):

Bo Wang ◽

Daniele Ramazzotti ◽

Luca De Sano ◽

Junjie Zhu ◽

Emma Pierson ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Single Cell Analysis ◽

R Package ◽

Supplementary Information ◽

Cell Analysis ◽

Rna Seq ◽

A Cell ◽

Supplementary Material ◽

Public Datasets

AbstractMotivationWe here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a cell-to-cell similarity measure from single-cell RNA-seq data. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of cells. SIMLR was benchmarked against state-of-the-art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization.Availability and ImplementationSIMLR is available on GitHub in both R and MATLAB implementations. Furthermore, it is also available as an R package on [email protected] or [email protected] InformationSupplementary data are available at Bioinformatics online.

Download Full-text

Single cell network analysis with a mixture of Nested Effects Models

10.1101/258202 ◽

2018 ◽

Author(s):

Martin Pirkl ◽

Niko Beerenwinkel

Keyword(s):

Single Cell ◽

New Technologies ◽

Single Cells ◽

R Package ◽

Supplementary Information ◽

Data Sets ◽

Cell Network ◽

A Cell ◽

Supplementary Material ◽

Cell Data

AbstractMotivationNew technologies allow for the elaborate measurement of different traits of single cells. These data promise to elucidate intra-cellular networks in unprecedented detail and further help to improve treatment of diseases like cancer. However, cell populations can be very heterogeneous.ResultsWe developed a mixture of Nested Effects Models (M&NEM) for single-cell data to simultaneously identify different cellular sub-populations and their corresponding causal networks to explain the heterogeneity in a cell population. For inference, we assign each cell to a network with a certain probability and iteratively update the optimal networks and cell probabilities in an Expectation Maximization scheme. We validate our method in the controlled setting of a simulation study and apply it to three data sets of pooled CRISPR screens generated previously by two novel experimental techniques, namely Crop-Seq and Perturb-Seq.AvailabilityThe mixture Nested Effects Model (M&NEM) is available as the R-package mnem at https://github.com/cbgethz/mnem/[email protected], [email protected] informationSupplementary data are available.online.

Download Full-text

Methods for Automatic Reference Trees and Multilevel Phylogenetic Placement

10.1101/299792 ◽

2018 ◽

Cited By ~ 1

Author(s):

Lucas Czech ◽

Alexandros Stamatakis

Keyword(s):

Large Scale ◽

Sequence Data ◽

Sequence Similarity ◽

Computational Effort ◽

Supplementary Information ◽

Data Sets ◽

Metagenomic Sequencing ◽

Sequencing Studies ◽

Manual Selection ◽

Supplementary Material

AbstractMotivationIn most metagenomic sequencing studies, the initial analysis step consists in assessing the evolutionary provenance of the sequences. Phylogenetic (or Evolutionary) Placement methods can be employed to determine the evolutionary position of sequences with respect to a given reference phylogeny. These placement methods do however face certain limitations: The manual selection of reference sequences is labor-intensive; the computational effort to infer reference phylogenies is substantially larger than for methods that rely on sequence similarity; the number of taxa in the reference phylogeny should be small enough to allow for visually inspecting the results.ResultsWe present algorithms to overcome the above limitations. First, we introduce a method to automatically construct representative sequences from databases to infer reference phylogenies. Second, we present an approach for conducting large-scale phylogenetic placements on nested phylogenies. Third, we describe a preprocessing pipeline that allows for handling huge sequence data sets. Our experiments on empirical data show that our methods substantially accelerate the workflow and yield highly accurate placement results.ImplementationFreely available under GPLv3 at http://github.com/lczech/[email protected] InformationSupplementary data are available at Bioinformatics online.

Download Full-text

Differential transcript usage analysis of bulk and single-cell RNA-seq data with DTUrtle

Bioinformatics ◽

10.1093/bioinformatics/btab629 ◽

2021 ◽

Author(s):

Tobias Tekath ◽

Martin Dugas

Keyword(s):

Single Cell ◽

Transcript Level ◽

R Package ◽

Supplementary Information ◽

Data Sets ◽

Rna Seq ◽

Cell Type ◽

Gene Level ◽

Analysis Workflow ◽

Usage Analysis

Abstract Motivation Each year, the number of published bulk and single-cell RNA-seq data sets is growing exponentially. Studies analyzing such data are commonly looking at gene-level differences, while the collected RNA-seq data inherently represents reads of transcript isoform sequences. Utilizing transcriptomic quantifiers, RNA-seq reads can be attributed to specific isoforms, allowing for analysis of transcript-level differences. A differential transcript usage (DTU) analysis is testing for proportional differences in a gene’s transcript composition, and has been of rising interest for many research questions, such as analysis of differential splicing or cell type identification. Results We present the R package DTUrtle, the first DTU analysis workflow for both bulk and single-cell RNA-seq data sets, and the first package to conduct a ‘classical’ DTU analysis in a single-cell context. DTUrtle extends established statistical frameworks, offers various result aggregation and visualization options and a novel detection probability score for tagged-end data. It has been successfully applied to bulk and single-cell RNA-seq data of human and mouse, confirming and extending key results. Additionally, we present novel potential DTU applications like the identification of cell type specific transcript isoforms as biomarkers. Availability The R package DTUrtle is available at https://github.com/TobiTekath/DTUrtle with extensive vignettes and documentation at https://tobitekath.github.io/DTUrtle/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

scTIM: seeking cell-type-indicative marker from single cell RNA-seq data by consensus optimization

Bioinformatics ◽

10.1093/bioinformatics/btz936 ◽

2019 ◽

Vol 36 (8) ◽

pp. 2474-2485 ◽

Cited By ~ 2

Author(s):

Zhanying Feng ◽

Xianwen Ren ◽

Yuan Fang ◽

Yining Yin ◽

Chutian Huang ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Cell Types ◽

Mouse Cell ◽

Supplementary Information ◽

Rna Seq ◽

Cell Type ◽

Robust Solution ◽

Development Trajectory ◽

Consensus Optimization

Abstract Motivation Single cell RNA-seq data offers us new resource and resolution to study cell type identity and its conversion. However, data analyses are challenging in dealing with noise, sparsity and poor annotation at single cell resolution. Detecting cell-type-indicative markers is promising to help denoising, clustering and cell type annotation. Results We developed a new method, scTIM, to reveal cell-type-indicative markers. scTIM is based on a multi-objective optimization framework to simultaneously maximize gene specificity by considering gene-cell relationship, maximize gene’s ability to reconstruct cell–cell relationship and minimize gene redundancy by considering gene–gene relationship. Furthermore, consensus optimization is introduced for robust solution. Experimental results on three diverse single cell RNA-seq datasets show scTIM’s advantages in identifying cell types (clustering), annotating cell types and reconstructing cell development trajectory. Applying scTIM to the large-scale mouse cell atlas data identifies critical markers for 15 tissues as ‘mouse cell marker atlas’, which allows us to investigate identities of different tissues and subtle cell types within a tissue. scTIM will serve as a useful method for single cell RNA-seq data mining. Availability and implementation scTIM is freely available at https://github.com/Frank-Orwell/scTIM. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Cloud Computing Enabled Big Multi-Omics Data Analytics

Bioinformatics and Biology Insights ◽

10.1177/11779322211035921 ◽

2021 ◽

Vol 15 ◽

pp. 117793222110359

Author(s):

Saraswati Koppad ◽

Annappa B ◽

Georgios V Gkoutos ◽

Animesh Acharjee

Keyword(s):

Cloud Computing ◽

Data Analytics ◽

Large Scale ◽

Low Cost ◽

Data Sets ◽

Omics Data ◽

Proteomics Data ◽

Phenotypic Data ◽

Multifactorial Diseases ◽

Big Data Technologies

High-throughput experiments enable researchers to explore complex multifactorial diseases through large-scale analysis of omics data. Challenges for such high-dimensional data sets include storage, analyses, and sharing. Recent innovations in computational technologies and approaches, especially in cloud computing, offer a promising, low-cost, and highly flexible solution in the bioinformatics domain. Cloud computing is rapidly proving increasingly useful in molecular modeling, omics data analytics (eg, RNA sequencing, metabolomics, or proteomics data sets), and for the integration, analysis, and interpretation of phenotypic data. We review the adoption of advanced cloud-based and big data technologies for processing and analyzing omics data and provide insights into state-of-the-art cloud bioinformatics applications.

Download Full-text

Cumulus: a cloud-based data analysis framework for large-scale single-cell and single-nucleus RNA-seq

10.1101/823682 ◽

2019 ◽

Cited By ~ 8

Author(s):

Bo Li ◽

Joshua Gould ◽

Yiming Yang ◽

Siranush Sarkizova ◽

Marcin Tabaka ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Bone Marrow Cells ◽

Low Cost ◽

Data Generation ◽

Rna Seq ◽

User Friendliness ◽

Health And Disease ◽

Single Nucleus

AbstractMassively parallel single-cell and single-nucleus RNA-seq (sc/snRNA-seq) have opened the way to systematic tissue atlases in health and disease, but as the scale of data generation is growing, so does the need for computational pipelines for scaled analysis. Here, we developed Cumulus, a cloud-based framework for analyzing large scale sc/snRNA-seq datasets. Cumulus combines the power of cloud computing with improvements in algorithm implementations to achieve high scalability, low cost, user-friendliness, and integrated support for a comprehensive set of features. We benchmark Cumulus on the Human Cell Atlas Census of Immune Cells dataset of bone marrow cells and show that it substantially improves efficiency over conventional frameworks, while maintaining or improving the quality of results, enabling large-scale studies.

Download Full-text

scater: pre-processing, quality control, normalisation and visualisation of single-cell RNA-seq data in R

10.1101/069633 ◽

2016 ◽

Cited By ~ 10

Author(s):

Davis J. McCarthy ◽

Kieran R. Campbell ◽

Aaron T. L. Lun ◽

Quin F. Wills

Keyword(s):

Quality Control ◽

Single Cell ◽

Sequence Data ◽

Supplementary Information ◽

Processing Quality ◽

Rna Seq ◽

Study Gene Expression ◽

Supplementary Material ◽

Downstream Analysis ◽

Cell Data

AbstractMotivationSingle-cell RNA sequencing (scRNA-seq) is increasingly used to study gene expression at the level of individual cells. However, preparing raw sequence data for further analysis is not a straightforward process. Biases, artifacts, and other sources of unwanted variation are present in the data, requiring substantial time and effort to be spent on pre-processing, quality control (QC) and normalisation.ResultsWe have developed the R/Bioconductor package scater to facilitate rigorous pre-processing, quality control, normalisation and visualisation of scRNA-seq data. The package provides a convenient, flexible workflow to process raw sequencing reads into a high-quality expression dataset ready for downstream analysis. scater provides a rich suite of plotting tools for single-cell data and a flexible data structure that is compatible with existing tools and can be used as infrastructure for future software development.AvailabilityThe open-source code, along with installation instructions, vignettes and case studies, is available through Bioconductor at http://bioconductor.org/packages/scater.Supplementary informationSupplementary material is available online at bioRxiv accompanying this manuscript, and all materials required to reproduce the results presented in this paper are available at dx.doi.org/10.5281/zenodo.60139.

Download Full-text

Scarf: A toolkit for memory efficient analysis of large-scale single-cell genomics data

10.1101/2021.05.02.441899 ◽

2021 ◽

Author(s):

Parashar Dhapola ◽

Johan Rodhe ◽

Rasmus Olofzon ◽

Thomas Bonald ◽

Eva Erlandsson ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Clustering Algorithm ◽

Single Cell Analysis ◽

Low Cost ◽

Rna Seq ◽

Representative Sampling ◽

Lineage Differentiation ◽

Memory Efficiency ◽

Memory Efficient

The increasing capacity to perform large-scale single-cell genomic experiments continues to outpace the ability to efficiently handle growing datasets. Herein we present Scarf, a modularly designed Python package that seamlessly interoperates with other single-cell toolkits and allows for memory efficient single-cell analysis of millions of cells on a laptop or low-cost devices like single board computers. We demonstrate Scarf's memory and compute-time efficiency by applying it to the largest existing single-cell RNA-Seq and ATAC-Seq datasets. Scarf wraps memory efficient implementations of a graph-based t-stochastic neighbour embedding and hierarchical clustering algorithm. Moreover, Scarf performs accurate reference-anchored mapping of datasets while maintaining memory efficiency. By implementing a novel data downsampling algorithm, Scarf additionally has the capacity to generate representative sampling of cells from a given dataset wherein rare cell populations and lineage differentiation trajectories are conserved. Together, Scarf provides a framework wherein any researcher can perform advanced processing, downsampling, reanalysis and integration of atlas-scale datasets on standard laptop computers.

Download Full-text

ascend: R package for analysis of single cell RNA-seq data

10.1101/207704 ◽

2017 ◽

Cited By ~ 11

Author(s):

Anne Senabouth ◽

Samuel W Lukowski ◽

Jose Alquicira Hernandez ◽

Stacey Andersen ◽

Xin Mei ◽

...

Keyword(s):

Single Cell ◽

R Package ◽

Computational Genomics ◽

Supplementary Information ◽

Rna Seq ◽

Software Packages ◽

Wide Range ◽

Flexible Framework ◽

Supplementary Material ◽

Data Objects

AbstractSummaryascend is an R package comprised of fast, streamlined analysis functions optimized to address the statistical challenges of single cell RNA-seq. The package incorporates novel and established methods to provide a flexible framework to perform filtering, quality control, normalization, dimension reduction, clustering, differential expression and a wide-range of plotting. ascend is designed to work with scRNA-seq data generated by any high-throughput platform, and includes functions to convert data objects between software packages.AvailabilityThe R package and associated vignettes are freely available at https://github.com/IMB-Computational-Genomics-Lab/[email protected] informationAn example dataset is available at ArrayExpress, accession number E-MTAB-6108

Download Full-text