scholarly journals K-mer counting with low memory consumption enables fast clustering of single-cell sequencing data without read alignment

2019 ◽  
Author(s):  
Christina Huan Shi ◽  
Kevin Y. Yip

AbstractK-mer counting has many applications in sequencing data processing and analysis. However, sequencing errors can produce many false k-mers that substantially increase the memory requirement during counting. We propose a fast k-mer counting method, CQF-deNoise, which has a novel component for dynamically identifying and removing false k-mers while preserving counting accuracy. Compared with four state-of-the-art k-mer counting methods, CQF-deNoise consumed 49-76% less memory than the second best method, but still ran competitively fast. The k-mer counts from CQF-deNoise produced cell clusters from single-cell RNA-seq data highly consistent with CellRanger but required only 5% of the running time at the same memory consumption, suggesting that CQF-deNoise can be used for a preview of cell clusters for an early detection of potential data problems, before running a much more time-consuming full analysis pipeline.

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Boying Gong ◽  
Yun Zhou ◽  
Elizabeth Purdom

AbstractA growing number of single-cell sequencing platforms enable joint profiling of multiple omics from the same cells. We present , a novel method that not only allows for analyzing the data from joint-modality platforms, but provides a coherent framework for the integration of multiple datasets measured on different modalities. We demonstrate its performance on multi-modality data of gene expression and chromatin accessibility and illustrate the integration abilities of by jointly analyzing this multi-modality data with single-cell RNA-seq and ATAC-seq datasets.


2017 ◽  
Author(s):  
Saskia Freytag ◽  
Ingrid Lonnstedt ◽  
Milica Ng ◽  
Melanie Bahlo

AbstractThe commercially available 10X Genomics protocol to generate droplet-based single cell RNA-seq (scRNA-seq) data is enjoying growing popularity among researchers. Fundamental to the analysis of such scRNA-seq data is the ability to cluster similar or same cells into non-overlapping groups. Many competing methods have been proposed for this task, but there is currently little guidance with regards to which method offers most accuracy. Answering this question is complicated by the fact that 10X Genomics data lack cell labels that would allow a direct performance evaluation. Thus in this review, we focused on comparing clustering solutions of a dozen methods for three datasets on human peripheral mononuclear cells generated with the 10X Genomics technology. While clustering solutions appeared robust, we found that solutions produced by different methods have little in common with each other. They also failed to replicate cell type assignment generated with supervised labeling approaches. Furthermore, we demonstrate that all clustering methods tested clustered cells to a large degree according to the amount of genes coding for ribosomal protein genes in each cell.


2020 ◽  
Vol 8 (Suppl 3) ◽  
pp. A520-A520
Author(s):  
Son Pham ◽  
Tri Le ◽  
Tan Phan ◽  
Minh Pham ◽  
Huy Nguyen ◽  
...  

BackgroundSingle-cell sequencing technology has opened an unprecedented ability to interrogate cancer. It reveals significant insights into the intratumoral heterogeneity, metastasis, therapeutic resistance, which facilitates target discovery and validation in cancer treatment. With rapid advancements in throughput and strategies, a particular immuno-oncology study can produce multi-omics profiles for several thousands of individual cells. This overflow of single-cell data poses formidable challenges, including standardizing data formats across studies, performing reanalysis for individual datasets and meta-analysis.MethodsN/AResultsWe present BioTuring Browser, an interactive platform for accessing and reanalyzing published single-cell omics data. The platform is currently hosting a curated database of more than 10 million cells from 247 projects, covering more than 120 immune cell types and subtypes, and 15 different cancer types. All data are processed and annotated with standardized labels of cell types, diseases, therapeutic responses, etc. to be instantly accessed and explored in a uniform visualization and analytics interface. Based on this massive curated database, BioTuring Browser supports searching similar expression profiles, querying a target across datasets and automatic cell type annotation. The platform supports single-cell RNA-seq, CITE-seq and TCR-seq data. BioTuring Browser is now available for download at www.bioturing.com.ConclusionsN/A


2020 ◽  
Vol 22 (Supplement_2) ◽  
pp. ii112-ii112
Author(s):  
Pravesh Gupta ◽  
Minghao Dang ◽  
Krishna Bojja ◽  
Tuan Tran M ◽  
Huma Shehwana ◽  
...  

Abstract The brain tumor immune microenvironment (TIME) continuously evolves during glioma progression and a comprehensive understanding of the glioma-centric immune cell repertoire beyond a priori cell types and/or states is uncharted. Consequently, we performed single-cell RNA-sequencing on ~123,000 tumor-derived immune cells from 17-pathologically stratified, IDH (isocitrate dehydrogenase)-differential primary, recurrent human gliomas, and non-glioma brains. Our analysis delineated predominant 34-myeloid cell clusters (~75%) over 28-lymphoid cell clusters (~25%) reflecting enormous heterogeneity within and across gliomas. The glioma immune diversity spanned functionally imprinted phagocytic, antigen-presenting, hypoxia, angiogenesis and, tumoricidal myeloid to classical cytotoxic lymphoid subpopulations. Specifically, IDH-mutant gliomas were enriched for brain-resident microglial subpopulations in contrast to enhanced bone barrow-derived infiltrates in IDH-wild type, especially in a recurrent setting. Microglia attrition in IDH-wild type -primary and -recurrent gliomas were concomitant with invading monocyte-derived cells with semblance to dendritic cell and macrophage/microglia like transcriptomic features. Additionally, microglial functional diversification was noted with disease severity and mostly converged to inflammatory states in IDH-wild type recurrent gliomas. Beyond dendritic cells, multiple antigen-presenting cellular states expanded with glioma severity especially in IDH-wild type primary and recurrent- gliomas. Furthermore, we noted differential microglia and dendritic cell inherent antigen presentation axis viz, osteopontin, and classical HLAs in IDH subtypes and, glioma-wide non-PD1 checkpoints associations in T cells like Galectin9 and Tim-3. As a general utility, our immune cell deconvolution approach with single-cell-matched bulk RNA sequencing data faithfully resolved 58-cell states which provides glioma specific immune reference for digital cytometry application to genomics datasets. Resultantly, we identified prognosticator immune cell-signatures from TCGA cohorts as one of many potential immune responsiveness applications of the curated signatures for basic and translational immune-genomics efforts. Thus, we not only provide an unprecedented insight of glioma TIME but also present an immune data resource that can be exploited to guide pragmatic glioma immunotherapy designs.


2020 ◽  
Author(s):  
Viacheslav Mylka ◽  
Jeroen Aerts ◽  
Irina Matetovici ◽  
Suresh Poovathingal ◽  
Niels Vandamme ◽  
...  

ABSTRACTMultiplexing of samples in single-cell RNA-seq studies allows significant reduction of experimental costs, straightforward identification of doublets, increased cell throughput, and reduction of sample-specific batch effects. Recently published multiplexing techniques using oligo-conjugated antibodies or - lipids allow barcoding sample-specific cells, a process called ‘hashing’. Here, we compare the hashing performance of TotalSeq-A and -C antibodies, custom synthesized lipids and MULTI-seq lipid hashes in four cell lines, both for single-cell RNA-seq and single-nucleus RNA-seq. Hashing efficiency was evaluated using the intrinsic genetic variation of the cell lines. Benchmarking of different hashing strategies and computational pipelines indicates that correct demultiplexing can be achieved with both lipid- and antibody-hashed human cells and nuclei, with MULTISeqDemux as the preferred demultiplexing function and antibody-based hashing as the most efficient protocol on cells. Antibody hashing was further evaluated on clinical samples using PBMCs from healthy and SARS-CoV-2 infected patients, where we demonstrate a more affordable approach for large single-cell sequencing clinical studies, while simultaneously reducing batch effects.


2019 ◽  
Author(s):  
Simone Ciccolella ◽  
Murray Patterson ◽  
Paola Bonizzoni ◽  
Gianluca Della Vedova

AbstractBackgroundSingle cell sequencing (SCS) technologies provide a level of resolution that makes it indispensable for inferring from a sequenced tumor, evolutionary trees or phylogenies representing an accumulation of cancerous mutations. A drawback of SCS is elevated false negative and missing value rates, resulting in a large space of possible solutions, which in turn makes infeasible using some approaches and tools. While this has not inhibited the development of methods for inferring phylogenies from SCS data, the continuing increase in size and resolution of these data begin to put a strain on such methods.One possible solution is to reduce the size of an SCS instance — usually represented as a matrix of presence, absence and missing values of the mutations found in the different sequenced cells — and infer the tree from this reduced-size instance. Previous approaches have used k-means to this end, clustering groups of mutations and/or cells, and using these means as the reduced instance. Such an approach typically uses the Euclidean distance for computing means. However, since the values in these matrices are of a categorical nature (having the three categories: present, absent and missing), we explore techniques for clustering categorical data — commonly used in data mining and machine learning — to SCS data, with this goal in mind.ResultsIn this work, we present a new clustering procedure aimed at clustering categorical vector, or matrix data — here representing SCS instances, called celluloid. We demonstrate that celluloid clusters mutations with high precision: never pairing too many mutations that are unrelated in the ground truth, but also obtains accurate results in terms of the phylogeny inferred downstream from the reduced instance produced by this method.Finally, we demonstrate the usefulness of a clustering step by applying the entire pipeline (clustering + inference method) to a real dataset, showing a significant reduction in the runtime, raising considerably the upper bound on the size of SCS instances which can be solved in practice.AvailabilityOur approach, celluloid: clustering single cell sequencing data around centroids is available at https://github.com/AlgoLab/celluloid/ under an MIT license.


2021 ◽  
Author(s):  
Combiz Khozoie ◽  
Nurun Fancy ◽  
Mahdi Moradi Marjaneh ◽  
Alan E. Murphy ◽  
Paul M. Matthews ◽  
...  

Advances in single-cell RNA-sequencing technology over the last decade have enabled exponential increases in throughput: datasets with over a million cells are becoming commonplace. The burgeoning scale of data generation, combined with the proliferation of alternative analysis methods, led us to develop the scFlow toolkit and the nf-core/scflow pipeline for reproducible, efficient, and scalable analyses of single-cell and single-nuclei RNA-sequencing data. The scFlow toolkit provides a higher level of abstraction on top of popular single-cell packages within an R ecosystem, while the nf-core/scflow Nextflow pipeline is built within the nf-core framework to enable compute infrastructure-independent deployment across all institutions and research facilities. Here we present our flexible pipeline, which leverages the advantages of containerization and the potential of Cloud computing for easy orchestration and scaling of the analysis of large case/control datasets by even non-expert users. We demonstrate the functionality of the analysis pipeline from sparse-matrix quality control through to insight discovery with examples of analysis of four recently published public datasets and describe the extensibility of scFlow as a modular, open-source tool for single-cell and single nuclei bioinformatic analyses.


Sign in / Sign up

Export Citation Format

Share Document