CUT&RUNTools 2.0: A pipeline for single-cell and bulk-level CUT&RUN and CUT&Tag data analysis

AbstractGenome-wide profiling of transcription factor binding and chromatin states is a widely-used approach for mechanistic understanding of gene regulation. Recent technology development has enabled such profiling at single-cell resolution. However, an end-to-end computational pipeline for analyzing such data is still lacking. To fill this gap, we have developed a flexible pipeline for analysis and visualization of single-cell CUT&RUN and CUT&Tag data, which provides functions for sequence alignment, quality control, dimensionality reduction, cell clustering, data aggregation, and visualization. Furthermore, it is also seamlessly integrated with the functions in original CUT&RUNTools for population-level analyses. As such, this provides a valuable toolbox for the community.

Download Full-text

mbkmeans: Fast clustering for single cell data using mini-batch k-means

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008625 ◽

2021 ◽

Vol 17 (1) ◽

pp. e1008625

Author(s):

Stephanie C. Hicks ◽

Ruoxi Liu ◽

Yuwei Ni ◽

Elizabeth Purdom ◽

Davide Risso

Keyword(s):

Single Cell ◽

Clustering Algorithms ◽

Large Datasets ◽

Clustering Methods ◽

Cell Clustering ◽

Genome Wide ◽

Data Representations ◽

Computing Performance ◽

Cell Data ◽

Genome Wide Gene Expression

Single-cell RNA-Sequencing (scRNA-seq) is the most widely used high-throughput technology to measure genome-wide gene expression at the single-cell level. One of the most common analyses of scRNA-seq data detects distinct subpopulations of cells through the use of unsupervised clustering algorithms. However, recent advances in scRNA-seq technologies result in current datasets ranging from thousands to millions of cells. Popular clustering algorithms, such as k-means, typically require the data to be loaded entirely into memory and therefore can be slow or impossible to run with large datasets. To address this problem, we developed the mbkmeans R/Bioconductor package, an open-source implementation of the mini-batch k-means algorithm. Our package allows for on-disk data representations, such as the common HDF5 file format widely used for single-cell data, that do not require all the data to be loaded into memory at one time. We demonstrate the performance of the mbkmeans package using large datasets, including one with 1.3 million cells. We also highlight and compare the computing performance of mbkmeans against the standard implementation of k-means and other popular single-cell clustering methods. Our software package is available in Bioconductor at https://bioconductor.org/packages/mbkmeans.

Download Full-text

Correlation imputation in single cell RNA-seq using auxiliary information and ensemble learning

10.1101/2020.09.03.282178 ◽

2020 ◽

Author(s):

Luqin Gan ◽

Giuseppe Vinci ◽

Genevera I. Allen

Keyword(s):

Single Cell ◽

Ensemble Learning ◽

Graphical Model ◽

Auxiliary Information ◽

Rna Seq ◽

Correlation Matrices ◽

Reduction Cell ◽

Cell Clustering ◽

Novel Approach ◽

Gene Correlation

AbstractSingle cell RNA sequencing is a powerful technique that measures the gene expression of individual cells in a high throughput fashion. However, due to sequencing inefficiency, the data is unreliable due to dropout events, or technical artifacts where genes erroneously appear to have zero expression. Many data imputation methods have been proposed to alleviate this issue. Yet, effective imputation can be difficult and biased because the data is sparse and high-dimensional, resulting in major distortions in downstream analyses. In this paper, we propose a completely novel approach that imputes the gene-by-gene correlations rather than the data itself. We call this method SCENA: Single cell RNA-seq Correlation completion by ENsemble learning and Auxiliary information. The SCENA gene-by-gene correlation matrix estimate is obtained by model stacking of multiple imputed correlation matrices based on known auxiliary information about gene connections. In an extensive simulation study based on real scRNA-seq data, we demonstrate that SCENA not only accurately imputes gene correlations but also outperforms existing imputation approaches in downstream analyses such as dimension reduction, cell clustering, graphical model estimation.

Download Full-text

Zeros in scRNA-seq data: good or bad? How to embrace or tackle zeros in scRNA-seq data analysis?

10.1101/2020.12.28.424633 ◽

2020 ◽

Author(s):

Ruochen Jiang ◽

Tianyi Sun ◽

Dongyuan Song ◽

Jingyi Jessica Li

Keyword(s):

Gene Expression ◽

Data Analysis ◽

Single Cell ◽

Biomedical Sciences ◽

Data Types ◽

Excess Zeros ◽

Cell Clustering ◽

Open Questions ◽

Genome Wide ◽

Differential Gene

AbstractSingle-cell RNA sequencing (scRNA-seq) technologies have revolutionized biomedical sciences by enabling genome-wide profiling of gene expression levels at an unprecedented single-cell resolution. A distinct characteristic of scRNA-seq data is the vast proportion of zeros unseen in bulk RNA-seq data. Researchers view these zeros differently: some regard zeros as biological signals representing no or low gene expression, while others regard zeros as false signals or missing data to be corrected. As a result, the scRNA-seq field faces much controversy regarding how to handle zeros in data analysis. In this paper, we first discuss the origins of biological and non-biological zeros in scRNA-seq data. Second, we clarify the definitions of several commonly-used but ambiguous terms, including “dropouts,” “excess zeros,” and “zero inflation.” Third, we evaluate the impacts of non-biological zeros on cell clustering and differential gene expression analysis. Fourth, we summarize the advantages, disadvantages, and suitable users of three input data types: original counts, imputed counts, and binarized counts. Finally, we discuss the open questions regarding non-biological zeros, the need for benchmarking, and the importance of transparent analysis.

Download Full-text

High-throughput single-cell chromatin accessibility CRISPR screens enable unbiased identification of regulatory networks in cancer

Nature Communications ◽

10.1038/s41467-021-23213-w ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Sarah E. Pierce ◽

Jeffrey M. Granja ◽

William J. Greenleaf

Keyword(s):

Transcription Factor ◽

Single Cell ◽

Cancer Cells ◽

High Throughput ◽

Regulatory Networks ◽

Temporal Dynamics ◽

Chromatin Accessibility ◽

Regulatory Regions ◽

Factor Binding ◽

Genome Wide

AbstractChromatin accessibility profiling can identify putative regulatory regions genome wide; however, pooled single-cell methods for assessing the effects of regulatory perturbations on accessibility are limited. Here, we report a modified droplet-based single-cell ATAC-seq protocol for perturbing and evaluating dynamic single-cell epigenetic states. This method (Spear-ATAC) enables simultaneous read-out of chromatin accessibility profiles and integrated sgRNA spacer sequences from thousands of individual cells at once. Spear-ATAC profiling of 104,592 cells representing 414 sgRNA knock-down populations reveals the temporal dynamics of epigenetic responses to regulatory perturbations in cancer cells and the associations between transcription factor binding profiles.

Download Full-text

A robust single cell clustering method based on subspace learning and partial imputation

2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ◽

10.1109/bibm49941.2020.9313478 ◽

2020 ◽

Author(s):

Ruiqing Zheng ◽

Zhenlan Liang ◽

Xiangmao Meng ◽

Yu Tian ◽

Min Li

Keyword(s):

Single Cell ◽

Subspace Learning ◽

Clustering Method ◽

Cell Clustering

Download Full-text

TPK: a single-cell clustering algorithm based on novel feature selection genes

Journal of Physics Conference Series ◽

10.1088/1742-6596/1738/1/012078 ◽

2021 ◽

Vol 1738 ◽

pp. 012078

Author(s):

Yaxuan Cui ◽

Kunjie Luo ◽

Zheyu Zhang ◽

Saijia Liu

Keyword(s):

Feature Selection ◽

Single Cell ◽

Clustering Algorithm ◽

Cell Clustering

Download Full-text

Selecting single cell clustering parameter values using subsampling-based robustness metrics

BMC Bioinformatics ◽

10.1186/s12859-021-03957-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Ryan B. Patterson-Cross ◽

Ariel J. Levine ◽

Vilas Menon

Keyword(s):

Single Cell ◽

Optimal Parameter ◽

Clustering Algorithms ◽

Cell Types ◽

Parameter Selection ◽

Data Set ◽

Biologically Relevant ◽

Cell Clustering ◽

Parameter Values ◽

Robustness Metrics

Abstract Background Generating and analysing single-cell data has become a widespread approach to examine tissue heterogeneity, and numerous algorithms exist for clustering these datasets to identify putative cell types with shared transcriptomic signatures. However, many of these clustering workflows rely on user-tuned parameter values, tailored to each dataset, to identify a set of biologically relevant clusters. Whereas users often develop their own intuition as to the optimal range of parameters for clustering on each data set, the lack of systematic approaches to identify this range can be daunting to new users of any given workflow. In addition, an optimal parameter set does not guarantee that all clusters are equally well-resolved, given the heterogeneity in transcriptomic signatures in most biological systems. Results Here, we illustrate a subsampling-based approach (chooseR) that simultaneously guides parameter selection and characterizes cluster robustness. Through bootstrapped iterative clustering across a range of parameters, chooseR was used to select parameter values for two distinct clustering workflows (Seurat and scVI). In each case, chooseR identified parameters that produced biologically relevant clusters from both well-characterized (human PBMC) and complex (mouse spinal cord) datasets. Moreover, it provided a simple “robustness score” for each of these clusters, facilitating the assessment of cluster quality. Conclusion chooseR is a simple, conceptually understandable tool that can be used flexibly across clustering algorithms, workflows, and datasets to guide clustering parameter selection and characterize cluster robustness.

Download Full-text

High-resolution single-cell 3D-models of chromatin ensembles during Drosophila embryogenesis

Nature Communications ◽

10.1038/s41467-020-20490-9 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Qiu Sun ◽

Alan Perez-Rathke ◽

Daniel M. Czajkowsky ◽

Zhifeng Shao ◽

Jie Liang

Keyword(s):

High Resolution ◽

Single Cell ◽

3D Models ◽

Computational Method ◽

Midblastula Transition ◽

Genome Wide ◽

Large Ensembles ◽

Chromatin Folding ◽

Function Relationship ◽

Relationship Of

AbstractSingle-cell chromatin studies provide insights into how chromatin structure relates to functions of individual cells. However, balancing high-resolution and genome wide-coverage remains challenging. We describe a computational method for the reconstruction of large 3D-ensembles of single-cell (sc) chromatin conformations from population Hi-C that we apply to study embryogenesis in Drosophila. With minimal assumptions of physical properties and without adjustable parameters, our method generates large ensembles of chromatin conformations via deep-sampling. Our method identifies specific interactions, which constitute 5–6% of Hi-C frequencies, but surprisingly are sufficient to drive chromatin folding, giving rise to the observed Hi-C patterns. Modeled sc-chromatins quantify chromatin heterogeneity, revealing significant changes during embryogenesis. Furthermore, >50% of modeled sc-chromatin maintain topologically associating domains (TADs) in early embryos, when no population TADs are perceptible. Domain boundaries become fixated during development, with strong preference at binding-sites of insulator-complexes upon the midblastula transition. Overall, high-resolution 3D-ensembles of sc-chromatin conformations enable further in-depth interpretation of population Hi-C, improving understanding of the structure-function relationship of genome organization.

Download Full-text

Reconstructing single-cell karyotype alterations in colorectal cancer identifies punctuated and gradual diversification patterns

Nature Genetics ◽

10.1038/s41588-021-00891-2 ◽

2021 ◽

Author(s):

Yannik Bollen ◽

Ellen Stelloo ◽

Petra van Leenen ◽

Myrna van den Bos ◽

Bas Ponsioen ◽

...

Keyword(s):

Colorectal Cancer ◽

Single Cell ◽

Temporal Dynamics ◽

De Novo ◽

Tumor Evolution ◽

Genome Wide ◽

Aneuploid Tumor ◽

Evolutionary Paths ◽

Tumor Biopsies ◽

Punctuated Evolution

AbstractCentral to tumor evolution is the generation of genetic diversity. However, the extent and patterns by which de novo karyotype alterations emerge and propagate within human tumors are not well understood, especially at single-cell resolution. Here, we present 3D Live-Seq—a protocol that integrates live-cell imaging of tumor organoid outgrowth and whole-genome sequencing of each imaged cell to reconstruct evolving tumor cell karyotypes across consecutive cell generations. Using patient-derived colorectal cancer organoids and fresh tumor biopsies, we demonstrate that karyotype alterations of varying complexity are prevalent and can arise within a few cell generations. Sub-chromosomal acentric fragments were prone to replication and collective missegregation across consecutive cell divisions. In contrast, gross genome-wide karyotype alterations were generated in a single erroneous cell division, providing support that aneuploid tumor genomes can evolve via punctuated evolution. Mapping the temporal dynamics and patterns of karyotype diversification in cancer enables reconstructions of evolutionary paths to malignant fitness.

Download Full-text

Deep Denoising Subspace Single-Cell Clustering

Communications in Computer and Information Science - Neural Information Processing ◽

10.1007/978-3-030-63823-8_36 ◽

2020 ◽

pp. 308-315

Author(s):

Yijie Wang ◽

Bo Yang

Keyword(s):

Single Cell ◽

Cell Clustering

Download Full-text