scholarly journals CUT&RUNTools 2.0: A pipeline for single-cell and bulk-level CUT&RUN and CUT&Tag data analysis

2021 ◽  
Author(s):  
Fulong Yu ◽  
Vijay G. Sankaran ◽  
Guo-Cheng Yuan

AbstractGenome-wide profiling of transcription factor binding and chromatin states is a widely-used approach for mechanistic understanding of gene regulation. Recent technology development has enabled such profiling at single-cell resolution. However, an end-to-end computational pipeline for analyzing such data is still lacking. To fill this gap, we have developed a flexible pipeline for analysis and visualization of single-cell CUT&RUN and CUT&Tag data, which provides functions for sequence alignment, quality control, dimensionality reduction, cell clustering, data aggregation, and visualization. Furthermore, it is also seamlessly integrated with the functions in original CUT&RUNTools for population-level analyses. As such, this provides a valuable toolbox for the community.

2021 ◽  
Vol 17 (1) ◽  
pp. e1008625
Author(s):  
Stephanie C. Hicks ◽  
Ruoxi Liu ◽  
Yuwei Ni ◽  
Elizabeth Purdom ◽  
Davide Risso

Single-cell RNA-Sequencing (scRNA-seq) is the most widely used high-throughput technology to measure genome-wide gene expression at the single-cell level. One of the most common analyses of scRNA-seq data detects distinct subpopulations of cells through the use of unsupervised clustering algorithms. However, recent advances in scRNA-seq technologies result in current datasets ranging from thousands to millions of cells. Popular clustering algorithms, such as k-means, typically require the data to be loaded entirely into memory and therefore can be slow or impossible to run with large datasets. To address this problem, we developed the mbkmeans R/Bioconductor package, an open-source implementation of the mini-batch k-means algorithm. Our package allows for on-disk data representations, such as the common HDF5 file format widely used for single-cell data, that do not require all the data to be loaded into memory at one time. We demonstrate the performance of the mbkmeans package using large datasets, including one with 1.3 million cells. We also highlight and compare the computing performance of mbkmeans against the standard implementation of k-means and other popular single-cell clustering methods. Our software package is available in Bioconductor at https://bioconductor.org/packages/mbkmeans.


2020 ◽  
Author(s):  
Luqin Gan ◽  
Giuseppe Vinci ◽  
Genevera I. Allen

AbstractSingle cell RNA sequencing is a powerful technique that measures the gene expression of individual cells in a high throughput fashion. However, due to sequencing inefficiency, the data is unreliable due to dropout events, or technical artifacts where genes erroneously appear to have zero expression. Many data imputation methods have been proposed to alleviate this issue. Yet, effective imputation can be difficult and biased because the data is sparse and high-dimensional, resulting in major distortions in downstream analyses. In this paper, we propose a completely novel approach that imputes the gene-by-gene correlations rather than the data itself. We call this method SCENA: Single cell RNA-seq Correlation completion by ENsemble learning and Auxiliary information. The SCENA gene-by-gene correlation matrix estimate is obtained by model stacking of multiple imputed correlation matrices based on known auxiliary information about gene connections. In an extensive simulation study based on real scRNA-seq data, we demonstrate that SCENA not only accurately imputes gene correlations but also outperforms existing imputation approaches in downstream analyses such as dimension reduction, cell clustering, graphical model estimation.


2020 ◽  
Author(s):  
Ruochen Jiang ◽  
Tianyi Sun ◽  
Dongyuan Song ◽  
Jingyi Jessica Li

AbstractSingle-cell RNA sequencing (scRNA-seq) technologies have revolutionized biomedical sciences by enabling genome-wide profiling of gene expression levels at an unprecedented single-cell resolution. A distinct characteristic of scRNA-seq data is the vast proportion of zeros unseen in bulk RNA-seq data. Researchers view these zeros differently: some regard zeros as biological signals representing no or low gene expression, while others regard zeros as false signals or missing data to be corrected. As a result, the scRNA-seq field faces much controversy regarding how to handle zeros in data analysis. In this paper, we first discuss the origins of biological and non-biological zeros in scRNA-seq data. Second, we clarify the definitions of several commonly-used but ambiguous terms, including “dropouts,” “excess zeros,” and “zero inflation.” Third, we evaluate the impacts of non-biological zeros on cell clustering and differential gene expression analysis. Fourth, we summarize the advantages, disadvantages, and suitable users of three input data types: original counts, imputed counts, and binarized counts. Finally, we discuss the open questions regarding non-biological zeros, the need for benchmarking, and the importance of transparent analysis.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Sarah E. Pierce ◽  
Jeffrey M. Granja ◽  
William J. Greenleaf

AbstractChromatin accessibility profiling can identify putative regulatory regions genome wide; however, pooled single-cell methods for assessing the effects of regulatory perturbations on accessibility are limited. Here, we report a modified droplet-based single-cell ATAC-seq protocol for perturbing and evaluating dynamic single-cell epigenetic states. This method (Spear-ATAC) enables simultaneous read-out of chromatin accessibility profiles and integrated sgRNA spacer sequences from thousands of individual cells at once. Spear-ATAC profiling of 104,592 cells representing 414 sgRNA knock-down populations reveals the temporal dynamics of epigenetic responses to regulatory perturbations in cancer cells and the associations between transcription factor binding profiles.


2021 ◽  
Vol 1738 ◽  
pp. 012078
Author(s):  
Yaxuan Cui ◽  
Kunjie Luo ◽  
Zheyu Zhang ◽  
Saijia Liu

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Ryan B. Patterson-Cross ◽  
Ariel J. Levine ◽  
Vilas Menon

Abstract Background Generating and analysing single-cell data has become a widespread approach to examine tissue heterogeneity, and numerous algorithms exist for clustering these datasets to identify putative cell types with shared transcriptomic signatures. However, many of these clustering workflows rely on user-tuned parameter values, tailored to each dataset, to identify a set of biologically relevant clusters. Whereas users often develop their own intuition as to the optimal range of parameters for clustering on each data set, the lack of systematic approaches to identify this range can be daunting to new users of any given workflow. In addition, an optimal parameter set does not guarantee that all clusters are equally well-resolved, given the heterogeneity in transcriptomic signatures in most biological systems. Results Here, we illustrate a subsampling-based approach (chooseR) that simultaneously guides parameter selection and characterizes cluster robustness. Through bootstrapped iterative clustering across a range of parameters, chooseR was used to select parameter values for two distinct clustering workflows (Seurat and scVI). In each case, chooseR identified parameters that produced biologically relevant clusters from both well-characterized (human PBMC) and complex (mouse spinal cord) datasets. Moreover, it provided a simple “robustness score” for each of these clusters, facilitating the assessment of cluster quality. Conclusion chooseR is a simple, conceptually understandable tool that can be used flexibly across clustering algorithms, workflows, and datasets to guide clustering parameter selection and characterize cluster robustness.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Qiu Sun ◽  
Alan Perez-Rathke ◽  
Daniel M. Czajkowsky ◽  
Zhifeng Shao ◽  
Jie Liang

AbstractSingle-cell chromatin studies provide insights into how chromatin structure relates to functions of individual cells. However, balancing high-resolution and genome wide-coverage remains challenging. We describe a computational method for the reconstruction of large 3D-ensembles of single-cell (sc) chromatin conformations from population Hi-C that we apply to study embryogenesis in Drosophila. With minimal assumptions of physical properties and without adjustable parameters, our method generates large ensembles of chromatin conformations via deep-sampling. Our method identifies specific interactions, which constitute 5–6% of Hi-C frequencies, but surprisingly are sufficient to drive chromatin folding, giving rise to the observed Hi-C patterns. Modeled sc-chromatins quantify chromatin heterogeneity, revealing significant changes during embryogenesis. Furthermore, >50% of modeled sc-chromatin maintain topologically associating domains (TADs) in early embryos, when no population TADs are perceptible. Domain boundaries become fixated during development, with strong preference at binding-sites of insulator-complexes upon the midblastula transition. Overall, high-resolution 3D-ensembles of sc-chromatin conformations enable further in-depth interpretation of population Hi-C, improving understanding of the structure-function relationship of genome organization.


2021 ◽  
Author(s):  
Yannik Bollen ◽  
Ellen Stelloo ◽  
Petra van Leenen ◽  
Myrna van den Bos ◽  
Bas Ponsioen ◽  
...  

AbstractCentral to tumor evolution is the generation of genetic diversity. However, the extent and patterns by which de novo karyotype alterations emerge and propagate within human tumors are not well understood, especially at single-cell resolution. Here, we present 3D Live-Seq—a protocol that integrates live-cell imaging of tumor organoid outgrowth and whole-genome sequencing of each imaged cell to reconstruct evolving tumor cell karyotypes across consecutive cell generations. Using patient-derived colorectal cancer organoids and fresh tumor biopsies, we demonstrate that karyotype alterations of varying complexity are prevalent and can arise within a few cell generations. Sub-chromosomal acentric fragments were prone to replication and collective missegregation across consecutive cell divisions. In contrast, gross genome-wide karyotype alterations were generated in a single erroneous cell division, providing support that aneuploid tumor genomes can evolve via punctuated evolution. Mapping the temporal dynamics and patterns of karyotype diversification in cancer enables reconstructions of evolutionary paths to malignant fitness.


Sign in / Sign up

Export Citation Format

Share Document